[DO NOT MERGE] Test new MI300 ROCm CI nodes #140989

saienduri · 2024-11-18T23:20:29Z

This commit enables pytorch rocm ci on MI300 nodes. It also removes a few steps that shouldn't be in the workflows. They are not universal to all runners and should be taken care of as part of the runner setup itself.

Fixes #ISSUE_NUMBER

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

linux-foundation-easycla · 2024-11-18T23:20:33Z

The committers listed above are authorized under a signed CLA.

✅ login: saienduri (9563ddf, 71c3797, 2689c08, 3204bfe, 4f9a667, 12aabce, 4a33e6b, d75cbe1, 8f1339f, 4ac8c36, e7373da, 0d9436e, dd46713, 4c7b1b8, 7235317, fc48575, 5b8fd78, abed6f4, 8232e18, e09f5ca, fd1f36b, ad85afe)
✅ login: jithunnair-amd / name: Jithun Nair (763179c)

pytorch-bot · 2024-11-18T23:20:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140989

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 15 New Failures, 1 Unrelated Failure

As of commit 763179c with merge base fd553b9 ():

NEW FAILURES - The following jobs have failed:

docker-builds / docker-build (linux.12xlarge, pytorch-linux-focal-cuda12.4-cudnn9-py3.13-gcc9-inductor-benchmarks) (gh)
#2 ERROR: docker.io/nvidia/cuda:12.4-devel-ubuntu20.04: not found
docker-builds / docker-build (linux.12xlarge, pytorch-linux-focal-py3.13-clang10) (gh)
64.43 Getting requirements to build wheel: finished with status 'error'
docker-builds / docker-build (linux.12xlarge, pytorch-linux-jammy-py3-clang12-executorch) (gh)
curl: (22) The requested URL returned error:
inductor-rocm / rocm6.2-py3.10-inductor / test (inductor, 1, 2, linux-mi300-gpu-2) (gh)
distributed/tensor/parallel/test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_scaled_matmul_reduce_scatter_A_dims_3_scatter_dim_2
inductor-rocm / rocm6.2-py3.10-inductor / test (inductor, 2, 2, linux-mi300-gpu-2) (gh)
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_none_args_aot_codegen_cuda
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for test/test_transformers.py:
periodic / linux-focal-cuda11.8-py3.9-gcc9 / test (multigpu, 1, 1, lf.linux.g5.12xlarge.nvidia.gpu, oncall:distributed) (gh)
'Test'
periodic / linux-focal-rocm6.2-py3.10 / test (distributed, 3, 3, linux-mi300-gpu-4, module:rocm, oncall:distributed) (gh)
distributed/tensor/parallel/test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_scaled_matmul_reduce_scatter_A_dims_3_scatter_dim_2
pull / linux-focal-cuda12.4-py3.10-gcc9-sm89 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
[ FAILED ] RNNTest.BidirectionalMultilayerGRU_CPU_vs_CUDA
pull / linux-focal-py3.13-clang10 / build (gh)
#21 73.54 Getting requirements to build wheel: finished with status 'error'
rocm / linux-focal-rocm6.2-py3.10 / test (default, 1, 6, linux-mi300-gpu-2) (gh)
[ FAILED ] StaticRuntime.autogen_isin_Scalar_Tensor
rocm / linux-focal-rocm6.2-py3.10 / test (default, 4, 6, linux-mi300-gpu-2) (gh)
test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda
rocm / linux-focal-rocm6.2-py3.10 / test (default, 5, 6, linux-mi300-gpu-2) (gh)
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_none_args_aot_codegen_cuda
rocm / linux-focal-rocm6.2-py3.10 / test (default, 6, 6, linux-mi300-gpu-2) (gh)
inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qconv2d_int8_mixed_bf16
trunk / linux-focal-rocm6.2-py3.10 / test (default, 1, 2, linux-mi300-gpu-2) (gh)
test_cuda.py::TestCudaMallocAsync::test_power_draw

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 2, 5, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh) (disabled by #141329 but the issue was closed recently and a rebase is needed to make it pass)
inductor/test_mkldnn_pattern_matcher.py::TestDynamicPatternMatcher::test_conv2d_binary_dynamic_shapes

This comment was automatically generated by Dr. CI and updates every 15 minutes.

saienduri · 2024-11-19T10:44:20Z

@pytorchbot ciflow rerun

pytorch-bot · 2024-11-19T10:44:24Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'ciflow' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

saienduri · 2024-11-19T10:46:23Z

@pytorchbot rebase

pytorch-bot · 2024-11-19T10:46:27Z

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

saienduri · 2024-11-21T02:01:30Z

@pytorchbot label ciflow/periodic

pytorch-bot · 2024-11-21T02:01:38Z

Can't add following labels to PR: ciflow/periodic. Please ping one of the reviewers for help.

.github/actions/diskspace-cleanup/action.yml

jithunnair-amd · 2024-11-21T04:34:03Z

.github/actions/setup-rocm/action.yml

-      shell: bash
-      run: |
-        killall runsvc.sh
-


@saienduri Is this meant to be temporary? While some checks might need to be disabled for MI300 nodes, it seems like a good idea to still fail if the applicable health checks fail?

Yes, meant to be temporary (deleted while dealing with other failures). Added back now, thanks

jithunnair-amd · 2024-11-27T16:58:25Z

.github/actions/setup-rocm/action.yml

-        cat ~/.docker/config.json || true
-        # https://stackoverflow.com/questions/64455468/error-when-logging-into-ecr-with-docker-login-error-saving-credentials-not
-        rm -f ~/.docker/config.json
-


@saienduri @amdfaa We need to make sure the changes here work not just for the Mi300 nodes but also our existing CI nodes. Please discuss offline how to resolve that. Afaiu, setting DOCKER_HOST is necessary to have our existing nodes working?

I should be able to conditionalize on the type of node that is being used. The step on line 8 appears to be necessary but I don't think 12 is necessary.

You can add environment variables as part of the runner startup if needed for some machines in the actions-runner/.env file:

We shouldn't be doing any machine specific setup in the workflow files @amdfaa

@amdfaa This seems like a reasonable suggestion, can you please update your runner setup scripts to use the actions-runners/.env file to set any required env variables for the runner, and update all the non-MI300 runners to have that file?
Let us know if you have any concerns.

A big concern is to start and stop the runners. The file actions-runners/.env has to be set prior to starting the service.

@amdfaa I believe we are now done with setting up the actions-runners/.env files on all the MI2xx runners, so we're good with this change?

.github/actions/setup-rocm/action.yml

Signed-off-by: saienduri <saimanas.enduri@amd.com>

…uctor-rocm` (#143768) This helps to make `continue-through-error`/`keep-going` work as expected on `inductor-rocm` workflow jobs. Without this, the code here doesn't enter the `if` condition: https://github.com/pytorch/pytorch/blob/6ccb8ed1868984d9d2ea4e48a085508d1027cd9b/.github/scripts/filter_test_configs.py#L577 Tested via [this PR](#140989): Without this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=8232e18957f987d99c946efc0cf6da9be9b52067: https://github.com/pytorch/pytorch/actions/runs/12164558045/job/34192442187#step:13:144 With this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=763179c5e421791ee05c8e2a600379b29a1c8c33: https://github.com/pytorch/pytorch/actions/runs/12261943684/job/34213300153#step:13:145 Pull Request resolved: #143768 Approved by: https://github.com/huydhn

…x MI300 related failed tests (#143673) This PR * makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners * skips and fixes several tests, failed on MI300, observed in #140989 Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300): - distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\*_gather_dim_\* (24 tests across inductor/distributed configs) - distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\*_scatter_dim_\* (12 tests across inductor/distributed configs)) - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2 Skipped due to AssertionError on MI300: - inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16 - distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1 Skipped: - test_cuda.py::TestCudaMallocAsync::test_clock_speed - test_cuda.py::TestCudaMallocAsync::test_power_draw - test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda Skipped flaky tests on MI300: - distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda - inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests) Fixed: - test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda Features: - inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU Pull Request resolved: #143673 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

github-actions · 2025-03-09T05:34:15Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

saienduri requested review from a team and jeffdaily as code owners November 18, 2024 23:20

pytorch-bot bot added module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Nov 18, 2024

pytorchbot added the open source label Nov 18, 2024

jithunnair-amd added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 19, 2024

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 19, 2024

jithunnair-amd added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm Trigger "default" config CI on ROCm labels Nov 21, 2024

jithunnair-amd reviewed Nov 21, 2024

View reviewed changes

.github/actions/diskspace-cleanup/action.yml Show resolved Hide resolved

jithunnair-amd reviewed Nov 21, 2024

View reviewed changes

saienduri force-pushed the mi300-cluster branch from 28ef33e to a28d530 Compare November 26, 2024 00:15

jithunnair-amd reviewed Nov 27, 2024

View reviewed changes

.github/actions/setup-rocm/action.yml Show resolved Hide resolved

saienduri added 7 commits November 27, 2024 11:23

test mi300 cluster for CI

4f9a667

Signed-off-by: saienduri <saimanas.enduri@amd.com>

switch to min of 2 gpu runners

3204bfe

comment out docker setup steps

4ac8c36

This is an empty commit

9563ddf

comment out kilall runner health check

71c3797

check if docker root dir exists

8f1339f

try privileged docker run

dd46713

saienduri added 13 commits November 27, 2024 11:23

try more rocm flags for device permissions

d75cbe1

add render group

abed6f4

change mechanism to add to render group

12aabce

This is an empty commit

5b8fd78

This is an empty commit

7235317

uncomment docker and runner health steps. try --network=host

fd1f36b

comment out prunneccesary cker and killall

2689c08

remove steps that are not needed

4a33e6b

change inductor rocm labels too

0d9436e

add back runner health check on failure

e7373da

this is an empty commit

4c7b1b8

this is an empty commit

fc48575

switch rocm.yml back to original runners

ad85afe

saienduri force-pushed the mi300-cluster branch from 97c2dce to ad85afe Compare November 27, 2024 17:23

jithunnair-amd added the keep-going Don't stop on first failure, keep running tests until the end label Nov 27, 2024

saienduri and others added 3 commits December 3, 2024 22:25

This is an empty commit

e09f5ca

switch rocm.yml back too

8232e18

Update ciflow regex to allow ciflow/inductor-rocm

763179c

This was referenced Dec 23, 2024

[ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests #143673

Closed

Update tag_regex in filter_test_configs.py for workflows such as inductor-rocm #143768

Closed

github-actions bot added the Stale label Mar 9, 2025

github-actions bot closed this Apr 8, 2025

[DO NOT MERGE] Test new MI300 ROCm CI nodes #140989

[DO NOT MERGE] Test new MI300 ROCm CI nodes #140989

Uh oh!

Conversation

saienduri commented Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140989

❌ 15 New Failures, 1 Unrelated Failure

Uh oh!

saienduri commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 19, 2024

Uh oh!

saienduri commented Nov 19, 2024

Uh oh!

pytorch-bot bot commented Nov 19, 2024

Uh oh!

saienduri commented Nov 21, 2024

Uh oh!

pytorch-bot bot commented Nov 21, 2024

Uh oh!

Uh oh!

jithunnair-amd Nov 21, 2024

Choose a reason for hiding this comment

Uh oh!

saienduri Nov 21, 2024

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

amdfaa Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

saienduri Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

amdfaa Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

saienduri commented Nov 18, 2024 •

edited

Loading

linux-foundation-easycla bot commented Nov 18, 2024 •

edited

Loading

pytorch-bot bot commented Nov 18, 2024 •

edited

Loading

saienduri commented Nov 19, 2024 •

edited

Loading

saienduri Dec 3, 2024 •

edited

Loading