-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[DO NOT MERGE] Test new MI300 ROCm CI nodes #140989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140989
Note: Links to docs will display an error until the docs builds have been completed. ❌ 15 New Failures, 1 Unrelated FailureAs of commit 763179c with merge base fd553b9 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot ciflow rerun |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot rebase |
You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra. |
@pytorchbot label ciflow/periodic |
Can't add following labels to PR: ciflow/periodic. Please ping one of the reviewers for help. |
shell: bash | ||
run: | | ||
killall runsvc.sh | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saienduri Is this meant to be temporary? While some checks might need to be disabled for MI300 nodes, it seems like a good idea to still fail if the applicable health checks fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, meant to be temporary (deleted while dealing with other failures). Added back now, thanks
28ef33e
to
a28d530
Compare
cat ~/.docker/config.json || true | ||
# https://stackoverflow.com/questions/64455468/error-when-logging-into-ecr-with-docker-login-error-saving-credentials-not | ||
rm -f ~/.docker/config.json | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saienduri @amdfaa We need to make sure the changes here work not just for the Mi300 nodes but also our existing CI nodes. Please discuss offline how to resolve that. Afaiu, setting DOCKER_HOST is necessary to have our existing nodes working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should be able to conditionalize on the type of node that is being used. The step on line 8 appears to be necessary but I don't think 12 is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add environment variables as part of the runner startup if needed for some machines in the actions-runner/.env
file:
We shouldn't be doing any machine specific setup in the workflow files @amdfaa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amdfaa This seems like a reasonable suggestion, can you please update your runner setup scripts to use the actions-runners/.env
file to set any required env variables for the runner, and update all the non-MI300 runners to have that file?
Let us know if you have any concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A big concern is to start and stop the runners. The file actions-runners/.env
has to be set prior to starting the service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amdfaa I believe we are now done with setting up the actions-runners/.env
files on all the MI2xx runners, so we're good with this change?
Signed-off-by: saienduri <saimanas.enduri@amd.com>
97c2dce
to
ad85afe
Compare
…uctor-rocm` (#143768) This helps to make `continue-through-error`/`keep-going` work as expected on `inductor-rocm` workflow jobs. Without this, the code here doesn't enter the `if` condition: https://github.com/pytorch/pytorch/blob/6ccb8ed1868984d9d2ea4e48a085508d1027cd9b/.github/scripts/filter_test_configs.py#L577 Tested via [this PR](#140989): Without this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=8232e18957f987d99c946efc0cf6da9be9b52067: https://github.com/pytorch/pytorch/actions/runs/12164558045/job/34192442187#step:13:144 With this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=763179c5e421791ee05c8e2a600379b29a1c8c33: https://github.com/pytorch/pytorch/actions/runs/12261943684/job/34213300153#step:13:145 Pull Request resolved: #143768 Approved by: https://github.com/huydhn
…x MI300 related failed tests (#143673) This PR * makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners * skips and fixes several tests, failed on MI300, observed in #140989 Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300): - distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\*_gather_dim_\* (24 tests across inductor/distributed configs) - distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\*_scatter_dim_\* (12 tests across inductor/distributed configs)) - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t - inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2 Skipped due to AssertionError on MI300: - inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16 - distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1 Skipped: - test_cuda.py::TestCudaMallocAsync::test_clock_speed - test_cuda.py::TestCudaMallocAsync::test_power_draw - test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda Skipped flaky tests on MI300: - distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda - inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests) Fixed: - test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda Features: - inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU Pull Request resolved: #143673 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
This commit enables pytorch rocm ci on MI300 nodes. It also removes a few steps that shouldn't be in the workflows. They are not universal to all runners and should be taken care of as part of the runner setup itself.
Fixes #ISSUE_NUMBER
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd