Skip to content

Conversation

saienduri
Copy link
Collaborator

@saienduri saienduri commented Nov 18, 2024

This commit enables pytorch rocm ci on MI300 nodes. It also removes a few steps that shouldn't be in the workflows. They are not universal to all runners and should be taken care of as part of the runner setup itself.

Fixes #ISSUE_NUMBER

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@saienduri saienduri requested review from a team and jeffdaily as code owners November 18, 2024 23:20
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 18, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Nov 18, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140989

Note: Links to docs will display an error until the docs builds have been completed.

❌ 15 New Failures, 1 Unrelated Failure

As of commit 763179c with merge base fd553b9 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jithunnair-amd jithunnair-amd added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 19, 2024
@saienduri
Copy link
Collaborator Author

saienduri commented Nov 19, 2024

@pytorchbot ciflow rerun

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 19, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'ciflow' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

@saienduri
Copy link
Collaborator Author

@pytorchbot rebase

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 19, 2024

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 19, 2024
@saienduri
Copy link
Collaborator Author

@pytorchbot label ciflow/periodic

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 21, 2024

Can't add following labels to PR: ciflow/periodic. Please ping one of the reviewers for help.

@jithunnair-amd jithunnair-amd added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm Trigger "default" config CI on ROCm labels Nov 21, 2024
shell: bash
run: |
killall runsvc.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saienduri Is this meant to be temporary? While some checks might need to be disabled for MI300 nodes, it seems like a good idea to still fail if the applicable health checks fail?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, meant to be temporary (deleted while dealing with other failures). Added back now, thanks

cat ~/.docker/config.json || true
# https://stackoverflow.com/questions/64455468/error-when-logging-into-ecr-with-docker-login-error-saving-credentials-not
rm -f ~/.docker/config.json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saienduri @amdfaa We need to make sure the changes here work not just for the Mi300 nodes but also our existing CI nodes. Please discuss offline how to resolve that. Afaiu, setting DOCKER_HOST is necessary to have our existing nodes working?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be able to conditionalize on the type of node that is being used. The step on line 8 appears to be necessary but I don't think 12 is necessary.

Copy link
Collaborator Author

@saienduri saienduri Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add environment variables as part of the runner startup if needed for some machines in the actions-runner/.env file:
{22C99364-AC70-49AD-BA8C-1DDF3B8D5CB5}
We shouldn't be doing any machine specific setup in the workflow files @amdfaa

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amdfaa This seems like a reasonable suggestion, can you please update your runner setup scripts to use the actions-runners/.env file to set any required env variables for the runner, and update all the non-MI300 runners to have that file?
Let us know if you have any concerns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big concern is to start and stop the runners. The file actions-runners/.env has to be set prior to starting the service.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amdfaa I believe we are now done with setting up the actions-runners/.env files on all the MI2xx runners, so we're good with this change?

@jithunnair-amd jithunnair-amd added the keep-going Don't stop on first failure, keep running tests until the end label Nov 27, 2024
pytorchmergebot pushed a commit that referenced this pull request Jan 9, 2025
…x MI300 related failed tests (#143673)

This PR
* makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners
* skips and fixes several tests, failed on MI300, observed in #140989

Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300):
- distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\*_gather_dim_\* (24 tests across inductor/distributed configs)
- distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\*_scatter_dim_\* (12 tests across inductor/distributed configs))
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2

Skipped due to AssertionError on MI300:
- inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16
- distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1

Skipped:
- test_cuda.py::TestCudaMallocAsync::test_clock_speed
- test_cuda.py::TestCudaMallocAsync::test_power_draw
- test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda

Skipped flaky tests on MI300:
- distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda
- inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests)

Fixed:
- test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda

Features:
- inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU

Pull Request resolved: #143673
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony

Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2025

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Mar 9, 2025
@github-actions github-actions bot closed this Apr 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end module: rocm AMD GPU support for Pytorch open source Stale topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants