Skip to content

Conversation

jithunnair-amd
Copy link
Collaborator

@jithunnair-amd jithunnair-amd commented Dec 23, 2024

  • Will enable us to target periodic/distributed CI jobs to 4-GPU runners using a different label linux.rocm.gpu.4
  • Use 2-GPU runners for trunk, pull and slow (in addition to inductor-rocm) as well (although this currently will not change anything, since all our MI2xx runners have both linux.rocm.gpu and linux.rocm.gpu.2 labels... but this will change in the future: see next point)
  • Continue to use linux.rocm.gpu label for any job that doesn't need more than 1-GPU eg. binary test jobs in workflows/generated-linux-binary-manywheel-nightly.yml

cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

Copy link

pytorch-bot bot commented Dec 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143769

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 75 Pending, 3 Unrelated Failures

As of commit 7035cf6 with merge base 2ab698e (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Dec 23, 2024
@jithunnair-amd jithunnair-amd changed the title [ROCm] Use linux.rocm.gpu.2 for 2-GPU runners [ROCm] Use linux.rocm.gpu.2 for 2-GPU and linux.rocm.gpu.4 for 4-GPU runners Dec 23, 2024
@jithunnair-amd jithunnair-amd marked this pull request as ready for review December 24, 2024 00:11
@jithunnair-amd jithunnair-amd requested a review from a team as a code owner December 24, 2024 00:11
@jithunnair-amd jithunnair-amd marked this pull request as draft December 24, 2024 00:14
@jithunnair-amd jithunnair-amd added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 24, 2024
@jithunnair-amd
Copy link
Collaborator Author

Will merge this once we have added the linux.rocm.gpu.4 label on the ROCm runners

@jithunnair-amd jithunnair-amd added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Dec 24, 2024
@jithunnair-amd
Copy link
Collaborator Author

Successfully able to launch distributed config jobs on linux.rocm.gpu.4 label runners here: https://github.com/pytorch/pytorch/actions/runs/12474788837/job/34820089359

@jithunnair-amd jithunnair-amd marked this pull request as ready for review December 24, 2024 08:00
@jithunnair-amd
Copy link
Collaborator Author

@pytorchbot merge -f "Label updates successfully picked up runners. CI failures are not related to this PR"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the update_rocm_yml_labels branch January 24, 2025 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants