Skip to content

Conversation

ngimel
Copy link
Collaborator

@ngimel ngimel commented Aug 28, 2025

Copy link

pytorch-bot bot commented Aug 28, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161741

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit e05819a with merge base 82f63c8 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/h100-symm-mem ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 28, 2025
@ngimel
Copy link
Collaborator Author

ngimel commented Aug 28, 2025

@yuankaichen-amd my PR couldn't possibly have caused those rocm failures, I'll leave it to you to figure it out and land the PR

@yuankaichen-amd
Copy link

Hi Natalia,

Please rebase your branch to head and try again. For the two failures, they are irrelevant to this PR:

  1. test_baddmm_search_space_EXHAUSTIVE: we should disable this test IIUC. @chuanqi129 Please comment whether this has been already disabled for ROCM, if not, how should we disable it?
  2. test_compile_selective_checkpoint_custom_rule_cuda: @ngimel your branch somehow messed up. This parameter was killed and then the kill was reverted. Rebasing to head should do it.

@ngimel
Copy link
Collaborator Author

ngimel commented Sep 2, 2025

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased ngimel/rocm_handle_type onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout ngimel/rocm_handle_type && git pull --rebase)

@jeffdaily
Copy link
Collaborator

Please note that ciflow/rocm (mi200s) are having infra issues and the signal will take a long time. But for this PR this test is only run on the mi300 anyway, so you can force-merge once you get a good mi300 signal.

@jeffdaily jeffdaily added the ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 label Sep 2, 2025
@ngimel
Copy link
Collaborator Author

ngimel commented Sep 3, 2025

@yuankaichen-amd I see the CI is still failing for reasons unrelated to my PR. Feel free to take this PR over and do with it what you want.

@jeffdaily
Copy link
Collaborator

@pytorchbot rebase

@jeffdaily jeffdaily removed the ciflow/rocm Trigger "default" config CI on ROCm label Sep 3, 2025
@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #161741, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

@jeffdaily
Copy link
Collaborator

@pytorchbot merge -f "unrelated rocm failures"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@yuankaichen-amd
Copy link

Thank you all!

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
@github-actions github-actions bot deleted the ngimel/rocm_handle_type branch October 4, 2025 02:05
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/h100-symm-mem ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 Merged module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Symmetric memory seems broken for AMD GPUs in Pytorch nightly: "RuntimeError: handle_type_ != Expandable_Segments_Handle_Type::UNSPECIFIED"

5 participants