[Symmetric memory] set handle type for ROCm #161741

ngimel · 2025-08-28T21:45:55Z

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-08-28T21:45:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161741

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit e05819a with merge base 82f63c8 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic-rocm-mi300 / linux-jammy-rocm-py3.10-mi300 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4, module:rocm, oncall:distributed) (gh) (disabled by #162071)
distributed/test_symmetric_memory.py::SymmMemEmptySetDeviceTest::test_empty_strided_p2p_set_device_False

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

rocm / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.2) (gh) (trunk failure)
inductor/test_provenance_tracing.py::TestProvenanceTracingStackTraces::test_cpu_extern_kernel

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ngimel · 2025-08-28T23:38:31Z

@yuankaichen-amd my PR couldn't possibly have caused those rocm failures, I'll leave it to you to figure it out and land the PR

yuankaichen-amd · 2025-09-02T17:53:51Z

Hi Natalia,

Please rebase your branch to head and try again. For the two failures, they are irrelevant to this PR:

test_baddmm_search_space_EXHAUSTIVE: we should disable this test IIUC. @chuanqi129 Please comment whether this has been already disabled for ROCM, if not, how should we disable it?
test_compile_selective_checkpoint_custom_rule_cuda: @ngimel your branch somehow messed up. This parameter was killed and then the kill was reverted. Rebasing to head should do it.

ngimel · 2025-09-02T20:19:07Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-09-02T20:20:45Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-09-02T20:20:48Z

Successfully rebased ngimel/rocm_handle_type onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout ngimel/rocm_handle_type && git pull --rebase)

jeffdaily · 2025-09-02T20:22:50Z

Please note that ciflow/rocm (mi200s) are having infra issues and the signal will take a long time. But for this PR this test is only run on the mi300 anyway, so you can force-merge once you get a good mi300 signal.

ngimel · 2025-09-03T20:13:36Z

@yuankaichen-amd I see the CI is still failing for reasons unrelated to my PR. Feel free to take this PR over and do with it what you want.

jeffdaily · 2025-09-03T20:29:33Z

@pytorchbot rebase

pytorchmergebot · 2025-09-03T20:31:10Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-09-03T20:31:11Z

Tried to rebase and push PR #161741, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

jeffdaily · 2025-09-03T20:31:41Z

@pytorchbot merge -f "unrelated rocm failures"

pytorchmergebot · 2025-09-03T20:33:23Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Unblocking #161741 Pull Request resolved: #162035 Approved by: https://github.com/Skylion007, https://github.com/ngimel

yuankaichen-amd · 2025-09-03T21:04:24Z

Thank you all!

Fixes pytorch#161722 Pull Request resolved: pytorch#161741 Approved by: https://github.com/kwen2501

Unblocking pytorch#161741 Pull Request resolved: pytorch#162035 Approved by: https://github.com/Skylion007, https://github.com/ngimel

Fixes pytorch#161722 Pull Request resolved: pytorch#161741 Approved by: https://github.com/kwen2501

Unblocking pytorch#161741 Pull Request resolved: pytorch#162035 Approved by: https://github.com/Skylion007, https://github.com/ngimel

Fixes pytorch#161722 Pull Request resolved: pytorch#161741 Approved by: https://github.com/kwen2501

Unblocking pytorch#161741 Pull Request resolved: pytorch#162035 Approved by: https://github.com/Skylion007, https://github.com/ngimel

Fixes pytorch#161722 Pull Request resolved: pytorch#161741 Approved by: https://github.com/kwen2501

pytorch-bot bot added ciflow/h100-symm-mem ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 28, 2025

kwen2501 approved these changes Aug 28, 2025

View reviewed changes

[Symmetric memory] set handle type for ROCm

e05819a

pytorchmergebot force-pushed the ngimel/rocm_handle_type branch from 5f0b628 to e05819a Compare September 2, 2025 20:20

jeffdaily added the ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 label Sep 2, 2025

kwen2501 mentioned this pull request Sep 3, 2025

[SymmMem][CI] Make sure group names are consistent #162035

Closed

jeffdaily removed the ciflow/rocm Trigger "default" config CI on ROCm label Sep 3, 2025

pytorchmergebot added the merging label Sep 3, 2025

pytorchmergebot added the Merged label Sep 3, 2025

pytorchmergebot closed this in d1706d9 Sep 3, 2025

pytorchmergebot removed the merging label Sep 3, 2025

pytorchmergebot pushed a commit that referenced this pull request Sep 3, 2025

[SymmMem][CI] Make sure group names are consistent (#162035)

994f2a5

Unblocking #161741 Pull Request resolved: #162035 Approved by: https://github.com/Skylion007, https://github.com/ngimel

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025

[Symmetric memory] set handle type for ROCm (pytorch#161741)

870c48c

Fixes pytorch#161722 Pull Request resolved: pytorch#161741 Approved by: https://github.com/kwen2501

mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025

[Symmetric memory] set handle type for ROCm (pytorch#161741)

d19d185

Fixes pytorch#161722 Pull Request resolved: pytorch#161741 Approved by: https://github.com/kwen2501

dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025

[Symmetric memory] set handle type for ROCm (pytorch#161741)

c58e0cb

Fixes pytorch#161722 Pull Request resolved: pytorch#161741 Approved by: https://github.com/kwen2501

github-actions bot deleted the ngimel/rocm_handle_type branch October 4, 2025 02:05

pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Oct 14, 2025

[Symmetric memory] set handle type for ROCm (pytorch#161741)

b76ce7a

Fixes pytorch#161722 Pull Request resolved: pytorch#161741 Approved by: https://github.com/kwen2501

[Symmetric memory] set handle type for ROCm #161741

[Symmetric memory] set handle type for ROCm #161741

Conversation

ngimel commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161741

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

ngimel commented Aug 28, 2025

Uh oh!

yuankaichen-amd commented Sep 2, 2025

Uh oh!

ngimel commented Sep 2, 2025

Uh oh!

pytorchmergebot commented Sep 2, 2025

Uh oh!

pytorchmergebot commented Sep 2, 2025

Uh oh!

jeffdaily commented Sep 2, 2025

Uh oh!

ngimel commented Sep 3, 2025

Uh oh!

jeffdaily commented Sep 3, 2025

Uh oh!

pytorchmergebot commented Sep 3, 2025

Uh oh!

pytorchmergebot commented Sep 3, 2025

Uh oh!

jeffdaily commented Sep 3, 2025

Uh oh!

pytorchmergebot commented Sep 3, 2025

Merge started

Uh oh!

yuankaichen-amd commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ngimel commented Aug 28, 2025 •

edited

Loading

pytorch-bot bot commented Aug 28, 2025 •

edited

Loading