[SymmMem] Isolate set_device tests to avoid hang #161668

kwen2501 · 2025-08-27T23:16:01Z

Stack from ghstack (oldest at bottom):

-> [SymmMem] Isolate set_device tests to avoid hang #161668

test_symmetric_memory.py hangs like this:

SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_False PASSED [5.6364s]                                                                                                                                                 
SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_True ...

This set of tests parameterizes whether user sets the device before calling symm_mem.emtpy.
However, such parametrization does not work well with MultiProcContinuousTest because the set device will "contaminate" the next test function.

Solution is to move the "set device" tests to a separate test suite using the traditional MultiProcessTestCase, which would respawn processes every time.

Hang is gone now.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

[ghstack-poisoned]

pytorch-bot · 2025-08-27T23:16:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161668

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 822bec9 with merge base 763053d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 287dfe3 Pull-Request-resolved: #161668

test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test. #161668 should also fix the issue but we can land this PR for a safer test. ghstack-source-id: 72f67de Pull-Request-resolved: #161677

kwen2501 · 2025-08-28T02:58:08Z

@pytorchbot merge

pytorchmergebot · 2025-08-28T02:59:58Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test. #161668 should also fix the issue but we can land this PR for a safer test. ghstack-source-id: 0e89aff Pull-Request-resolved: #161677

test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test. #161668 should also fix the issue but we can land this PR for a safer test. ghstack-source-id: fa91e94 Pull-Request-resolved: #161677

test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test. #161668 should also fix the issue but we can land this PR for a safer test. Pull Request resolved: #161677 Approved by: https://github.com/kwen2501 ghstack dependencies: #161676

`test_symmetric_memory.py` hangs like this: ``` SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_False PASSED [5.6364s] SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_True ... ``` This set of tests parameterizes whether user sets the device before calling `symm_mem.emtpy`. However, such parametrization does not work well with `MultiProcContinuousTest` because the set device will "contaminate" the next test function. Solution is to move the "set device" tests to a separate test suite using the traditional `MultiProcessTestCase`, which would respawn processes every time. Hang is gone now. Pull Request resolved: pytorch#161668 Approved by: https://github.com/fegin

test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test. pytorch#161668 should also fix the issue but we can land this PR for a safer test. Pull Request resolved: pytorch#161677 Approved by: https://github.com/kwen2501 ghstack dependencies: pytorch#161676

Update

822bec9

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Aug 27, 2025

[SymmMem] Isolate set_device tests to avoid hang

9300a67

ghstack-source-id: 287dfe3 Pull-Request-resolved: #161668

pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Aug 27, 2025

kwen2501 requested review from fegin and ngimel August 27, 2025 23:23

fegin approved these changes Aug 28, 2025

View reviewed changes

fegin mentioned this pull request Aug 28, 2025

[SymmMEM] Fix test_empty_strided_p2p_persistent #161677

Closed

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 28, 2025

pytorchmergebot added the merging label Aug 28, 2025

pytorchmergebot closed this in eec876d Aug 28, 2025

pytorchmergebot added Merged and removed merging labels Aug 28, 2025

github-actions bot deleted the gh/kwen2501/219/head branch September 28, 2025 02:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SymmMem] Isolate set_device tests to avoid hang #161668

[SymmMem] Isolate set_device tests to avoid hang #161668

Uh oh!

kwen2501 commented Aug 27, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 27, 2025 •

edited

Loading

Uh oh!

kwen2501 commented Aug 28, 2025

Uh oh!

pytorchmergebot commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SymmMem] Isolate set_device tests to avoid hang #161668

[SymmMem] Isolate set_device tests to avoid hang #161668

Uh oh!

Conversation

kwen2501 commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161668

✅ No Failures

Uh oh!

kwen2501 commented Aug 28, 2025

Uh oh!

pytorchmergebot commented Aug 28, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kwen2501 commented Aug 27, 2025 •

edited

Loading

pytorch-bot bot commented Aug 27, 2025 •

edited

Loading