Skip to content

Conversation

kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Aug 27, 2025

Stack from ghstack (oldest at bottom):

test_symmetric_memory.py hangs like this:

SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_False PASSED [5.6364s]                                                                                                                                                 
SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_True ...

This set of tests parameterizes whether user sets the device before calling symm_mem.emtpy.
However, such parametrization does not work well with MultiProcContinuousTest because the set device will "contaminate" the next test function.

Solution is to move the "set device" tests to a separate test suite using the traditional MultiProcessTestCase, which would respawn processes every time.

Hang is gone now.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Aug 27, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161668

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 822bec9 with merge base 763053d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501 added a commit that referenced this pull request Aug 27, 2025
ghstack-source-id: 287dfe3
Pull-Request-resolved: #161668
@pytorch-bot pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Aug 27, 2025
@kwen2501 kwen2501 requested review from fegin and ngimel August 27, 2025 23:23
fegin added a commit that referenced this pull request Aug 28, 2025
test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test.

#161668 should also fix the issue but we can land this PR for a safer test.
ghstack-source-id: 72f67de
Pull-Request-resolved: #161677
@kwen2501
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 28, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

fegin added a commit that referenced this pull request Aug 28, 2025
test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test.

#161668 should also fix the issue but we can land this PR for a safer test.
ghstack-source-id: 0e89aff
Pull-Request-resolved: #161677
fegin added a commit that referenced this pull request Aug 28, 2025
test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test.

#161668 should also fix the issue but we can land this PR for a safer test.
ghstack-source-id: fa91e94
Pull-Request-resolved: #161677
pytorchmergebot pushed a commit that referenced this pull request Aug 29, 2025
test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test.

#161668 should also fix the issue but we can land this PR for a safer test.

Pull Request resolved: #161677
Approved by: https://github.com/kwen2501
ghstack dependencies: #161676
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
`test_symmetric_memory.py` hangs like this:
```
SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_False PASSED [5.6364s]
SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_True ...
```

This set of tests parameterizes whether user sets the device before calling `symm_mem.emtpy`.
However, such parametrization does not work well with `MultiProcContinuousTest` because the set device will "contaminate" the next test function.

Solution is to move the "set device" tests to a separate test suite using the traditional `MultiProcessTestCase`, which would respawn processes every time.

Hang is gone now.

Pull Request resolved: pytorch#161668
Approved by: https://github.com/fegin
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test.

pytorch#161668 should also fix the issue but we can land this PR for a safer test.

Pull Request resolved: pytorch#161677
Approved by: https://github.com/kwen2501
ghstack dependencies: pytorch#161676
@github-actions github-actions bot deleted the gh/kwen2501/219/head branch September 28, 2025 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/h100-symm-mem ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants