[SymmMem] Barrier on team instead of world #163298

kwen2501 · 2025-09-18T21:41:56Z

Stack from ghstack (oldest at bottom):

As titled. Avoiding a potential hang when running dispatch and combine in subgroups.

The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change)

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-09-18T21:41:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163298

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ce9d19b with merge base 4840a1a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

kwen2501 · 2025-09-19T20:06:20Z

@pytorchbot merge

pytorchmergebot · 2025-09-19T20:08:09Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-09-19T20:13:54Z

Starting merge as part of PR stack under #162680

Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: #162680 Approved by: https://github.com/ezyang ghstack dependencies: #163298

kwen2501 · 2025-09-19T21:32:06Z

@pytorchbot cherry-pick --onto release/2.9 -c critical

Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: #162680 Approved by: https://github.com/ezyang ghstack dependencies: #163298 (cherry picked from commit 7130b17)

As titled. Avoiding a potential hang when running dispatch and combine in subgroups. The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change) Pull Request resolved: #163298 Approved by: https://github.com/fegin (cherry picked from commit f8fb437)

pytorchbot · 2025-09-19T21:38:24Z

Cherry picking #163298

The cherry pick PR is at #163376 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

[v.2.9.0] Release Tracker #162497 (comment)

Details for Dev Infra team

Raised by workflow job

As titled. Avoiding a potential hang when running dispatch and combine in subgroups. The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change) Pull Request resolved: pytorch#163298 Approved by: https://github.com/fegin

Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: pytorch#162680 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#163298

As titled. Avoiding a potential hang when running dispatch and combine in subgroups. The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change) Pull Request resolved: pytorch#163298 Approved by: https://github.com/fegin

Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: pytorch#162680 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#163298

As titled. Avoiding a potential hang when running dispatch and combine in subgroups. The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change) Pull Request resolved: pytorch#163298 Approved by: https://github.com/fegin

Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: pytorch#162680 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#163298

[SymmMem] Fix memory allocation hold-up (#162680) Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: #162680 Approved by: https://github.com/ezyang ghstack dependencies: #163298 (cherry picked from commit 7130b17) Co-authored-by: Ke Wen <kw2501@meta.com>

[SymmMem] Barrier on team instead of world (#163298) As titled. Avoiding a potential hang when running dispatch and combine in subgroups. The rest is just re-arrange of the tests to create a sub-group test class. (no substantial change) Pull Request resolved: #163298 Approved by: https://github.com/fegin (cherry picked from commit f8fb437) Co-authored-by: Ke Wen <kw2501@meta.com>

Update

9a179e9

[ghstack-poisoned]

pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Sep 18, 2025

This was referenced Sep 18, 2025

[SymmMem] Fix memory allocation hold-up #162680

Closed

[SymmMem] Add num_active_allocations and dealloc checks #162681

Open

kwen2501 requested review from fduwjj, fegin and ngimel September 18, 2025 21:45

Update

ce9d19b

[ghstack-poisoned]

kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 19, 2025

fegin approved these changes Sep 19, 2025

View reviewed changes

pytorchmergebot added the merging label Sep 19, 2025

pytorchmergebot closed this in f8fb437 Sep 19, 2025

pytorchmergebot added the Merged label Sep 19, 2025

pytorchbot mentioned this pull request Sep 19, 2025

[SymmMem] Fix memory allocation hold-up #163375

Merged

pytorchbot mentioned this pull request Sep 19, 2025

[SymmMem] Barrier on team instead of world #163376

Merged

pytorchbot mentioned this pull request Sep 19, 2025

[v.2.9.0] Release Tracker #162497

Closed

pytorchmergebot removed the merging label Sep 19, 2025

github-actions bot deleted the gh/kwen2501/254/head branch October 20, 2025 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SymmMem] Barrier on team instead of world #163298

[SymmMem] Barrier on team instead of world #163298

kwen2501 commented Sep 18, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Sep 18, 2025 •

edited

Loading

Uh oh!

kwen2501 commented Sep 19, 2025

Uh oh!

pytorchmergebot commented Sep 19, 2025

Uh oh!

pytorchmergebot commented Sep 19, 2025

Uh oh!

kwen2501 commented Sep 19, 2025

Uh oh!

pytorchbot commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SymmMem] Barrier on team instead of world #163298

[SymmMem] Barrier on team instead of world #163298

Conversation

kwen2501 commented Sep 18, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163298

✅ No Failures

Uh oh!

kwen2501 commented Sep 19, 2025

Uh oh!

pytorchmergebot commented Sep 19, 2025

Merge started

Uh oh!

pytorchmergebot commented Sep 19, 2025

Uh oh!

kwen2501 commented Sep 19, 2025

Uh oh!

pytorchbot commented Sep 19, 2025

Cherry picking #163298

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented Sep 18, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 18, 2025 •

edited

Loading