[a2av] Test must allocate tensors symmetrically #155835

kwen2501 · 2025-06-12T20:24:50Z

Stack from ghstack (oldest at bottom):

-> [a2av] Test must allocate tensors symmetrically #155835

This is a requirement of most SHMEM backends. Otherwise, allocations may misalign across ranks.

In this PR, we make the (total) input size and output size a constant number, even though the split sizes are created random. (Previously we sum the splits up as input size, which creates misalignment in SHMEM heap across ranks).

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-06-12T20:24:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155835

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a1af1d3 with merge base 4d9d884 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (trunk failure)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: e88ef2b Pull-Request-resolved: #155835

fduwjj

LGTM

kwen2501 · 2025-06-12T23:35:26Z

@pytorchbot merge

pytorchmergebot · 2025-06-12T23:37:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-06-13T05:35:51Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

kwen2501 · 2025-06-13T06:03:53Z

@pytorchbot merge -f "merge timed out"

pytorchmergebot · 2025-06-13T06:05:27Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

No code enqueues entries to `ptr_to_symm_mem_`, thus it is always empty. This PR removes it and supports relying functionalities via the `allocations_` map. Pull Request resolved: #155968 Approved by: https://github.com/Skylion007 ghstack dependencies: #155506, #155835

`NVSHMEMSymmetricMemory.cu` and `nvshmem_extension.cu` are under the same compilation condition now (i.e. only when `USE_NVSHMEM=True`), see https://github.com/pytorch/pytorch/blob/main/caffe2/CMakeLists.txt#L1013-L1018. Therefore there is no need to build an extra layer to hide dependency. Pull Request resolved: #155971 Approved by: https://github.com/Skylion007 ghstack dependencies: #155506, #155835, #155968

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Calling `nvshmem_free` when an `NVSHMEMAllocation` is being destructed. Use a `is_finalizing()` as a guard as done in `CUDASymmetricMemory.cu` to avoid "driver shutting down" error (destruction fiasco). Pull Request resolved: #155975 Approved by: https://github.com/ngimel ghstack dependencies: #155506, #155835, #155968, #155971

The rank-to-global-rank exchange is a major overhead in `NVSHMEMSymmetricMemory` creation. We should cache its result on per-group basis. Before this change: ``` TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py exchanged_n_times: 18 ``` After this change: ``` TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py exchanged_n_times: 1 ``` Pull Request resolved: #156116 Approved by: https://github.com/fegin, https://github.com/ngimel ghstack dependencies: #155506, #155835, #155968, #155971, #155975

Avoiding a copy, not expecting a caller to change its value. Pull Request resolved: #156117 Approved by: https://github.com/fegin ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116

so that we can pick the default backend for SymmetricMemory without fully relying on env var `TORCH_SYMMMEM=CUDA | NVSHMEM` On Python side, the following API is added: `torch.distributed._symmetric_memory.is_nvshmem_available()` Pull Request resolved: #156291 Approved by: https://github.com/Skylion007 ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116, #156117

Update

a1af1d3

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jun 12, 2025

[a2av] Test must allocate tensors symmetrically

5862f98

ghstack-source-id: e88ef2b Pull-Request-resolved: #155835

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Jun 12, 2025

kwen2501 requested review from fduwjj and fegin and removed request for fegin June 12, 2025 20:38

kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 12, 2025

fduwjj approved these changes Jun 12, 2025

View reviewed changes

fegin approved these changes Jun 12, 2025

View reviewed changes

Skylion007 approved these changes Jun 12, 2025

View reviewed changes

pytorchmergebot added the merging label Jun 12, 2025

pytorchmergebot closed this in 99e99d5 Jun 13, 2025

pytorchmergebot added Merged and removed merging labels Jun 13, 2025

github-actions bot deleted the gh/kwen2501/167/head branch July 14, 2025 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[a2av] Test must allocate tensors symmetrically #155835

[a2av] Test must allocate tensors symmetrically #155835

Uh oh!

kwen2501 commented Jun 12, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 12, 2025 •

edited

Loading

Uh oh!

fduwjj left a comment

Uh oh!

kwen2501 commented Jun 12, 2025

Uh oh!

pytorchmergebot commented Jun 12, 2025

Uh oh!

pytorchmergebot commented Jun 13, 2025

Uh oh!

kwen2501 commented Jun 13, 2025

Uh oh!

pytorchmergebot commented Jun 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[a2av] Test must allocate tensors symmetrically #155835

[a2av] Test must allocate tensors symmetrically #155835

Uh oh!

Conversation

kwen2501 commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155835

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Jun 12, 2025

Uh oh!

pytorchmergebot commented Jun 12, 2025

Merge started

Uh oh!

pytorchmergebot commented Jun 13, 2025

Uh oh!

kwen2501 commented Jun 13, 2025

Uh oh!

pytorchmergebot commented Jun 13, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kwen2501 commented Jun 12, 2025 •

edited

Loading

pytorch-bot bot commented Jun 12, 2025 •

edited

Loading