Skip to content

Conversation

kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Jun 12, 2025

Stack from ghstack (oldest at bottom):

This is a requirement of most SHMEM backends. Otherwise, allocations may misalign across ranks.

In this PR, we make the (total) input size and output size a constant number, even though the split sizes are created random. (Previously we sum the splits up as input size, which creates misalignment in SHMEM heap across ranks).

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Jun 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155835

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a1af1d3 with merge base 4d9d884 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501 added a commit that referenced this pull request Jun 12, 2025
ghstack-source-id: e88ef2b
Pull-Request-resolved: #155835
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Jun 12, 2025
@kwen2501 kwen2501 requested review from fduwjj and fegin and removed request for fegin June 12, 2025 20:38
@kwen2501 kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 12, 2025
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kwen2501
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@kwen2501
Copy link
Contributor Author

@pytorchbot merge -f "merge timed out"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jun 14, 2025
No code enqueues entries to `ptr_to_symm_mem_`, thus it is always empty.
This PR removes it and supports relying functionalities via the `allocations_` map.

Pull Request resolved: #155968
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835
pytorchmergebot pushed a commit that referenced this pull request Jun 14, 2025
`NVSHMEMSymmetricMemory.cu` and `nvshmem_extension.cu` are under the same compilation condition now (i.e. only when `USE_NVSHMEM=True`), see https://github.com/pytorch/pytorch/blob/main/caffe2/CMakeLists.txt#L1013-L1018.

Therefore there is no need to build an extra layer to hide dependency.

Pull Request resolved: #155971
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835, #155968
pytorchmergebot pushed a commit that referenced this pull request Jun 15, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Calling `nvshmem_free` when an `NVSHMEMAllocation` is being destructed.

Use a `is_finalizing()` as a guard as done in `CUDASymmetricMemory.cu` to avoid "driver shutting down" error (destruction fiasco).

Pull Request resolved: #155975
Approved by: https://github.com/ngimel
ghstack dependencies: #155506, #155835, #155968, #155971
pytorchmergebot pushed a commit that referenced this pull request Jun 17, 2025
The rank-to-global-rank exchange is a major overhead in `NVSHMEMSymmetricMemory` creation.
We should cache its result on per-group basis.

Before this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 18
```

After this change:
```
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
exchanged_n_times: 1
```

Pull Request resolved: #156116
Approved by: https://github.com/fegin, https://github.com/ngimel
ghstack dependencies: #155506, #155835, #155968, #155971, #155975
pytorchmergebot pushed a commit that referenced this pull request Jun 17, 2025
Avoiding a copy, not expecting a caller to change its value.

Pull Request resolved: #156117
Approved by: https://github.com/fegin
ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116
pytorchmergebot pushed a commit that referenced this pull request Jun 19, 2025
so that we can pick the default backend for SymmetricMemory without
fully relying on env var `TORCH_SYMMMEM=CUDA | NVSHMEM`

On Python side, the following API is added:
`torch.distributed._symmetric_memory.is_nvshmem_available()`

Pull Request resolved: #156291
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116, #156117
@github-actions github-actions bot deleted the gh/kwen2501/167/head branch July 14, 2025 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants