Skip to content

Conversation

codingwithsurya
Copy link
Contributor

@codingwithsurya codingwithsurya commented Jun 24, 2025

Summary

This PR removes unnecessary dist.barrier calls up in our Triton NVSHMEM test suite and adds signal_op support, which is a lightweight device-side signaling mechanism. Added test for this in our wait_until kernel and corresponding core.extern wrapper.

Why did we drop the dist.barrier() calls?
We dropped the host‐side dist.barrier() in all Triton NVSHMEM tests (except the raw put/get cases) because every other test already uses NVSHMEM collectives or device‐side sync primitives (fence/quiet/signal/wait), making the extra barrier redundant. This keeps synchronization entirely on the GPU and leverages NVSHMEM’s native ordering guarantees for clearer, more efficient tests.

test_triton_wait_until update

  • Rank 1: after put_kernel writes the data, launches signal_op_kernel to atomically set Rank 0's flag via nvshmemx_signal_op
  • Rank 0: drops its old dist.barrier() and simply calls wait_until_kernel to spin-wait on the device flag, then asserts data correctness
  • Changes made per this comment

Testing

TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py

Stack from ghstack (oldest at bottom):

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 24, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156684

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit a3c5209 with merge base 455dfd2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 24, 2025
@codingwithsurya codingwithsurya changed the title adding signal_op kernel, remove unncessary dist.barrier calls [SymmMem] Remove redundant dist.barrier in Triton NVSHMEM tests & add device‐side signal_op support Jun 24, 2025
@codingwithsurya codingwithsurya self-assigned this Jun 24, 2025
@codingwithsurya codingwithsurya force-pushed the gh/codingwithsurya/6/head branch 2 times, most recently from 373dc9d to def902f Compare June 24, 2025 16:45
@kwen2501 kwen2501 added the release notes: distributed (symm_mem) release note label for symmetric memory label Jun 24, 2025
Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment re argument type

Comment on lines +618 to +620
signal: tl.constexpr,
sig_op: tl.constexpr,
peer: tl.constexpr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these three operands be constexpr? Can they be regular variables? Just to avoid re-compilation.

Copy link
Contributor Author

@codingwithsurya codingwithsurya Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked offline; these make sense to be regular variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm going to update this on top PR of this stack.

[ghstack-poisoned]

peer = 1 - rank
NVSHMEM_CMP_EQ = 0 # from nvshmem.h
NVSHMEM_SIGNAL_SET = 0 # atomic set operation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we can also pass in NVSHMEM_SIGNAL_ADD here? (https://docs.nvidia.com/nvshmem/api/gen/api/signal.html#available-signal-operators )

Do we see any scenario where it could be useful? @kwen2501

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, when the sender wants to monotonically update a counter as versioning info to the receiver.

@codingwithsurya
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 25, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 4 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]
@codingwithsurya
Copy link
Contributor Author

some checks need to be rerun. pull-rebasing.

[ghstack-poisoned]
[ghstack-poisoned]
@codingwithsurya
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jun 27, 2025
…d file (#156685)

## Summary

Moved the Triton-specific NVSHMEM tests in `test_nvshmem.py` into a dedicated `test_nvshmem_triton.py` file. Also put the shared Triton JIT kernels at the top-level of new file for reusability.

## Testing

```bash
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem_triton.py
```

All 16 original tests pass with no functionality changes.

Pull Request resolved: #156685
Approved by: https://github.com/mandroid6, https://github.com/kwen2501
ghstack dependencies: #156684
@github-actions github-actions bot deleted the gh/codingwithsurya/6/head branch July 27, 2025 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/h100-distributed ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category release notes: distributed (symm_mem) release note label for symmetric memory

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants