Enable NCCL zero-copy (user buffer registration) for FSDP2 #150564

lw · 2025-04-02T18:50:49Z

Stack from ghstack (oldest at bottom):

-> Enable NCCL zero-copy (user buffer registration) for FSDP2 #150564

In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).

This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.

FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.

This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2025-04-02T18:50:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150564

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 2 Pending

As of commit ef21112 with merge base 2e0e085 ():
💚 Looks good so far! There are no failures yet. 💚

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 3d3e463 Pull Request resolved: #150564

torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py

[ghstack-poisoned]

ghstack-source-id: f4386d3 Pull Request resolved: #150564

[ghstack-poisoned]

ghstack-source-id: 5f9b0f7 Pull Request resolved: #150564

[ghstack-poisoned]

ghstack-source-id: f53f9e6 Pull Request resolved: #150564

[ghstack-poisoned]

ghstack-source-id: 046a122 Pull Request resolved: #150564

[ghstack-poisoned]

ghstack-source-id: e006d27 Pull Request resolved: #150564

[ghstack-poisoned]

ghstack-source-id: a289c22 Pull Request resolved: #150564

[ghstack-poisoned]

In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention). This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing. FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers. This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported). ghstack-source-id: 04cda46 Pull Request resolved: #150564

kwen2501 · 2025-06-13T17:18:40Z

torch/distributed/fsdp/_fully_shard/_fully_shard.py

@@ -523,6 +523,22 @@ def set_unshard_in_backward(self, unshard_in_backward: bool) -> None:
        if (fsdp_param_group := state._fsdp_param_group) is not None:
            fsdp_param_group.unshard_in_backward = unshard_in_backward

+    def set_allocate_memory_from_process_group_for_comm(self, enable: bool) -> None:


What if there are multiple FSDP modules? Would users need to call .set_... one by one?
It seems this needs to be a class method at least.

And user would be advised to call FSDPModule.set_... before doing anything, just to be safe.

I very much prefer to make this a per-instance method, rather than a global classmethod, as I think there could be value in some cases to set this differently for different modules.

I can understand however the desire to make this method "recursive". I don't have the impression that other methods in this class do that, thus I don't know how exactly to achieve it, but if you have pointers please let me know!

test/distributed/_composable/fsdp/test_fully_shard_comm.py

[ghstack-poisoned]

In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention). This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing. FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers. This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported). ghstack-source-id: 8303b5f Pull Request resolved: #150564

[ghstack-poisoned]

In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention). This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing. FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers. This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported). ghstack-source-id: 3393635 Pull Request resolved: #150564

[ghstack-poisoned]

In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention). This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing. FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers. This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported). ghstack-source-id: 594d590 Pull Request resolved: #150564

[ghstack-poisoned]

In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention). This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing. FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers. This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported). ghstack-source-id: 09ac795 Pull Request resolved: #150564

lw · 2025-06-17T12:47:35Z

@pytorchbot merge

pytorchmergebot · 2025-06-17T12:49:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Skylion007 · 2025-06-17T21:35:09Z

Will this be added to FSDP1 o-o?

weifengpy · 2025-06-17T22:38:55Z

Will this be added to FSDP1 o-o?

@Skylion007 no plans. it's hard to commit to maintain zero-copy in fsdp1

Update

d853202

[ghstack-poisoned]

lw mentioned this pull request Apr 2, 2025

Fix detection of GPU multicast #150563

Closed

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Apr 2, 2025

lw added a commit that referenced this pull request Apr 2, 2025

Experiment with user buffer registration for FSDP2

6afc958

ghstack-source-id: 3d3e463 Pull Request resolved: #150564

Prateikx reviewed Apr 2, 2025

View reviewed changes

torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py Outdated Show resolved Hide resolved

Prateikx reviewed Apr 2, 2025

View reviewed changes

torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py Outdated Show resolved Hide resolved

Prateikx reviewed Apr 2, 2025

View reviewed changes

torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py Outdated Show resolved Hide resolved

Update

d8c9a97

[ghstack-poisoned]

lw requested review from eqy and syed-ahmed as code owners April 3, 2025 08:09

lw added a commit that referenced this pull request Apr 3, 2025

Experiment with user buffer registration for FSDP2

daf2b03

ghstack-source-id: f4386d3 Pull Request resolved: #150564

Update

c914f51

[ghstack-poisoned]

lw added a commit that referenced this pull request Apr 3, 2025

Experiment with user buffer registration for FSDP2

b5d12cc

ghstack-source-id: 5f9b0f7 Pull Request resolved: #150564

Update

f19ab43

[ghstack-poisoned]

lw added a commit that referenced this pull request Apr 3, 2025

Experiment with user buffer registration for FSDP2

9b914a9

ghstack-source-id: f53f9e6 Pull Request resolved: #150564

Update

0a352b4

[ghstack-poisoned]

This was referenced Apr 4, 2025

Safer bookkeeping of NCCL communicators #150681

Closed

Clarify behavior of TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK #150682

Closed

Add mempool to allocator's trace events #150683

Closed

Register also future allocations in mempool with NCCL #150684

Closed

lw added a commit that referenced this pull request Apr 4, 2025

Experiment with user buffer registration for FSDP2

39cb3cf

ghstack-source-id: 046a122 Pull Request resolved: #150564

Update

4f2e66b

[ghstack-poisoned]

lw added a commit that referenced this pull request Apr 4, 2025

Experiment with user buffer registration for FSDP2

2ca2a97

ghstack-source-id: e006d27 Pull Request resolved: #150564

Update

308b758

[ghstack-poisoned]

lw added a commit that referenced this pull request Apr 5, 2025

Experiment with user buffer registration for FSDP2

8ab7838

ghstack-source-id: a289c22 Pull Request resolved: #150564

lw mentioned this pull request Apr 7, 2025

Getting NVLS user buffer registration to work under PyTorch NVIDIA/nccl#1584

Closed

Update

d379586

[ghstack-poisoned]

Update

9a15931

[ghstack-poisoned]

lw changed the title ~~Experiment with user buffer registration for FSDP2~~ Enable NCCL zero-copy (user buffer registration) for FSDP2 Jun 13, 2025

lw mentioned this pull request Jun 13, 2025

Allow forcing FSDP2 to always use SUM reductions #155915

Closed

Update

f12dd64

[ghstack-poisoned]

kwen2501 reviewed Jun 13, 2025

View reviewed changes

weifengpy approved these changes Jun 13, 2025

View reviewed changes

weifengpy reviewed Jun 13, 2025

View reviewed changes

test/distributed/_composable/fsdp/test_fully_shard_comm.py Outdated Show resolved Hide resolved

Update

d8bb918

[ghstack-poisoned]

lw added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 16, 2025

Update

4d8f0fa

[ghstack-poisoned]

Update

8d2b5df

[ghstack-poisoned]

lw added the ciflow/h100-distributed label Jun 17, 2025

Update

ef21112

[ghstack-poisoned]

lw requested a review from a team as a code owner June 17, 2025 08:36

pytorchmergebot added the merging label Jun 17, 2025

pytorchmergebot closed this in 0a0023d Jun 17, 2025

pytorchmergebot added Merged and removed merging labels Jun 17, 2025

lw deleted the gh/lw/10/head branch June 19, 2025 08:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable NCCL zero-copy (user buffer registration) for FSDP2 #150564

Enable NCCL zero-copy (user buffer registration) for FSDP2 #150564

Uh oh!

lw commented Apr 2, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Apr 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kwen2501 Jun 13, 2025 •

edited

Loading

Uh oh!

lw Jun 16, 2025

Uh oh!

Uh oh!

lw commented Jun 17, 2025

Uh oh!

pytorchmergebot commented Jun 17, 2025

Uh oh!

Skylion007 commented Jun 17, 2025

Uh oh!

weifengpy commented Jun 17, 2025

Uh oh!

Uh oh!

Enable NCCL zero-copy (user buffer registration) for FSDP2 #150564

Enable NCCL zero-copy (user buffer registration) for FSDP2 #150564

Uh oh!

Conversation

lw commented Apr 2, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150564

⏳ No Failures, 2 Pending

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kwen2501 Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lw commented Jun 17, 2025

Uh oh!

pytorchmergebot commented Jun 17, 2025

Merge started

Uh oh!

Skylion007 commented Jun 17, 2025

Uh oh!

weifengpy commented Jun 17, 2025

Uh oh!

Uh oh!

lw commented Apr 2, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Apr 2, 2025 •

edited

Loading

kwen2501 Jun 13, 2025 •

edited

Loading