Skip to content

Enable NCCL zero-copy (user buffer registration) for FSDP2 #150564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

lw
Copy link
Contributor

@lw lw commented Apr 2, 2025

Stack from ghstack (oldest at bottom):

In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).

This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.

FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.

This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Apr 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150564

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 2 Pending

As of commit ef21112 with merge base 2e0e085 (image):
💚 Looks good so far! There are no failures yet. 💚

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Apr 2, 2025
lw added a commit that referenced this pull request Apr 2, 2025
ghstack-source-id: 3d3e463
Pull Request resolved: #150564
[ghstack-poisoned]
@lw lw requested review from eqy and syed-ahmed as code owners April 3, 2025 08:09
lw added a commit that referenced this pull request Apr 3, 2025
ghstack-source-id: f4386d3
Pull Request resolved: #150564
[ghstack-poisoned]
lw added a commit that referenced this pull request Apr 3, 2025
ghstack-source-id: 5f9b0f7
Pull Request resolved: #150564
[ghstack-poisoned]
lw added a commit that referenced this pull request Apr 3, 2025
ghstack-source-id: f53f9e6
Pull Request resolved: #150564
[ghstack-poisoned]
[ghstack-poisoned]
lw added a commit that referenced this pull request Apr 4, 2025
ghstack-source-id: e006d27
Pull Request resolved: #150564
[ghstack-poisoned]
lw added a commit that referenced this pull request Apr 5, 2025
ghstack-source-id: a289c22
Pull Request resolved: #150564
[ghstack-poisoned]
[ghstack-poisoned]
@lw lw changed the title Experiment with user buffer registration for FSDP2 Enable NCCL zero-copy (user buffer registration) for FSDP2 Jun 13, 2025
[ghstack-poisoned]
lw added a commit that referenced this pull request Jun 13, 2025
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).

This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.

FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.

This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).

ghstack-source-id: 04cda46
Pull Request resolved: #150564
lw added a commit that referenced this pull request Jun 13, 2025
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).

This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.

FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.

This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).

ghstack-source-id: 04cda46
Pull Request resolved: #150564
@@ -523,6 +523,22 @@ def set_unshard_in_backward(self, unshard_in_backward: bool) -> None:
if (fsdp_param_group := state._fsdp_param_group) is not None:
fsdp_param_group.unshard_in_backward = unshard_in_backward

def set_allocate_memory_from_process_group_for_comm(self, enable: bool) -> None:
Copy link
Contributor

@kwen2501 kwen2501 Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there are multiple FSDP modules? Would users need to call .set_... one by one?
It seems this needs to be a class method at least.

And user would be advised to call FSDPModule.set_... before doing anything, just to be safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much prefer to make this a per-instance method, rather than a global classmethod, as I think there could be value in some cases to set this differently for different modules.

I can understand however the desire to make this method "recursive". I don't have the impression that other methods in this class do that, thus I don't know how exactly to achieve it, but if you have pointers please let me know!

[ghstack-poisoned]
lw added a commit that referenced this pull request Jun 16, 2025
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).

This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.

FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.

This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).

ghstack-source-id: 8303b5f
Pull Request resolved: #150564
@lw lw added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 16, 2025
[ghstack-poisoned]
lw added a commit that referenced this pull request Jun 16, 2025
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).

This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.

FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.

This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).

ghstack-source-id: 3393635
Pull Request resolved: #150564
[ghstack-poisoned]
lw added a commit that referenced this pull request Jun 16, 2025
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).

This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.

FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.

This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).

ghstack-source-id: 594d590
Pull Request resolved: #150564
[ghstack-poisoned]
@lw lw requested a review from a team as a code owner June 17, 2025 08:36
lw added a commit that referenced this pull request Jun 17, 2025
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).

This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.

FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.

This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).

ghstack-source-id: 09ac795
Pull Request resolved: #150564
@lw
Copy link
Contributor Author

lw commented Jun 17, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Skylion007
Copy link
Collaborator

Will this be added to FSDP1 o-o?

@weifengpy
Copy link
Contributor

Will this be added to FSDP1 o-o?

@Skylion007 no plans. it's hard to commit to maintain zero-copy in fsdp1

@lw lw deleted the gh/lw/10/head branch June 19, 2025 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/h100-distributed ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants