[c10d] use allocator trace callbacks for NCCL PG register #112850

minsii · 2023-11-03T13:42:40Z

Summary:
We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations.

How to track and register all cache segments:

It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree.
When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once.
When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator

Test Plan: See test in D50726971

Reviewed By: wconstab

Differential Revision: D50726970

pytorch-bot · 2023-11-03T13:42:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112850

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e0270ca with merge base 9c1fb2c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2023-11-03T13:42:52Z

This pull request was exported from Phabricator. Differential Revision: D50726970

…2850) Summary: We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations. How to track and register all cache segments: - It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree. - When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once. - When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator Test Plan: See test in D50726971 Reviewed By: wconstab Differential Revision: D50726970

facebook-github-bot · 2023-11-03T16:29:56Z

This pull request was exported from Phabricator. Differential Revision: D50726970

facebook-github-bot · 2023-11-05T22:01:44Z

This pull request was exported from Phabricator. Differential Revision: D50726970

…2850) Summary: We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations. How to track and register all cache segments: - It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree. - When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once. - When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator Test Plan: ``` NCCL_CTRAN_REGISTER=1 NCCL_ALLGATHER_ALGO=ctran:direct NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL pytest test/distributed/test_c10d_nccl.py -vsk test_tensor_register_hook ``` ``` ============================= test session starts ============================== platform linux -- Python 3.10.13, pytest-7.4.3, pluggy-1.3.0 -- /home/msi/conda/bin/python cachedir: .pytest_cache hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/data/users/msi/git/pytorch/.hypothesis/examples')) rootdir: /data/users/msi/git/pytorch configfile: pytest.ini plugins: typeguard-3.0.2, hypothesis-6.88.1 collecting ... collected 153 items / 152 deselected / 1 selected test/distributed/test_c10d_nccl.py::ProcessGroupNCCLTest::test_tensor_register_hook NCCL version 2.18.3meta-exp git-bc6c420+cuda11.8 2023-11-03T11:05:30-0700 devgpu001:3050625:3052473 [1] 1210737 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled 2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211154 NCCL INFO CTRAN-MAPPER: register buffer 0x7fc49a400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0) 2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211276 NCCL INFO AllGather: opCount 0 sendbuff 0x7fc49a400000 recvbuff 0x7fc49a400200 count 8 datatype 0 op 0 root 0 comm 0x55063ff0 commHash 5314677976282377676 [nranks=2] stream 0x54809370 2023-11-03T11:05:30-0700 devgpu001:3050624:3051241 [0] 4572989 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled 2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573408 NCCL INFO CTRAN-MAPPER: register buffer 0x7f0fec400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0) 2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573527 NCCL INFO AllGather: opCount 0 sendbuff 0x7f0fec400000 recvbuff 0x7f0fec400200 count 8 datatype 0 op 0 root 0 comm 0x41a01590 commHash 5314677976282377676 [nranks=2] stream 0x4118d470 PASSED [13.2144s] ====================== 1 passed, 152 deselected in 18.02s ====================== ``` ## Facebook Performance study with xlformer 150b model on 16 nodes: https://docs.google.com/document/d/1YJe1yplTb4IE2TtpYiuHTCTOZHJ10OxLm4JZXX95wfE/edit?usp=sharing Reviewed By: wconstab Differential Revision: D50726970

facebook-github-bot · 2023-11-05T22:12:30Z

This pull request was exported from Phabricator. Differential Revision: D50726970

facebook-github-bot · 2023-11-06T19:27:22Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2023-11-06T19:29:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…2850) Summary: We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations. How to track and register all cache segments: - It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree. - When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once. - When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator Test Plan: See test in D50726971 Reviewed By: wconstab Differential Revision: D50726970 Pull Request resolved: pytorch#112850 Approved by: https://github.com/wconstab

minsii requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol, fegin, fduwjj, wz337, LucasLLC, kiukchung and d4l3k as code owners November 3, 2023 13:42

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Nov 3, 2023

facebook-github-bot added the fb-exported label Nov 3, 2023

minsii force-pushed the export-D50726970 branch from f35b398 to 492b37e Compare November 3, 2023 16:29

minsii mentioned this pull request Nov 3, 2023

Prototype version 2 to support NCCL zero copy #112241

Closed

minsii force-pushed the export-D50726970 branch from 492b37e to 9edfff5 Compare November 5, 2023 22:01

minsii force-pushed the export-D50726970 branch from 9edfff5 to e0270ca Compare November 5, 2023 22:12

wconstab approved these changes Nov 6, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 6, 2023

pytorchmergebot added the merging label Nov 6, 2023

pytorchmergebot added Merged and removed merging labels Nov 6, 2023

pytorchmergebot closed this in ab1f6d5 Nov 6, 2023

syed-ahmed mentioned this pull request Apr 24, 2024

[RFC] Mix and Match CUDA Allocators using Private Pools #124807

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[c10d] use allocator trace callbacks for NCCL PG register #112850

[c10d] use allocator trace callbacks for NCCL PG register #112850

minsii commented Nov 3, 2023

pytorch-bot bot commented Nov 3, 2023 •

edited

facebook-github-bot commented Nov 3, 2023

facebook-github-bot commented Nov 3, 2023

facebook-github-bot commented Nov 5, 2023

facebook-github-bot commented Nov 5, 2023

facebook-github-bot commented Nov 6, 2023

pytorchmergebot commented Nov 6, 2023

[c10d] use allocator trace callbacks for NCCL PG register #112850

[c10d] use allocator trace callbacks for NCCL PG register #112850

Conversation

minsii commented Nov 3, 2023

pytorch-bot bot commented Nov 3, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112850

✅ No Failures

facebook-github-bot commented Nov 3, 2023

facebook-github-bot commented Nov 3, 2023

facebook-github-bot commented Nov 5, 2023

facebook-github-bot commented Nov 5, 2023

facebook-github-bot commented Nov 6, 2023

pytorchmergebot commented Nov 6, 2023

Merge started

pytorch-bot bot commented Nov 3, 2023 •

edited