New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[c10d] use allocator trace callbacks for NCCL PG register #112850
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112850
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e0270ca with merge base 9c1fb2c (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D50726970 |
f35b398
to
492b37e
Compare
…2850) Summary: We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations. How to track and register all cache segments: - It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree. - When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once. - When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator Test Plan: See test in D50726971 Reviewed By: wconstab Differential Revision: D50726970
This pull request was exported from Phabricator. Differential Revision: D50726970 |
492b37e
to
9edfff5
Compare
This pull request was exported from Phabricator. Differential Revision: D50726970 |
…2850) Summary: We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations. How to track and register all cache segments: - It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree. - When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once. - When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator Test Plan: ``` NCCL_CTRAN_REGISTER=1 NCCL_ALLGATHER_ALGO=ctran:direct NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL pytest test/distributed/test_c10d_nccl.py -vsk test_tensor_register_hook ``` ``` ============================= test session starts ============================== platform linux -- Python 3.10.13, pytest-7.4.3, pluggy-1.3.0 -- /home/msi/conda/bin/python cachedir: .pytest_cache hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/data/users/msi/git/pytorch/.hypothesis/examples')) rootdir: /data/users/msi/git/pytorch configfile: pytest.ini plugins: typeguard-3.0.2, hypothesis-6.88.1 collecting ... collected 153 items / 152 deselected / 1 selected test/distributed/test_c10d_nccl.py::ProcessGroupNCCLTest::test_tensor_register_hook NCCL version 2.18.3meta-exp git-bc6c420+cuda11.8 2023-11-03T11:05:30-0700 devgpu001:3050625:3052473 [1] 1210737 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled 2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211154 NCCL INFO CTRAN-MAPPER: register buffer 0x7fc49a400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0) 2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211276 NCCL INFO AllGather: opCount 0 sendbuff 0x7fc49a400000 recvbuff 0x7fc49a400200 count 8 datatype 0 op 0 root 0 comm 0x55063ff0 commHash 5314677976282377676 [nranks=2] stream 0x54809370 2023-11-03T11:05:30-0700 devgpu001:3050624:3051241 [0] 4572989 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled 2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573408 NCCL INFO CTRAN-MAPPER: register buffer 0x7f0fec400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0) 2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573527 NCCL INFO AllGather: opCount 0 sendbuff 0x7f0fec400000 recvbuff 0x7f0fec400200 count 8 datatype 0 op 0 root 0 comm 0x41a01590 commHash 5314677976282377676 [nranks=2] stream 0x4118d470 PASSED [13.2144s] ====================== 1 passed, 152 deselected in 18.02s ====================== ``` ## Facebook Performance study with xlformer 150b model on 16 nodes: https://docs.google.com/document/d/1YJe1yplTb4IE2TtpYiuHTCTOZHJ10OxLm4JZXX95wfE/edit?usp=sharing Reviewed By: wconstab Differential Revision: D50726970
…2850) Summary: We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations. How to track and register all cache segments: - It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree. - When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once. - When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator Test Plan: ``` NCCL_CTRAN_REGISTER=1 NCCL_ALLGATHER_ALGO=ctran:direct NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL pytest test/distributed/test_c10d_nccl.py -vsk test_tensor_register_hook ``` ``` ============================= test session starts ============================== platform linux -- Python 3.10.13, pytest-7.4.3, pluggy-1.3.0 -- /home/msi/conda/bin/python cachedir: .pytest_cache hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/data/users/msi/git/pytorch/.hypothesis/examples')) rootdir: /data/users/msi/git/pytorch configfile: pytest.ini plugins: typeguard-3.0.2, hypothesis-6.88.1 collecting ... collected 153 items / 152 deselected / 1 selected test/distributed/test_c10d_nccl.py::ProcessGroupNCCLTest::test_tensor_register_hook NCCL version 2.18.3meta-exp git-bc6c420+cuda11.8 2023-11-03T11:05:30-0700 devgpu001:3050625:3052473 [1] 1210737 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled 2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211154 NCCL INFO CTRAN-MAPPER: register buffer 0x7fc49a400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0) 2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211276 NCCL INFO AllGather: opCount 0 sendbuff 0x7fc49a400000 recvbuff 0x7fc49a400200 count 8 datatype 0 op 0 root 0 comm 0x55063ff0 commHash 5314677976282377676 [nranks=2] stream 0x54809370 2023-11-03T11:05:30-0700 devgpu001:3050624:3051241 [0] 4572989 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled 2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573408 NCCL INFO CTRAN-MAPPER: register buffer 0x7f0fec400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0) 2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573527 NCCL INFO AllGather: opCount 0 sendbuff 0x7f0fec400000 recvbuff 0x7f0fec400200 count 8 datatype 0 op 0 root 0 comm 0x41a01590 commHash 5314677976282377676 [nranks=2] stream 0x4118d470 PASSED [13.2144s] ====================== 1 passed, 152 deselected in 18.02s ====================== ``` ## Facebook Performance study with xlformer 150b model on 16 nodes: https://docs.google.com/document/d/1YJe1yplTb4IE2TtpYiuHTCTOZHJ10OxLm4JZXX95wfE/edit?usp=sharing Reviewed By: wconstab Differential Revision: D50726970
9edfff5
to
e0270ca
Compare
This pull request was exported from Phabricator. Differential Revision: D50726970 |
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…2850) Summary: We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations. How to track and register all cache segments: - It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree. - When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once. - When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator Test Plan: See test in D50726971 Reviewed By: wconstab Differential Revision: D50726970 Pull Request resolved: pytorch#112850 Approved by: https://github.com/wconstab
…2850) Summary: We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations. How to track and register all cache segments: - It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree. - When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once. - When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator Test Plan: See test in D50726971 Reviewed By: wconstab Differential Revision: D50726970 Pull Request resolved: pytorch#112850 Approved by: https://github.com/wconstab
Summary:
We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations.
How to track and register all cache segments:
Test Plan: See test in D50726971
Reviewed By: wconstab
Differential Revision: D50726970