Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c10d] use allocator trace callbacks for NCCL PG register #112850

Closed
wants to merge 1 commit into from

Conversation

minsii
Copy link
Contributor

@minsii minsii commented Nov 3, 2023

Summary:
We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations.

How to track and register all cache segments:

  • It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree.
  • When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once.
  • When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator

Test Plan: See test in D50726971

Reviewed By: wconstab

Differential Revision: D50726970

Copy link

pytorch-bot bot commented Nov 3, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112850

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e0270ca with merge base 9c1fb2c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726970

minsii added a commit to minsii/pytorch that referenced this pull request Nov 3, 2023
…2850)

Summary:

We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations.

How to track and register all cache segments:
- It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree.
- When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once.
- When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator

Test Plan: See test in D50726971

Reviewed By: wconstab

Differential Revision: D50726970
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726970

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726970

minsii added a commit to minsii/pytorch that referenced this pull request Nov 5, 2023
…2850)

Summary:

We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations.

How to track and register all cache segments:
- It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree.
- When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once.
- When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator

Test Plan:
```
 NCCL_CTRAN_REGISTER=1 NCCL_ALLGATHER_ALGO=ctran:direct NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL pytest test/distributed/test_c10d_nccl.py -vsk test_tensor_register_hook
```

```
============================= test session starts ==============================
platform linux -- Python 3.10.13, pytest-7.4.3, pluggy-1.3.0 -- /home/msi/conda/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/data/users/msi/git/pytorch/.hypothesis/examples'))
rootdir: /data/users/msi/git/pytorch
configfile: pytest.ini
plugins: typeguard-3.0.2, hypothesis-6.88.1
collecting ... collected 153 items / 152 deselected / 1 selected

test/distributed/test_c10d_nccl.py::ProcessGroupNCCLTest::test_tensor_register_hook NCCL version 2.18.3meta-exp git-bc6c420+cuda11.8

2023-11-03T11:05:30-0700 devgpu001:3050625:3052473 [1] 1210737 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled
2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211154 NCCL INFO CTRAN-MAPPER: register buffer 0x7fc49a400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0)
2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211276 NCCL INFO AllGather: opCount 0 sendbuff 0x7fc49a400000 recvbuff 0x7fc49a400200 count 8 datatype 0 op 0 root 0 comm 0x55063ff0 commHash 5314677976282377676 [nranks=2] stream 0x54809370

2023-11-03T11:05:30-0700 devgpu001:3050624:3051241 [0] 4572989 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled
2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573408 NCCL INFO CTRAN-MAPPER: register buffer 0x7f0fec400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0)
2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573527 NCCL INFO AllGather: opCount 0 sendbuff 0x7f0fec400000 recvbuff 0x7f0fec400200 count 8 datatype 0 op 0 root 0 comm 0x41a01590 commHash 5314677976282377676 [nranks=2] stream 0x4118d470
PASSED [13.2144s]

====================== 1 passed, 152 deselected in 18.02s ======================
```

## Facebook
Performance study with xlformer 150b model on 16 nodes:
https://docs.google.com/document/d/1YJe1yplTb4IE2TtpYiuHTCTOZHJ10OxLm4JZXX95wfE/edit?usp=sharing

Reviewed By: wconstab

Differential Revision: D50726970
…2850)

Summary:

We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations.

How to track and register all cache segments:
- It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree.
- When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once.
- When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator

Test Plan:
```
 NCCL_CTRAN_REGISTER=1 NCCL_ALLGATHER_ALGO=ctran:direct NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL pytest test/distributed/test_c10d_nccl.py -vsk test_tensor_register_hook
```

```
============================= test session starts ==============================
platform linux -- Python 3.10.13, pytest-7.4.3, pluggy-1.3.0 -- /home/msi/conda/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/data/users/msi/git/pytorch/.hypothesis/examples'))
rootdir: /data/users/msi/git/pytorch
configfile: pytest.ini
plugins: typeguard-3.0.2, hypothesis-6.88.1
collecting ... collected 153 items / 152 deselected / 1 selected

test/distributed/test_c10d_nccl.py::ProcessGroupNCCLTest::test_tensor_register_hook NCCL version 2.18.3meta-exp git-bc6c420+cuda11.8

2023-11-03T11:05:30-0700 devgpu001:3050625:3052473 [1] 1210737 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled
2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211154 NCCL INFO CTRAN-MAPPER: register buffer 0x7fc49a400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0)
2023-11-03T11:05:30-0700 devgpu001:3050625:3050625 [1] 1211276 NCCL INFO AllGather: opCount 0 sendbuff 0x7fc49a400000 recvbuff 0x7fc49a400200 count 8 datatype 0 op 0 root 0 comm 0x55063ff0 commHash 5314677976282377676 [nranks=2] stream 0x54809370

2023-11-03T11:05:30-0700 devgpu001:3050624:3051241 [0] 4572989 ctran/mapper/ctranMapper.cc:67 NCCL WARN CTRAN: IB backend not enabled
2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573408 NCCL INFO CTRAN-MAPPER: register buffer 0x7f0fec400000 len 2097152 (cached 0 registered 1 total cached 0 total registered 1 total dynamically registered 0)
2023-11-03T11:05:30-0700 devgpu001:3050624:3050624 [0] 4573527 NCCL INFO AllGather: opCount 0 sendbuff 0x7f0fec400000 recvbuff 0x7f0fec400200 count 8 datatype 0 op 0 root 0 comm 0x41a01590 commHash 5314677976282377676 [nranks=2] stream 0x4118d470
PASSED [13.2144s]

====================== 1 passed, 152 deselected in 18.02s ======================
```

## Facebook
Performance study with xlformer 150b model on 16 nodes:
https://docs.google.com/document/d/1YJe1yplTb4IE2TtpYiuHTCTOZHJ10OxLm4JZXX95wfE/edit?usp=sharing

Reviewed By: wconstab

Differential Revision: D50726970
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726970

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 6, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
…2850)

Summary:
We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations.

How to track and register all cache segments:
- It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree.
- When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once.
- When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator

Test Plan: See test in D50726971

Reviewed By: wconstab

Differential Revision: D50726970

Pull Request resolved: pytorch#112850
Approved by: https://github.com/wconstab
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
…2850)

Summary:
We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations.

How to track and register all cache segments:
- It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree.
- When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once.
- When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator

Test Plan: See test in D50726971

Reviewed By: wconstab

Differential Revision: D50726970

Pull Request resolved: pytorch#112850
Approved by: https://github.com/wconstab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants