[cuda] introduce trace tracker callback in cache allocator #112238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

minsii wants to merge 1 commit into pytorch:main from minsii:export-D50726971

Contributor

minsii commented Oct 27, 2023 •

edited

Loading

Summary:
This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records.

It allows external of cache allocator to "attach" trace tracker callbacks.
When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action.
RISK: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock.
ADVICE: The callback mechanism is designed only for Pytorch internal use. We should not expose it to Python layer due to Python GIL that would cause a deadlock.

See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL.

Differential Revision: D50726971

pytorch-bot bot commented Oct 27, 2023 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112238

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit f289e51 with merge base 9d09d29 ():

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Contributor

facebook-github-bot commented Oct 27, 2023

This pull request was exported from Phabricator. Differential Revision: D50726971

facebook-github-bot added the fb-exported label

minsii force-pushed the export-D50726971 branch 2 times, most recently from 4e2a581 to 3e1dcd0 Compare

October 27, 2023 16:47

Contributor

facebook-github-bot commented Oct 27, 2023

This pull request was exported from Phabricator. Differential Revision: D50726971

minsii force-pushed the export-D50726971 branch from 3e1dcd0 to 631637b Compare

October 27, 2023 16:48

Contributor

facebook-github-bot commented Oct 27, 2023

This pull request was exported from Phabricator. Differential Revision: D50726971

zdevito reviewed

View reviewed changes

Contributor

zdevito left a comment

This looks good, I only have minor comments. It will need some test from within PyTorch to make sure the events get called.

c10/cuda/CUDACachingAllocator.cpp Outdated Show resolved Hide resolved

c10/cuda/CUDACachingAllocator.cpp Outdated Show resolved Hide resolved

c10/cuda/CUDACachingAllocator.cpp Outdated Show resolved Hide resolved

c10/cuda/CUDACachingAllocator.cpp Outdated Show resolved Hide resolved

c10/cuda/CUDACachingAllocator.h Outdated Show resolved Hide resolved

c10/cuda/CUDACachingAllocator.h Outdated Show resolved Hide resolved

minsii force-pushed the export-D50726971 branch from 631637b to c8855f0 Compare

October 28, 2023 19:09

Contributor

facebook-github-bot commented Oct 28, 2023

This pull request was exported from Phabricator. Differential Revision: D50726971

minsii force-pushed the export-D50726971 branch from c8855f0 to 4826432 Compare

October 28, 2023 23:20

Contributor

facebook-github-bot commented Oct 28, 2023

This pull request was exported from Phabricator. Differential Revision: D50726971

Contributor Author

minsii commented Oct 28, 2023

Addressed comments and added a unit test

Contributor Author

minsii commented Oct 29, 2023

@pytorchbot label "topic: not user facing"

pytorch-bot bot added the topic: not user facing label

minsii force-pushed the export-D50726971 branch from 4826432 to c5e8a61 Compare

October 30, 2023 06:11

Contributor

facebook-github-bot commented Oct 30, 2023

This pull request was exported from Phabricator. Differential Revision: D50726971

minsii force-pushed the export-D50726971 branch from c5e8a61 to 74ea813 Compare

October 30, 2023 06:11

Contributor

facebook-github-bot commented Oct 30, 2023

This pull request was exported from Phabricator. Differential Revision: D50726971

minsii mentioned this pull request

Prototype version 2 to support NCCL zero copy #112241

Closed

Contributor

facebook-github-bot commented Oct 30, 2023

This pull request was exported from Phabricator. Differential Revision: D50726971

minsii force-pushed the export-D50726971 branch from 74ea813 to 0cc4515 Compare

October 30, 2023 16:06

minsii requested review from d4l3k, kiukchung and wz337 as code owners

October 30, 2023 16:11

minsii force-pushed the export-D50726971 branch 2 times, most recently from 0cc4515 to 213be0b Compare

October 30, 2023 19:21

Contributor

facebook-github-bot commented Oct 30, 2023

This pull request was exported from Phabricator. Differential Revision: D50726971

minsii force-pushed the export-D50726971 branch from 213be0b to b4e6ada Compare

October 30, 2023 19:49

Contributor

facebook-github-bot commented Oct 30, 2023

This pull request was exported from Phabricator. Differential Revision: D50726971

minsii force-pushed the export-D50726971 branch from b4e6ada to d3a4d68 Compare

November 1, 2023 03:22

Contributor

facebook-github-bot commented Nov 1, 2023

This pull request was exported from Phabricator. Differential Revision: D50726971

zdevito approved these changes

View reviewed changes

Contributor

zdevito left a comment

Looks good! Thanks for getting that test to run


          [prototype][cuda] introduce trace tracker callback in cache allocator (…

f289e51

…pytorch#112238)

Summary:

This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records.

- It allows external of cache allocator to "attach" trace tracker callbacks.
- When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action.
- **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock.
- **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock.

See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL.

Reviewed By: zdevito

Differential Revision: D50726971

minsii force-pushed the export-D50726971 branch from d3a4d68 to f289e51 Compare

November 2, 2023 18:59

Contributor

facebook-github-bot commented Nov 2, 2023

This pull request was exported from Phabricator. Differential Revision: D50726971

minsii changed the title ~~[prototype][cuda] introduce trace tracker callback in cache allocator~~ [cuda] introduce trace tracker callback in cache allocator

Contributor

facebook-github-bot commented Nov 3, 2023

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorch-bot bot added the ciflow/trunk label

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Nov 3, 2023

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added Merged and removed merging labels

pytorchmergebot closed this in

94ebf52

xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request


          [cuda] introduce trace tracker callback in cache allocator (pytorch#1…

8e533cd

…12238)

Summary:
This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records.

- It allows external of cache allocator to "attach" trace tracker callbacks.
- When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action.
- **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock.
- **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock.

See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL.

Differential Revision: D50726971

Pull Request resolved: pytorch#112238
Approved by: https://github.com/zdevito

Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request


          [cuda] introduce trace tracker callback in cache allocator (pytorch#1…

c88ec69

…12238)

Summary:
This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records.

- It allows external of cache allocator to "attach" trace tracker callbacks.
- When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action.
- **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock.
- **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock.

See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL.

Differential Revision: D50726971

Pull Request resolved: pytorch#112238
Approved by: https://github.com/zdevito

syed-ahmed mentioned this pull request

[RFC] Mix and Match CUDA Allocators using Private Pools #124807

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

zdevito zdevito approved these changes

mrshenli Awaiting requested review from mrshenli

zhaojuanmao Awaiting requested review from zhaojuanmao

rohan-varma Awaiting requested review from rohan-varma

H-Huang Awaiting requested review from H-Huang

awgu Awaiting requested review from awgu

kwen2501 Awaiting requested review from kwen2501

wanchaol Awaiting requested review from wanchaol

fegin Awaiting requested review from fegin

fduwjj Awaiting requested review from fduwjj

kiukchung Awaiting requested review from kiukchung

d4l3k Awaiting requested review from d4l3k

wz337 Awaiting requested review from wz337

Labels

ciflow/trunk fb-exported Merged topic: not user facing