-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[cuda] introduce trace tracker callback in cache allocator #112238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112238
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (4 Unrelated Failures)As of commit f289e51 with merge base 9d09d29 ( UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D50726971 |
4e2a581
to
3e1dcd0
Compare
This pull request was exported from Phabricator. Differential Revision: D50726971 |
3e1dcd0
to
631637b
Compare
This pull request was exported from Phabricator. Differential Revision: D50726971 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, I only have minor comments. It will need some test from within PyTorch to make sure the events get called.
631637b
to
c8855f0
Compare
This pull request was exported from Phabricator. Differential Revision: D50726971 |
c8855f0
to
4826432
Compare
This pull request was exported from Phabricator. Differential Revision: D50726971 |
Addressed comments and added a unit test |
@pytorchbot label "topic: not user facing" |
4826432
to
c5e8a61
Compare
This pull request was exported from Phabricator. Differential Revision: D50726971 |
c5e8a61
to
74ea813
Compare
This pull request was exported from Phabricator. Differential Revision: D50726971 |
This pull request was exported from Phabricator. Differential Revision: D50726971 |
74ea813
to
0cc4515
Compare
0cc4515
to
213be0b
Compare
This pull request was exported from Phabricator. Differential Revision: D50726971 |
213be0b
to
b4e6ada
Compare
This pull request was exported from Phabricator. Differential Revision: D50726971 |
b4e6ada
to
d3a4d68
Compare
This pull request was exported from Phabricator. Differential Revision: D50726971 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for getting that test to run
…pytorch#112238) Summary: This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records. - It allows external of cache allocator to "attach" trace tracker callbacks. - When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action. - **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock. - **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock. See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL. Reviewed By: zdevito Differential Revision: D50726971
d3a4d68
to
f289e51
Compare
This pull request was exported from Phabricator. Differential Revision: D50726971 |
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…12238) Summary: This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records. - It allows external of cache allocator to "attach" trace tracker callbacks. - When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action. - **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock. - **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock. See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL. Differential Revision: D50726971 Pull Request resolved: pytorch#112238 Approved by: https://github.com/zdevito
…12238) Summary: This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records. - It allows external of cache allocator to "attach" trace tracker callbacks. - When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action. - **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock. - **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock. See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL. Differential Revision: D50726971 Pull Request resolved: pytorch#112238 Approved by: https://github.com/zdevito
Summary:
This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records.
See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL.
Differential Revision: D50726971