Skip to content

Conversation

minsii
Copy link
Contributor

@minsii minsii commented Oct 27, 2023

Summary:
This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records.

  • It allows external of cache allocator to "attach" trace tracker callbacks.
  • When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action.
  • RISK: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock.
  • ADVICE: The callback mechanism is designed only for Pytorch internal use. We should not expose it to Python layer due to Python GIL that would cause a deadlock.

See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL.

Differential Revision: D50726971

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 27, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112238

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit f289e51 with merge base 9d09d29 (image):

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726971

@minsii minsii force-pushed the export-D50726971 branch 2 times, most recently from 4e2a581 to 3e1dcd0 Compare October 27, 2023 16:47
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726971

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726971

Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, I only have minor comments. It will need some test from within PyTorch to make sure the events get called.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726971

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726971

@minsii
Copy link
Contributor Author

minsii commented Oct 28, 2023

Addressed comments and added a unit test

@minsii
Copy link
Contributor Author

minsii commented Oct 29, 2023

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Oct 29, 2023
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726971

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726971

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726971

@minsii minsii force-pushed the export-D50726971 branch 2 times, most recently from 0cc4515 to 213be0b Compare October 30, 2023 19:21
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726971

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726971

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726971

Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for getting that test to run

…pytorch#112238)

Summary:

This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records.

- It allows external of cache allocator to "attach" trace tracker callbacks.
- When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action.
- **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock.
- **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock.

See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL.

Reviewed By: zdevito

Differential Revision: D50726971
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D50726971

@minsii minsii changed the title [prototype][cuda] introduce trace tracker callback in cache allocator [cuda] introduce trace tracker callback in cache allocator Nov 2, 2023
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 3, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
…12238)

Summary:
This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records.

- It allows external of cache allocator to "attach" trace tracker callbacks.
- When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action.
- **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock.
- **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock.

See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL.

Differential Revision: D50726971

Pull Request resolved: pytorch#112238
Approved by: https://github.com/zdevito
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
…12238)

Summary:
This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records.

- It allows external of cache allocator to "attach" trace tracker callbacks.
- When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action.
- **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock.
- **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock.

See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL.

Differential Revision: D50726971

Pull Request resolved: pytorch#112238
Approved by: https://github.com/zdevito
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants