-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Fixed dead lock in execution trace #136892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136892
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 8379465 with merge base 10f16cc ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D63556608 |
f198602 to
f0e0469
Compare
|
This pull request was exported from Phabricator. Differential Revision: D63556608 |
f0e0469 to
d073b06
Compare
|
This pull request was exported from Phabricator. Differential Revision: D63556608 |
|
What use case was impacting this? Profiler only? |
d073b06 to
b35fc36
Compare
Summary: This DIFF is to fix dead lock issue in execution issue. ExecutionTraceObserver get a lock in recordOperatorStart and onFunctionExit. However, inside these two functions, the input/ouput values are evaluated, which will triger python GIL in some use cases. In this case, the lock order is ET locker -> GIL. One of the ads application get GIL first, then call all-gather to collect some metrics from all ranks. When ET is on, all-gather is captured by ET observer. In this case, the lock order is: GIL -> ET locker That is the reason why dead lock happens. To fix it, I changed the ET locker scope, so the input/output evaluation is no longer inside the scope of the ET locker. Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda Reviewed By: aaronenyeshi Differential Revision: D63556608
|
This pull request was exported from Phabricator. Differential Revision: D63556608 |
Summary: This DIFF is to fix dead lock issue in execution issue. ExecutionTraceObserver get a lock in recordOperatorStart and onFunctionExit. However, inside these two functions, the input/ouput values are evaluated, which will triger python GIL in some use cases. In this case, the lock order is ET locker -> GIL. One of the ads application get GIL first, then call all-gather to collect some metrics from all ranks. When ET is on, all-gather is captured by ET observer. In this case, the lock order is: GIL -> ET locker That is the reason why dead lock happens. To fix it, I changed the ET locker scope, so the input/output evaluation is no longer inside the scope of the ET locker. I created a stress test in this diff https://www.internalfb.com/diff/D64220205 Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda Reviewed By: aaronenyeshi Differential Revision: D63556608
b35fc36 to
8379465
Compare
|
This pull request was exported from Phabricator. Differential Revision: D63556608 |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: This DIFF is to fix dead lock issue in execution issue. ExecutionTraceObserver get a lock in recordOperatorStart and onFunctionExit. However, inside these two functions, the input/ouput values are evaluated, which will triger python GIL in some use cases. In this case, the lock order is ET locker -> GIL. One of the ads application get GIL first, then call all-gather to collect some metrics from all ranks. When ET is on, all-gather is captured by ET observer. In this case, the lock order is: GIL -> ET locker That is the reason why dead lock happens. To fix it, I changed the ET locker scope, so the input/output evaluation is no longer inside the scope of the ET locker. Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda Differential Revision: D63556608 Pull Request resolved: #136892 Approved by: https://github.com/aaronenyeshi
Summary:
This DIFF is to fix dead lock issue in execution issue. ExecutionTraceObserver get a lock in recordOperatorStart and onFunctionExit. However, inside these two functions, the input/ouput values are evaluated, which will triger python GIL in some use cases. In this case, the lock order is ET locker -> GIL.
One of the ads application get GIL first, then call all-gather to collect some metrics from all ranks. When ET is on, all-gather is captured by ET observer. In this case, the lock order is: GIL -> ET locker
That is the reason why dead lock happens. To fix it, I changed the ET locker scope, so the input/output evaluation is no longer inside the scope of the ET locker.
Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda
Differential Revision: D63556608