-
Notifications
You must be signed in to change notification settings - Fork 22.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Back out "[Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623)" #116201
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116201
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 78e1a8c with merge base 06ae9b7 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D52339142 |
b7b381c
to
4ca9f97
Compare
This pull request was exported from Phabricator. Differential Revision: D52339142 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D52339142 |
4ca9f97
to
e0557dc
Compare
e0557dc
to
cfadea5
Compare
…rocess during pybind set-up (pytorch#112623)" (pytorch#116201) Summary: This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error. https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095 Test Plan: CI Differential Revision: D52339142
This pull request was exported from Phabricator. Differential Revision: D52339142 |
…rocess during pybind set-up (pytorch#112623)" (pytorch#116201) Summary: This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error. https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095 Test Plan: CI Differential Revision: D52339142
cfadea5
to
78e1a8c
Compare
This pull request was exported from Phabricator. Differential Revision: D52339142 |
This has been landed more than a month ago. Why are we only seeing it now? |
@albanD We detected this issue one month ago: pytorch/benchmark#2064 This error could be caused by CUDA memory OOM (cublas couldn't allocate new handle). I am wondering if we could run memory profiler and submit a forward-fix. |
That is an interesting hypothesis. I had to backout a different lazy cupti init as well, due to some crashes, and talking to NVIDIA about it. So we are safe to backout this diff until then. Do we also want to integrate the memory profiler or memory snapshot into TorchBench for future debugging? |
@pytorchbot merge -f 'Landed internally' (Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally) |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Great idea! I will work on that. |
I wonder if this is also a reason for calling cuInit during |
…rocess during pybind set-up (#112623)" (#116201) Summary: This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error. https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095 Test Plan: CI Differential Revision: D52339142 Pull Request resolved: #116201 Approved by: https://github.com/xuzhao9 (cherry picked from commit a357a0f)
…rocess during pybind set-up (#112623)" (#116201) (#116332) Summary: This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error. https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095 Test Plan: CI Differential Revision: D52339142 Pull Request resolved: #116201 Approved by: https://github.com/xuzhao9 (cherry picked from commit a357a0f) Co-authored-by: Aaron Shi <aaronshi@meta.com>
By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED` Test Plan: run it on nighties before #116201 got reverted and observe the failure This is very important for lots of distributed launchers Fixes #116276 Pull Request resolved: #117010 Approved by: https://github.com/albanD
By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED` Test Plan: run it on nighties before pytorch#116201 got reverted and observe the failure This is very important for lots of distributed launchers Fixes pytorch#116276 Pull Request resolved: pytorch#117010 Approved by: https://github.com/albanD
By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED` Test Plan: run it on nighties before #116201 got reverted and observe the failure This is very important for lots of distributed launchers Fixes #116276 Cherry-pick of #117010 into release/2.2 Co-authored-by: Nikita Shulga <nshulga@meta.com>
Summary:
This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095
Test Plan: CI
Differential Revision: D52339142