Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Back out "[Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623)" #116201

Closed
wants to merge 1 commit into from

Conversation

aaronenyeshi
Copy link
Member

Summary:
This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142

Copy link

pytorch-bot bot commented Dec 20, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116201

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 78e1a8c with merge base 06ae9b7 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D52339142

aaronenyeshi added a commit to aaronenyeshi/pytorch that referenced this pull request Dec 20, 2023
…rocess during pybind set-up (pytorch#112623)" (pytorch#116201)

Summary:

This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D52339142

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D52339142

aaronenyeshi added a commit to aaronenyeshi/pytorch that referenced this pull request Dec 20, 2023
…rocess during pybind set-up (pytorch#112623)" (pytorch#116201)

Summary:
Pull Request resolved: pytorch#116201

This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142

fbshipit-source-id: 1d9586c1fd205c5636fecee08c617a0548e6c8a2
aaronenyeshi added a commit to aaronenyeshi/pytorch that referenced this pull request Dec 20, 2023
…rocess during pybind set-up (pytorch#112623)" (pytorch#116201)

Summary:

This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D52339142

@aaronenyeshi aaronenyeshi added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request labels Dec 20, 2023
…rocess during pybind set-up (pytorch#112623)" (pytorch#116201)

Summary:

This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D52339142

@albanD
Copy link
Collaborator

albanD commented Dec 20, 2023

This has been landed more than a month ago. Why are we only seeing it now?

@xuzhao9
Copy link
Contributor

xuzhao9 commented Dec 20, 2023

@albanD We detected this issue one month ago: pytorch/benchmark#2064

This error could be caused by CUDA memory OOM (cublas couldn't allocate new handle). I am wondering if we could run memory profiler and submit a forward-fix.

@aaronenyeshi
Copy link
Member Author

@albanD We detected this issue one month ago: pytorch/benchmark#2064

This error could be caused by CUDA memory OOM (cublas couldn't allocate new handle). I am wondering if we could run memory profiler and submit a forward-fix.

That is an interesting hypothesis. I had to backout a different lazy cupti init as well, due to some crashes, and talking to NVIDIA about it. So we are safe to backout this diff until then. Do we also want to integrate the memory profiler or memory snapshot into TorchBench for future debugging?

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@xuzhao9
Copy link
Contributor

xuzhao9 commented Dec 21, 2023

@albanD We detected this issue one month ago: pytorch/benchmark#2064
This error could be caused by CUDA memory OOM (cublas couldn't allocate new handle). I am wondering if we could run memory profiler and submit a forward-fix.

That is an interesting hypothesis. I had to backout a different lazy cupti init as well, due to some crashes, and talking to NVIDIA about it. So we are safe to backout this diff until then. Do we also want to integrate the memory profiler or memory snapshot into TorchBench for future debugging?

Great idea! I will work on that.

@malfet malfet added this to the 2.2.0 milestone Dec 21, 2023
@malfet
Copy link
Contributor

malfet commented Dec 21, 2023

I wonder if this is also a reason for calling cuInit during import torch

malfet pushed a commit that referenced this pull request Dec 22, 2023
…rocess during pybind set-up (#112623)" (#116201)

Summary:
This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142

Pull Request resolved: #116201
Approved by: https://github.com/xuzhao9

(cherry picked from commit a357a0f)
atalman pushed a commit that referenced this pull request Dec 24, 2023
…rocess during pybind set-up (#112623)" (#116201) (#116332)

Summary:
This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142

Pull Request resolved: #116201
Approved by: https://github.com/xuzhao9

(cherry picked from commit a357a0f)

Co-authored-by: Aaron Shi <aaronshi@meta.com>
pytorchmergebot pushed a commit that referenced this pull request Jan 9, 2024
By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED`

Test Plan: run it on nighties before #116201 got reverted and observe the failure

This is very important for lots of distributed launchers

Fixes #116276

Pull Request resolved: #117010
Approved by: https://github.com/albanD
atalman pushed a commit to atalman/pytorch that referenced this pull request Jan 9, 2024
By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED`

Test Plan: run it on nighties before pytorch#116201 got reverted and observe the failure

This is very important for lots of distributed launchers

Fixes pytorch#116276

Pull Request resolved: pytorch#117010
Approved by: https://github.com/albanD
malfet added a commit that referenced this pull request Jan 9, 2024
By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED`

Test Plan: run it on nighties before #116201 got reverted and observe the failure

This is very important for lots of distributed launchers

Fixes #116276

Cherry-pick of  #117010 into release/2.2

Co-authored-by: Nikita Shulga <nshulga@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/rocm ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants