New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Back out "[Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623)" #116201

Closed

aaronenyeshi wants to merge 1 commit into pytorch:main from aaronenyeshi:export-D52339142

Member

aaronenyeshi commented Dec 20, 2023

Summary:
This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142

aaronenyeshi requested review from albanD and soulitzer as code owners

December 20, 2023 19:39

pytorch-bot bot commented Dec 20, 2023 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116201

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 78e1a8c with merge base 06ae9b7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Contributor

facebook-github-bot commented Dec 20, 2023

This pull request was exported from Phabricator. Differential Revision: D52339142

facebook-github-bot added the fb-exported label

aaronenyeshi requested a review from xuzhao9

December 20, 2023 19:39

aaronenyeshi force-pushed the export-D52339142 branch from b7b381c to 4ca9f97 Compare

December 20, 2023 19:40

Contributor

facebook-github-bot commented Dec 20, 2023

This pull request was exported from Phabricator. Differential Revision: D52339142

1 similar comment

Contributor

facebook-github-bot commented Dec 20, 2023

This pull request was exported from Phabricator. Differential Revision: D52339142

aaronenyeshi force-pushed the export-D52339142 branch from 4ca9f97 to e0557dc Compare

December 20, 2023 19:43

aaronenyeshi force-pushed the export-D52339142 branch from e0557dc to cfadea5 Compare

December 20, 2023 19:44

aaronenyeshi added a commit to aaronenyeshi/pytorch that referenced this pull request


          Back out "[Kineto] Initialize libkineto profilers during torch init p…

cfadea5

…rocess during pybind set-up (pytorch#112623)" (pytorch#116201)

Summary:

This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142

Contributor

facebook-github-bot commented Dec 20, 2023

This pull request was exported from Phabricator. Differential Revision: D52339142

aaronenyeshi added topic: not user facing ciflow/trunk labels


          Back out "[Kineto] Initialize libkineto profilers during torch init p…

78e1a8c

…rocess during pybind set-up (pytorch#112623)" (pytorch#116201)

Summary:

This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142

aaronenyeshi added the ciflow/rocm label

aaronenyeshi force-pushed the export-D52339142 branch from cfadea5 to 78e1a8c Compare

December 20, 2023 19:45

Contributor

facebook-github-bot commented Dec 20, 2023

This pull request was exported from Phabricator. Differential Revision: D52339142

Collaborator

albanD commented Dec 20, 2023

This has been landed more than a month ago. Why are we only seeing it now?

Contributor

xuzhao9 commented Dec 20, 2023 •

edited

Loading

@albanD We detected this issue one month ago: pytorch/benchmark#2064

This error could be caused by CUDA memory OOM (cublas couldn't allocate new handle). I am wondering if we could run memory profiler and submit a forward-fix.

xuzhao9 approved these changes

View reviewed changes

Member Author

aaronenyeshi commented Dec 21, 2023

@albanD We detected this issue one month ago: pytorch/benchmark#2064

This error could be caused by CUDA memory OOM (cublas couldn't allocate new handle). I am wondering if we could run memory profiler and submit a forward-fix.

That is an interesting hypothesis. I had to backout a different lazy cupti init as well, due to some crashes, and talking to NVIDIA about it. So we are safe to backout this diff until then. Do we also want to integrate the memory profiler or memory snapshot into TorchBench for future debugging?

Contributor

facebook-github-bot commented Dec 21, 2023

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Dec 21, 2023

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added Merged and removed merging labels

pytorchmergebot closed this in

a357a0f

Contributor

xuzhao9 commented Dec 21, 2023

@albanD We detected this issue one month ago: pytorch/benchmark#2064
This error could be caused by CUDA memory OOM (cublas couldn't allocate new handle). I am wondering if we could run memory profiler and submit a forward-fix.

That is an interesting hypothesis. I had to backout a different lazy cupti init as well, due to some crashes, and talking to NVIDIA about it. So we are safe to backout this diff until then. Do we also want to integrate the memory profiler or memory snapshot into TorchBench for future debugging?

Great idea! I will work on that.

malfet added this to the 2.2.0 milestone

Contributor

malfet commented Dec 21, 2023

I wonder if this is also a reason for calling cuInit during import torch

This was referenced Dec 21, 2023

import torch results in cuInit call #116276

Closed

Back out "[Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623)" (#116201) #116332

Merged

malfet pushed a commit that referenced this pull request


          Back out "[Kineto] Initialize libkineto profilers during torch init p…

…rocess during pybind set-up (#112623)" (#116201)

Summary:
This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142

Pull Request resolved: #116201
Approved by: https://github.com/xuzhao9

(cherry picked from commit a357a0f)

malfet mentioned this pull request

[v.2.2.0] Release Tracker #115300

Closed

atalman pushed a commit that referenced this pull request


          Back out "[Kineto] Initialize libkineto profilers during torch init p…

be25427

…rocess during pybind set-up (#112623)" (#116201) (#116332)

Summary:
This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142

Pull Request resolved: #116201
Approved by: https://github.com/xuzhao9

(cherry picked from commit a357a0f)

Co-authored-by: Aaron Shi <aaronshi@meta.com>

atalman mentioned this pull request

Validations for 2.2 Release. Cherrry Pick Validation and Manual pytorch/test-infra#4855

Closed

11 tasks

malfet mentioned this pull request

[CI] Test that cuInit is not called during import #117010

Closed

pytorchmergebot pushed a commit that referenced this pull request


          [CI] Test that cuInit is not called during import (#117010)

81b7a09

By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED`

Test Plan: run it on nighties before #116201 got reverted and observe the failure

This is very important for lots of distributed launchers

Fixes #116276

Pull Request resolved: #117010
Approved by: https://github.com/albanD

atalman pushed a commit to atalman/pytorch that referenced this pull request


          [CI] Test that cuInit is not called during import (pytorch#117010)

bb0e192

By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED`

Test Plan: run it on nighties before pytorch#116201 got reverted and observe the failure

This is very important for lots of distributed launchers

Fixes pytorch#116276

Pull Request resolved: pytorch#117010
Approved by: https://github.com/albanD

atalman mentioned this pull request

[CI] Test that cuInit is not called during import (#117010) #117043

Merged

malfet added a commit that referenced this pull request


          [CI] Test that cuInit is not called during import (#117043)

3a44bb7

By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED`

Test Plan: run it on nighties before #116201 got reverted and observe the failure

This is very important for lots of distributed launchers

Fixes #116276

Cherry-pick of  #117010 into release/2.2

Co-authored-by: Nikita Shulga <nshulga@meta.com>

malfet mentioned this pull request

CUPTI error CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED when monitoring pytorch nightly with an external profiler #117935

Closed

youkaichao mentioned this pull request

[Core] Upgrade to pytorch 2.2, remove cupy dependency, avoid nccl 2.19 bug vllm-project/vllm#3442

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm ciflow/trunk fb-exported Merged topic: not user facing