Skip to content

Conversation

davidberard98
Copy link
Contributor

@davidberard98 davidberard98 commented Aug 12, 2025

Stack from ghstack (oldest at bottom):

Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings.

The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called. When this happens, it causes a segfault.

This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults.

Added a test test_thread_safe_error_stacks which segfaults prior to these changes, and no longer segfaults.

cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel

Differential Revision: D80054972

Copy link

pytorch-bot bot commented Aug 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160386

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4fc831a with merge base f33ce40 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: jit release notes category label Aug 12, 2025
davidberard98 added a commit that referenced this pull request Aug 12, 2025
ghstack-source-id: 6d69083
Pull Request resolved: #160386
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Aug 12, 2025
cc EikanWang jgong5 wenzhe-nrv sanchitintel

[ghstack-poisoned]
davidberard98 added a commit that referenced this pull request Aug 12, 2025
ghstack-source-id: c3d2587
Pull Request resolved: #160386
@davidberard98
Copy link
Contributor Author

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 12, 2025
@davidberard98 davidberard98 marked this pull request as draft August 12, 2025 00:56
cc EikanWang jgong5 wenzhe-nrv sanchitintel

Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972)

[ghstack-poisoned]
davidberard98 added a commit that referenced this pull request Aug 12, 2025
ghstack-source-id: b8dbb61
Pull Request resolved: #160386
cc EikanWang jgong5 wenzhe-nrv sanchitintel

Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972)

[ghstack-poisoned]
davidberard98 added a commit that referenced this pull request Aug 12, 2025
ghstack-source-id: 1bc9789
Pull Request resolved: #160386
@davidberard98
Copy link
Contributor Author

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@davidberard98 davidberard98 marked this pull request as ready for review August 12, 2025 16:22
@davidberard98 davidberard98 requested a review from eellison August 12, 2025 16:22
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings.

The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault.

This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults.

Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults.

cc EikanWang jgong5 wenzhe-nrv sanchitintel

Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972)

[ghstack-poisoned]
davidberard98 added a commit that referenced this pull request Aug 12, 2025
ghstack-source-id: 568d7b2
Pull Request resolved: #160386
@davidberard98
Copy link
Contributor Author

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings.

The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault.

This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults.

Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults.

cc EikanWang jgong5 wenzhe-nrv sanchitintel

Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972)

[ghstack-poisoned]
davidberard98 added a commit that referenced this pull request Aug 12, 2025
ghstack-source-id: 2f56244
Pull Request resolved: #160386
@davidberard98
Copy link
Contributor Author

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@davidberard98
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

chuanhaozhuge pushed a commit that referenced this pull request Aug 14, 2025
Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings.

The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault.

This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults.

Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults.

Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972)
Pull Request resolved: #160386
Approved by: https://github.com/eellison
chuanhaozhuge pushed a commit that referenced this pull request Aug 18, 2025
Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings.

The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault.

This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults.

Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults.

Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972)
Pull Request resolved: #160386
Approved by: https://github.com/eellison
can-gaa-hou pushed a commit to can-gaa-hou/pytorch that referenced this pull request Aug 22, 2025
Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings.

The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault.

This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults.

Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults.

Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972)
Pull Request resolved: pytorch#160386
Approved by: https://github.com/eellison
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: jit Add this issue/PR to JIT oncall triage queue release notes: jit release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants