[TorchScript] thread-safe ErrorReport::CallStack #160386

davidberard98 · 2025-08-12T00:53:56Z

Stack from ghstack (oldest at bottom):

-> [TorchScript] thread-safe ErrorReport::CallStack #160386

Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings.

The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called. When this happens, it causes a segfault.

This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults.

Added a test test_thread_safe_error_stacks which segfaults prior to these changes, and no longer segfaults.

cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel

Differential Revision: D80054972

[ghstack-poisoned]

pytorch-bot · 2025-08-12T00:53:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160386

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4fc831a with merge base f33ce40 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 6d69083 Pull Request resolved: #160386

cc EikanWang jgong5 wenzhe-nrv sanchitintel [ghstack-poisoned]

ghstack-source-id: c3d2587 Pull Request resolved: #160386

davidberard98 · 2025-08-12T00:55:36Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cc EikanWang jgong5 wenzhe-nrv sanchitintel Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972) [ghstack-poisoned]

ghstack-source-id: b8dbb61 Pull Request resolved: #160386

cc EikanWang jgong5 wenzhe-nrv sanchitintel Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972) [ghstack-poisoned]

ghstack-source-id: 1bc9789 Pull Request resolved: #160386

davidberard98 · 2025-08-12T03:39:11Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

eellison

looks good!

Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings. The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault. This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults. Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults. cc EikanWang jgong5 wenzhe-nrv sanchitintel Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972) [ghstack-poisoned]

ghstack-source-id: 568d7b2 Pull Request resolved: #160386

davidberard98 · 2025-08-12T17:00:23Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings. The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault. This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults. Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults. cc EikanWang jgong5 wenzhe-nrv sanchitintel Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972) [ghstack-poisoned]

ghstack-source-id: 2f56244 Pull Request resolved: #160386

davidberard98 · 2025-08-12T18:50:26Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

davidberard98 · 2025-08-12T19:18:33Z

@pytorchbot merge

pytorchmergebot · 2025-08-12T19:20:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings. The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault. This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults. Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults. Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972) Pull Request resolved: #160386 Approved by: https://github.com/eellison

Context: During jit.script, the TorchScript frontend maintains a callstack of Python frames, which is used to present the corresponding user code in case TorchScript errors. The callstack is maintained via ErrorReport::CallStack RAII guards. Before recursing into a function, an ErrorReport::CallStack guard is created and the CallStack guard pushes the frame information onto a thread_local callstack (a list of calls); and after exiting, the frame information is popped off the callstack. Note that the CallStack guards are also sometimes used in python via pybindings. The problem is that sometimes another thread can obtain a reference to the CallStack guard (if it's a Python CallStack guard). **This means that the destructor for a CallStack guard can be called from a different thread than the constructor was called**. When this happens, it causes a segfault. This PR makes the callstack vector thread-safe to access, and each CallStack guard will store a reference to the callstack vector onto which it pushed. When the CallStack guard is destructed, it pops off the appropriate callstack vector. Although this could potentially lead to mangled callstacks, it should prevent segfaults. Added a test `test_thread_safe_error_stacks` which segfaults prior to these changes, and no longer segfaults. Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972) Pull Request resolved: pytorch#160386 Approved by: https://github.com/eellison

[TorchScript] thread-safe ErrorReport::CallStack

83db7f1

[ghstack-poisoned]

pytorch-bot bot added the release notes: jit release notes category label Aug 12, 2025

davidberard98 added a commit that referenced this pull request Aug 12, 2025

[TorchScript] thread-safe ErrorReport::CallStack

9bb2038

ghstack-source-id: 6d69083 Pull Request resolved: #160386

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Aug 12, 2025

Update on "[TorchScript] thread-safe ErrorReport::CallStack"

124385b

cc EikanWang jgong5 wenzhe-nrv sanchitintel [ghstack-poisoned]

davidberard98 added a commit that referenced this pull request Aug 12, 2025

[TorchScript] thread-safe ErrorReport::CallStack

2a541a0

ghstack-source-id: c3d2587 Pull Request resolved: #160386

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 12, 2025

davidberard98 marked this pull request as draft August 12, 2025 00:56

Update on "[TorchScript] thread-safe ErrorReport::CallStack"

22f4dd8

cc EikanWang jgong5 wenzhe-nrv sanchitintel Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972) [ghstack-poisoned]

davidberard98 added a commit that referenced this pull request Aug 12, 2025

[TorchScript] thread-safe ErrorReport::CallStack

8ac585a

ghstack-source-id: b8dbb61 Pull Request resolved: #160386

Update on "[TorchScript] thread-safe ErrorReport::CallStack"

604e97f

cc EikanWang jgong5 wenzhe-nrv sanchitintel Differential Revision: [D80054972](https://our.internmc.facebook.com/intern/diff/D80054972) [ghstack-poisoned]

davidberard98 added a commit that referenced this pull request Aug 12, 2025

[TorchScript] thread-safe ErrorReport::CallStack

1ab7992

ghstack-source-id: 1bc9789 Pull Request resolved: #160386

davidberard98 marked this pull request as ready for review August 12, 2025 16:22

davidberard98 requested a review from eellison August 12, 2025 16:22

eellison approved these changes Aug 12, 2025

View reviewed changes

davidberard98 added a commit that referenced this pull request Aug 12, 2025

[TorchScript] thread-safe ErrorReport::CallStack

3f64f05

ghstack-source-id: 568d7b2 Pull Request resolved: #160386

davidberard98 added a commit that referenced this pull request Aug 12, 2025

[TorchScript] thread-safe ErrorReport::CallStack

810274e

ghstack-source-id: 2f56244 Pull Request resolved: #160386

pytorchmergebot added the merging label Aug 12, 2025

pytorchmergebot added the Merged label Aug 12, 2025

pytorchmergebot closed this in 78a2fe1 Aug 12, 2025

pytorchmergebot removed the merging label Aug 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TorchScript] thread-safe ErrorReport::CallStack #160386

[TorchScript] thread-safe ErrorReport::CallStack #160386

Uh oh!

davidberard98 commented Aug 12, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 12, 2025 •

edited

Loading

Uh oh!

davidberard98 commented Aug 12, 2025

Uh oh!

davidberard98 commented Aug 12, 2025

Uh oh!

eellison left a comment

Uh oh!

davidberard98 commented Aug 12, 2025

Uh oh!

davidberard98 commented Aug 12, 2025

Uh oh!

davidberard98 commented Aug 12, 2025

Uh oh!

pytorchmergebot commented Aug 12, 2025

Uh oh!

Uh oh!

[TorchScript] thread-safe ErrorReport::CallStack #160386

[TorchScript] thread-safe ErrorReport::CallStack #160386

Uh oh!

Conversation

davidberard98 commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160386

✅ No Failures

Uh oh!

davidberard98 commented Aug 12, 2025

Uh oh!

davidberard98 commented Aug 12, 2025

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

davidberard98 commented Aug 12, 2025

Uh oh!

davidberard98 commented Aug 12, 2025

Uh oh!

davidberard98 commented Aug 12, 2025

Uh oh!

pytorchmergebot commented Aug 12, 2025

Merge started

Uh oh!

Uh oh!

davidberard98 commented Aug 12, 2025 •

edited

Loading

pytorch-bot bot commented Aug 12, 2025 •

edited

Loading