-
Notifications
You must be signed in to change notification settings - Fork 25.4k
[C10D] Flight recorder - disable c++ stacktrace by default #114651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CPP Stacktrace processing (symbolizer) takes a long time on some systems using a particular version of addr2line. In slow systems, this makes flight-recorder dumping slow enough to time out on even toy programs. TORCH_NCCL_TRACE_CPP_STACK=True will re-enable CPP stacktrace collection as part of the flight recorder. CPP stacktrace is fast enough for use on certain combinations of OS. We can investigate moving to llvm's symbolizer as a replacement. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114651
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit c958bc2 with merge base e4b1378 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot merge -f "flaky dynamic-shapes test" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Do we know when we want to open this flag and when to turn it off? |
Add TORCH_NCCL_DUMP_DEBUG_INFO env to control dumping independently of desync debug feature. Currently default to disabled (so no behavior change by default), but plan to default this to true after validation. Moves 'sleep for 30 sec' that used to be after desync debug to before it. In my view sleeping before desync is equivalent since we always sleep the same duration, and keeps the code simpler this way. Fixes #114433 Pull Request resolved: #114614 Approved by: https://github.com/zdevito ghstack dependencies: #114651
Stack from ghstack (oldest at bottom):
CPP Stacktrace processing (symbolizer) takes a long time on some systems
using a particular version of addr2line. In slow systems, this makes
flight-recorder dumping slow enough to time out on even toy programs.
TORCH_NCCL_TRACE_CPP_STACK=True will re-enable CPP stacktrace collection
as part of the flight recorder.
CPP stacktrace is fast enough for use on certain combinations of OS. We
can investigate moving to llvm's symbolizer as a replacement.
On devserver with C++ stacktraces disabled/enabled:
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @kiukchung @d4l3k @LucasLLC