[C10D] Flight recorder - disable c++ stacktrace by default #114651

wconstab · 2023-11-28T00:55:12Z

Stack from ghstack (oldest at bottom):

CPP Stacktrace processing (symbolizer) takes a long time on some systems
using a particular version of addr2line. In slow systems, this makes
flight-recorder dumping slow enough to time out on even toy programs.

TORCH_NCCL_TRACE_CPP_STACK=True will re-enable CPP stacktrace collection
as part of the flight recorder.

CPP stacktrace is fast enough for use on certain combinations of OS. We
can investigate moving to llvm's symbolizer as a replacement.

On devserver with C++ stacktraces disabled/enabled:

python test/distributed/test_c10d_nccl.py -k test_short
Ran 1 test in 12.175s

TORCH_NCCL_TRACE_CPP_STACK=1 python test/distributed/test_c10d_nccl.py -k test_short
Ran 1 test in 53.338s

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @kiukchung @d4l3k @LucasLLC

CPP Stacktrace processing (symbolizer) takes a long time on some systems using a particular version of addr2line. In slow systems, this makes flight-recorder dumping slow enough to time out on even toy programs. TORCH_NCCL_TRACE_CPP_STACK=True will re-enable CPP stacktrace collection as part of the flight recorder. CPP stacktrace is fast enough for use on certain combinations of OS. We can investigate moving to llvm's symbolizer as a replacement. [ghstack-poisoned]

pytorch-bot · 2023-11-28T00:55:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114651

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit c958bc2 with merge base e4b1378 ():

NEW FAILURE - The following job has failed:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab · 2023-11-28T16:47:23Z

@pytorchbot merge -f "flaky dynamic-shapes test"

pytorchmergebot · 2023-11-28T16:49:10Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

fduwjj · 2023-11-28T18:22:33Z

Do we know when we want to open this flag and when to turn it off?

Add TORCH_NCCL_DUMP_DEBUG_INFO env to control dumping independently of desync debug feature. Currently default to disabled (so no behavior change by default), but plan to default this to true after validation. Moves 'sleep for 30 sec' that used to be after desync debug to before it. In my view sleeping before desync is equivalent since we always sleep the same duration, and keeps the code simpler this way. Fixes #114433 Pull Request resolved: #114614 Approved by: https://github.com/zdevito ghstack dependencies: #114651

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Nov 28, 2023

wconstab mentioned this pull request Nov 28, 2023

[C10D] Decouple PGNCCL desync from dbg dump #114614

Closed

This was referenced Nov 28, 2023

[C10D] Flight Recorder- TORCH_NCCL_DUMP_ON_TIMEOUT default to true #114615

Closed

[C10D] logging/comment clean ups #114625

Closed

github-actions bot added the module: distributed label Nov 28, 2023

wconstab requested a review from zdevito November 28, 2023 01:00

zdevito approved these changes Nov 28, 2023

View reviewed changes

pytorchmergebot added the merging label Nov 28, 2023

pytorchmergebot added the Merged label Nov 28, 2023

pytorchmergebot closed this in e6a8052 Nov 28, 2023

pytorchmergebot removed the merging label Nov 28, 2023

facebook-github-bot deleted the gh/wconstab/228/head branch December 2, 2023 15:28

albanD added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed module: distributed labels Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C10D] Flight recorder - disable c++ stacktrace by default #114651

[C10D] Flight recorder - disable c++ stacktrace by default #114651

Uh oh!

wconstab commented Nov 28, 2023 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 28, 2023 •

edited

Loading

Uh oh!

wconstab commented Nov 28, 2023

Uh oh!

pytorchmergebot commented Nov 28, 2023

Uh oh!

fduwjj commented Nov 28, 2023

Uh oh!

Uh oh!

[C10D] Flight recorder - disable c++ stacktrace by default #114651

[C10D] Flight recorder - disable c++ stacktrace by default #114651

Uh oh!

Conversation

wconstab commented Nov 28, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114651

❌ 1 New Failure

Uh oh!

wconstab commented Nov 28, 2023

Uh oh!

pytorchmergebot commented Nov 28, 2023

Merge started

Uh oh!

fduwjj commented Nov 28, 2023

Uh oh!

Uh oh!

wconstab commented Nov 28, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 28, 2023 •

edited

Loading