Skip to content

Conversation

wconstab
Copy link
Contributor

@wconstab wconstab commented Nov 28, 2023

Stack from ghstack (oldest at bottom):

CPP Stacktrace processing (symbolizer) takes a long time on some systems
using a particular version of addr2line. In slow systems, this makes
flight-recorder dumping slow enough to time out on even toy programs.

TORCH_NCCL_TRACE_CPP_STACK=True will re-enable CPP stacktrace collection
as part of the flight recorder.

CPP stacktrace is fast enough for use on certain combinations of OS. We
can investigate moving to llvm's symbolizer as a replacement.

On devserver with C++ stacktraces disabled/enabled:

python test/distributed/test_c10d_nccl.py -k test_short
Ran 1 test in 12.175s

TORCH_NCCL_TRACE_CPP_STACK=1 python test/distributed/test_c10d_nccl.py -k test_short
Ran 1 test in 53.338s

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @kiukchung @d4l3k @LucasLLC

CPP Stacktrace processing (symbolizer) takes a long time on some systems
using a particular version of addr2line.  In slow systems, this makes
flight-recorder dumping slow enough to time out on even toy programs.

TORCH_NCCL_TRACE_CPP_STACK=True will re-enable CPP stacktrace collection
as part of the flight recorder.

CPP stacktrace is fast enough for use on certain combinations of OS. We
can investigate moving to llvm's symbolizer as a replacement.

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Nov 28, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114651

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit c958bc2 with merge base e4b1378 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@wconstab
Copy link
Contributor Author

@pytorchbot merge -f "flaky dynamic-shapes test"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@fduwjj
Copy link
Contributor

fduwjj commented Nov 28, 2023

Do we know when we want to open this flag and when to turn it off?

pytorchmergebot pushed a commit that referenced this pull request Nov 28, 2023
Add TORCH_NCCL_DUMP_DEBUG_INFO env to control dumping independently
of desync debug feature.

Currently default to disabled (so no behavior change by default),
but plan to default this to true after validation.

Moves 'sleep for 30 sec' that used to be after desync debug to before
it. In my view sleeping before desync is equivalent since we always
sleep the same duration, and keeps the code simpler this way.

Fixes #114433

Pull Request resolved: #114614
Approved by: https://github.com/zdevito
ghstack dependencies: #114651
@facebook-github-bot facebook-github-bot deleted the gh/wconstab/228/head branch December 2, 2023 15:28
@albanD albanD added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed module: distributed labels Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants