seg-fault of "basic_string::_M_construct null not valid" fix for getNcclErrorDetailStr #121905

hongxiayang · 2024-03-14T15:45:54Z

When working on testing all-reduce with an alternative rccl replacement backend, my test script crashed. After debugging, I found that ncclGetLastError(NULL) return null, and then the code uses the return value to do std::string would seg-fault with an exception of basic_string::_M_construct null not valid.

This pull request is to fix this edge condition so that it will exit the program gracefully with useful information.

Test:
Before the fix, my test script exits like below:

File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2051, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: basic_string::_M_construct null not valid

After this fix, my test script exited with useful message like,

[rank0]:   File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank0]:     work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:272, internal error - please report this issue to the NCCL developers, NCCL version 0.4.2
[rank0]: ncclInternalError: Internal check failed.
[rank0]:  Last error: Unknown NCCL Error

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

pytorch-bot · 2024-03-14T15:45:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121905

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit da0c633 with merge base 5891c5b ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge) (gh)
Process completed with exit code 128.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jeffdaily · 2024-03-25T17:45:47Z

@pytorchbot rebase

pytorchmergebot · 2024-03-25T17:47:19Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-03-25T17:47:26Z

Successfully rebased nccl_error_crash onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout nccl_error_crash && git pull --rebase)

jeffdaily · 2024-03-25T17:51:36Z

@pytorchbot merge

pytorchmergebot · 2024-03-25T17:54:02Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…cclErrorDetailStr (#121905) When working on testing all-reduce with an alternative rccl replacement backend, my test script crashed. After debugging, I found that `ncclGetLastError(NULL)` return null, and then the code uses the return value to do std::string would seg-fault with an exception of `basic_string::_M_construct null not valid`. This pull request is to fix this edge condition so that it will exit the program gracefully with useful information. **Test:** Before the fix, my test script exits like below: ``` File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2051, in all_reduce work = group.allreduce([tensor], opts) RuntimeError: basic_string::_M_construct null not valid ``` After this fix, my test script exited with useful message like, ``` [rank0]: File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce [rank0]: work = group.allreduce([tensor], opts) [rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:272, internal error - please report this issue to the NCCL developers, NCCL version 0.4.2 [rank0]: ncclInternalError: Internal check failed. [rank0]: Last error: Unknown NCCL Error ``` Pull Request resolved: #121905 Approved by: https://github.com/wconstab

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Mar 14, 2024

pytorchbot added the open source label Mar 14, 2024

hongxiayang requested review from chipturner, jeffdaily, kwen2501 and xw285cornell March 14, 2024 15:48

hongxiayang marked this pull request as draft March 14, 2024 15:49

wconstab approved these changes Mar 22, 2024

View reviewed changes

hongxiayang marked this pull request as ready for review March 22, 2024 15:02

hongxiayang added 2 commits March 25, 2024 17:47

basic_string::_M_construct null not valid fix for getNcclErrorDetailStr

7671014

lint fix

da0c633

pytorchmergebot force-pushed the nccl_error_crash branch from 78bf913 to da0c633 Compare March 25, 2024 17:47

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 25, 2024

pytorchmergebot added the merging label Mar 25, 2024

jeffdaily added the topic: not user facing topic category label Mar 25, 2024

pytorchmergebot added the Merged label Mar 25, 2024

pytorchmergebot closed this in 1c1268b Mar 25, 2024

pytorchmergebot removed the merging label Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

seg-fault of "basic_string::_M_construct null not valid" fix for getNcclErrorDetailStr #121905

seg-fault of "basic_string::_M_construct null not valid" fix for getNcclErrorDetailStr #121905

Uh oh!

hongxiayang commented Mar 14, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Mar 14, 2024 •

edited

Loading

Uh oh!

jeffdaily commented Mar 25, 2024

Uh oh!

pytorchmergebot commented Mar 25, 2024

Uh oh!

pytorchmergebot commented Mar 25, 2024

Uh oh!

jeffdaily commented Mar 25, 2024

Uh oh!

pytorchmergebot commented Mar 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

seg-fault of "basic_string::_M_construct null not valid" fix for getNcclErrorDetailStr #121905

seg-fault of "basic_string::_M_construct null not valid" fix for getNcclErrorDetailStr #121905

Uh oh!

Conversation

hongxiayang commented Mar 14, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121905

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

jeffdaily commented Mar 25, 2024

Uh oh!

pytorchmergebot commented Mar 25, 2024

Uh oh!

pytorchmergebot commented Mar 25, 2024

Uh oh!

jeffdaily commented Mar 25, 2024

Uh oh!

pytorchmergebot commented Mar 25, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hongxiayang commented Mar 14, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 14, 2024 •

edited

Loading