Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(torchelastic) make --max_restarts explicit in the quickstart and runner docs (#65675) #65838

Closed
wants to merge 1 commit into from

Conversation

kiukchung
Copy link
Collaborator

@kiukchung kiukchung commented Sep 29, 2021

Summary:
closes #65675

The default --max_restarts for torch.distributed.run was changed to 0 from 3 to make things backwards compatible with torch.distributed.launch. Since the default --max_restarts used to be greater than 0 we never documented passing --max_restarts explicitly in any of our example code.

Test Plan: N/A doc change only

Differential Revision: D31279544

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 29, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit a3e326e (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-bionic-py3.6-clang9 / test (default, 1, 2, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-09-29T21:50:01.1579094Z RuntimeError: test_jit failed! Received signal: SIGSEGV
2021-09-29T21:50:00.1739959Z   test_pretty_printer (__main__.TestJit) ... ok (0.025s)
2021-09-29T21:50:00.1747631Z   test_print_classes_module (__main__.TestJit) ... ok (0.001s)
2021-09-29T21:50:00.1754277Z   test_print_op_module (__main__.TestJit) ... ok (0.001s)
2021-09-29T21:50:00.1762264Z   test_print_torch_ops_modules (__main__.TestJit) ... ok (0.001s)
2021-09-29T21:50:00.1874271Z   test_profiler (__main__.TestJit) ... ok (0.011s)
2021-09-29T21:50:01.1573406Z   test_python_bindings (__main__.TestJit) ... Traceback (most recent call last):
2021-09-29T21:50:01.1574038Z   File "test/run_test.py", line 1030, in <module>
2021-09-29T21:50:01.1575992Z     main()
2021-09-29T21:50:01.1576586Z   File "test/run_test.py", line 1008, in main
2021-09-29T21:50:01.1578544Z     raise RuntimeError(err_message)
2021-09-29T21:50:01.1579094Z RuntimeError: test_jit failed! Received signal: SIGSEGV
2021-09-29T21:50:01.4199271Z 
2021-09-29T21:50:01.4199710Z real	9m50.934s
2021-09-29T21:50:01.4200134Z user	22m29.936s
2021-09-29T21:50:01.4200649Z sys	1m0.184s
2021-09-29T21:50:01.4200996Z + cleanup
2021-09-29T21:50:01.4201283Z + retcode=1
2021-09-29T21:50:01.4201543Z + set +x
2021-09-29T21:50:01.4202860Z =================== sccache compilation log ===================
2021-09-29T21:50:01.4634003Z =========== If your build fails, please take a look at the log above for possible reasons ===========
2021-09-29T21:50:01.4656435Z Compile requests                     28

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31279544

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 29, 2021
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

docs/source/elastic/quickstart.rst Outdated Show resolved Hide resolved
…ner docs (pytorch#65838)

Summary:
Pull Request resolved: pytorch#65838

closes pytorch#65675

The default `--max_restarts` for `torch.distributed.run` was changed to `0` from `3` to make things backwards compatible with `torch.distributed.launch`. Since the default `--max_restarts` used to be greater than `0` we never documented passing `--max_restarts` explicitly in any of our example code.

Test Plan: N/A doc change only

Reviewed By: d4l3k

Differential Revision: D31279544

fbshipit-source-id: afabb4ac492ba0640752efbc8e36ed485d97ec5a
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31279544

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed fb-exported oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants