Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of Autograd+Fork errors. #33885

Closed

Conversation

VitalyFedyunin
Copy link
Contributor

@VitalyFedyunin VitalyFedyunin commented Feb 27, 2020

Stack from ghstack:

Fixes: #32835
Fixes: #5834

Can not combine with CUDA's implementation as each of them requires individual std::once_flag as well as different forked_autograd_child functions. CUDA version relays to python module, autograd uses TORCH_CHECK to report error to python and cpp.

Differential Revision: D20144024

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

void Engine::initialize_threads_pool() {
track_bad_autograd_forks();
TORCH_CHECK(!in_bad_autograd_fork,
"Unable to handle autograd's threading in combination with fork. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "fork" -> "fork-based multiprocessing"
Maybe some people use the multiprocessing package without knowing what fork is?

@dr-ci
Copy link

dr-ci bot commented Feb 27, 2020

💊 CircleCI build failures summary and remediations

As of commit a3c88af:

None of the build failures appear to be your fault.

  • 1/2 broken upstream at merge base 095de1e since Feb 27

    Please rebase on the viable/strict branch (expand for instructions)

    If your commit is newer than viable/strict, you can try basing on an older, stable commit:

    git fetch origin viable/strict
    git rebase --onto viable/strict $(git merge-base origin/master HEAD)
    

    If your commit is older than viable/strict:

    git fetch origin viable/strict
    git rebase viable/strict
    

    Check out the recency history of this "viable master" tracking branch.

  • 1/2 recognized as flaky ❄️

    • Re-run these jobs?

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

❄️ 1 failure recognized as flaky

The following build failures have been detected as flaky and may not be your fault:

See CircleCI build pytorch_linux_xenial_cuda10_1_cudnn7_py3_nogpu_test (1/1)

Step: "Test" (full log | pattern match details) ❄️

Feb 27 20:25:35 AssertionError: 12 not less than or equal to 1e-05 :
Feb 27 20:25:35 ---------------------------------------------------------------------- 
Feb 27 20:25:35 Traceback (most recent call last): 
Feb 27 20:25:35   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 175, in wrapper 
Feb 27 20:25:35     self._join_processes(fn) 
Feb 27 20:25:35   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 256, in _join_processes 
Feb 27 20:25:35     self._check_return_codes(elapsed_time) 
Feb 27 20:25:35   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 276, in _check_return_codes 
Feb 27 20:25:35     self.assertEqual(p.exitcode, first_process.exitcode) 
Feb 27 20:25:35   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 896, in assertEqual 
Feb 27 20:25:35     super(TestCase, self).assertLessEqual(abs(x - y), prec, message) 
Feb 27 20:25:35 AssertionError: 12 not less than or equal to 1e-05 :  
Feb 27 20:25:35  
Feb 27 20:25:35 ---------------------------------------------------------------------- 
Feb 27 20:25:35 Ran 84 tests in 171.177s 
Feb 27 20:25:35  
Feb 27 20:25:35 FAILED (failures=1, skipped=1) 
Feb 27 20:25:35  
Feb 27 20:25:35 Generating XML reports... 
Feb 27 20:25:35 Traceback (most recent call last): 
Feb 27 20:25:35   File "test/run_test.py", line 493, in <module> 
Feb 27 20:25:35     main() 

🚧 1 upstream failure recognized by patterns:

These builds matched patterns, but were probably caused by upstream breakages:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 5 times.

Fixes: #32835
Fixes: #5834

Can not combine with CUDA's implementation as each of them requires individual `std::once_flag` as well as different `forked_autograd_child` functions. CUDA version relays to python module, autograd uses TORCH_CHECK to report error to python and cpp.



[ghstack-poisoned]
Fixes: #32835
Fixes: #5834

Can not combine with CUDA's implementation as each of them requires individual `std::once_flag` as well as different `forked_autograd_child` functions. CUDA version relays to python module, autograd uses TORCH_CHECK to report error to python and cpp.



[ghstack-poisoned]
Fixes: #32835
Fixes: #5834

Can not combine with CUDA's implementation as each of them requires individual `std::once_flag` as well as different `forked_autograd_child` functions. CUDA version relays to python module, autograd uses TORCH_CHECK to report error to python and cpp.

Differential Revision: [D20144024](https://our.internmc.facebook.com/intern/diff/D20144024)

[ghstack-poisoned]
VitalyFedyunin added a commit that referenced this pull request Feb 27, 2020
ghstack-source-id: 5b62de9df7aa09124b075c81a21d0219da99f21f
Pull Request resolved: #33885
@VitalyFedyunin VitalyFedyunin changed the title Better handing of Autograd+Fork errors. Better handling of Autograd+Fork errors. Feb 27, 2020
@@ -821,15 +834,6 @@ def test_is_shared_cuda(self):
t = torch.randn(5, 5).cuda()
self.assertTrue(t.is_shared())

@unittest.skip('this test occasionally fails and deadlocks; see https://github.com/pytorch/pytorch/issues/5834')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch !

@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in 877ab3a.

hczhu pushed a commit that referenced this pull request Feb 28, 2020
Summary:
Pull Request resolved: #33885

Fixes: #32835
Fixes: #5834

Can not combine with CUDA's implementation as each of them requires individual `std::once_flag` as well as different `forked_autograd_child` functions. CUDA version relays to python module, autograd uses TORCH_CHECK to report error to python and cpp.

Test Plan: Imported from OSS

Differential Revision: D20144024

Pulled By: VitalyFedyunin

fbshipit-source-id: e7cf30568fff5110e9df7fe5b23f18ed992fa17f
@facebook-github-bot facebook-github-bot deleted the gh/VitalyFedyunin/116/head branch March 2, 2020 15:16
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
Summary:
Pull Request resolved: pytorch#33885

Fixes: pytorch#32835
Fixes: pytorch#5834

Can not combine with CUDA's implementation as each of them requires individual `std::once_flag` as well as different `forked_autograd_child` functions. CUDA version relays to python module, autograd uses TORCH_CHECK to report error to python and cpp.

Test Plan: Imported from OSS

Differential Revision: D20144024

Pulled By: VitalyFedyunin

fbshipit-source-id: e7cf30568fff5110e9df7fe5b23f18ed992fa17f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants