Better handling of Autograd+Fork errors. #33885

VitalyFedyunin · 2020-02-27T18:22:19Z

Stack from ghstack:

Better handling of Autograd+Fork errors. #33885 Better handing of Autograd+Fork errors.

Can not combine with CUDA's implementation as each of them requires individual std::once_flag as well as different forked_autograd_child functions. CUDA version relays to python module, autograd uses TORCH_CHECK to report error to python and cpp.

Differential Revision: D20144024

[ghstack-poisoned]

albanD

LGTM

albanD · 2020-02-27T18:37:08Z

torch/csrc/autograd/engine.cpp

+void Engine::initialize_threads_pool() {
+  track_bad_autograd_forks();
+  TORCH_CHECK(!in_bad_autograd_fork,
+              "Unable to handle autograd's threading in combination with fork. "


nit: "fork" -> "fork-based multiprocessing"
Maybe some people use the multiprocessing package without knowing what fork is?

dr-ci · 2020-02-27T18:58:47Z

💊 CircleCI build failures summary and remediations

As of commit a3c88af:

None of the build failures appear to be your fault.

1/2 broken upstream at merge base 095de1e since Feb 27
Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:
```
git fetch origin viable/strict
git rebase --onto viable/strict $(git merge-base origin/master HEAD)
```
If your commit is older than viable/strict:
```
git fetch origin viable/strict
git rebase viable/strict
```
Check out the recency history of this "viable master" tracking branch.
1/2 recognized as flaky ❄️
- Re-run these jobs?

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

❄️ 1 failure recognized as flaky

The following build failures have been detected as flaky and may not be your fault:

pytorch_linux_xenial_cuda10_1_cudnn7_py3_nogpu_test (1/1)

Step: "Test" (full log | pattern match details) ❄️

Feb 27 20:25:35 AssertionError: 12 not less than or equal to 1e-05 :

Feb 27 20:25:35 ---------------------------------------------------------------------- 
Feb 27 20:25:35 Traceback (most recent call last): 
Feb 27 20:25:35   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 175, in wrapper 
Feb 27 20:25:35     self._join_processes(fn) 
Feb 27 20:25:35   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 256, in _join_processes 
Feb 27 20:25:35     self._check_return_codes(elapsed_time) 
Feb 27 20:25:35   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 276, in _check_return_codes 
Feb 27 20:25:35     self.assertEqual(p.exitcode, first_process.exitcode) 
Feb 27 20:25:35   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 896, in assertEqual 
Feb 27 20:25:35     super(TestCase, self).assertLessEqual(abs(x - y), prec, message) 
Feb 27 20:25:35 AssertionError: 12 not less than or equal to 1e-05 :  
Feb 27 20:25:35  
Feb 27 20:25:35 ---------------------------------------------------------------------- 
Feb 27 20:25:35 Ran 84 tests in 171.177s 
Feb 27 20:25:35  
Feb 27 20:25:35 FAILED (failures=1, skipped=1) 
Feb 27 20:25:35  
Feb 27 20:25:35 Generating XML reports... 
Feb 27 20:25:35 Traceback (most recent call last): 
Feb 27 20:25:35   File "test/run_test.py", line 493, in <module> 
Feb 27 20:25:35     main()

🚧 1 upstream failure recognized by patterns:

These builds matched patterns, but were probably caused by upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_legacy_test from Feb 27

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 5 times.

Fixes: #32835 Fixes: #5834 Can not combine with CUDA's implementation as each of them requires individual `std::once_flag` as well as different `forked_autograd_child` functions. CUDA version relays to python module, autograd uses TORCH_CHECK to report error to python and cpp. [ghstack-poisoned]

Fixes: #32835 Fixes: #5834 Can not combine with CUDA's implementation as each of them requires individual `std::once_flag` as well as different `forked_autograd_child` functions. CUDA version relays to python module, autograd uses TORCH_CHECK to report error to python and cpp. Differential Revision: [D20144024](https://our.internmc.facebook.com/intern/diff/D20144024) [ghstack-poisoned]

ghstack-source-id: 5b62de9df7aa09124b075c81a21d0219da99f21f Pull Request resolved: #33885

albanD · 2020-02-27T21:00:55Z

test/test_multiprocessing.py

@@ -821,15 +834,6 @@ def test_is_shared_cuda(self):
        t = torch.randn(5, 5).cuda()
        self.assertTrue(t.is_shared())

-    @unittest.skip('this test occasionally fails and deadlocks; see https://github.com/pytorch/pytorch/issues/5834')


Nice catch !

facebook-github-bot · 2020-02-28T03:08:41Z

@VitalyFedyunin merged this pull request in 877ab3a.

Summary: Pull Request resolved: #33885 Fixes: #32835 Fixes: #5834 Can not combine with CUDA's implementation as each of them requires individual `std::once_flag` as well as different `forked_autograd_child` functions. CUDA version relays to python module, autograd uses TORCH_CHECK to report error to python and cpp. Test Plan: Imported from OSS Differential Revision: D20144024 Pulled By: VitalyFedyunin fbshipit-source-id: e7cf30568fff5110e9df7fe5b23f18ed992fa17f

Summary: Pull Request resolved: pytorch#33885 Fixes: pytorch#32835 Fixes: pytorch#5834 Can not combine with CUDA's implementation as each of them requires individual `std::once_flag` as well as different `forked_autograd_child` functions. CUDA version relays to python module, autograd uses TORCH_CHECK to report error to python and cpp. Test Plan: Imported from OSS Differential Revision: D20144024 Pulled By: VitalyFedyunin fbshipit-source-id: e7cf30568fff5110e9df7fe5b23f18ed992fa17f

Better handing of Autograd+Fork errors.

057b8c8

[ghstack-poisoned]

VitalyFedyunin requested review from albanD and apaszke as code owners February 27, 2020 18:22

albanD approved these changes Feb 27, 2020

View reviewed changes

VitalyFedyunin added a commit that referenced this pull request Feb 27, 2020

Better handing of Autograd+Fork errors.

732a9cb

ghstack-source-id: 5b62de9df7aa09124b075c81a21d0219da99f21f Pull Request resolved: #33885

VitalyFedyunin changed the title ~~Better handing of Autograd+Fork errors.~~ Better handling of Autograd+Fork errors. Feb 27, 2020

albanD reviewed Feb 27, 2020

View reviewed changes

facebook-github-bot closed this in 877ab3a Feb 28, 2020

facebook-github-bot added the merged label Feb 28, 2020

facebook-github-bot deleted the gh/VitalyFedyunin/116/head branch March 2, 2020 15:16

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of Autograd+Fork errors. #33885

Better handling of Autograd+Fork errors. #33885

VitalyFedyunin commented Feb 27, 2020 •

edited

albanD left a comment

albanD Feb 27, 2020

dr-ci bot commented Feb 27, 2020 •

edited

albanD Feb 27, 2020

facebook-github-bot commented Feb 28, 2020

Better handling of Autograd+Fork errors. #33885

Better handling of Autograd+Fork errors. #33885

Conversation

VitalyFedyunin commented Feb 27, 2020 • edited

albanD left a comment

Choose a reason for hiding this comment

albanD Feb 27, 2020

Choose a reason for hiding this comment

dr-ci bot commented Feb 27, 2020 • edited

💊 CircleCI build failures summary and remediations

Detailed failure analysis

❄️ 1 failure recognized as flaky

pytorch_linux_xenial_cuda10_1_cudnn7_py3_nogpu_test (1/1)

🚧 1 upstream failure recognized by patterns:

albanD Feb 27, 2020

Choose a reason for hiding this comment

facebook-github-bot commented Feb 28, 2020

VitalyFedyunin commented Feb 27, 2020 •

edited

dr-ci bot commented Feb 27, 2020 •

edited