Add basic GPU support to distributed autograd. #40312

pritamdamania87 · 2020-06-19T22:11:15Z

Stack from ghstack:

Add basic GPU support to distributed autograd. #40312 Add basic GPU support to distributed autograd.

As part of #40255, we
realized that GPU support for distributed autograd was broken as part of our
multithreaded autograd change.

To fix this in the short term for 1.6, this PR includes the following changes:

Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the
autograd graph.
The long lived CPU thread has its own ready_queue and this queue is used for
all GraphTasks created by DistEngine.
In thread_main(), the CPU thread cannot exit once the GraphTask is done
processing because of the new CPU thread added in 1).
To resolve this, thread_main() now has a parameter device_thread instead
of reentrant_thread. When device_thread is True, we expect this to be a long
lived device thread that does not exit.
When device_thread is False, thread_main is expected to run a GraphTask and
return once done.

Differential Revision: D22146183

As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/) [ghstack-poisoned]

dr-ci · 2020-06-19T23:01:24Z

💊 CI failures summary and remediations

As of commit be9f57d (more details on the Dr. CI page):

1/2 failures introduced in this PR
1/2 broken upstream at merge base 016cf7d since Jun 22

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_backward_compatibility_check_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 23 02:19:21 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.

Jun 23 02:19:21 processing existing schema:  __str__(__torch__.torch.classes._TorchScriptTesting._StackString _0) -> (str _0) 
Jun 23 02:19:21 processing existing schema:  __init__(__torch__.torch.classes._TorchScriptTesting._PickleTester _0, int[] _1) -> (None _0) 
Jun 23 02:19:21 processing existing schema:  __getstate__(__torch__.torch.classes._TorchScriptTesting._PickleTester _0) -> (int[] _0) 
Jun 23 02:19:21 processing existing schema:  __setstate__(__torch__.torch.classes._TorchScriptTesting._PickleTester _0, int[] _1) -> (None _0) 
Jun 23 02:19:21 processing existing schema:  top(__torch__.torch.classes._TorchScriptTesting._PickleTester _0) -> (int _0) 
Jun 23 02:19:21 processing existing schema:  pop(__torch__.torch.classes._TorchScriptTesting._PickleTester _0) -> (int _0) 
Jun 23 02:19:21 processing existing schema:  get(__torch__.torch.classes._TorchScriptTesting._LiteInterpreterTest _0, Tensor _1) -> (str _0) 
Jun 23 02:19:21 processing existing schema:  __getstate__(__torch__.torch.classes._TorchScriptTesting._LiteInterpreterTest _0) -> (int _0) 
Jun 23 02:19:21 processing existing schema:  __setstate__(__torch__.torch.classes._TorchScriptTesting._LiteInterpreterTest _0, int _1) -> (None _0) 
Jun 23 02:19:21 processing existing schema:  __init__(__torch__.torch.classes.dist_rpc.WorkerInfo _0, str _1, int _2) -> (None _0) 
Jun 23 02:19:21 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.  
Jun 23 02:19:21  
Jun 23 02:19:21 Broken ops: [ 
Jun 23 02:19:21 	aten::mkldnn_max_pool3d(Tensor self, int[3] kernel_size, int[3] stride=[], int[3] padding=[0, 0, 0], int[3] dilation=[1, 1, 1], bool ceil_mode=False) -> (Tensor) 
Jun 23 02:19:21 	aten::mkldnn_reorder_conv3d_weight(Tensor self, int[3] padding=[0, 0, 0], int[3] stride=[1, 1, 1], int[3] dilation=[1, 1, 1], int groups=1) -> (Tensor) 
Jun 23 02:19:21 ] 
Jun 23 02:19:21 + cleanup 
Jun 23 02:19:21 + retcode=1 
Jun 23 02:19:21 + set +x 
Jun 23 02:19:21 =================== sccache compilation log =================== 
Jun 23 02:19:21 =========== If your build fails, please take a look at the log above for possible reasons ===========

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 20 times.

pritamdamania87 · 2020-06-19T23:06:15Z

Verified the RPC examples are working with this PR.

wanchaol

looks good, the device_thread naming is confusing as reentrant backwards could happen in the device threads as well. Maybe call it sth like spin or long_live?

wanchaol · 2020-06-19T23:04:06Z

torch/csrc/autograd/engine.cpp

@@ -443,7 +432,7 @@ void Engine::reentrant_thread_init() {
    // set the local_ready_queue to the ready queue on the graph_task->owner_ device
    local_ready_queue = ready_queue_by_index(graph_task->cpu_ready_queue_, graph_task->owner_);
    total_depth = graph_task->reentrant_depth_;
-    thread_main(graph_task, /* reentrant thread*/ true);
+    thread_main(graph_task, /* device_thread */ false);


It's a bit confused on the argument changed here, device threads can also have reentrant backwards right?

I would agree here. Since we have multiple use case for this, it might be clearer to just describe what it does.
Or just remove this flag that is completely redundant with the first arguments :D

As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/) [ghstack-poisoned]

Pull Request resolved: #40312 As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. ghstack-source-id: 106287186 Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/)

As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/) [ghstack-poisoned]

Pull Request resolved: #40312 As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. ghstack-source-id: 106299861 Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/)

albanD

Good catch for the increment/decrement issues !
Just a minor point about argument naming but I think the overall logic is good.

albanD · 2020-06-22T05:09:41Z

torch/csrc/autograd/engine.cpp

@@ -443,7 +432,7 @@ void Engine::reentrant_thread_init() {
    // set the local_ready_queue to the ready queue on the graph_task->owner_ device
    local_ready_queue = ready_queue_by_index(graph_task->cpu_ready_queue_, graph_task->owner_);
    total_depth = graph_task->reentrant_depth_;
-    thread_main(graph_task, /* reentrant thread*/ true);
+    thread_main(graph_task, /* device_thread */ false);


I would agree here. Since we have multiple use case for this, it might be clearer to just describe what it does.
Or just remove this flag that is completely redundant with the first arguments :D

torch/csrc/autograd/engine.cpp

torch/csrc/autograd/python_engine.cpp

As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/) [ghstack-poisoned]

Pull Request resolved: #40312 As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. ghstack-source-id: 106352396 Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/)

albanD

Thanks for the update. LGTM

As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/) [ghstack-poisoned]

Pull Request resolved: #40312 As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. ghstack-source-id: 106391329 Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/)

mrshenli · 2020-06-23T14:08:27Z

Shall we land this today before branch cut?

facebook-github-bot · 2020-06-23T16:12:06Z

This pull request has been merged in 54c05fa.

facebook-github-bot · 2020-06-23T16:12:17Z

This pull request has been merged in 54c05fa.

pritamdamania87 requested review from albanD, apaszke, mrshenli and zhaojuanmao as code owners June 19, 2020 22:11

pritamdamania87 requested a review from wanchaol June 19, 2020 22:21

pritamdamania87 mentioned this pull request Jun 19, 2020

Add warning to distributed autograd about CPU only support. #40261

Closed

wanchaol approved these changes Jun 19, 2020

View reviewed changes

albanD reviewed Jun 22, 2020

View reviewed changes

albanD approved these changes Jun 22, 2020

View reviewed changes

facebook-github-bot closed this in 54c05fa Jun 23, 2020

mrshenli mentioned this pull request Jun 23, 2020

[dist_autograd] GPU continuations does not work in distributed autograd #40255

Closed

facebook-github-bot added the merged label Jun 23, 2020

facebook-github-bot deleted the gh/pritamdamania87/144/head branch June 27, 2020 14:16

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add basic GPU support to distributed autograd. #40312

Add basic GPU support to distributed autograd. #40312

Uh oh!

pritamdamania87 commented Jun 19, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Jun 19, 2020 •

edited

Loading

Uh oh!

pritamdamania87 commented Jun 19, 2020

Uh oh!

wanchaol left a comment

Uh oh!

wanchaol Jun 19, 2020

Uh oh!

albanD Jun 22, 2020

Uh oh!

albanD left a comment

Uh oh!

albanD Jun 22, 2020

Uh oh!

Uh oh!

Uh oh!

albanD left a comment

Uh oh!

mrshenli commented Jun 23, 2020

Uh oh!

facebook-github-bot commented Jun 23, 2020

Uh oh!

facebook-github-bot commented Jun 23, 2020

Uh oh!

Uh oh!

Add basic GPU support to distributed autograd. #40312

Add basic GPU support to distributed autograd. #40312

Uh oh!

Conversation

pritamdamania87 commented Jun 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Jun 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_backward_compatibility_check_test (1/1)

Uh oh!

pritamdamania87 commented Jun 19, 2020

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

wanchaol Jun 19, 2020

Choose a reason for hiding this comment

Uh oh!

albanD Jun 22, 2020

Choose a reason for hiding this comment

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

albanD Jun 22, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Jun 23, 2020

Uh oh!

facebook-github-bot commented Jun 23, 2020

Uh oh!

facebook-github-bot commented Jun 23, 2020

Uh oh!

Uh oh!

pritamdamania87 commented Jun 19, 2020 •

edited

Loading

dr-ci bot commented Jun 19, 2020 •

edited

Loading