-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Add basic GPU support to distributed autograd. #40312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic GPU support to distributed autograd. #40312
Conversation
As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit be9f57d (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Verified the RPC examples are working with this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, the device_thread
naming is confusing as reentrant backwards could happen in the device threads as well. Maybe call it sth like spin
or long_live
?
torch/csrc/autograd/engine.cpp
Outdated
@@ -443,7 +432,7 @@ void Engine::reentrant_thread_init() { | |||
// set the local_ready_queue to the ready queue on the graph_task->owner_ device | |||
local_ready_queue = ready_queue_by_index(graph_task->cpu_ready_queue_, graph_task->owner_); | |||
total_depth = graph_task->reentrant_depth_; | |||
thread_main(graph_task, /* reentrant thread*/ true); | |||
thread_main(graph_task, /* device_thread */ false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit confused on the argument changed here, device threads can also have reentrant backwards right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree here. Since we have multiple use case for this, it might be clearer to just describe what it does.
Or just remove this flag that is completely redundant with the first arguments :D
As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/) [ghstack-poisoned]
Pull Request resolved: #40312 As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. ghstack-source-id: 106287186 Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/)
As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/) [ghstack-poisoned]
Pull Request resolved: #40312 As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. ghstack-source-id: 106299861 Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch for the increment/decrement issues !
Just a minor point about argument naming but I think the overall logic is good.
torch/csrc/autograd/engine.cpp
Outdated
@@ -443,7 +432,7 @@ void Engine::reentrant_thread_init() { | |||
// set the local_ready_queue to the ready queue on the graph_task->owner_ device | |||
local_ready_queue = ready_queue_by_index(graph_task->cpu_ready_queue_, graph_task->owner_); | |||
total_depth = graph_task->reentrant_depth_; | |||
thread_main(graph_task, /* reentrant thread*/ true); | |||
thread_main(graph_task, /* device_thread */ false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree here. Since we have multiple use case for this, it might be clearer to just describe what it does.
Or just remove this flag that is completely redundant with the first arguments :D
As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/) [ghstack-poisoned]
Pull Request resolved: #40312 As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. ghstack-source-id: 106352396 Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. LGTM
As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/) [ghstack-poisoned]
Pull Request resolved: #40312 As part of #40255, we realized that GPU support for distributed autograd was broken as part of our multithreaded autograd change. To fix this in the short term for 1.6, this PR includes the following changes: 1) Long lived CPU thread in DistEngine to execute GPU->CPU continuations in the autograd graph. 2) The long lived CPU thread has its own ready_queue and this queue is used for all GraphTasks created by DistEngine. 3) In thread_main(), the CPU thread cannot exit once the GraphTask is done processing because of the new CPU thread added in 1). 4) To resolve this, thread_main() now has a parameter `device_thread` instead of `reentrant_thread`. When device_thread is True, we expect this to be a long lived device thread that does not exit. 5) When device_thread is False, thread_main is expected to run a GraphTask and return once done. ghstack-source-id: 106391329 Differential Revision: [D22146183](https://our.internmc.facebook.com/intern/diff/D22146183/)
Shall we land this today before branch cut? |
This pull request has been merged in 54c05fa. |
This pull request has been merged in 54c05fa. |
Stack from ghstack:
As part of #40255, we
realized that GPU support for distributed autograd was broken as part of our
multithreaded autograd change.
To fix this in the short term for 1.6, this PR includes the following changes:
autograd graph.
all GraphTasks created by DistEngine.
processing because of the new CPU thread added in 1).
device_thread
insteadof
reentrant_thread
. When device_thread is True, we expect this to be a longlived device thread that does not exit.
return once done.
Differential Revision: D22146183