Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DDP] Log if graph is static at end of training #61871

Closed
wants to merge 9 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Jul 19, 2021

Stack from ghstack:

When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.

Differential Revision: D29773962

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 19, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 90f1395 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



❄️ 2 failures tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_bionic_cuda10_2_cudnn7_py3_9_gcc7_test2 (1/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Jul 27 03:22:30 unknown file: Failure
Jul 27 03:22:30 [ RUN      ] GradModeTest.TestRequiresGradInplaceOp
Jul 27 03:22:30 [       OK ] GradModeTest.TestRequiresGradInplaceOp (0 ms)
Jul 27 03:22:30 [ RUN      ] GradModeTest.TestRequiresGradViewOp
Jul 27 03:22:30 [       OK ] GradModeTest.TestRequiresGradViewOp (0 ms)
Jul 27 03:22:30 [ RUN      ] GradModeTest.TestRequiresGradViewOpExiting
Jul 27 03:22:30 [       OK ] GradModeTest.TestRequiresGradViewOpExiting (3 ms)
Jul 27 03:22:30 [----------] 4 tests from GradModeTest (3 ms total)
Jul 27 03:22:30 
Jul 27 03:22:30 [----------] 9 tests from ParallelTest
Jul 27 03:22:30 [ RUN      ] ParallelTest.DifferentiableScatter_MultiCUDA
Jul 27 03:22:30 unknown file: Failure
Jul 27 03:22:30 C++ exception with description "All inputs to Gather must be CUDA tensors, got UndefinedType
Jul 27 03:22:30 Exception raised from apply at /var/lib/jenkins/workspace/torch/csrc/autograd/functions/comm.cpp:84 (most recent call first):
Jul 27 03:22:30 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7fc7c24284bb in /var/lib/jenkins/workspace/build/lib/libc10.so)
Jul 27 03:22:30 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7fc7c242410e in /var/lib/jenkins/workspace/build/lib/libc10.so)
Jul 27 03:22:30 frame #2: torch::autograd::Gather::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0xb49 (0x7fc7c3912259 in /var/lib/jenkins/workspace/build/lib/libtorch_cuda.so)
Jul 27 03:22:30 frame #3: <unknown function> + 0x3b55ccd (0x7fc7d5d9fccd in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jul 27 03:22:30 frame #4: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1533 (0x7fc7d5d9b313 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jul 27 03:22:30 frame #5: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x431 (0x7fc7d5d9bea1 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jul 27 03:22:30 frame #6: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x99 (0x7fc7d5d93779 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jul 27 03:22:30 frame #7: <unknown function> + 0xc9039 (0x7fc7dde8a039 in /opt/conda/lib/libstdc++.so.6)

See CircleCI build pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test2 (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Jul 27 03:56:27 unknown file: Failure
Jul 27 03:56:27 [ RUN      ] GradModeTest.TestRequiresGradInplaceOp
Jul 27 03:56:27 [       OK ] GradModeTest.TestRequiresGradInplaceOp (0 ms)
Jul 27 03:56:27 [ RUN      ] GradModeTest.TestRequiresGradViewOp
Jul 27 03:56:27 [       OK ] GradModeTest.TestRequiresGradViewOp (0 ms)
Jul 27 03:56:27 [ RUN      ] GradModeTest.TestRequiresGradViewOpExiting
Jul 27 03:56:27 [       OK ] GradModeTest.TestRequiresGradViewOpExiting (3 ms)
Jul 27 03:56:27 [----------] 4 tests from GradModeTest (4 ms total)
Jul 27 03:56:27 
Jul 27 03:56:27 [----------] 9 tests from ParallelTest
Jul 27 03:56:27 [ RUN      ] ParallelTest.DifferentiableScatter_MultiCUDA
Jul 27 03:56:27 unknown file: Failure
Jul 27 03:56:27 C++ exception with description "All inputs to Gather must be CUDA tensors, got UndefinedType
Jul 27 03:56:27 Exception raised from apply at /var/lib/jenkins/workspace/torch/csrc/autograd/functions/comm.cpp:84 (most recent call first):
Jul 27 03:56:27 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7fe61117542b in /var/lib/jenkins/workspace/build/lib/libc10.so)
Jul 27 03:56:27 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7fe611170bee in /var/lib/jenkins/workspace/build/lib/libc10.so)
Jul 27 03:56:27 frame #2: torch::autograd::Gather::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0xb49 (0x7fe5f86a6d49 in /var/lib/jenkins/workspace/build/lib/libtorch_cuda_cu.so)
Jul 27 03:56:27 frame #3: <unknown function> + 0x3b71cdd (0x7fe615132cdd in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jul 27 03:56:27 frame #4: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1533 (0x7fe61512e323 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jul 27 03:56:27 frame #5: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x431 (0x7fe61512eeb1 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jul 27 03:56:27 frame #6: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x99 (0x7fe615126789 in /var/lib/jenkins/workspace/build/lib/libtorch_cpu.so)
Jul 27 03:56:27 frame #7: <unknown function> + 0xc9039 (0x7fe621df8039 in /opt/conda/lib/libstdc++.so.6)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Jul 19, 2021
When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!

[ghstack-poisoned]
Copy link
Contributor

@zhaojuanmao zhaojuanmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit

@@ -1211,6 +1216,14 @@ void Reducer::search_unused_parameters(
"flag off. Note that this warning may be a false positive if your model "
"has flow control causing later iterations to have unused parameters.");
}
if (!static_graph_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if (!static_graph_ && ddp_graph_static_), so that the {} codes can be skipped if ddp_graph_static_ is false

When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 19, 2021
Pull Request resolved: #61871

When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.
ghstack-source-id: 133830868

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!
When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 21, 2021
Pull Request resolved: #61871

When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.
ghstack-source-id: 133995573

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!
driazati added a commit that referenced this pull request Jul 21, 2021
driazati added a commit that referenced this pull request Jul 21, 2021
ghstack-source-id: 4b52956580b8374f3357bb82b8ec7f5937cae7ee
Pull Request resolved: #62005
When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 23, 2021
Pull Request resolved: #61871

When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.
ghstack-source-id: 134209980

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!
When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 24, 2021
Pull Request resolved: #61871

When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.
ghstack-source-id: 134253626

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!
When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!

[ghstack-poisoned]
When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 25, 2021
Pull Request resolved: #61871

When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.
ghstack-source-id: 134264548

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!
When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Jul 27, 2021
Pull Request resolved: #61871

When set_static_graph=False, the only type of dynamism we really
support in DDP is dynamic set of unused parameters which must be explicitly
enabled with find_unused_parameters=True. Although, some workflows have static
set of unused parameters, would be good to detect and add this to logging to
identify workflows that are candidates for static graph optimization.
ghstack-source-id: 134371429

Differential Revision: [D29773962](https://our.internmc.facebook.com/intern/diff/D29773962/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D29773962/)!
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 2cbc0ed.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/356/head branch July 31, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants