Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix handling of replica parameters in DataParallel #33907

Closed
wants to merge 4 commits into from

Conversation

ngimel
Copy link
Collaborator

@ngimel ngimel commented Feb 27, 2020

In DataParallel, replica parameters are not leaves (because they are computed via broadcast from master parameters), and should be treated as such. Fixes #33552

Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome fix! Nice to see a test go from doing nothing to actually testing something.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@dr-ci
Copy link

dr-ci bot commented Feb 28, 2020

💊 CircleCI build failures summary and remediations

As of commit d70e4cd (more details on the Dr. CI page):


  • 11/11 failures introduced in this PR

🕵️ 10 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakages (reran 8 jobs to discount flakiness):

See CircleCI build binary_linux_manywheel_2_7mu_cpu_devtoolset7_build (1/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Mar 09 23:04:49 ImportError: No module named setuptools
Mar 09 23:04:49 ++ PATCHELF_BIN=/usr/local/bin/patchelf 
Mar 09 23:04:49 +++ /usr/local/bin/patchelf --version 
Mar 09 23:04:49 ++ patchelf_version='patchelf 0.10' 
Mar 09 23:04:49 ++ echo 'patchelf version: ' patchelf 0.10 
Mar 09 23:04:49 patchelf version:  patchelf 0.10 
Mar 09 23:04:49 ++ [[ patchelf 0.10 == \p\a\t\c\h\e\l\f\ \0\.\9 ]] 
Mar 09 23:04:49 ++ python setup.py clean 
Mar 09 23:04:49 Traceback (most recent call last): 
Mar 09 23:04:49   File "setup.py", line 165, in <module> 
Mar 09 23:04:49     from setuptools import setup, Extension, distutils, find_packages 
Mar 09 23:04:49 ImportError: No module named setuptools 

See CircleCI build binary_linux_libtorch_2_7m_cpu_devtoolset7_shared-with-deps_build (2/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Mar 09 23:04:51 ImportError: No module named setuptools
Mar 09 23:04:51 ++ PATCHELF_BIN=/usr/local/bin/patchelf 
Mar 09 23:04:51 +++ /usr/local/bin/patchelf --version 
Mar 09 23:04:51 ++ patchelf_version='patchelf 0.10' 
Mar 09 23:04:51 ++ echo 'patchelf version: ' patchelf 0.10 
Mar 09 23:04:51 patchelf version:  patchelf 0.10 
Mar 09 23:04:51 ++ [[ patchelf 0.10 == \p\a\t\c\h\e\l\f\ \0\.\9 ]] 
Mar 09 23:04:51 ++ python setup.py clean 
Mar 09 23:04:51 Traceback (most recent call last): 
Mar 09 23:04:51   File "setup.py", line 165, in <module> 
Mar 09 23:04:51     from setuptools import setup, Extension, distutils, find_packages 
Mar 09 23:04:51 ImportError: No module named setuptools 

See CircleCI build pytorch_linux_xenial_py3_clang5_android_ndk_r19c_mobile_code_analysis (3/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:07597f23-fa81-474c-8bef-5c8a91b50595 not found
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:07597f23-fa81-474c-8bef-5c8a91b50595 not found 

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_build (4/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:07597f23-fa81-474c-8bef-5c8a91b50595 not found
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:07597f23-fa81-474c-8bef-5c8a91b50595 not found 

See CircleCI build pytorch_linux_xenial_cuda10_1_cudnn7_py3_gcc7_build (5/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7:07597f23-fa81-474c-8bef-5c8a91b50595 not found
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7:07597f23-fa81-474c-8bef-5c8a91b50595 not found 

See CircleCI build pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build (6/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:07597f23-fa81-474c-8bef-5c8a91b50595 not found
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:07597f23-fa81-474c-8bef-5c8a91b50595 not found 

See CircleCI build pytorch_libtorch_linux_xenial_cuda10_1_cudnn7_py3_gcc7_build (7/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7:07597f23-fa81-474c-8bef-5c8a91b50595 not found
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7:07597f23-fa81-474c-8bef-5c8a91b50595 not found 

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (8/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:07597f23-fa81-474c-8bef-5c8a91b50595 not found
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:07597f23-fa81-474c-8bef-5c8a91b50595 not found 

See CircleCI build pytorch_xla_linux_xenial_py3_6_clang7_build (9/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-clang7:07597f23-fa81-474c-8bef-5c8a91b50595 not found
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-clang7:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-clang7:07597f23-fa81-474c-8bef-5c8a91b50595 not found 

See CircleCI build pytorch_linux_xenial_py3_clang5_mobile_build (10/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:07597f23-fa81-474c-8bef-5c8a91b50595 not found
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:07597f23-fa81-474c-8bef-5c8a91b50595 not found 

1 failure not recognized by patterns:

Job Step Status
CircleCI caffe2_onnx_py2_gcc5_ubuntu16_04_build Build

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 6 times.

@ngimel
Copy link
Collaborator Author

ngimel commented Feb 29, 2020

cc @suo. Jit tests are now passing because I don't delete an attribute for scripted module. However, scripted RNN+DataParallel did not work before this PR, and is still broken after. The reason behind this is scripting does not call overriden _apply from rnn module when moving scripted module to cuda, and does not call overriden setattr from rnn module when creating replicas, and generic implementations of those do not work for rnn (that's why we needed those overrides).

Copy link
Member

@suo suo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change lgtm, can you file an issue about the brokenness of RNN + DataParallel + JIT?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator Author

ngimel commented Mar 9, 2020

#34513

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in cd9d9a2.

mrshenli pushed a commit to mrshenli/pytorch that referenced this pull request Apr 13, 2020
DDP was expecting .parameters() would return a all parameters
of a replicated module. However, after pytorch#33907, we no longer
populating _parameters for replicated modules. This PR fixes
that problem by keeping params in a _former_parameters ordered
dict on every module replica.
cskyan added a commit to ncbi-nlp/PhenoRerank that referenced this pull request Jun 17, 2021
cskyan added a commit to ncbi-nlp/PhenoRerank that referenced this pull request Jun 17, 2021
cskyan added a commit to ncbi-nlp/PhenoRerank that referenced this pull request Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RNNs do not have gradients while using DataParallel in 1.4.0
4 participants