fix handling of replica parameters in DataParallel #33907

ngimel · 2020-02-27T22:25:12Z

In DataParallel, replica parameters are not leaves (because they are computed via broadcast from master parameters), and should be treated as such. Fixes #33552

mruberry

Awesome fix! Nice to see a test go from doing nothing to actually testing something.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

dr-ci · 2020-02-28T00:01:22Z

💊 CircleCI build failures summary and remediations

As of commit d70e4cd (more details on the Dr. CI page):

11/11 failures introduced in this PR

🕵️ 10 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakages (reran 8 jobs to discount flakiness):

binary_linux_manywheel_2_7mu_cpu_devtoolset7_build (1/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Mar 09 23:04:49 ImportError: No module named setuptools

Mar 09 23:04:49 ++ PATCHELF_BIN=/usr/local/bin/patchelf 
Mar 09 23:04:49 +++ /usr/local/bin/patchelf --version 
Mar 09 23:04:49 ++ patchelf_version='patchelf 0.10' 
Mar 09 23:04:49 ++ echo 'patchelf version: ' patchelf 0.10 
Mar 09 23:04:49 patchelf version:  patchelf 0.10 
Mar 09 23:04:49 ++ [[ patchelf 0.10 == \p\a\t\c\h\e\l\f\ \0\.\9 ]] 
Mar 09 23:04:49 ++ python setup.py clean 
Mar 09 23:04:49 Traceback (most recent call last): 
Mar 09 23:04:49   File "setup.py", line 165, in <module> 
Mar 09 23:04:49     from setuptools import setup, Extension, distutils, find_packages 
Mar 09 23:04:49 ImportError: No module named setuptools

binary_linux_libtorch_2_7m_cpu_devtoolset7_shared-with-deps_build (2/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Mar 09 23:04:51 ImportError: No module named setuptools

Mar 09 23:04:51 ++ PATCHELF_BIN=/usr/local/bin/patchelf 
Mar 09 23:04:51 +++ /usr/local/bin/patchelf --version 
Mar 09 23:04:51 ++ patchelf_version='patchelf 0.10' 
Mar 09 23:04:51 ++ echo 'patchelf version: ' patchelf 0.10 
Mar 09 23:04:51 patchelf version:  patchelf 0.10 
Mar 09 23:04:51 ++ [[ patchelf 0.10 == \p\a\t\c\h\e\l\f\ \0\.\9 ]] 
Mar 09 23:04:51 ++ python setup.py clean 
Mar 09 23:04:51 Traceback (most recent call last): 
Mar 09 23:04:51   File "setup.py", line 165, in <module> 
Mar 09 23:04:51     from setuptools import setup, Extension, distutils, find_packages 
Mar 09 23:04:51 ImportError: No module named setuptools

pytorch_linux_xenial_py3_clang5_android_ndk_r19c_mobile_code_analysis (3/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:07597f23-fa81-474c-8bef-5c8a91b50595 not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:07597f23-fa81-474c-8bef-5c8a91b50595 not found

pytorch_linux_xenial_py3_clang5_asan_build (4/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:07597f23-fa81-474c-8bef-5c8a91b50595 not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:07597f23-fa81-474c-8bef-5c8a91b50595 not found

pytorch_linux_xenial_cuda10_1_cudnn7_py3_gcc7_build (5/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7:07597f23-fa81-474c-8bef-5c8a91b50595 not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7:07597f23-fa81-474c-8bef-5c8a91b50595 not found

pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build (6/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:07597f23-fa81-474c-8bef-5c8a91b50595 not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:07597f23-fa81-474c-8bef-5c8a91b50595 not found

pytorch_libtorch_linux_xenial_cuda10_1_cudnn7_py3_gcc7_build (7/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7:07597f23-fa81-474c-8bef-5c8a91b50595 not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.1-cudnn7-py3-gcc7:07597f23-fa81-474c-8bef-5c8a91b50595 not found

pytorch_linux_xenial_py3_6_gcc5_4_build (8/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:07597f23-fa81-474c-8bef-5c8a91b50595 not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:07597f23-fa81-474c-8bef-5c8a91b50595 not found

pytorch_xla_linux_xenial_py3_6_clang7_build (9/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-clang7:07597f23-fa81-474c-8bef-5c8a91b50595 not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-clang7:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-clang7:07597f23-fa81-474c-8bef-5c8a91b50595 not found

pytorch_linux_xenial_py3_clang5_mobile_build (10/10)

Step: "Build" (full log | pattern match details) <confirmed not flaky by 2 failures>

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:07597f23-fa81-474c-8bef-5c8a91b50595 not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:07597f23-fa81-474c-8bef-5c8a91b50595 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:07597f23-fa81-474c-8bef-5c8a91b50595 not found

1 failure not recognized by patterns:

Job	Step	Status
^{caffe2_onnx_py2_gcc5_ubuntu16_04_build}	^Build

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 6 times.

ngimel · 2020-02-29T18:43:13Z

cc @suo. Jit tests are now passing because I don't delete an attribute for scripted module. However, scripted RNN+DataParallel did not work before this PR, and is still broken after. The reason behind this is scripting does not call overriden _apply from rnn module when moving scripted module to cuda, and does not call overriden setattr from rnn module when creating replicas, and generic implementations of those do not work for rnn (that's why we needed those overrides).

suo

Change lgtm, can you file an issue about the brokenness of RNN + DataParallel + JIT?

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2020-03-09T23:10:50Z

#34513

facebook-github-bot · 2020-03-10T20:24:52Z

@ngimel merged this pull request in cd9d9a2.

DDP was expecting .parameters() would return a all parameters of a replicated module. However, after pytorch#33907, we no longer populating _parameters for replicated modules. This PR fixes that problem by keeping params in a _former_parameters ordered dict on every module replica.

…taParallel replicas. ([pytorch/pytorch#33907](pytorch/pytorch#33907))

…ataParallel replicas. (pytorch/pytorch#33907)

…ataParallel replicas. (pytorch/pytorch#33907) Same issue is reported at huggingface/transformers#3936

ngimel requested a review from mruberry February 27, 2020 22:25

ngimel requested a review from apaszke as a code owner February 27, 2020 22:25

mruberry approved these changes Feb 27, 2020

View reviewed changes

facebook-github-bot reviewed Feb 27, 2020

View reviewed changes

suo approved these changes Mar 9, 2020

View reviewed changes

ngimel added 4 commits March 9, 2020 16:00

fix handling of replica parameters

79225ef

fix data_parallel rnn test

2d64a38

lint

caee6bc

fix handling of scripted modules

d70e4cd

ngimel force-pushed the data_parallel branch from 65a0708 to d70e4cd Compare March 9, 2020 23:01

facebook-github-bot reviewed Mar 9, 2020

View reviewed changes

facebook-github-bot closed this in cd9d9a2 Mar 10, 2020

facebook-github-bot added the merged label Mar 10, 2020

ousou mentioned this pull request Mar 24, 2020

GRU model learns very slowly when using DataParallel with multiple GPUs #33238

Closed

mrshenli mentioned this pull request Apr 13, 2020

Fix DDP bug in single process multiple device use cases #36523

Closed

mrshenli mentioned this pull request Apr 22, 2020

Result parameters from Single-Process Multi-GPU DDP training on RNN do not match local training #37079

Open

This was referenced May 11, 2020

Pytorch 1.5 DataParallel huggingface/transformers#3936

Closed

Fix nn.DataParallel compatibility in PyTorch 1.5 huggingface/transformers#4300

Merged

mrshenli mentioned this pull request May 14, 2020

Error out when parameters() is called on replicated models #38493

Closed

mrshenli mentioned this pull request Jun 23, 2020

DataParallel with Torch 1.5 #40457

Open

kan-bayashi mentioned this pull request Jul 24, 2020

Update chainer to v6.7.0? espnet/espnet#2188

Closed

pritamdamania87 mentioned this pull request Jul 30, 2020

Wrong in nn.ParameterList and nn.DataParallel, pytorch1.6 #42327

Closed

mrshenli mentioned this pull request Aug 3, 2020

nn.Parameter{List,Dict} not copied to gpus in forward pass when nn.DataParallel is used #36035

Closed

mruberry added the Merged label Oct 28, 2020

cskyan added a commit to ncbi-nlp/PhenoRerank that referenced this pull request Jun 17, 2021

Fix the bugs caused by the new way PyTorch handle nn.Parameters in Da…

83f367f

…taParallel replicas. ([pytorch/pytorch#33907](pytorch/pytorch#33907))

cskyan added a commit to ncbi-nlp/PhenoRerank that referenced this pull request Jun 17, 2021

Fix the bugs caused by the new way PyTorch handles nn.Parameters in D…

9be04f3

…ataParallel replicas. (pytorch/pytorch#33907)

cskyan added a commit to ncbi-nlp/PhenoRerank that referenced this pull request Jun 17, 2021

Fix the bugs caused by the new way PyTorch handles nn.Parameters in D…

869503b

…ataParallel replicas. (pytorch/pytorch#33907) Same issue is reported at huggingface/transformers#3936

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix handling of replica parameters in DataParallel #33907

fix handling of replica parameters in DataParallel #33907

ngimel commented Feb 27, 2020

mruberry left a comment

facebook-github-bot left a comment

dr-ci bot commented Feb 28, 2020 •

edited

Loading

ngimel commented Feb 29, 2020

suo left a comment

facebook-github-bot left a comment

ngimel commented Mar 9, 2020

facebook-github-bot commented Mar 10, 2020

fix handling of replica parameters in DataParallel #33907

fix handling of replica parameters in DataParallel #33907

Conversation

ngimel commented Feb 27, 2020

mruberry left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

dr-ci bot commented Feb 28, 2020 • edited Loading

💊 CircleCI build failures summary and remediations

🕵️ 10 new failures recognized by patterns

binary_linux_manywheel_2_7mu_cpu_devtoolset7_build (1/10)

binary_linux_libtorch_2_7m_cpu_devtoolset7_shared-with-deps_build (2/10)

pytorch_linux_xenial_py3_clang5_android_ndk_r19c_mobile_code_analysis (3/10)

pytorch_linux_xenial_py3_clang5_asan_build (4/10)

pytorch_linux_xenial_cuda10_1_cudnn7_py3_gcc7_build (5/10)

pytorch_linux_xenial_py3_clang5_android_ndk_r19c_x86_32_build (6/10)

pytorch_libtorch_linux_xenial_cuda10_1_cudnn7_py3_gcc7_build (7/10)

pytorch_linux_xenial_py3_6_gcc5_4_build (8/10)

pytorch_xla_linux_xenial_py3_6_clang7_build (9/10)

pytorch_linux_xenial_py3_clang5_mobile_build (10/10)

1 failure not recognized by patterns:

ngimel commented Feb 29, 2020

suo left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

ngimel commented Mar 9, 2020

facebook-github-bot commented Mar 10, 2020

dr-ci bot commented Feb 28, 2020 •

edited

Loading