Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dist_optim] serialize compilation when creating dist_optim #45871

Closed
wants to merge 3 commits into from

Conversation

wanchaol
Copy link
Contributor

@wanchaol wanchaol commented Oct 5, 2020

Stack from ghstack:

This PR fixes #45845

The issue to the case is where request_callback spawn threads in the same process and when instantiating ScriptModule, it tries to compile concurrently. TorchScript compilation is not thread safe, so it results in the following failure:

RuntimeError("Can't redefine method: step on class: __torch__.torch.distributed.optim.optimizer._ScriptLocalOptimizer",)

This PR introduce a lock in ScriptLocalOptimizer to protect/serialize the compilation.

Differential Revision: D24125209

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 5, 2020
wanchaol added a commit that referenced this pull request Oct 5, 2020
Attempt to fix #45845

ghstack-source-id: 03ca7dfcdd5cb44796c5baa651dabad3ae1486e9
Pull Request resolved: #45871
@codecov
Copy link

codecov bot commented Oct 6, 2020

Codecov Report

Merging #45871 into gh/wanchaol/136/base will decrease coverage by 0.00%.
The diff coverage is 50.00%.

Impacted file tree graph

@@                   Coverage Diff                    @@
##           gh/wanchaol/136/base   #45871      +/-   ##
========================================================
- Coverage                 68.32%   68.32%   -0.01%     
========================================================
  Files                       410      410              
  Lines                     52992    52997       +5     
========================================================
+ Hits                      36208    36210       +2     
- Misses                    16784    16787       +3     
Impacted Files Coverage Δ
torch/distributed/optim/optimizer.py 37.50% <50.00%> (+0.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ab73c1...54ec8d2. Read the comment docs.

wanchaol added a commit that referenced this pull request Oct 7, 2020
Attempt to fix #45845

ghstack-source-id: b8bc30a5752f15095b0977575ee34db9a3d07ea8
Pull Request resolved: #45871
Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Can you update the PR summary to include details of why the test was flaky and how this PR fixed it.

@mrshenli
Copy link
Contributor

mrshenli commented Oct 7, 2020

Do we need this to be cherry-picked into v1.7?

@wanchaol
Copy link
Contributor Author

wanchaol commented Oct 7, 2020

Do we need this to be cherry-picked into v1.7?

Yes I will submit the cherry-pick PR soon :)

@facebook-github-bot
Copy link
Contributor

@wanchaol merged this pull request in 505be08.

wanchaol added a commit that referenced this pull request Oct 9, 2020
Summary:
Pull Request resolved: #45871

Attempt to fix #45845

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D24125209

Pulled By: wanchaol

fbshipit-source-id: e3697dd6ef107d8153d2a82d78a17c66d109b4fa
@facebook-github-bot facebook-github-bot deleted the gh/wanchaol/136/head branch October 11, 2020 14:18
malfet pushed a commit that referenced this pull request Oct 13, 2020
…46071)

Summary:
Pull Request resolved: #45871

Attempt to fix #45845

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D24125209

Pulled By: wanchaol

fbshipit-source-id: e3697dd6ef107d8153d2a82d78a17c66d109b4fa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants