[ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict #52760

blefaudeux · 2021-02-24T18:45:42Z

Stack from ghstack:

[ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict #52760 [ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict

[ghstack-poisoned]

facebook-github-bot · 2021-02-24T18:45:52Z

💊 CI failures summary and remediations

As of commit d02adb1 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

…tate dict" [ghstack-poisoned]

ghstack-source-id: aa49c60 Pull Request resolved: #52760

blefaudeux · 2021-02-24T21:34:34Z

test/distributed/optim/test_zero_redundancy_optimizer.py

        o.step()
        self.assertEqual(x, torch.tensor([0.9], device=DEVICE))

-    def test_local_state_dict(self):


no more "local state dict" concept, a bit error prone and not elastic, there was no user that I know of for this in fairscale

blefaudeux · 2021-02-24T21:35:05Z

torch/distributed/optim/zero_redundancy_optimizer.py

-        else:
-            # Dispatch this rank's state dictionary to the wrapped shard optimizer
-            self.load_local_state_dict(ZeroRedundancyOptimizer.rank_local_state_dict(self.rank, state_dict))
+        # NOTE: PyTorch 1.5 does not index linearly but with the id(params) at saving time


make zeroredundancyoptimizer compatible with a normal pytorch checkpoint

torch/distributed/optim/zero_redundancy_optimizer.py

…tate dict" [ghstack-poisoned]

mrshenli

LGTM! Add a some minor comments. The main one is whether we need the v1.5 workaround.

mrshenli · 2021-02-26T02:27:35Z

torch/distributed/optim/zero_redundancy_optimizer.py

            OrderedDict()
        )  # device, rank, params
        self._param_rank: Dict[torch.Tensor, int] = {}
+        self._param_to_index: Dict[int, int] = {}


(this is prior to this PR) curious, what if we directly use Tensor instead of id(param) as the key, will that result in wrong mapping? If so, shall we add a comment to mention that here?

I saw the comments below. Would I be correct if I assume the concern of using Tensor as the map key was that Tensor hash values depends Tensor implementation which might 1) depend on Tensor value 2) be more expensive then id()?

hmm good question, I didn't benchmark that, maybe nicer to move to Tensor for consistency ? The context is that for state dicts in pytorch code id(param) is typically used, and I did the same. When trying to "cache" this it moved into a more generic place, and it does look inconsistent with the line just above, I agree

I didn't benchmark that, maybe nicer to move to Tensor for consistency ?

I am not sure how Tensor implementation computes its hash code. cc @albanD @gchanan do you know whether it's OK to use Tensor as map key?

It just returns the id haha:

pytorch/torch/tensor.py

Lines 599 to 602 in 7a178a8

def __hash__(self):

if has_torch_function_unary(self):

return handle_torch_function(Tensor.__hash__, (self,), self)

return id(self)

torch/distributed/optim/zero_redundancy_optimizer.py

…tate dict" [ghstack-poisoned]

Summary: [ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict Test Plan: CircleCI / Unit tests Differential Revision: D26703501 fbshipit-source-id: 9f42e16735a192eaca24a8ed9108b50cd13460c3

…roken tooling on "[ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict" [ghstack-poisoned]

ghstack-source-id: 8516824 Pull Request resolved: #52760

Summary: Same as #52760 which I could not get to land. I just could not live with ghstack/ghimport/randomly broken things, I break enough of them myself, so this is a fresh copy without ghstack shenanigans. I'm hopeful that this can land relatively bug free, and am sorry for the duplications.. What this does: - call the common_utils test runner instead of unittest, because it seems that it's how it should be done - change the returned state from ZeroRedundancyOptimizer to be PyTorch compliant, which has the added benefit of being elastic (world size independent) Pull Request resolved: #52960 Reviewed By: mrshenli Differential Revision: D26710932 Pulled By: blefaudeux fbshipit-source-id: 1d914bc9221442ba1bb2b48f5df10c313e674ece

Summary: Same as pytorch#52760 which I could not get to land. I just could not live with ghstack/ghimport/randomly broken things, I break enough of them myself, so this is a fresh copy without ghstack shenanigans. I'm hopeful that this can land relatively bug free, and am sorry for the duplications.. What this does: - call the common_utils test runner instead of unittest, because it seems that it's how it should be done - change the returned state from ZeroRedundancyOptimizer to be PyTorch compliant, which has the added benefit of being elastic (world size independent) Pull Request resolved: pytorch#52960 Reviewed By: mrshenli Differential Revision: D26710932 Pulled By: blefaudeux fbshipit-source-id: 1d914bc9221442ba1bb2b48f5df10c313e674ece

elastic and pytorch compatible state dict

47324c6

[ghstack-poisoned]

blefaudeux requested review from mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners February 24, 2021 18:45

blefaudeux mentioned this pull request Feb 24, 2021

easy cleanup #52759

Closed

facebook-github-bot added the cla signed label Feb 24, 2021

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 24, 2021

blefaudeux changed the title ~~elastic and pytorch compatible state dict~~ [ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict Feb 24, 2021

Update on "[ZeroRedundancyOptimizer] Elastic and pytorch compatible s…

3097197

…tate dict" [ghstack-poisoned]

blefaudeux added a commit that referenced this pull request Feb 24, 2021

elastic and pytorch compatible state dict

c5a5c13

ghstack-source-id: aa49c60 Pull Request resolved: #52760

blefaudeux mentioned this pull request Feb 24, 2021

[ZeroRedundancyOptimizer] Pack parameters in tensor views, handle training change #52764

Closed

blefaudeux commented Feb 24, 2021

View reviewed changes

torch/distributed/optim/zero_redundancy_optimizer.py Show resolved Hide resolved

Update on "[ZeroRedundancyOptimizer] Elastic and pytorch compatible s…

123ad39

…tate dict" [ghstack-poisoned]

mrshenli approved these changes Feb 26, 2021

View reviewed changes

Update on "[ZeroRedundancyOptimizer] Elastic and pytorch compatible s…

ea6de34

…tate dict" [ghstack-poisoned]

Update on "[ZeroRedundancyOptimizer] Elastic and pytorch compatible s…

9543cb3

…tate dict" [ghstack-poisoned]

blefaudeux added 2 commits February 26, 2021 11:49

Update test_zero_redundancy_optimizer.py

67bb608

Update test_zero_redundancy_optimizer.py

7007377

rebase on latest master, even if this probably does nothing against b…

d02adb1

…roken tooling on "[ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict" [ghstack-poisoned]

blefaudeux added a commit that referenced this pull request Feb 27, 2021

elastic and pytorch compatible state dict

8797de8

ghstack-source-id: 8516824 Pull Request resolved: #52760

blefaudeux closed this Feb 27, 2021

blefaudeux added a commit to blefaudeux/pytorch that referenced this pull request Feb 27, 2021

Restoring pytorch#52760, without the broken tools

98cc2f3

blefaudeux mentioned this pull request Feb 27, 2021

[ZeroRedundancyOptimizer] Pytorch compliant state #52960

Closed

facebook-github-bot deleted the gh/blefaudeux/2/head branch March 29, 2021 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict #52760

[ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict #52760

Uh oh!

blefaudeux commented Feb 24, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Feb 24, 2021 •

edited

Loading

Uh oh!

blefaudeux Feb 24, 2021

Uh oh!

blefaudeux Feb 24, 2021

Uh oh!

Uh oh!

mrshenli left a comment

Uh oh!

mrshenli Feb 26, 2021

Uh oh!

mrshenli Feb 26, 2021

Uh oh!

blefaudeux Feb 26, 2021

Uh oh!

mrshenli Feb 26, 2021

Uh oh!

albanD Feb 26, 2021

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	def __hash__(self):
	if has_torch_function_unary(self):
	return handle_torch_function(Tensor.__hash__, (self,), self)
	return id(self)

[ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict #52760

[ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict #52760

Uh oh!

Conversation

blefaudeux commented Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

blefaudeux Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

blefaudeux Feb 24, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli Feb 26, 2021

Choose a reason for hiding this comment

Uh oh!

mrshenli Feb 26, 2021

Choose a reason for hiding this comment

Uh oh!

blefaudeux Feb 26, 2021

Choose a reason for hiding this comment

Uh oh!

mrshenli Feb 26, 2021

Choose a reason for hiding this comment

Uh oh!

albanD Feb 26, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

blefaudeux commented Feb 24, 2021 •

edited

Loading

facebook-github-bot commented Feb 24, 2021 •

edited

Loading