[ZeroRedundancyOptimizer] Pack parameters in tensor views, handle training change #52764

blefaudeux · 2021-02-24T20:13:15Z

Stack from ghstack:

[ZeroRedundancyOptimizer] Pack parameters in tensor views, handle training change #52764 [ZeroRedundancyOptimizer] Pack parameters in tensor views, handle training change
[ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict #52760 [ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict

…ining change [ghstack-poisoned]

facebook-github-bot · 2021-02-24T20:13:25Z

💊 CI failures summary and remediations

As of commit 830d1b0 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

… handle training change" [ghstack-poisoned]

…ining change ghstack-source-id: ecd0bcb Pull Request resolved: #52764

… handle training change" [ghstack-poisoned]

blefaudeux · 2021-02-24T21:39:13Z

needs rebasing, doing that

… handle training change" [ghstack-poisoned]

…ining change ghstack-source-id: 175b0f9 Pull Request resolved: #52764

codecov · 2021-02-25T01:29:19Z

Codecov Report

Merging #52764 (830d1b0) into gh/blefaudeux/3/base (123ad39) will decrease coverage by 0.00%.
The diff coverage is 7.27%.

@@                   Coverage Diff                    @@
##           gh/blefaudeux/3/base   #52764      +/-   ##
========================================================
- Coverage                 80.46%   80.46%   -0.01%     
========================================================
  Files                      1972     1972              
  Lines                    216335   216351      +16     
========================================================
+ Hits                     174079   174081       +2     
- Misses                    42256    42270      +14

mrshenli · 2021-02-26T02:56:49Z

torch/distributed/optim/zero_redundancy_optimizer.py

    Keyword Args:
        optim (torch.nn.Optimizer): optimizer to shard
        group (group): torch.distributed group (default: group.WORLD)
-        bucket_cap (int): the size of the buffer used to batch the small parameter tensors,


Is it because we don't need to overlap the comm with backward or the next forward so it is always faster to use larger buckets? And since params are now bucket views, we also don't need to worry about double the memory consumption? Hence we don't need to expose the bucket size knob to users?

yes sorry, I should have added an explanation here: basically with these tensor views, there's no need for bucketing really, memory wise the parameters are part of an optimal big bucket, but from the outside they're exposed as normal parameters (ie: they're tensor views of the bucket). When the shards are sent (broadcast/all-gather), the "background" buffer is synced in between the ranks, so there's one call per rank<>rank, instead of a lot of small calls.

To add more context and compare with DDP/ShardedDDP reduce buckets, it's not the same situation really, because in that case we don't wipe anything over time and there's no latency cost in using buckets. The parameters are always 100% there (the update process is sharded, but not the params as such), there's no drawback in having maximal "tensor view buckets" really, it does not take any extra space and it just removes calls. In the case of reduce with DDP the buckets are more of a tradeoff, for instance because you need to wait for them to fill up, so it makes sense to expose the knob I think.

There's no overlap currently in between the broadcast and the FW, and frankly it's killing performance, if there's a way you can think of that would be game changing, I spent some time on that, outside of breaking the model in subparts (what FSDP does) I could not find anything. Within a node it's no problem, but in between nodes with slow interconnect it's easily the bottleneck.

mrshenli · 2021-02-26T03:09:57Z

torch/distributed/optim/zero_redundancy_optimizer.py

+                    for param in trainable_params:
+                        offset_next = offset + param.numel()
+                        bucket[offset:offset_next].copy_(param.data.flatten())
+                        param.data = bucket[offset:offset_next].view_as(param.data)


I wonder if we should still keep the bucket size knob and allow users to choose whether params should be bucket views or not. Because changing param.data might break application if user created separate views on parameters. See the following code snippet. Changing x.data won't update y.

>>> import torch >>> x = torch.zeros(2, 2) >>> y = x.view(1, 4) >>> y tensor([[0., 0., 0., 0.]]) >>> x tensor([[0., 0.], [0., 0.]]) >>> x.data = torch.ones(2, 2) >>> x tensor([[1., 1.], [1., 1.]]) >>> y tensor([[0., 0., 0., 0.]])

Does it make sense if we keep the bucket size knob. And then -1 means using bucket view and one bucket per device as implemented in this PR, and other bucket size values means dedicated bucket and no view.

yes makes sense, but in that case I would make it a binary choice: If you agree not to touch .data -> ok for tensor view buckets, else don't bucket for safety ? Without the tensor views the buckets are a lot of code for not-so-great effects (extra memory cost, even temporary + need to unroll the bucket after the fact). What do you think ?

One option might be we keep the PR as-is (always use bucket view) and clarify the side-effect in the documents, so that we don't surprise users. If we see requests to add the no-view option, we can add that later. If this looks OK to you, I will stamp the PR.

I think your points where very valid, and adding an option to disable that is not a lot of work, I would just do that. Adding the partial buckets with copies and unroll is not worth it though, in my view, also because it complexifies the code a fair bit and people looking for the best perf all around will probably dig into more complex solutions

mrshenli

In general LGTM! the main concern is whether we should keep the bucket size flag and allow users to choose whether they want to use bucket view as param data.

blefaudeux · 2021-02-26T19:09:01Z

torch/distributed/optim/zero_redundancy_optimizer.py

+            optimizer to shard
+        group (group):
+            torch.distributed group (default: group.WORLD)
+        parameters_as_bucket_view (bool):


adding this new setting, similar to DDPs 'gradient_as_bucket_view'

mrshenli · 2021-02-26T20:45:40Z

torch/distributed/optim/zero_redundancy_optimizer.py


            # Update the bucketing strategy accordingly
-            self._setup_bucket_strategy()
+            self._setup_flat_buffers()


does this also needs to be guarded by parameters_as_bucket_view?

mrshenli

LGTM! the only concern in this PR is whether we should guard all occurrences of _setup_flat_buffers with parameters_as_bucket_view.

A general topic for discussion. Does it make sense that we start with a small public API surface (i.e., mark APIs as private is they are not absolutely necessary). Because it is a lot easier to add APIs than retire APIs.

mrshenli · 2021-02-26T20:47:23Z

torch/distributed/optim/zero_redundancy_optimizer.py

+            self._setup_flat_buffers()
+
    @property
    def per_device_params(self) -> Dict[torch.device, List[List[Parameter]]]:


do we expect users to access this for any reason? can this be private?

ShardedDDP uses that typically, but it can do this and keep it private, this is python after all :)

- bucket as tensor views, optional - make a lot of attributes private - minor unit test refactor - adding coverage in the unit test for with and without bucket views

blefaudeux · 2021-03-01T18:42:30Z

Superseded with #52987, without ghstack

…terface (#52987) Summary: Updated version following #52764 (including comments from Shen), but this one I expect to be able to land. ZeroRedundancyOptimizer: - bucket as tensor views, optional - make a lot of attributes private - minor unit test refactor - adding coverage in the unit test for with and without bucket views Pull Request resolved: #52987 Reviewed By: mrshenli Differential Revision: D26728851 Pulled By: blefaudeux fbshipit-source-id: f8c745966719c9076c20a554ef56198fb838856c

…terface (pytorch#52987) Summary: Updated version following pytorch#52764 (including comments from Shen), but this one I expect to be able to land. ZeroRedundancyOptimizer: - bucket as tensor views, optional - make a lot of attributes private - minor unit test refactor - adding coverage in the unit test for with and without bucket views Pull Request resolved: pytorch#52987 Reviewed By: mrshenli Differential Revision: D26728851 Pulled By: blefaudeux fbshipit-source-id: f8c745966719c9076c20a554ef56198fb838856c

[ZeroRedundancyOptimizer] Pack parameters in tensor views, handle tra…

264feef

…ining change [ghstack-poisoned]

blefaudeux requested review from mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners February 24, 2021 20:13

blefaudeux mentioned this pull request Feb 24, 2021

[ZeroRedundancyOptimizer] Elastic and pytorch compatible state dict #52760

Closed

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 24, 2021

Update on "[ZeroRedundancyOptimizer] Pack parameters in tensor views,…

44073a6

… handle training change" [ghstack-poisoned]

blefaudeux added a commit that referenced this pull request Feb 24, 2021

[ZeroRedundancyOptimizer] Pack parameters in tensor views, handle tra…

25f71fe

…ining change ghstack-source-id: ecd0bcb Pull Request resolved: #52764

Update on "[ZeroRedundancyOptimizer] Pack parameters in tensor views,…

3dda515

… handle training change" [ghstack-poisoned]

Update on "[ZeroRedundancyOptimizer] Pack parameters in tensor views,…

f306a72

… handle training change" [ghstack-poisoned]

blefaudeux added a commit that referenced this pull request Feb 24, 2021

[ZeroRedundancyOptimizer] Pack parameters in tensor views, handle tra…

3e85609

…ining change ghstack-source-id: 175b0f9 Pull Request resolved: #52764

mrshenli reviewed Feb 26, 2021

View reviewed changes

blefaudeux added 2 commits February 26, 2021 11:07

Update test_zero_redundancy_optimizer.py

a41f2ff

Update zero_redundancy_optimizer.py

830d1b0

blefaudeux commented Feb 26, 2021

View reviewed changes

mrshenli reviewed Feb 26, 2021

View reviewed changes

mrshenli approved these changes Feb 26, 2021

View reviewed changes

blefaudeux mentioned this pull request Mar 1, 2021

[ZeroRedundancyOptimizer] Buckets as tensor view + minimize public interface #52987

Closed

blefaudeux added a commit to blefaudeux/pytorch that referenced this pull request Mar 1, 2021

Restoring pytorch#52764

41a070a

- bucket as tensor views, optional - make a lot of attributes private - minor unit test refactor - adding coverage in the unit test for with and without bucket views

blefaudeux closed this Mar 1, 2021

facebook-github-bot deleted the gh/blefaudeux/3/head branch April 1, 2021 14:19

[ZeroRedundancyOptimizer] Pack parameters in tensor views, handle training change #52764

[ZeroRedundancyOptimizer] Pack parameters in tensor views, handle training change #52764

Uh oh!

Conversation

blefaudeux commented Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

blefaudeux commented Feb 24, 2021

Uh oh!

codecov bot commented Feb 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blefaudeux Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blefaudeux commented Mar 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

blefaudeux commented Feb 24, 2021 •

edited

Loading

facebook-github-bot commented Feb 24, 2021 •

edited

Loading

codecov bot commented Feb 25, 2021 •

edited

Loading

blefaudeux Feb 26, 2021 •

edited

Loading