Skip to content

Conversation

blefaudeux
Copy link
Contributor

@blefaudeux blefaudeux commented Feb 24, 2021

Stack from ghstack:

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 24, 2021

💊 CI failures summary and remediations

As of commit 830d1b0 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

blefaudeux added a commit that referenced this pull request Feb 24, 2021
…ining change

ghstack-source-id: ecd0bcb
Pull Request resolved: #52764
@blefaudeux
Copy link
Contributor Author

needs rebasing, doing that

blefaudeux added a commit that referenced this pull request Feb 24, 2021
…ining change

ghstack-source-id: 175b0f9
Pull Request resolved: #52764
@codecov
Copy link

codecov bot commented Feb 25, 2021

Codecov Report

Merging #52764 (830d1b0) into gh/blefaudeux/3/base (123ad39) will decrease coverage by 0.00%.
The diff coverage is 7.27%.

@@                   Coverage Diff                    @@
##           gh/blefaudeux/3/base   #52764      +/-   ##
========================================================
- Coverage                 80.46%   80.46%   -0.01%     
========================================================
  Files                      1972     1972              
  Lines                    216335   216351      +16     
========================================================
+ Hits                     174079   174081       +2     
- Misses                    42256    42270      +14     

Keyword Args:
optim (torch.nn.Optimizer): optimizer to shard
group (group): torch.distributed group (default: group.WORLD)
bucket_cap (int): the size of the buffer used to batch the small parameter tensors,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because we don't need to overlap the comm with backward or the next forward so it is always faster to use larger buckets? And since params are now bucket views, we also don't need to worry about double the memory consumption? Hence we don't need to expose the bucket size knob to users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sorry, I should have added an explanation here: basically with these tensor views, there's no need for bucketing really, memory wise the parameters are part of an optimal big bucket, but from the outside they're exposed as normal parameters (ie: they're tensor views of the bucket). When the shards are sent (broadcast/all-gather), the "background" buffer is synced in between the ranks, so there's one call per rank<>rank, instead of a lot of small calls.

To add more context and compare with DDP/ShardedDDP reduce buckets, it's not the same situation really, because in that case we don't wipe anything over time and there's no latency cost in using buckets. The parameters are always 100% there (the update process is sharded, but not the params as such), there's no drawback in having maximal "tensor view buckets" really, it does not take any extra space and it just removes calls. In the case of reduce with DDP the buckets are more of a tradeoff, for instance because you need to wait for them to fill up, so it makes sense to expose the knob I think.

There's no overlap currently in between the broadcast and the FW, and frankly it's killing performance, if there's a way you can think of that would be game changing, I spent some time on that, outside of breaking the model in subparts (what FSDP does) I could not find anything. Within a node it's no problem, but in between nodes with slow interconnect it's easily the bottleneck.

for param in trainable_params:
offset_next = offset + param.numel()
bucket[offset:offset_next].copy_(param.data.flatten())
param.data = bucket[offset:offset_next].view_as(param.data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should still keep the bucket size knob and allow users to choose whether params should be bucket views or not. Because changing param.data might break application if user created separate views on parameters. See the following code snippet. Changing x.data won't update y.

>>> import torch
>>> x = torch.zeros(2, 2)
>>> y = x.view(1, 4)
>>> y
tensor([[0., 0., 0., 0.]])
>>> x
tensor([[0., 0.],
        [0., 0.]])
>>> x.data = torch.ones(2, 2)
>>> x
tensor([[1., 1.],
        [1., 1.]])
>>> y
tensor([[0., 0., 0., 0.]])

Does it make sense if we keep the bucket size knob. And then -1 means using bucket view and one bucket per device as implemented in this PR, and other bucket size values means dedicated bucket and no view.

Copy link
Contributor Author

@blefaudeux blefaudeux Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes makes sense, but in that case I would make it a binary choice: If you agree not to touch .data -> ok for tensor view buckets, else don't bucket for safety ? Without the tensor views the buckets are a lot of code for not-so-great effects (extra memory cost, even temporary + need to unroll the bucket after the fact). What do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option might be we keep the PR as-is (always use bucket view) and clarify the side-effect in the documents, so that we don't surprise users. If we see requests to add the no-view option, we can add that later. If this looks OK to you, I will stamp the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your points where very valid, and adding an option to disable that is not a lot of work, I would just do that. Adding the partial buckets with copies and unroll is not worth it though, in my view, also because it complexifies the code a fair bit and people looking for the best perf all around will probably dig into more complex solutions

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM! the main concern is whether we should keep the bucket size flag and allow users to choose whether they want to use bucket view as param data.

optimizer to shard
group (group):
torch.distributed group (default: group.WORLD)
parameters_as_bucket_view (bool):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding this new setting, similar to DDPs 'gradient_as_bucket_view'


# Update the bucketing strategy accordingly
self._setup_bucket_strategy()
self._setup_flat_buffers()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this also needs to be guarded by parameters_as_bucket_view?

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! the only concern in this PR is whether we should guard all occurrences of _setup_flat_buffers with parameters_as_bucket_view.

A general topic for discussion. Does it make sense that we start with a small public API surface (i.e., mark APIs as private is they are not absolutely necessary). Because it is a lot easier to add APIs than retire APIs.

self._setup_flat_buffers()

@property
def per_device_params(self) -> Dict[torch.device, List[List[Parameter]]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect users to access this for any reason? can this be private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShardedDDP uses that typically, but it can do this and keep it private, this is python after all :)

blefaudeux added a commit to blefaudeux/pytorch that referenced this pull request Mar 1, 2021
- bucket as tensor views, optional
- make a lot of attributes private
- minor unit test refactor
- adding coverage in the unit test for with and without bucket views
@blefaudeux
Copy link
Contributor Author

Superseded with #52987, without ghstack

@blefaudeux blefaudeux closed this Mar 1, 2021
facebook-github-bot pushed a commit that referenced this pull request Mar 1, 2021
…terface (#52987)

Summary:
Updated version following  #52764 (including comments from Shen), but this one I expect to be able to land.
ZeroRedundancyOptimizer:
- bucket as tensor views, optional
- make a lot of attributes private
- minor unit test refactor
- adding coverage in the unit test for with and without bucket views

Pull Request resolved: #52987

Reviewed By: mrshenli

Differential Revision: D26728851

Pulled By: blefaudeux

fbshipit-source-id: f8c745966719c9076c20a554ef56198fb838856c
aocsa pushed a commit to Quansight/pytorch that referenced this pull request Mar 15, 2021
…terface (pytorch#52987)

Summary:
Updated version following  pytorch#52764 (including comments from Shen), but this one I expect to be able to land.
ZeroRedundancyOptimizer:
- bucket as tensor views, optional
- make a lot of attributes private
- minor unit test refactor
- adding coverage in the unit test for with and without bucket views

Pull Request resolved: pytorch#52987

Reviewed By: mrshenli

Differential Revision: D26728851

Pulled By: blefaudeux

fbshipit-source-id: f8c745966719c9076c20a554ef56198fb838856c
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
…terface (pytorch#52987)

Summary:
Updated version following  pytorch#52764 (including comments from Shen), but this one I expect to be able to land.
ZeroRedundancyOptimizer:
- bucket as tensor views, optional
- make a lot of attributes private
- minor unit test refactor
- adding coverage in the unit test for with and without bucket views

Pull Request resolved: pytorch#52987

Reviewed By: mrshenli

Differential Revision: D26728851

Pulled By: blefaudeux

fbshipit-source-id: f8c745966719c9076c20a554ef56198fb838856c
@facebook-github-bot facebook-github-bot deleted the gh/blefaudeux/3/head branch April 1, 2021 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants