-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ZeroRedundancyOptimizer] Pack parameters in tensor views, handle training change #52764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ining change [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 830d1b0 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
… handle training change" [ghstack-poisoned]
… handle training change" [ghstack-poisoned]
needs rebasing, doing that |
… handle training change" [ghstack-poisoned]
Codecov Report
@@ Coverage Diff @@
## gh/blefaudeux/3/base #52764 +/- ##
========================================================
- Coverage 80.46% 80.46% -0.01%
========================================================
Files 1972 1972
Lines 216335 216351 +16
========================================================
+ Hits 174079 174081 +2
- Misses 42256 42270 +14 |
Keyword Args: | ||
optim (torch.nn.Optimizer): optimizer to shard | ||
group (group): torch.distributed group (default: group.WORLD) | ||
bucket_cap (int): the size of the buffer used to batch the small parameter tensors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because we don't need to overlap the comm with backward or the next forward so it is always faster to use larger buckets? And since params are now bucket views, we also don't need to worry about double the memory consumption? Hence we don't need to expose the bucket size knob to users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes sorry, I should have added an explanation here: basically with these tensor views, there's no need for bucketing really, memory wise the parameters are part of an optimal big bucket, but from the outside they're exposed as normal parameters (ie: they're tensor views of the bucket). When the shards are sent (broadcast/all-gather), the "background" buffer is synced in between the ranks, so there's one call per rank<>rank, instead of a lot of small calls.
To add more context and compare with DDP/ShardedDDP reduce buckets, it's not the same situation really, because in that case we don't wipe anything over time and there's no latency cost in using buckets. The parameters are always 100% there (the update process is sharded, but not the params as such), there's no drawback in having maximal "tensor view buckets" really, it does not take any extra space and it just removes calls. In the case of reduce with DDP the buckets are more of a tradeoff, for instance because you need to wait for them to fill up, so it makes sense to expose the knob I think.
There's no overlap currently in between the broadcast and the FW, and frankly it's killing performance, if there's a way you can think of that would be game changing, I spent some time on that, outside of breaking the model in subparts (what FSDP does) I could not find anything. Within a node it's no problem, but in between nodes with slow interconnect it's easily the bottleneck.
for param in trainable_params: | ||
offset_next = offset + param.numel() | ||
bucket[offset:offset_next].copy_(param.data.flatten()) | ||
param.data = bucket[offset:offset_next].view_as(param.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should still keep the bucket size knob and allow users to choose whether params should be bucket views or not. Because changing param.data
might break application if user created separate views on parameters. See the following code snippet. Changing x.data
won't update y
.
>>> import torch
>>> x = torch.zeros(2, 2)
>>> y = x.view(1, 4)
>>> y
tensor([[0., 0., 0., 0.]])
>>> x
tensor([[0., 0.],
[0., 0.]])
>>> x.data = torch.ones(2, 2)
>>> x
tensor([[1., 1.],
[1., 1.]])
>>> y
tensor([[0., 0., 0., 0.]])
Does it make sense if we keep the bucket size knob. And then -1 means using bucket view and one bucket per device as implemented in this PR, and other bucket size values means dedicated bucket and no view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes makes sense, but in that case I would make it a binary choice: If you agree not to touch .data -> ok for tensor view buckets, else don't bucket for safety ? Without the tensor views the buckets are a lot of code for not-so-great effects (extra memory cost, even temporary + need to unroll the bucket after the fact). What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option might be we keep the PR as-is (always use bucket view) and clarify the side-effect in the documents, so that we don't surprise users. If we see requests to add the no-view option, we can add that later. If this looks OK to you, I will stamp the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your points where very valid, and adding an option to disable that is not a lot of work, I would just do that. Adding the partial buckets with copies and unroll is not worth it though, in my view, also because it complexifies the code a fair bit and people looking for the best perf all around will probably dig into more complex solutions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LGTM! the main concern is whether we should keep the bucket size flag and allow users to choose whether they want to use bucket view as param data.
optimizer to shard | ||
group (group): | ||
torch.distributed group (default: group.WORLD) | ||
parameters_as_bucket_view (bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding this new setting, similar to DDPs 'gradient_as_bucket_view'
|
||
# Update the bucketing strategy accordingly | ||
self._setup_bucket_strategy() | ||
self._setup_flat_buffers() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this also needs to be guarded by parameters_as_bucket_view
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! the only concern in this PR is whether we should guard all occurrences of _setup_flat_buffers
with parameters_as_bucket_view
.
A general topic for discussion. Does it make sense that we start with a small public API surface (i.e., mark APIs as private is they are not absolutely necessary). Because it is a lot easier to add APIs than retire APIs.
self._setup_flat_buffers() | ||
|
||
@property | ||
def per_device_params(self) -> Dict[torch.device, List[List[Parameter]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we expect users to access this for any reason? can this be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShardedDDP uses that typically, but it can do this and keep it private, this is python after all :)
- bucket as tensor views, optional - make a lot of attributes private - minor unit test refactor - adding coverage in the unit test for with and without bucket views
Superseded with #52987, without ghstack |
…terface (#52987) Summary: Updated version following #52764 (including comments from Shen), but this one I expect to be able to land. ZeroRedundancyOptimizer: - bucket as tensor views, optional - make a lot of attributes private - minor unit test refactor - adding coverage in the unit test for with and without bucket views Pull Request resolved: #52987 Reviewed By: mrshenli Differential Revision: D26728851 Pulled By: blefaudeux fbshipit-source-id: f8c745966719c9076c20a554ef56198fb838856c
…terface (pytorch#52987) Summary: Updated version following pytorch#52764 (including comments from Shen), but this one I expect to be able to land. ZeroRedundancyOptimizer: - bucket as tensor views, optional - make a lot of attributes private - minor unit test refactor - adding coverage in the unit test for with and without bucket views Pull Request resolved: pytorch#52987 Reviewed By: mrshenli Differential Revision: D26728851 Pulled By: blefaudeux fbshipit-source-id: f8c745966719c9076c20a554ef56198fb838856c
…terface (pytorch#52987) Summary: Updated version following pytorch#52764 (including comments from Shen), but this one I expect to be able to land. ZeroRedundancyOptimizer: - bucket as tensor views, optional - make a lot of attributes private - minor unit test refactor - adding coverage in the unit test for with and without bucket views Pull Request resolved: pytorch#52987 Reviewed By: mrshenli Differential Revision: D26728851 Pulled By: blefaudeux fbshipit-source-id: f8c745966719c9076c20a554ef56198fb838856c
Stack from ghstack: