Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix C++ data parallel #20910

Closed
wants to merge 4 commits into from
Closed

Fix C++ data parallel #20910

wants to merge 4 commits into from

Conversation

mrshenli
Copy link
Contributor

Fixes #19540

CC @nmerrill67

C++ data parallel was using Module.clone() to create module replicas on every destination device. However, clone() does not set up gradient edges to point from replicas to the original module. As a result, the gradient will not be aggregated into the original module. This commit fixes the the problem by manually setting gradient edges from every parameter X in every replica to the same parameter X in the original module.

Failed Attempt

Initially I tried implementing what we did in replicate.py, which

  1. create module replicas
  2. use Python Broadcast autograd function to broadcast every parameter in the original module to all destination devices.
  3. assign the broadcast result params to module replicas' _parameters dict.

This works in Python because derived module member field params (e.g., Linear.weight) and base module _parameters (e.g., Linear._parameters['weight']) are referencing the same parameter instance. Assigning one of them will apply to both. However, in C++, even though I can modify Module's parameters_ values and gradient edges to point to the broadcast source, I cannot touch the weight and bias member fields in Linear, because replicate cannot (and should not) add special-case handlers to every different module. (See Linear .h, .cpp) Although they initially point to the same TensorImpl instance, after assigning to Module.parameters_['weight'], it will be different from Linear.weight.

Solution Options

@gchanan and I had several discussions on this issue and figured two solutions to this problem.

Option One [implemented in this PR]

Replicate the module in two steps:

  1. call Module.clone() to create a module replica on every destination device.
  2. manually setting gradient edges from every parameter in every replica to the same parameter in the original module.
  • Pro: Does not need to change any existing module, and relatively easier to implement
  • Con: It is a little hackish.

Options Two

Implement a Replicatable class (similar to Cloneable), and make it a friend class of Module. For more details see Note [Replicating Modules] in the code change.

  • Pro: Maybe this aligns more with our existing approach implemented in Cloneable?
  • Con: Require changes to every existing module.

I am inclined to go with option one, because replicate will only be used on data parallel. I feel it is too big an overkill if we have to change all existing module implementations due to a data parallel requirement.

Before this commit, C++ data parallel uses Module.clone() to create
module replicas on every destination device. However, clone() does not
set up gradient edges to point from replicas to the original module.
As a result, the gradient will not be aggregated into the original
module. This commit fixes the bug by manually setting gradient edges
from every parameter X in every replica to the same parameter X in
the original module.
@@ -17,6 +20,13 @@
using namespace torch::autograd;
using namespace torch::nn;

template <typename T>
bool almost_equal(torch::Tensor left, torch::Tensor right, T tolerance = 1e-4) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can consider using allclose for this if it helps.

@ezyang
Copy link
Contributor

ezyang commented May 24, 2019

Just replicating here what I told @mrshenli in person: if the problem is that the two copies of Tensor in Module can get out of sync in C++ but not in Python, the correct solution is to make sure they stay in sync, one way or another.

Taking a closer look now.

@ezyang
Copy link
Contributor

ezyang commented May 24, 2019

Some clarification about how replicate.py works:

This works in Python because derived module member field params (e.g., Linear.weight) and base module _parameters (e.g., Linear._parameters['weight']) are referencing the same parameter instance. Assigning one of them will apply to both.

This is because there isn't actually a Linear.weight. It is provided by the __getattr__ magic method on Module:

    def __getattr__(self, name):
        if '_parameters' in self.__dict__:
            _parameters = self.__dict__['_parameters']
            if name in _parameters:
                return _parameters[name]

@ezyang
Copy link
Contributor

ezyang commented May 24, 2019

After reminding myself about how parameter registration works in C++, I now agree that it is impossible to do assignment after clone. So I think that fixing up the cloned tensor directly is the right way to go.

@ssnl
Copy link
Collaborator

ssnl commented May 27, 2019

This is completely unrelated to this PR. But out of curiosity I checked c++ DP code. It seems that c++ DP is quite a bit slower than the python one because module.clone() does .clone() before .to(*device)

std::shared_ptr<Module> clone(
const optional<Device>& device = nullopt) const override {
NoGradGuard no_grad;
const auto& self = static_cast<const Derived&>(*this);
auto copy = std::make_shared<Derived>(self);
copy->parameters_.clear();
copy->buffers_.clear();
copy->children_.clear();
copy->reset();
TORCH_CHECK(
copy->parameters_.size() == parameters_.size(),
"The cloned module does not have the same number of "
"parameters as the original module after calling reset(). "
"Are you sure you called register_parameter() inside reset() "
"and not the constructor?");
for (const auto& parameter : parameters_) {
auto data = autograd::Variable(*parameter).clone();
copy->parameters_[parameter.key()].set_data(
device ? data.to(*device) : data);
}
TORCH_CHECK(
copy->buffers_.size() == buffers_.size(),
"The cloned module does not have the same number of "
"buffers as the original module after calling reset(). "
"Are you sure you called register_buffer() inside reset() "
"and not the constructor?");
for (const auto& buffer : buffers_) {
auto data = autograd::Variable(*buffer).clone();
copy->buffers_[buffer.key()].set_data(device ? data.to(*device) : data);
}
TORCH_CHECK(
copy->children_.size() == children_.size(),
"The cloned module does not have the same number of "
"child modules as the original module after calling reset(). "
"Are you sure you called register_module() inside reset() "
"and not the constructor?");
for (const auto& child : children_) {
copy->children_[child.key()]->clone_(*child.value(), device);
}
return copy;
}

Moreover, it is also incorrect with BNs (and other things that relies on in-place changes to buffers/parameters) because module.clone always deepcopies, i.e., the buffers and parameters are always cloned for each replica, even if the source module lives on the correct device. This seems to me that it would be impossible to correctly perform BN.

Edit: typo related -> unrelated

@mrshenli
Copy link
Contributor Author

@ssnl thanks for the catch. I added #20995 to remove the unnecessary clone. For BN use cases, let me revise this PR and add a test for it.

namespace {

// Note [Replicating Modules]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for a nice note, it is much appreciated!

"All inputs of ReduceAdd must have the same dtype, but got ",
input.dtype(), " and ", inputs[0].dtype());

// TODO: use nccl reduce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this TODO mean that C++ data parallel is less efficient than Python data parallel at the moment? Seems like worth a big warning in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the the current version would be slower. Let me implement nccl reduce and remove the TODO here.

// Configure gradient edges to point from replcia parameters to original
// module parameters. See [Replicating Modules]
std::vector<std::shared_ptr<Module>> untyped_replicas(
replicas.begin(), replicas.end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why you have to allocate a vector here. Can't you just pass the iterators to the function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is because of the way you handle children, because you concatenate the replicas of all sibling children (not just the current child) and now you don't have iterators. But that seems like a pretty unusual to write the recursion: normally you'd just recurse on child modules. Is there a reason it was done this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember I run into problems when collecting children modules on ling 126. Let me try if I can drop this untyped_replicas .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can recurse on child modules by creating a map from *parameter to grad_fn, and retrieve that grad_fn in every recursion.

//
// ReduceAdd can ONLY be used during the backward pass of data parallel. Forward
// pass cannot use this function as it does not setup gradient function and
// history at all. Do NOT try to use ReduceAdd for any other purposes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support double backwards through replicate in Python? I don't see any indication in issues that we don't...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tried yet. Let me add a test.

@ezyang
Copy link
Contributor

ezyang commented May 29, 2019

If you want to land this to fix an urgent bug, I suppose I can be convinced to land it in its current state (I agree it's an improvement), but what I am confused about is why you had to implement ReduceAdd from scratch. Ostensibly, the Python implementation of DataParallel sets up some autograd backwards function, and to me, it seems like you ought to use whatever that backwards function was, and not make up a new one (this would also fix your TODO, if we are using nccl reduce in Python).

@ezyang
Copy link
Contributor

ezyang commented May 29, 2019 via email

@mrshenli
Copy link
Contributor Author

mrshenli commented May 29, 2019

@ezyang

but what I am confused about is why you had to implement ReduceAdd from scratch. Ostensibly, the Python implementation of DataParallel sets up some autograd backwards function, and to me, it seems like you ought to use whatever that backwards function was, and not make up a new one (this would also fix your TODO, if we are using nccl reduce in Python).

Python's DataParallel uses Broadcast and ReduceAddCoalesced functions which are implemented in Python. How about letting me create a new PR to move these two functions to C++?

@ezyang
Copy link
Contributor

ezyang commented May 29, 2019

Yes, that seems like the right way to do it. If you want to apply this hotfix on master, I'll let it through, but ideally this should just all live in C++.

@mrshenli
Copy link
Contributor Author

If you want to apply this hotfix on master, I'll let it through, but ideally this should just all live in C++.

Thanks. Let me first address smaller revision requests and create issues to track bigger ones. Will do that later today or tmr. I need to finish up sth else now.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mrshenli
Copy link
Contributor Author

The latest commit addressed @vishwakftw's comments on using allclose and @ezyang's comments on removing untyped_replicas. I didn't modify the recursion strategy, because after a second thought I think the current one is easier to understand and cleaner, i.e., recurse based on the original module's structure instead of creating a big map from all params/buffers to grad_fns upfront and then recurse on every original-replica pair. The current solution to the problem is adding a typename ModuleType to replicate_grad_edges. The template is only useful in the very first invocation, and ModuleType becomes Module in all subsequent invocations.

TODOs are in #21144.

@ezyang is the current solution acceptable?

@ezyang
Copy link
Contributor

ezyang commented Jun 5, 2019

@pytorchbot rebase this please

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go ahead and land this for now. Looking up to the follow up where we line up the Python and C++ versions exactly.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mrshenli
Copy link
Contributor Author

mrshenli commented Jun 5, 2019

@pytorchbot rebase this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mrshenli
Copy link
Contributor Author

mrshenli commented Jun 6, 2019

Failed test passed in rerun.

@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in b7b6b61.

facebook-github-bot pushed a commit that referenced this pull request Jul 26, 2019
Summary:
As pointed out by SsnL in #20910, when clone destination is different from the module's device,
`Cloneable` currently calls `clone()` and then `to()` on every parameter and buffer, where the first clone is unnecessary.
Pull Request resolved: #20995

Differential Revision: D15517353

Pulled By: mrshenli

fbshipit-source-id: 6b6dc01560540a63845663f863dea0a948021fa5
@yulinhuyang
Copy link

hi mrshenli!
I use my module like this in C++ :

struct ConvModel : torch::nn::Module
{
torch::Tensor forward(at::Tensor gallery)
{

auto result = torch::nn::functional::conv1d(gallery, query_weight);
return result;

}

and i use torch::nn::parallel::replicate

but i error

clone() has not been implemented for ConvModel. Subclass torch::nn::Cloneable instead of torch::nn::Module to inherit the ability to clone

if I try in herit torch::nn::Cloneable ,it have anthor error. can you give me some suggestion about it ?@mrshenli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged module: cpp Related to C++ API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

C++ Frontend data_parallel Does Not Update Weights
8 participants