Change AccumulateGrad to yield `.grad`s that match weights' memory layout #34904

mcarilli · 2020-03-17T19:14:36Z

Currently, whether AccumulateGrad steals or clones an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout. If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient. This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown.

The present PR changes AccumulateGrad to prefer, where possible, stashing gradients that match their params' layouts ("Gradient Layout Contract").

Allowing AccumulateGrad to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads. This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense.

For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout). The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes. I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple.

Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix. The spirit is layout matching in general:

Grads should be stashed with memory layouts matching their params.
Src and dst tensors on opposite ends of collectives should have matching dense layouts.

This PR also updates autograd docs to describe potential BC-breaking changes below.

BC notes

@ngimel @albanD @gchanan

BC-breaking

In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change. Any user code that was accustomed to view(-1)ing these grads will break.

Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed. In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params. Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous. IMO this is a mild BC breakage. Param backward hooks still see grads come in with whatever format the backward kernel gave them. The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual param.grad. Any such users hopefully know they're off the edge of the map and understand how to update their expectations.

BC escape hatches

At @albanD's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place. Accumulations of new grads onto an existing .grad attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing .grad's layout. After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset param.grad to a zeroed tensor with the desired strides or call grad.contiguous(desired format). This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point.

One limitation (present before this PR and unchanged by this PR): Presetting param.grad does not ensure in-place accumulation all the time. For example, if create_graph=True, or if incoming new_grad is dense and existing variable_grad is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides.

I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility:

make sure Reducer's ops sync with AccumulateGrad streams
to reduce CPU overhead and incur fewer kernel launches, lazily create flat contents tensors by a single cat kernel only when a bucket is full, instead of copy_ing grads into contents individually as soon as they are received. PR includes a minor change to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately. Without cat+div fusion, div-while-copying is the best we can do.
Skip allreducing local_used_maps_dev_ when find_unused_param=False #38942

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

albanD · 2020-03-17T19:29:57Z

Just adding the BC-breaking tag to make sure it is tracked.
But it is fine to merge.

ngimel · 2020-03-17T21:48:27Z

DO NOT MERGE before corresponding distributed fix, this will break DP/DDP.

dr-ci · 2020-03-17T22:09:05Z

💊 CI failures summary and remediations

As of commit 679480f (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 125 times.

albanD

Adding request changes following comment above.
@ngimel do you have a reference for these other changes? And more information how this would break DP/DDP?

torch/csrc/distributed/c10d/reducer.cpp

ngimel · 2020-03-18T16:00:18Z

@albanD ddp changes are being added to this PR currently.
The reason why dp/ddp could break with the AccumulateGrad changes above is that nccl backend at least (and possibly other backends) can all-reduce only legacy contiguous tensors, and after this PR grads can be channels-last contiguous.
At this point I'm not sure what's the best place to fix it - is it dp/ddp, or enabling distributed backends to handle permuted tensors.

albanD · 2020-03-18T16:12:52Z

nccl backend at least can all-reduce only legacy contiguous tensors

Ok ! Good to know.

mcarilli · 2020-03-18T16:20:23Z

At this point I'm not sure what's the best place to fix it - is it dp/ddp, or enabling distributed backends to handle permuted tensors.

For DDP allreduces as currently written, the fix must be in DDP's Reducer, because Reducer performs its own bucket flattening/unflattening (and those are the points where my fixes are applied).

Interestingly, to perform the param broadcast during construction, DDP does not use its own bucketing logic. Instead it calls into BroadcastWork, a handy self-contained broadcaster in torch/csrc/distributed/c10d/comm.cpp. BroadcastWork takes a list of tensors, creates flat buckets, broadcasts the buckets, and copies the broadcasted data back into the original tensors. It handles NHWC-contiguous tensors just fine. Creating an analogous AllreduceWork class for Reducer to call is worth considering, but probably out of scope for today.

…P gradient and DDP changes to see what CI thinks of raw collective changes alone

albanD

LGTM :)

test/distributed/test_c10d.py

mcarilli · 2020-06-11T16:42:21Z

torch/csrc/distributed/c10d/reducer.cpp

        if (!global_unused) {
          if (!grad.defined()) {
-            grad = at::empty(bucket_view.sizes(), bucket_view.options());
+            // Creates grad according to the "Gradient Layout Contract"


@mrshenli are the diffs here ok? (again, grad might not be rowmajor contiguous here). Also under what circumstances does the grad need to be "written back" and what does "written back" entail? The grad is modified in place, so references to it elsewhere don't need to be explicitly modified (although I guess copies in other processes would be)?

Also, why isn't there an equivalent runGradCallbackForVariable in finalize_bucket_sparse? Was that an oversight in the dist_autograd PR, or was that intentionally omitted?

are the diffs here ok?

This looks OK to me.

Also under what circumstances does the grad need to be "written back" and what does "written back" entail?

If you are referring to the written back comment below, I think that comment is inaccurate. IIUC, it should return true when we are certain that the grad value is final, and we can now launch cbs on it.

Also, why isn't there an equivalent runGradCallbackForVariable in finalize_bucket_sparse? Was that an oversight in the dist_autograd PR, or was that intentionally omitted?

I guess this is because #37998 didn't mean to support sparse tensors for RPC + DDP. @pritamdamania87 if this is the case, should we also remove the runGradCallbackForVariable in mark_variable_ready_sparse to make it consistent?

I think this was probably an oversight in #37998, we should probably call runGradCallbackForVariable in finalize_bucket_sparse as well. I'll fix this in a separate diff.

SGTM. I wrote my other thoughts about sparse grads in the other thread https://github.com/pytorch/pytorch/pull/34904/files#r439117597

torch/csrc/distributed/c10d/reducer.cpp

mcarilli · 2020-06-11T17:05:33Z

torch/csrc/distributed/c10d/reducer.cpp

-    return false;
+    replica.contents.div_(process_group_->getSize());
+    // The grad is modified in place and needs to be written back.
+    return true;


@mrshenli are the diffs here ok? In mark_variable_ready_dense mul_out averages the pre-allreduce gradients (which doesn't affect grad itself, because the result is written to the allreduce bucket). For mark_variable_ready_sparse to mirror that control flow, I added div_ here. For sparse gradients, replica.contents = grad, though, so grad itself IS affected in place. Therefore, I also changed return false to return true. Admittedly I don't fully understand the implications of doing so.

Maybe the div_+return true here actually fixes a bug with the dist_autograd changes. In current master, the bucket predivision is carried out in mark_variable_ready. For sparse gradients, the predivision alters the grad in place, but there's no mention at that point of runGradCallbackForVariable, so that change isn't communicated anywhere. Again, admittedly i have no idea what's going on.

IIUC, the criteria for returning true is when we want to trigger cbs on that grad. So the prior (return true when the grad is modified) comments might not be accurate.

We want to trigger those cbs when the grad is ready, i.e., after the allreduce sync done by DDP. So I think we should still return false here?

@pritamdamania87 please correct me if I am wrong.

I think for consistency with local autograd its better to return true here (since for local autograd .grad would change, so for dist_autograd grads in the map should change too). Although, is it possible to do the division in the end in finalize_backward? That way we modify the grad in only one place?

Wait a minute, since for sparse grads the grad is the bucket, it makes no sense for anything external (local or remote) to look at its values between when the allreduce hook triggers and when until the allreduce is finalized. From the perspective of external code in this thread, the allreduce hook carries out a div_ and immediately kicks off an allreduce which spends a while updating grad's values in place. Any access to the grad before finalize_backward is racy.

We could make div_ an out-of-place div for sparse gradients. Then variable.grad itself will remain untouched until finalize_bucket_sparse (which makes sparse grad control flow more consistent with dense grads which use separate memory for their allreduce buckets). However, as I said earlier, finalize_bucket_sparse makes no reference to runGradCallbackForVariable anyway, so dist_autograd's "map" is never informed even after the allreduce completes and grad is safe to access. Do we need to add runGradCallbackForVariable to finalize_bucket_sparse?

is it possible to do the division in the end in finalize_backward

That's a reasonable idea but I'm wary of it. For some networks that allreduce FP16 gradients at scale, we've noticed that post-allreduce averaging caused nonconvergence, but pre-allreduce averaging was fine. I'm not aware off the top of my head of any cases where we observed the reverse, so I prefer the predivision that DDP implements currently.

Yes, it is true that nothing should access the grad between allreduce hooks and finalize_backward. Although, my point was mostly from a consistency standpoint. In the local autograd engine case, we are modifying the grad here and technically anyone looking at variable's .grad after the div_ operation would see a change. So if we're updating .grad here, we should update it in distributed autograd's map as well. The other option is not to update in both .grad and dist autograd map. Although, I'd prefer the in place div_ to avoid creating a new tensor.

Do we need to add runGradCallbackForVariable to finalize_bucket_sparse?

Yes, we do and this was a bug in #37998.

mcarilli · 2020-06-11T17:06:21Z

torch/csrc/distributed/c10d/reducer.cpp

+      auto wrapped = c10::scalar_to_tensor(double(1.)/process_group_->getSize());
+      wrapped.unsafeGetTensorImpl()->set_wrapped_number(true);
+      // Divides while copying into the bucket view.
+      at::native::mul_out(bucket_view, grad, wrapped);


@mrshenli are the diffs here ok? grad is now divided while being copied into bucket_view, but it's still not being modified, so I think it's ok.
A more subtle thing is that with this PR's other diffs, grad may no longer be rowmajor contiguous. I'm not sure how that affects dist_autograd in general if the callback ends up routing into the dist_autograd context.

This LGTM.

Curious, what difference does set_wrapped_number make? Is it like mul_out only support scalar tensor or there is a shortcut for this?

Regarding dist autograd, as DDP + RPC is a very new feature and will be released as beta. It still need more tests and apps to verify and try it out. I think even if it does not work with this PR, it should not block us from landing this. We can fix that in follow PRs.

~~Hold on. Let me think again about dist autograd.~~

at::native::mul_out only has an overload that accepts three Tensor args, so I need to convert the scalar to a tensor.

I'm not sure exactly why set_wrapped_number in particular is needed. I don't think it's needed for TensorIterator kernels inside mul_out, because the lambda capturing of CPU scalars for GPU kernels is based on TensorIterator::is_cpu_scalar which does not rely on Tensor::is_wrapped_number. I included set_wrapped_number because it seems to be the standard practice when converting scalars to Tensors (e.g. in BinaryOps.cpp and python_arg_parser.cpp).

facebook-github-bot

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mcarilli · 2020-06-15T19:32:14Z

@jeffdaily I'm seeing timeouts for my DDP tests on Rocm. Is it expected that DDP tests with @requires_nccl() decorator will run on Rocm? If so, do you know why they might hang?
example:

17:44:58 ERROR: test_grad_layout_1devicemodule_1replicaperprocess (__main__.DistributedDataParallelTest)
17:44:58 ----------------------------------------------------------------------
17:44:58 Traceback (most recent call last):
17:44:58   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
17:44:58     self._join_processes(fn)
17:44:58   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes
17:44:58     self._check_return_codes(elapsed_time)
17:44:58   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 344, in _check_return_codes
17:44:58     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
17:44:58 RuntimeError: Process 0 terminated or timed out after 100.08511734008789 seconds

from https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-xenial-rocm3.3-py3.6-test2/452/console

I'm decorating these with skip_if_rocm in the meantime but if it's expected that they pass we should figure out why they failed.

jeffdaily · 2020-06-15T20:14:45Z

This is what I see in the CI log:

17:34:16 test_grad_layout_1devicemodule_1replicaperprocess (__main__.DistributedDataParallelTest) ... [W Module.cpp:449] Warning: Disabling benchmark mode for MIOpen is NOT supported. Overriding value to True (function operator())
17:34:16 [W Module.cpp:449] Warning: Disabling benchmark mode for MIOpen is NOT supported. Overriding value to True (function operator())
17:34:34 MIOpen(HIP): Warning [ForwardBackwardGetWorkSpaceSizeImplicitGemm] /root/driver/MLOpen/src/lock_file.cpp:75: Error creating file </var/lib/jenkins/.config/miopen//miopen.udb.lock> for locking.
17:35:56 ERROR
17:35:56 test_grad_layout_1devicemodule_2replicaperprocess (__main__.DistributedDataParallelTest) ... Timing out after 100 seconds and killing subprocesses.

It looks to me that MIOpen has encountered kernels that it hasn't compiled yet. The error creating the miopen lock file indicates two or more processes attempted to race to compile the same kernel (race was resolved when one couldn't acquire lock). However, this likely has lead to a timeout. One process got the miopen lock, compiled the kernel, then released the lock to the other process that may or may not re-compile the same kernel (depending on timing).

One solution might be if you increased the timeout, then you might get the test to pass for rocm. Another solution will be available starting in ROCm 3.5 where we can install precompiled kernels as a separate deb package, but that isn't available to you yet.

Skipping them for now to unblock your work is understandable for the short term. Wouldn't be the first time we had a timing issue caused by first-use compiling of miopen kernels.

mcarilli · 2020-06-15T20:26:49Z

@jeffdaily Thanks for the quick reply, just wanted to make sure you were aware.

facebook-github-bot

@albanD is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-06-16T18:16:03Z

@albanD merged this pull request in 2beb969.

…memory layout (#40129) Summary: #34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)). This PR reverts the revert, and adds diffs that should repair the misconfigured test. Pull Request resolved: #40129 Differential Revision: D22079377 Pulled By: albanD fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1

…yout (pytorch#34904) Summary: Currently, whether `AccumulateGrad` [steals](https://github.com/pytorch/pytorch/blob/67cb0184625ca3c30f44e02cc21ebfa7382c75c5/torch/csrc/autograd/functions/accumulate_grad.h#L42) or [clones](https://github.com/pytorch/pytorch/blob/67cb0184625ca3c30f44e02cc21ebfa7382c75c5/torch/csrc/autograd/functions/accumulate_grad.h#L80) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout. If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient. This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown. The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)). Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads. This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense. For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout). The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes. I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple. Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix. The spirit is layout matching in general: - Grads should be stashed with memory layouts matching their params. - Src and dst tensors on opposite ends of collectives should have matching dense layouts. This PR also updates autograd docs to describe potential BC-breaking changes below. ## BC notes ngimel albanD gchanan #### BC-breaking In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change. Any user code that was accustomed to `view(-1)`ing these grads will break. Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed. In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params. Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous. IMO this is a mild BC breakage. Param backward hooks still see grads come in with whatever format the backward kernel gave them. The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`. Any such users hopefully know they're off the edge of the map and understand how to update their expectations. #### BC escape hatches At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place. Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout. After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`. This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point. One limitation (present before this PR and unchanged by this PR): Presetting `param.grad` does not ensure in-place accumulation all the time. For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides. ---------------------------- I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility: 1. make sure Reducer's ops sync with AccumulateGrad streams 2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~ PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately. Without cat+div fusion, div-while-copying is the best we can do. 3. pytorch#38942 Pull Request resolved: pytorch#34904 Differential Revision: D20496044 Pulled By: albanD fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217

…memory layout (pytorch#40129) Summary: pytorch#34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)). This PR reverts the revert, and adds diffs that should repair the misconfigured test. Pull Request resolved: pytorch#40129 Differential Revision: D22079377 Pulled By: albanD fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1

Grad stealing/cloning to match variable's memory format

c11064d

mcarilli requested review from albanD and apaszke as code owners March 17, 2020 19:14

mcarilli requested review from VitalyFedyunin, jjsjann123 and ngimel March 17, 2020 19:14

mcarilli added the module: memory format Memory format/layout related issues/changes (channels_last, nhwc) label Mar 17, 2020

pytorchbot added the open source label Mar 17, 2020

VitalyFedyunin approved these changes Mar 17, 2020

View reviewed changes

facebook-github-bot reviewed Mar 17, 2020

View reviewed changes

albanD added the module: bc-breaking Related to a BC-breaking change label Mar 17, 2020

mcarilli requested review from mrshenli, pietern and zhaojuanmao as code owners March 18, 2020 06:26

albanD requested changes Mar 18, 2020

View reviewed changes

ngimel reviewed Mar 18, 2020

View reviewed changes

torch/csrc/distributed/c10d/reducer.cpp Outdated Show resolved Hide resolved

mcarilli mentioned this pull request Mar 19, 2020

[RELAND] Eager autocasting, out-of-place ops only (with MSVC 2017 fix) #35011

Closed

yf225 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 20, 2020

definitelynotmcarilli added 3 commits May 6, 2020 17:28

Merge remote-tracking branch 'upstream/master' into nhwc_accumulate_grad

b0af541

Enable arbitrary dense layouts for raw nccl collectives, roll back WI…

326f547

…P gradient and DDP changes to see what CI thinks of raw collective changes alone

Merge remote-tracking branch 'upstream/master' into nhwc_accumulate_grad

68ef85e

mcarilli force-pushed the nhwc_accumulate_grad branch from 6dfb14f to 68ef85e Compare May 15, 2020 04:11

definitelynotmcarilli added 2 commits May 16, 2020 21:48

Check that output and input layouts match in flatten_for_scatter_gather

4b3ba8b

Merge remote-tracking branch 'upstream/master' into nhwc_accumulate_grad

9691919

definitelynotmcarilli added 5 commits June 10, 2020 14:12

Make test_data_parallel.py deterministic

8466e16

cudnn deterministic for DDP test

22a4c88

Enable cudnn for DP and DDP

603dc28

Resolving conflicts in reducer.cpp

0a8e55c

Sync subrepo with master, how do i always screw this up

a3a8712

albanD approved these changes Jun 11, 2020

View reviewed changes

test/distributed/test_c10d.py Outdated Show resolved Hide resolved

mcarilli commented Jun 11, 2020

View reviewed changes

mrshenli reviewed Jun 11, 2020

View reviewed changes

torch/csrc/distributed/c10d/reducer.cpp Show resolved Hide resolved

torch/csrc/distributed/c10d/reducer.cpp Show resolved Hide resolved

torch/csrc/distributed/c10d/reducer.cpp Show resolved Hide resolved

torch/csrc/distributed/c10d/reducer.cpp Show resolved Hide resolved

mcarilli commented Jun 11, 2020

View reviewed changes

definitelynotmcarilli added 4 commits June 11, 2020 12:41

nits

7198205

Merge remote-tracking branch 'upstream/master' into nhwc_accumulate_grad

b209386

Merge remote-tracking branch 'upstream/master' into nhwc_accumulate_grad

41899dc

Upstream linter does not like bare exceptions

77e0e97

facebook-github-bot reviewed Jun 15, 2020

View reviewed changes

See if CI passes

679480f

facebook-github-bot reviewed Jun 15, 2020

View reviewed changes

facebook-github-bot closed this in 2beb969 Jun 16, 2020

facebook-github-bot added the merged label Jun 16, 2020

mcarilli mentioned this pull request Jun 16, 2020

[RELAND] Change AccumulateGrad to yield .grads that match weights' memory layout #40129

Closed

albanD mentioned this pull request Jul 8, 2020

[RELAND2] Change AccumulateGrad to yield .grads that match weights' memory layout #40358

Closed

mruberry added the Merged label Oct 28, 2020

Change AccumulateGrad to yield .grads that match weights' memory layout #34904

Change AccumulateGrad to yield .grads that match weights' memory layout #34904

Uh oh!

Conversation

mcarilli commented Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BC notes

BC-breaking

BC escape hatches

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

albanD commented Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Mar 17, 2020

Uh oh!

dr-ci bot commented Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngimel commented Mar 18, 2020

Uh oh!

albanD commented Mar 18, 2020

Uh oh!

mcarilli commented Mar 18, 2020

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mcarilli Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcarilli Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mcarilli Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Change AccumulateGrad to yield `.grad`s that match weights' memory layout #34904

Change AccumulateGrad to yield `.grad`s that match weights' memory layout #34904

mcarilli commented Mar 17, 2020 •

edited

Loading

albanD commented Mar 17, 2020 •

edited

Loading

dr-ci bot commented Mar 17, 2020 •

edited

Loading

mcarilli Jun 11, 2020 •

edited

Loading

mcarilli Jun 11, 2020 •

edited

Loading

mcarilli Jun 11, 2020 •

edited

Loading

mrshenli Jun 11, 2020 •

edited

Loading

mrshenli Jun 11, 2020 •

edited

Loading

mcarilli commented Jun 15, 2020 •

edited

Loading

mcarilli commented Jun 15, 2020 •

edited

Loading