Skip to content

Conversation

mcarilli
Copy link
Collaborator

@mcarilli mcarilli commented Mar 17, 2020

Currently, whether AccumulateGrad steals or clones an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout. If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient. This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown.

The present PR changes AccumulateGrad to prefer, where possible, stashing gradients that match their params' layouts ("Gradient Layout Contract").

Allowing AccumulateGrad to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads. This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense.

For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout). The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes. I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple.

Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix. The spirit is layout matching in general:

  • Grads should be stashed with memory layouts matching their params.
  • Src and dst tensors on opposite ends of collectives should have matching dense layouts.

This PR also updates autograd docs to describe potential BC-breaking changes below.

BC notes

@ngimel @albanD @gchanan

BC-breaking

In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change. Any user code that was accustomed to view(-1)ing these grads will break.

Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed. In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params. Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous. IMO this is a mild BC breakage. Param backward hooks still see grads come in with whatever format the backward kernel gave them. The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual param.grad. Any such users hopefully know they're off the edge of the map and understand how to update their expectations.

BC escape hatches

At @albanD's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place. Accumulations of new grads onto an existing .grad attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing .grad's layout. After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset param.grad to a zeroed tensor with the desired strides or call grad.contiguous(desired format). This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point.

One limitation (present before this PR and unchanged by this PR): Presetting param.grad does not ensure in-place accumulation all the time. For example, if create_graph=True, or if incoming new_grad is dense and existing variable_grad is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides.


I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility:

  1. make sure Reducer's ops sync with AccumulateGrad streams
  2. to reduce CPU overhead and incur fewer kernel launches, lazily create flat contents tensors by a single cat kernel only when a bucket is full, instead of copy_ing grads into contents individually as soon as they are received. PR includes a minor change to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately. Without cat+div fusion, div-while-copying is the best we can do.
  3. Skip allreducing local_used_maps_dev_ when find_unused_param=False #38942

@mcarilli mcarilli requested review from albanD and apaszke as code owners March 17, 2020 19:14
@mcarilli mcarilli added the module: memory format Memory format/layout related issues/changes (channels_last, nhwc) label Mar 17, 2020
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@albanD
Copy link
Collaborator

albanD commented Mar 17, 2020

Just adding the BC-breaking tag to make sure it is tracked.
But it is fine to merge.

@albanD albanD added the module: bc-breaking Related to a BC-breaking change label Mar 17, 2020
@ngimel
Copy link
Collaborator

ngimel commented Mar 17, 2020

DO NOT MERGE before corresponding distributed fix, this will break DP/DDP.

@dr-ci
Copy link

dr-ci bot commented Mar 17, 2020

💊 CI failures summary and remediations

As of commit 679480f (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 125 times.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding request changes following comment above.
@ngimel do you have a reference for these other changes? And more information how this would break DP/DDP?

@ngimel
Copy link
Collaborator

ngimel commented Mar 18, 2020

@albanD ddp changes are being added to this PR currently.
The reason why dp/ddp could break with the AccumulateGrad changes above is that nccl backend at least (and possibly other backends) can all-reduce only legacy contiguous tensors, and after this PR grads can be channels-last contiguous.
At this point I'm not sure what's the best place to fix it - is it dp/ddp, or enabling distributed backends to handle permuted tensors.

@albanD
Copy link
Collaborator

albanD commented Mar 18, 2020

nccl backend at least can all-reduce only legacy contiguous tensors

Ok ! Good to know.

@mcarilli
Copy link
Collaborator Author

At this point I'm not sure what's the best place to fix it - is it dp/ddp, or enabling distributed backends to handle permuted tensors.

For DDP allreduces as currently written, the fix must be in DDP's Reducer, because Reducer performs its own bucket flattening/unflattening (and those are the points where my fixes are applied).

Interestingly, to perform the param broadcast during construction, DDP does not use its own bucketing logic. Instead it calls into BroadcastWork, a handy self-contained broadcaster in torch/csrc/distributed/c10d/comm.cpp. BroadcastWork takes a list of tensors, creates flat buckets, broadcasts the buckets, and copies the broadcasted data back into the original tensors. It handles NHWC-contiguous tensors just fine. Creating an analogous AllreduceWork class for Reducer to call is worth considering, but probably out of scope for today.

@yf225 yf225 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 20, 2020
@mcarilli mcarilli force-pushed the nhwc_accumulate_grad branch from 6dfb14f to 68ef85e Compare May 15, 2020 04:11
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

if (!global_unused) {
if (!grad.defined()) {
grad = at::empty(bucket_view.sizes(), bucket_view.options());
// Creates grad according to the "Gradient Layout Contract"
Copy link
Collaborator Author

@mcarilli mcarilli Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli are the diffs here ok? (again, grad might not be rowmajor contiguous here). Also under what circumstances does the grad need to be "written back" and what does "written back" entail? The grad is modified in place, so references to it elsewhere don't need to be explicitly modified (although I guess copies in other processes would be)?

Copy link
Collaborator Author

@mcarilli mcarilli Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why isn't there an equivalent runGradCallbackForVariable in finalize_bucket_sparse? Was that an oversight in the dist_autograd PR, or was that intentionally omitted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the diffs here ok?

This looks OK to me.

Also under what circumstances does the grad need to be "written back" and what does "written back" entail?

If you are referring to the written back comment below, I think that comment is inaccurate. IIUC, it should return true when we are certain that the grad value is final, and we can now launch cbs on it.

Also, why isn't there an equivalent runGradCallbackForVariable in finalize_bucket_sparse? Was that an oversight in the dist_autograd PR, or was that intentionally omitted?

I guess this is because #37998 didn't mean to support sparse tensors for RPC + DDP. @pritamdamania87 if this is the case, should we also remove the runGradCallbackForVariable in mark_variable_ready_sparse to make it consistent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was probably an oversight in #37998, we should probably call runGradCallbackForVariable in finalize_bucket_sparse as well. I'll fix this in a separate diff.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. I wrote my other thoughts about sparse grads in the other thread https://github.com/pytorch/pytorch/pull/34904/files#r439117597

return false;
replica.contents.div_(process_group_->getSize());
// The grad is modified in place and needs to be written back.
return true;
Copy link
Collaborator Author

@mcarilli mcarilli Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli are the diffs here ok? In mark_variable_ready_dense mul_out averages the pre-allreduce gradients (which doesn't affect grad itself, because the result is written to the allreduce bucket). For mark_variable_ready_sparse to mirror that control flow, I added div_ here. For sparse gradients, replica.contents = grad, though, so grad itself IS affected in place. Therefore, I also changed return false to return true. Admittedly I don't fully understand the implications of doing so.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the div_+return true here actually fixes a bug with the dist_autograd changes. In current master, the bucket predivision is carried out in mark_variable_ready. For sparse gradients, the predivision alters the grad in place, but there's no mention at that point of runGradCallbackForVariable, so that change isn't communicated anywhere. Again, admittedly i have no idea what's going on.

Copy link
Contributor

@mrshenli mrshenli Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the criteria for returning true is when we want to trigger cbs on that grad. So the prior (return true when the grad is modified) comments might not be accurate.

We want to trigger those cbs when the grad is ready, i.e., after the allreduce sync done by DDP. So I think we should still return false here?

@pritamdamania87 please correct me if I am wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for consistency with local autograd its better to return true here (since for local autograd .grad would change, so for dist_autograd grads in the map should change too). Although, is it possible to do the division in the end in finalize_backward? That way we modify the grad in only one place?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a minute, since for sparse grads the grad is the bucket, it makes no sense for anything external (local or remote) to look at its values between when the allreduce hook triggers and when until the allreduce is finalized. From the perspective of external code in this thread, the allreduce hook carries out a div_ and immediately kicks off an allreduce which spends a while updating grad's values in place. Any access to the grad before finalize_backward is racy.

We could make div_ an out-of-place div for sparse gradients. Then variable.grad itself will remain untouched until finalize_bucket_sparse (which makes sparse grad control flow more consistent with dense grads which use separate memory for their allreduce buckets). However, as I said earlier, finalize_bucket_sparse makes no reference to runGradCallbackForVariable anyway, so dist_autograd's "map" is never informed even after the allreduce completes and grad is safe to access. Do we need to add runGradCallbackForVariable to finalize_bucket_sparse?

is it possible to do the division in the end in finalize_backward

That's a reasonable idea but I'm wary of it. For some networks that allreduce FP16 gradients at scale, we've noticed that post-allreduce averaging caused nonconvergence, but pre-allreduce averaging was fine. I'm not aware off the top of my head of any cases where we observed the reverse, so I prefer the predivision that DDP implements currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is true that nothing should access the grad between allreduce hooks and finalize_backward. Although, my point was mostly from a consistency standpoint. In the local autograd engine case, we are modifying the grad here and technically anyone looking at variable's .grad after the div_ operation would see a change. So if we're updating .grad here, we should update it in distributed autograd's map as well. The other option is not to update in both .grad and dist autograd map. Although, I'd prefer the in place div_ to avoid creating a new tensor.

Do we need to add runGradCallbackForVariable to finalize_bucket_sparse?

Yes, we do and this was a bug in #37998.

auto wrapped = c10::scalar_to_tensor(double(1.)/process_group_->getSize());
wrapped.unsafeGetTensorImpl()->set_wrapped_number(true);
// Divides while copying into the bucket view.
at::native::mul_out(bucket_view, grad, wrapped);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli are the diffs here ok? grad is now divided while being copied into bucket_view, but it's still not being modified, so I think it's ok.
A more subtle thing is that with this PR's other diffs, grad may no longer be rowmajor contiguous. I'm not sure how that affects dist_autograd in general if the callback ends up routing into the dist_autograd context.

Copy link
Contributor

@mrshenli mrshenli Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM.

Curious, what difference does set_wrapped_number make? Is it like mul_out only support scalar tensor or there is a shortcut for this?

Regarding dist autograd, as DDP + RPC is a very new feature and will be released as beta. It still need more tests and apps to verify and try it out. I think even if it does not work with this PR, it should not block us from landing this. We can fix that in follow PRs.

Hold on. Let me think again about dist autograd.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at::native::mul_out only has an overload that accepts three Tensor args, so I need to convert the scalar to a tensor.

I'm not sure exactly why set_wrapped_number in particular is needed. I don't think it's needed for TensorIterator kernels inside mul_out, because the lambda capturing of CPU scalars for GPU kernels is based on TensorIterator::is_cpu_scalar which does not rely on Tensor::is_wrapped_number. I included set_wrapped_number because it seems to be the standard practice when converting scalars to Tensors (e.g. in BinaryOps.cpp and python_arg_parser.cpp).

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mcarilli
Copy link
Collaborator Author

mcarilli commented Jun 15, 2020

@jeffdaily I'm seeing timeouts for my DDP tests on Rocm. Is it expected that DDP tests with @requires_nccl() decorator will run on Rocm? If so, do you know why they might hang?
example:

17:44:58 ERROR: test_grad_layout_1devicemodule_1replicaperprocess (__main__.DistributedDataParallelTest)
17:44:58 ----------------------------------------------------------------------
17:44:58 Traceback (most recent call last):
17:44:58   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
17:44:58     self._join_processes(fn)
17:44:58   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes
17:44:58     self._check_return_codes(elapsed_time)
17:44:58   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 344, in _check_return_codes
17:44:58     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
17:44:58 RuntimeError: Process 0 terminated or timed out after 100.08511734008789 seconds

from https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-xenial-rocm3.3-py3.6-test2/452/console

I'm decorating these with skip_if_rocm in the meantime but if it's expected that they pass we should figure out why they failed.

@jeffdaily
Copy link
Collaborator

This is what I see in the CI log:

17:34:16 test_grad_layout_1devicemodule_1replicaperprocess (__main__.DistributedDataParallelTest) ... [W Module.cpp:449] Warning: Disabling benchmark mode for MIOpen is NOT supported. Overriding value to True (function operator())
17:34:16 [W Module.cpp:449] Warning: Disabling benchmark mode for MIOpen is NOT supported. Overriding value to True (function operator())
17:34:34 MIOpen(HIP): Warning [ForwardBackwardGetWorkSpaceSizeImplicitGemm] /root/driver/MLOpen/src/lock_file.cpp:75: Error creating file </var/lib/jenkins/.config/miopen//miopen.udb.lock> for locking.
17:35:56 ERROR
17:35:56 test_grad_layout_1devicemodule_2replicaperprocess (__main__.DistributedDataParallelTest) ... Timing out after 100 seconds and killing subprocesses.

It looks to me that MIOpen has encountered kernels that it hasn't compiled yet. The error creating the miopen lock file indicates two or more processes attempted to race to compile the same kernel (race was resolved when one couldn't acquire lock). However, this likely has lead to a timeout. One process got the miopen lock, compiled the kernel, then released the lock to the other process that may or may not re-compile the same kernel (depending on timing).

One solution might be if you increased the timeout, then you might get the test to pass for rocm. Another solution will be available starting in ROCm 3.5 where we can install precompiled kernels as a separate deb package, but that isn't available to you yet.

Skipping them for now to unblock your work is understandable for the short term. Wouldn't be the first time we had a timing issue caused by first-use compiling of miopen kernels.

@mcarilli
Copy link
Collaborator Author

mcarilli commented Jun 15, 2020

@jeffdaily Thanks for the quick reply, just wanted to make sure you were aware.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@albanD merged this pull request in 2beb969.

facebook-github-bot pushed a commit that referenced this pull request Jun 17, 2020
…memory layout (#40129)

Summary:
#34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)).

This PR reverts the revert, and adds diffs that should repair the misconfigured test.
Pull Request resolved: #40129

Differential Revision: D22079377

Pulled By: albanD

fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1
xwang233 pushed a commit to xwang233/pytorch that referenced this pull request Jun 20, 2020
…yout (pytorch#34904)

Summary:
Currently, whether `AccumulateGrad`  [steals](https://github.com/pytorch/pytorch/blob/67cb0184625ca3c30f44e02cc21ebfa7382c75c5/torch/csrc/autograd/functions/accumulate_grad.h#L42) or [clones](https://github.com/pytorch/pytorch/blob/67cb0184625ca3c30f44e02cc21ebfa7382c75c5/torch/csrc/autograd/functions/accumulate_grad.h#L80) an incoming gradient, the gradient ends up rowmajor contiguous, regardless of its param's layout.  If the param's layout is channels last, or otherwise not rowmajor contigous, later kernels that apply gradients to params are forced into an uncoalesced memory access pattern for either the param or the gradient.  This may not sound like a big deal but for any binary op on large tensors it's a >3X increase in gmem traffic => 3X slowdown.

The present PR changes `AccumulateGrad` to prefer, where possible, stashing gradients that match their params' layouts (["Gradient Layout Contract"](https://github.com/pytorch/pytorch/pull/34904/files#diff-ef1a56d24f66b280dcdb401502d6a796R29-R38)).

Allowing `AccumulateGrad` to stash non-rowmajor-contiguous grads means DDP allreduces and DP reduces must allow non-rowmajor-contiguous grads.  This PR extends DDP and DP to allow gradients with non-rowmajor-contiguous strides as long as their layout is nonoverlapping and dense.

For good measure, I include changes that allow all five nccl primitives (allreduce, reduce, broadcast, allgather, reducescatter) to act on non-rowmajor-contiguous tensors (again as long as each input's layout is nonoverlapping and dense, and as long as all tensors participating in a given collective have the same layout).  The primitive comm changes aren't necessary to enable the DDP changes, but I wasn't sure this would end up true until I had written both sets of changes.  I think primitive comm enablement is reasonable to keep in the PR, especially since the code for it is simple.

Channels last params will be a major beneficiary of this PR, but I don't see it as channels-last-specific fix.  The spirit is layout matching in general:
- Grads should be stashed with memory layouts matching their params.
- Src and dst tensors on opposite ends of collectives should have matching dense layouts.

This PR also updates autograd docs to describe potential BC-breaking changes below.

## BC notes
ngimel albanD gchanan

#### BC-breaking
In the common case where the user lets AccumulateGrad decide grad layouts, strides for grads of dense but non-rowmajor-contiguous params will change.  Any user code that was accustomed to `view(-1)`ing these grads will break.

Also, the circumstances under which a grad can be stolen directly from the backward function that created it, as opposed to deep-copied by AccumulateGrad, have changed.  In most cases we expect silent performance improvement, because we expect channels-last-aware backward kernels will create channels last gradients for channels last params.  Now those can be stolen, whereas before this PR they were cloned and made rowmajor contiguous.  IMO this is a mild BC breakage.  Param backward hooks still see grads come in with whatever format the backward kernel gave them.  The only BC breakage potential I see is if user code relies somehow on a grad in a hook having or not having the same deep memory as the eventual `param.grad`.  Any such users hopefully know they're off the edge of the map and understand how to update their expectations.

#### BC escape hatches
At alband's recommendation, this PR's changes to AccumulateGrad do not alter the pre-PR code's decisions about whether grad is accumulated in or out of place.  Accumulations of new grads onto an existing `.grad` attribute were (usually) in-place before this PR and remain in-place after this PR, keeping the existing `.grad`'s layout.  After this PR, if the user wants to force accumulation into a grad with a particular layout, they can preset `param.grad` to a zeroed tensor with the desired strides or call `grad.contiguous(desired format)`.  This likely won't be as performant as letting AccumulateGrad establish grad layouts by cloning or stealing grads with contract-compliant strides, but at least users have a control point.

One limitation (present before this PR and unchanged by this PR):  Presetting `param.grad` does not ensure in-place accumulation all the time.  For example, if `create_graph=True`, or if incoming `new_grad` is dense and existing `variable_grad` is sparse, accumulation occurs out of place, and the out-of-place result may not match the existing grad's strides.

----------------------------
I also noticed some potential DDP improvements that I considered out of scope but want to mention for visibility:
1. make sure Reducer's ops sync with AccumulateGrad streams
2. ~to reduce CPU overhead and incur fewer kernel launches, lazily create flat `contents` tensors by a single `cat` kernel only when a bucket is full, instead of `copy_`ing grads into `contents` individually as soon as they are received.~  PR includes a [minor change](https://github.com/pytorch/pytorch/pull/34904/files#diff-c269190a925a4b0df49eda8a8f6c5bd3R312-R315) to divide grads while copying them into flat buffers, instead of copying them in, then dividing separately.  Without cat+div fusion, div-while-copying is the best we can do.
3. pytorch#38942
Pull Request resolved: pytorch#34904

Differential Revision: D20496044

Pulled By: albanD

fbshipit-source-id: 248d680f4b1bf77b0a986451844ec6e254469217
xwang233 pushed a commit to xwang233/pytorch that referenced this pull request Jun 20, 2020
…memory layout (pytorch#40129)

Summary:
pytorch#34904 was reverted because it had a misconfigured 4 GPU test that for some reason wasn't caught by external CI ([example failure](https://app.circleci.com/pipelines/github/pytorch/pytorch/181719/workflows/cfb37cd9-9a0c-4738-898b-d683934cd308/jobs/5868948/steps)).

This PR reverts the revert, and adds diffs that should repair the misconfigured test.
Pull Request resolved: pytorch#40129

Differential Revision: D22079377

Pulled By: albanD

fbshipit-source-id: 9bd2b7e0c34fdaf887497b52037cfe82cba709c1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: bc-breaking Related to a BC-breaking change module: memory format Memory format/layout related issues/changes (channels_last, nhwc) open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.