Memory Format support for Resnet models #23403

VitalyFedyunin · 2019-07-25T21:18:38Z

We define 4D tensor as stored in channels last memory format, when dimensions order is NCHW and C-strides < W-strides < H-strides < N-strides (If size of any dimension is equal to 1, this dimension strides value is not taken into account).

Channels last contiguous tensor is channel last tensor which occupies contiguous memory block. So x.is_contiguous(memory_format=torch.channels_last) checks if tensor is channels last contiguous.

The goal of the experiment is to use channels last memory format in all Resnet (https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py) model's operators and to measure performance gains on the Volta devices with CudNN library available.

This experiment requires:

Update operators kernels to follow the next rule: if one of the operator's inputs is channel last tensor, all outputs should also be in the channel last memory format.
For better performance gain, update DataLoader to output channel last tensors.

To avoid changing the model itself and more importantly, introduce this optimization to the existing saved models. We need to introduce the next changes:

Writing memory format aware operators require special functions introduced in #23391

auto memory_format = input_tensor.suggest_memory_format();
auto output_tensor = at::empty(output_shape, memory_format);

switch (memory_format) {
  case MemoryFormat::ChannelsLast: {
    input_cl_contiguous = input_tensor.contiguous(
        MemoryFormat::ChannelsLast); // if kernel requires memory contiguous
                                     // tensor
    // .... kernel code
    break;
  }
  case MemoryFormat::Contiguous: {
    // .... standard kernel
    break;
  }
  default:
    TORCH_CHECK(
        false,
        "Unsupported memory format. Supports only ChannelsLast, Contiguous");
}

Notes

Resnet calls x = x.reshape(x.size(0), -1) before linear layers, we are going to update reshape and view code and convert tensor's memory format to torch.contiguous_format at this step.
Making old models/libraries to work faster also requires BC sacrifices, such as
empty_like ( and all _like operators ) will return channels last tensor if input is channels last, similar will apply to to, clone, resize_as. We are thinking about the ability to control suggest_memory_format behaviors by the global variable.

The text was updated successfully, but these errors were encountered:

VitalyFedyunin · 2019-07-25T21:35:18Z

CC @ezyang @ifedan @zou3519 @csarofeen @dzhulgakov @soumith
Draft as I'm going to create bunch of PRs and Issues covering some of tasks.

fmassa · 2019-07-25T21:40:49Z

Question about the reshape / view comment above: does this perform any permutation on the data to put it in NCHW? I suppose not, as it would be very intuitive.
But if that's the case, then we wont be able to run models on different memory formats without hand-modifying the weights of the Linear layer (and keeping the same pre-trained weights) I believe

VitalyFedyunin · 2019-07-25T21:44:15Z

At this moment I changed x = x.reshape(x.size(0), -1) to x = x.contiguous().reshape(x.size(0), -1) inside the model code. I suggest we do it internally (if x is channels last) and warn user to update the model if possible. This way pre-trained weights will match proper positions.

VitalyFedyunin · 2019-07-25T21:45:13Z

Exactly same applies to Alexnet x = x.view(x.size(0), 256 * 6 * 6) -> x = x.contiguous().view(x.size(0), 256 * 6 * 6)

fmassa · 2019-07-25T21:52:56Z

What if we replace the occurrences of reshape with flatten, and leave the possibility to specify some kind of memory format in flatten itself?

VitalyFedyunin · 2019-07-25T21:59:51Z

Well ultimately flatten is reshape and reshape is view. So regardless of the operator we need to match positions of weights and this require exactly one memory copy with .contiguous() call or similar op.

Good news are:

At this point Tensor size is relatively small.
We can look how we can optimize conversion from torch.channels_last to torch.contiguous_format

VitalyFedyunin · 2019-07-25T22:03:47Z

To be honest, this point is the most questionable, but at same time easiest to solve one way or another, and we can move it out from performance test task. As I just wanted to write it down, as question might appear.

fmassa · 2019-07-25T22:07:24Z

Flatten is a subset of reshape: it only merges dimensions together, and cant create new dimensions.

I think it can be very mysterious to users to switch from .contiguous().reshape to just reshape and a model silently stops working because the Linear layer expects data played down differently. At the very least we should make the layout in the contiguous call explicit, to hint that this is actually necessary.

But I need to think a bit more about all this

apaszke · 2019-07-26T03:16:41Z

I would be very careful about .clone() preserving the memory format, because it's likely that there is code out there which assumes that its outputs are contiguous in the traditional sense, and e.g. pass it off to C++ code.

Why should we bother with layouts in .resize_as_? This op should probably be made internal anyway. Where is it used in a ResNet?

soumith · 2019-07-26T11:49:03Z

I'm not worried about the .clone() semantics @apaszke . For example, no user has noticed that TensorIterator based torch.* functions went from outputs always being contiguous to outputs matching input layouts. It happened somewhere around 1.0, and we didn't receive a single bug report or forum post.

soumith · 2019-07-26T11:49:52Z

@fmassa

I think it can be very mysterious to users to switch from .contiguous().reshape to just reshape and a model silently stops working because the Linear layer expects data played down differently.

The worst case for memory format is that there will be an extra transpose and copy. memory formats can never give wrong results.

fmassa · 2019-07-26T12:45:03Z

@soumith I think I misunderstood what was going to happen.

Resnet calls x = x.reshape(x.size(0), -1) before linear layers, we are going to update reshape and view code and convert tensor's memory format to torch.contiguous_format at this step.

So now inside of torch.reshape there will be a prior step where x.contiguous(memory_format= contiguous_format) is called?

In this case, there will be indeed no problems for the case I was thinking.

But this could also be confusing to the users, as now the semantics of .reshape become more complicated, and are equivalent to .permute().reshape().

But if the solution that we are proposing is to keep the behavior of reshape as is, but instead modify the model definitions to add an extra contiguous() call before reshape, then this might give unexpected results to users.

Imagine the two situations:

x = x.contiguous().reshape(x.shape[0], -1)
# or
x = x.reshape(x.shape[0], -1)

currently, the two are equivalent and one can just use the .reshape variant.

But with memory layouts, this doesn't hold true anymore.
Indeed,

x = x.contiguous().reshape(x.shape[0], -1)
# is equivalent to
x = x.contiguous(memory_format=torch. contiguous_format).reshape(x.shape[0], -1)
# but not to
x = x.reshape(x.shape[0], -1)

soumith · 2019-07-26T23:16:39Z

@fmassa memory_format as an entire feature doesn't change sizes, it only changes strides.

ngimel · 2019-07-28T04:28:19Z

While narrow_copy_dense would still be functionally correct with layout-preserving .clone(), it would use more memory than expected if dim is not 0

pytorch/aten/src/ATen/native/TensorShape.cpp

Lines 360 to 362 in 7e31c02

    
           Tensor narrow_copy_dense(const Tensor& self, int64_t dim, int64_t start, int64_t length){ 
        
               return self.narrow(dim, start, length).clone(); 
        
           }

If layout-preserving .clone() also preserves zero strides, then gradient accumulation will break here

pytorch/torch/csrc/autograd/functions/accumulate_grad.cpp

Line 57 in 41dfe72

variable.grad() = new_grad.clone();

- gradient will be a zero-strided tensor, and subsequent in-place addition of gradient in line 81 will error out.
As far as I can tell at a quick glance SpectralOps rely on output of .clone() being contiguous.

VitalyFedyunin · 2019-07-29T15:48:23Z

Thanks @ngimel, hence my hacky implementation was preserving .clone() strides only if input tensor is channel last, I never had issues with 0-dim grad. But I will take a look into this code piece as it might become important if we decide that .clone() and .to() operations should preserve strides in all scenarios.

jjsjann123 · 2019-07-31T20:53:53Z

Just to put it out for tracking who's working on what, so that we are not duplicating the effort.

I'm currently looking at the last two thing on the list:

cudnn_batch_norm and cudnn_batch_norm_backward should support channels last memory format.

cudnn_convolution_forward and cudnn_convolution_backward should support channels last memory format.

jjsjann123 · 2019-08-08T16:56:56Z

As discussed with Vitaly, I'm taking on these now:

adaptive_avg_pool2d_cuda and adaptive_avg_pool2d_backward_cuda should have channel last optimized kernels.

VitalyFedyunin · 2019-10-25T21:41:25Z

Moving progress tracking and extending scope in #28619

VitalyFedyunin self-assigned this Jul 25, 2019

VitalyFedyunin changed the title ~~[DRAFT] Memory Format support for Resnet models~~ Memory Format support for Resnet models Aug 5, 2019

VitalyFedyunin mentioned this issue Aug 5, 2019

max_pool2d_with_indices_cuda and max_pool2d_with_indices_backward_cuda should have channel last optimized kernels. #23815

Closed

jjsjann123 mentioned this issue Aug 6, 2019

[cudnn nhwc support] #23861

Closed

VitalyFedyunin mentioned this issue Aug 6, 2019

empty_like, to, resize_as_ and clone now preserve memory format #23899

Closed

jjsjann123 mentioned this issue Aug 9, 2019

suggest_memory_format has ambiguity & cannot represent intended layout format for corner cases #24090

Open

VitalyFedyunin mentioned this issue Aug 9, 2019

[WIP] Scatter gather memory format #24121

Closed

This was referenced Aug 15, 2019

[nhwc support for adaptive_avg_pool2d & adaptive_avg_pool2d_backward] #24396

Closed

[NHWC support] Revoking mutually exclusive requirement on channels last and contiguous tensor #24113

Closed

VitalyFedyunin mentioned this issue Aug 23, 2019

[DO NOT REVIEW] Channels last perf test #25102

Closed

VitalyFedyunin closed this as completed Oct 25, 2019

jjsjann123 mentioned this issue Sep 19, 2023

expand concretization to support opaque handling of stride order and shape specialization NVIDIA/Fuser#900

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Format support for Resnet models #23403

Memory Format support for Resnet models #23403

VitalyFedyunin commented Jul 25, 2019 •

edited

VitalyFedyunin commented Jul 25, 2019

fmassa commented Jul 25, 2019 •

edited

VitalyFedyunin commented Jul 25, 2019

VitalyFedyunin commented Jul 25, 2019

fmassa commented Jul 25, 2019

VitalyFedyunin commented Jul 25, 2019

VitalyFedyunin commented Jul 25, 2019

fmassa commented Jul 25, 2019

apaszke commented Jul 26, 2019

soumith commented Jul 26, 2019

soumith commented Jul 26, 2019

fmassa commented Jul 26, 2019

soumith commented Jul 26, 2019

ngimel commented Jul 28, 2019 •

edited

VitalyFedyunin commented Jul 29, 2019

jjsjann123 commented Jul 31, 2019

jjsjann123 commented Aug 8, 2019

VitalyFedyunin commented Oct 25, 2019

Memory Format support for Resnet models #23403

Memory Format support for Resnet models #23403

Comments

VitalyFedyunin commented Jul 25, 2019 • edited

Notes

VitalyFedyunin commented Jul 25, 2019

fmassa commented Jul 25, 2019 • edited

VitalyFedyunin commented Jul 25, 2019

VitalyFedyunin commented Jul 25, 2019

fmassa commented Jul 25, 2019

VitalyFedyunin commented Jul 25, 2019

VitalyFedyunin commented Jul 25, 2019

fmassa commented Jul 25, 2019

apaszke commented Jul 26, 2019

soumith commented Jul 26, 2019

soumith commented Jul 26, 2019

fmassa commented Jul 26, 2019

soumith commented Jul 26, 2019

ngimel commented Jul 28, 2019 • edited

VitalyFedyunin commented Jul 29, 2019

jjsjann123 commented Jul 31, 2019

jjsjann123 commented Aug 8, 2019

VitalyFedyunin commented Oct 25, 2019

VitalyFedyunin commented Jul 25, 2019 •

edited

fmassa commented Jul 25, 2019 •

edited

ngimel commented Jul 28, 2019 •

edited