Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Format support for Resnet models #23403

Closed
2 of 10 tasks
VitalyFedyunin opened this issue Jul 25, 2019 · 18 comments
Closed
2 of 10 tasks

Memory Format support for Resnet models #23403

VitalyFedyunin opened this issue Jul 25, 2019 · 18 comments
Assignees
Labels
module: cudnn Related to torch.backends.cudnn, and CuDNN support module: memory format Memory format/layout related issues/changes (channels_last, nhwc) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@VitalyFedyunin
Copy link
Contributor

VitalyFedyunin commented Jul 25, 2019

We define 4D tensor as stored in channels last memory format, when dimensions order is NCHW and C-strides < W-strides < H-strides < N-strides (If size of any dimension is equal to 1, this dimension strides value is not taken into account).

Channels last contiguous tensor is channel last tensor which occupies contiguous memory block. So x.is_contiguous(memory_format=torch.channels_last) checks if tensor is channels last contiguous.

The goal of the experiment is to use channels last memory format in all Resnet (https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py) model's operators and to measure performance gains on the Volta devices with CudNN library available.

This experiment requires:

  1. Update operators kernels to follow the next rule: if one of the operator's inputs is channel last tensor, all outputs should also be in the channel last memory format.
  2. For better performance gain, update DataLoader to output channel last tensors.

To avoid changing the model itself and more importantly, introduce this optimization to the existing saved models. We need to introduce the next changes:

Writing memory format aware operators require special functions introduced in #23391

auto memory_format = input_tensor.suggest_memory_format();
auto output_tensor = at::empty(output_shape, memory_format);

switch (memory_format) {
  case MemoryFormat::ChannelsLast: {
    input_cl_contiguous = input_tensor.contiguous(
        MemoryFormat::ChannelsLast); // if kernel requires memory contiguous
                                     // tensor
    // .... kernel code
    break;
  }
  case MemoryFormat::Contiguous: {
    // .... standard kernel
    break;
  }
  default:
    TORCH_CHECK(
        false,
        "Unsupported memory format. Supports only ChannelsLast, Contiguous");
}

Notes

  • Resnet calls x = x.reshape(x.size(0), -1) before linear layers, we are going to update reshape and view code and convert tensor's memory format to torch.contiguous_format at this step.
  • Making old models/libraries to work faster also requires BC sacrifices, such as
    empty_like ( and all _like operators ) will return channels last tensor if input is channels last, similar will apply to to, clone, resize_as. We are thinking about the ability to control suggest_memory_format behaviors by the global variable.
@VitalyFedyunin VitalyFedyunin added module: cudnn Related to torch.backends.cudnn, and CuDNN support triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: memory format Memory format/layout related issues/changes (channels_last, nhwc) labels Jul 25, 2019
@VitalyFedyunin VitalyFedyunin self-assigned this Jul 25, 2019
@VitalyFedyunin
Copy link
Contributor Author

CC @ezyang @ifedan @zou3519 @csarofeen @dzhulgakov @soumith
Draft as I'm going to create bunch of PRs and Issues covering some of tasks.

@fmassa
Copy link
Member

fmassa commented Jul 25, 2019

Question about the reshape / view comment above: does this perform any permutation on the data to put it in NCHW? I suppose not, as it would be very intuitive.
But if that's the case, then we wont be able to run models on different memory formats without hand-modifying the weights of the Linear layer (and keeping the same pre-trained weights) I believe

@VitalyFedyunin
Copy link
Contributor Author

At this moment I changed x = x.reshape(x.size(0), -1) to x = x.contiguous().reshape(x.size(0), -1) inside the model code. I suggest we do it internally (if x is channels last) and warn user to update the model if possible. This way pre-trained weights will match proper positions.

@VitalyFedyunin
Copy link
Contributor Author

Exactly same applies to Alexnet x = x.view(x.size(0), 256 * 6 * 6) -> x = x.contiguous().view(x.size(0), 256 * 6 * 6)

@fmassa
Copy link
Member

fmassa commented Jul 25, 2019

What if we replace the occurrences of reshape with flatten, and leave the possibility to specify some kind of memory format in flatten itself?

@VitalyFedyunin
Copy link
Contributor Author

Well ultimately flatten is reshape and reshape is view. So regardless of the operator we need to match positions of weights and this require exactly one memory copy with .contiguous() call or similar op.

Good news are:

  • At this point Tensor size is relatively small.
  • We can look how we can optimize conversion from torch.channels_last to torch.contiguous_format

@VitalyFedyunin
Copy link
Contributor Author

To be honest, this point is the most questionable, but at same time easiest to solve one way or another, and we can move it out from performance test task. As I just wanted to write it down, as question might appear.

@fmassa
Copy link
Member

fmassa commented Jul 25, 2019

Flatten is a subset of reshape: it only merges dimensions together, and cant create new dimensions.

I think it can be very mysterious to users to switch from .contiguous().reshape to just reshape and a model silently stops working because the Linear layer expects data played down differently. At the very least we should make the layout in the contiguous call explicit, to hint that this is actually necessary.

But I need to think a bit more about all this

@apaszke
Copy link
Contributor

apaszke commented Jul 26, 2019

I would be very careful about .clone() preserving the memory format, because it's likely that there is code out there which assumes that its outputs are contiguous in the traditional sense, and e.g. pass it off to C++ code.

Why should we bother with layouts in .resize_as_? This op should probably be made internal anyway. Where is it used in a ResNet?

@soumith
Copy link
Member

soumith commented Jul 26, 2019

I'm not worried about the .clone() semantics @apaszke . For example, no user has noticed that TensorIterator based torch.* functions went from outputs always being contiguous to outputs matching input layouts. It happened somewhere around 1.0, and we didn't receive a single bug report or forum post.

@soumith
Copy link
Member

soumith commented Jul 26, 2019

@fmassa

I think it can be very mysterious to users to switch from .contiguous().reshape to just reshape and a model silently stops working because the Linear layer expects data played down differently.

The worst case for memory format is that there will be an extra transpose and copy. memory formats can never give wrong results.

@fmassa
Copy link
Member

fmassa commented Jul 26, 2019

@soumith I think I misunderstood what was going to happen.

Resnet calls x = x.reshape(x.size(0), -1) before linear layers, we are going to update reshape and view code and convert tensor's memory format to torch.contiguous_format at this step.

So now inside of torch.reshape there will be a prior step where x.contiguous(memory_format= contiguous_format) is called?

In this case, there will be indeed no problems for the case I was thinking.

But this could also be confusing to the users, as now the semantics of .reshape become more complicated, and are equivalent to .permute().reshape().


But if the solution that we are proposing is to keep the behavior of reshape as is, but instead modify the model definitions to add an extra contiguous() call before reshape, then this might give unexpected results to users.

Imagine the two situations:

x = x.contiguous().reshape(x.shape[0], -1)
# or
x = x.reshape(x.shape[0], -1)

currently, the two are equivalent and one can just use the .reshape variant.

But with memory layouts, this doesn't hold true anymore.
Indeed,

x = x.contiguous().reshape(x.shape[0], -1)
# is equivalent to
x = x.contiguous(memory_format=torch. contiguous_format).reshape(x.shape[0], -1)
# but not to
x = x.reshape(x.shape[0], -1)

@soumith
Copy link
Member

soumith commented Jul 26, 2019

@fmassa memory_format as an entire feature doesn't change sizes, it only changes strides.

@ngimel
Copy link
Collaborator

ngimel commented Jul 28, 2019

While narrow_copy_dense would still be functionally correct with layout-preserving .clone(), it would use more memory than expected if dim is not 0

Tensor narrow_copy_dense(const Tensor& self, int64_t dim, int64_t start, int64_t length){
return self.narrow(dim, start, length).clone();
}

If layout-preserving .clone() also preserves zero strides, then gradient accumulation will break here
variable.grad() = new_grad.clone();
- gradient will be a zero-strided tensor, and subsequent in-place addition of gradient in line 81 will error out.
As far as I can tell at a quick glance SpectralOps rely on output of .clone() being contiguous.

@VitalyFedyunin
Copy link
Contributor Author

Thanks @ngimel, hence my hacky implementation was preserving .clone() strides only if input tensor is channel last, I never had issues with 0-dim grad. But I will take a look into this code piece as it might become important if we decide that .clone() and .to() operations should preserve strides in all scenarios.

@jjsjann123
Copy link
Collaborator

Just to put it out for tracking who's working on what, so that we are not duplicating the effort.

I'm currently looking at the last two thing on the list:

  • cudnn_batch_norm and cudnn_batch_norm_backward should support channels last memory format.
  • cudnn_convolution_forward and cudnn_convolution_backward should support channels last memory format.

@jjsjann123
Copy link
Collaborator

As discussed with Vitaly, I'm taking on these now:

  • adaptive_avg_pool2d_cuda and adaptive_avg_pool2d_backward_cuda should have channel last optimized kernels.

@VitalyFedyunin
Copy link
Contributor Author

Moving progress tracking and extending scope in #28619

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cudnn Related to torch.backends.cudnn, and CuDNN support module: memory format Memory format/layout related issues/changes (channels_last, nhwc) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants