ROCm MIOpen NHWC Convolution support #63617

amathews-amd · 2021-08-19T22:27:45Z

Added 2D-Convolution NHWC support
- on ROCm 4.3, with PYTORCH_MIOPEN_SUGGEST_NHWC=1 flag
- May need to force MIOpen to search for solutions ( see examples below for flags )

PYTORCH_MIOPEN_SUGGEST_NHWC Environment Flag
MIOpen does not officially support NHWC yet, although convolution support has been added to tip-of-tree of MIOpen. This flag is intended to be a short-lived flag to explicitly turn on NHWC support until ROCm officially supports NHWC and performance is verified.

Examples

Example usage 1 : Run test on ROCm4.3
PYTORCH_TEST_WITH_ROCM=1 PYTORCH_MIOPEN_SUGGEST_NHWC=1 MIOPEN_FIND_ENFORCE=4 MIOPEN_DEBUG_CONV_GEMM=0 MIOPEN_FIND_MODE=1 pytest test_nn.py -v -k "test_conv_cudnn_nhwc"
Example usage 2: Run the following with PYTORCH_MIOPEN_SUGGEST_NHWC=1 on ROCm4.3.

#!/usr/bin/env python3
import torch
model = torch.nn.Conv2d(8, 4, 3).cuda().half()
model = model.to(memory_format=torch.channels_last)
input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, requires_grad=True)
input = input.to(device="cuda", memory_format=torch.channels_last, dtype=torch.float16)

# should print True for is_contiguous(channels_last), and strides must match NHWC format
print(input.is_contiguous(memory_format=torch.channels_last), input.shape, input.stride() )

out = model(input)

# should print True for is_contiguous(channels_last), and strides must match NHWC format
print("Contiguous channel last :", out.is_contiguous(memory_format=torch.channels_last), " out shape :",  out.shape, "out stride :", out.stride() )

See https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html for more examples.

cc @jeffdaily @sunway513 @jithunnair-amd @ROCmSupport

20210412 upstream changes

…V-280751_MAGMA_master

Swdev 280751 magma master

…V-280751_MAGMA_remove_test_skips

…t_skips Swdev 280751 magma remove test skips

facebook-github-bot · 2021-08-19T22:27:50Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/63617

💊 CI failures summary and remediations

As of commit 2ef3bfb (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

…P_VERSION

…onvolution that does not work on AMD gpus

…pport

jeffdaily · 2021-09-01T15:23:19Z

aten/src/ATen/native/ConvUtils.h

+
+  bool can_use_miopen_channels_last_2d = false;
+#if defined(USE_ROCM) && (ROCM_VERSION >= 40300)
+  can_use_miopen_channels_last_2d = PYTORCH_MIOPEN_SUGGEST_NHWC &&  *PYTORCH_MIOPEN_SUGGEST_NHWC && (


What happens when the optional is a nullopt because the user set the env var to something other than 0 or 1? Do we care about that use case and failing gracefully?

Aside: The c10::utils::check_env appears to have been added recently (#59052), but is currently not used anywhere else. ATen seems to use getenv directly. Does upstream have a preference how to parse env vars?

You can see that I am testing both PYTORCH_MIOPEN_SUGGEST_NHWC and *PYTORCH_MIOPEN_SUGGEST_NHWC. The first one is the nullopt check. The second is the true/false check. About handling other arguments like 'True/False', thats upto check_env.

Yes, c10::utils::check_env is new, and not used. I figured I will use it since the API is supported, and it might be a direction where the PyTorch devs want to go.

sorry, I missed the nullopt check. looking forward to hearing from upstream reviewer how they'd like to handle env vars.

Can you explain why this feature has to be gated by an environment variable. For alternative layout support, ordinarily you would simply just do the correct layout algorithm depending on what the layout of the weights are. To compare, cuDNN supports alternative layouts without needing an envvar to modulate.

The reason it has to be gated is because MIOpen does not officially support NHWC yet, although convolution support has been added to tip-of-tree of MIOpen. The plan is to remove the environment variable check to a ROCm version check once support is officially added.
( Also, the MIOpen teams do need this support in PyTorch during this time, since we are testing performance of the newly added NHWC code on application loads. )

OK, but it's pretty difficult to "accidentally" end up running NHWC when you didn't intend to (the weights have to be NHWC). Wouldn't you just rather error in that case for now?

In any case, the plan on record was not clear from the PR description nor the code

Some of the application loads request NHWC explicitly now, and I wanted it to fall back on NCHW until support is officially added, and performance verified.

I have opened a ticket to track future removal of the flag: #64427
Let me add that to the code too.

jeffdaily · 2021-09-01T15:56:03Z

torch/testing/_internal/common_device_type.py

+        def wrap_fn(self, *args, **kwargs):
+            if self.device_type == 'cuda':
+                if not TEST_WITH_ROCM:
+                    reason = "ROCm not available"
+                    raise unittest.SkipTest(reason)


I think this will cause CUDA to skip any test using this decorator because CUDA never sets TEST_WITH_ROCM. I think the logic you were going for was

if self.device_type == 'cuda' and torch.version.hip is not None # rocm version parsing etc

I am not sure how the ROCm CI sets PYTORCH_TEST_WITH_ROCM=1, but TEST_WITH_ROCM is just the internal representation of that flag (read in common_utils.py). I got the idea to use this flag from the skipCUDAIfNotRocm() function just a few lines above.

When I am testing, I am manually setting PYTORCH_TEST_WITH_ROCM, as in the description of this PR.

jeffdaily

I'm confused why we need both skipIfRocmVersionLessThan and skipCUDAIfRocmVersionLessThan.

amathews-amd · 2021-09-01T16:40:33Z

I'm confused why we need both skipIfRocmVersionLessThan and skipCUDAIfRocmVersionLessThan.

'device_type' is not defined for a class, TestNN, but it exists for TestNNDeviceType (possibly due to instantiate_device_type_tests(TestNNDeviceType, globals()) line in test_nn.py ). So, we can't just use skipCUDAIfRocmVersionLessThan.

Can we just use skipIfRocmVersionLessThan ? Possibly, since the applicable tests are wrapped in @onlyCUDA anyway. However, this is against the design of the skipping decorator and will even skip on CPU (if support for NHWC is added on CPU in future). So, I am making a conservative choice here.

amathews-amd · 2021-09-01T16:46:17Z

aten/src/ATen/native/miopen/Conv_miopen.cpp

+  // Make sure that NC11 strides follow formula
+  bias_contig.resize_(bias_contig.sizes(), memory_format );
+
+  // TODO: Workaround since MIOpen does not support NHWC bias


Note: Internal tickets opened against frameworks pytorch to fix this later:
https://ontrack-internal.amd.com/browse/SWDEV-301466
See ticket for linked MIOpen ticket for NHWC bias support.

link the ticket in the code

ezyang · 2021-09-02T14:00:06Z

aten/src/ATen/miopen/Descriptors.cpp

+          // Pass-through
+          stride[i] = t.stride(i);
+      }
+  }


I'm confused by this. Why don't you just read out the strides from the tensor, instead of recomputing them here? (Does miopen require some specific canonical form of strides?)

For the Pad dimensions, it needs to be set to 1. So, this addition is needed.
In the ChannelsLast code, the new code is essentially doing what you propose, that is, reading out strides from the tensor.
In the NCHW (and other code), on line 117, I only kept it to recompute because the original code was recomputing. I don't think we need to recompute. If you are also on same opinion, I can change both code to just pass-through

OK I see. You don't have to fix the original code in this PR but I would like to see it fixed (Unless there is a reason for recompute, in which case it should be commented why)

I decided to fix the original code; Tested locally and test_nn passes. Lets see if CI tests pass. If it does, we are good.

ezyang · 2021-09-02T14:00:58Z

aten/src/ATen/miopen/Descriptors.cpp

@@ -90,17 +90,17 @@ std::ostream& operator<<(std::ostream & out, const TensorDescriptor& d) {

 void TensorDescriptor::print() { std::cout << *this; }

-void FilterDescriptor::set(const at::Tensor &t, int64_t pad) {
+void FilterDescriptor::set(const at::Tensor &t, const at::MemoryFormat memory_format, int64_t pad) {


I actually think passing memory format here explicitly is kind of suspect, but it is symmetric with cuDNN so I'll let it slide.

Yes, I referred to the cuDNN implementation and tried to match it as much as possible.

ezyang · 2021-09-02T14:01:25Z

aten/src/ATen/miopen/Descriptors.cpp

@@ -109,9 +109,25 @@ void FilterDescriptor::set(const at::Tensor &t, int64_t pad) {
  for (int i = dim; i < pad; ++i) {
    size[i] = (int) 1;
  }
-  for (int i = dim - 1; i >=0; --i) {
-    stride[i] = (i == dim - 1) ? 1 : stride[i+1] * size[i+1];
+  if( memory_format != at::MemoryFormat::ChannelsLast ) {


Don't do this. Do a switch on memory format and explicitly error if it is an unexpected error; this will make the code robust if a new memory format gets added.

Updated; this is gone.

ezyang · 2021-09-02T14:07:35Z

aten/src/ATen/native/miopen/Conv_miopen.cpp

+  shape[output_channels_dim] = -1;
+  at::Tensor bias_contig =  bias->reshape(shape).contiguous(memory_format);
+  // Make sure that NC11 strides follow formula
+  bias_contig.resize_(bias_contig.sizes(), memory_format );


This looks totally unnecessary

I agree that its not optimal to keep doing this. I was following the pattern in the cudnn implementation.

The NHWC blog (https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) describes this issue.
For general cases the two APIs behave the same. However in special cases for a 4D tensor with size NCHW when either: C==1 or H==1 && W==1, only to would generate a proper stride to represent channels last memory format.

ezyang · 2021-09-02T14:08:20Z

aten/src/ATen/native/miopen/Conv_miopen.cpp

+    memory_format = (weight->ndimension() == 5) ? /*at::MemoryFormat::ChannelsLast3d*/at::MemoryFormat::Contiguous : at::MemoryFormat::ChannelsLast;
+  }
+
+  auto output_t = at::native::empty_cuda(


What's going on here?

It looks like this was cargo culted from ConvShared.cpp

This empty_cuda function sets the strides correctly given the memory layout, which is why I used it.

ezyang · 2021-09-02T14:13:52Z

aten/src/ATen/native/miopen/Conv_miopen.cpp

+
+  Tensor outputBias = at::squeeze( at::sum(grad_output_t, discard_dims, true) );
+  if( outputBias.dim() == 0 ) {
+      // always return a tensor of shape [_]


If the result is just 1 element, 1 dim, at::squeeze will make the result a scalar. There are tests in test_nn.py that check this return value against a tensor of shape, [1].

ezyang · 2021-09-02T14:16:16Z

This looks ok but I'm skeptical about the environment variable

…pport

facebook-github-bot · 2021-09-02T16:24:58Z

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2021-09-08T12:14:35Z

Sorry about the delay. Need some warning cleanup in this PR:

caffe2/aten/src/ATen/native/ConvUtils.h:114:30: error: unused variable 'PYTORCH_MIOPEN_SUGGEST_NHWC' [-Werror,-Wunused-variable]
  static c10::optional<bool> PYTORCH_MIOPEN_SUGGEST_NHWC = c10::utils::check_env("PYTORCH_MIOPEN_SUGGEST_NHWC");
                             ^
caffe2/aten/src/ATen/native/ConvUtils.h:123:8: error: unused variable 'input_memory_format' [-Werror,-Wunused-variable]
  auto input_memory_format = input.suggest_memory_format();
       ^
caffe2/aten/src/ATen/native/ConvUtils.h:124:8: error: unused variable 'weight_memory_format' [-Werror,-Wunused-variable]
  auto weight_memory_format = weight.suggest_memory_format();

…pport

amathews-amd · 2021-09-08T13:35:21Z

Sorry about the delay. Need some warning cleanup in this PR:

caffe2/aten/src/ATen/native/ConvUtils.h:114:30: error: unused variable 'PYTORCH_MIOPEN_SUGGEST_NHWC' [-Werror,-Wunused-variable]
  static c10::optional<bool> PYTORCH_MIOPEN_SUGGEST_NHWC = c10::utils::check_env("PYTORCH_MIOPEN_SUGGEST_NHWC");
                             ^
caffe2/aten/src/ATen/native/ConvUtils.h:123:8: error: unused variable 'input_memory_format' [-Werror,-Wunused-variable]
  auto input_memory_format = input.suggest_memory_format();
       ^
caffe2/aten/src/ATen/native/ConvUtils.h:124:8: error: unused variable 'weight_memory_format' [-Werror,-Wunused-variable]
  auto weight_memory_format = weight.suggest_memory_format();

Fixed, and merged upstream to branch.

facebook-github-bot · 2021-09-08T15:32:45Z

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-09-10T15:08:02Z

@ezyang merged this pull request in 63b180b.

amathews-amd and others added 10 commits April 12, 2021 15:02

Merge pull request #1 from amathews-amd/20210412_upstream_changes

039e137

20210412 upstream changes

Merge branch 'master' of https://github.com/pytorch/pytorch into SWDE…

e439aa3

…V-280751_MAGMA_master

Merge pull request #2 from amathews-amd/SWDEV-280751_MAGMA_master

f53cca3

Swdev 280751 magma master

Merge branch 'master' of https://github.com/pytorch/pytorch into SWDE…

86ea244

…V-280751_MAGMA_remove_test_skips

Merge pull request #3 from amathews-amd/SWDEV-280751_MAGMA_remove_tes…

b51c38c

…t_skips Swdev 280751 magma remove test skips

Merge branch 'pytorch:master' into master

8fdcf52

Merge branch 'pytorch:master' into master

b7fe5ea

Merge branch 'pytorch:master' into master

17c7fe0

Merge branch 'pytorch:master' into master

df6a490

Support NHWC Convolution

5c435cb

facebook-github-bot added the cla signed label Aug 19, 2021

github-actions bot added the module: rocm AMD GPU support for Pytorch label Aug 19, 2021

pytorchbot added the open source label Aug 19, 2021

amathews-amd and others added 16 commits August 20, 2021 14:08

Support for ROCM_VERSION has not landed yet; ROCM_VERSION -> TORCH_HI…

8ae65f6

…P_VERSION

Fixed version check

c958713

cleanup

1be4025

Added backward, backward-weight support for resnet

081c991

Added support for NHWC Bias workarounds

a99d038

Added support for NHWC depthwise convolution workarounds

d6f2dd0

Added decorator for ROCM version check

86aa7ec

Activating NHWC tests for ROCm > 4.3

a84b1f9

Fixed filter descriptor strides where padding is used

5831637

cleanup

e382bce

In backward bias, always return a tensor array

232dae1

updating test decorators

4f24727

Updated test to use pytorch at::convolution instead of direct cudnn_c…

bf0467a

…onvolution that does not work on AMD gpus

fixing mypy complaints

d3153a9

Merge branch 'pytorch:master' into master

bf13ad7

Merge remote-tracking branch 'origin/master' into rocm_miopen_nhwc_su…

e8344ee

…pport

jeffdaily reviewed Sep 1, 2021

View reviewed changes

amathews-amd commented Sep 1, 2021

View reviewed changes

ezyang reviewed Sep 2, 2021

View reviewed changes

This was referenced Sep 2, 2021

Add support for ROCm MIOpen miopenConvolutionForwardBias and miopenConvolutionBackwardBias #64426

Open

Remove PYTORCH_MIOPEN_SUGGEST_NHWC flag in ROCm #64427

Closed

amathews-amd and others added 3 commits September 2, 2021 15:42

Addressing code review comments

fe31d88

Merge branch 'pytorch:master' into master

b5d8c4a

Merge remote-tracking branch 'origin/master' into rocm_miopen_nhwc_su…

8bc7950

…pport

ezyang approved these changes Sep 2, 2021

View reviewed changes

amathews-amd and others added 3 commits September 8, 2021 08:30

Merge branch 'pytorch:master' into master

7313c47

Fixing warning cleanup

e578de6

Merge remote-tracking branch 'origin/master' into rocm_miopen_nhwc_su…

8d3a298

…pport

Fixing warning cleanup

2ef3bfb

facebook-github-bot closed this in 63b180b Sep 10, 2021

facebook-github-bot added the Merged label Sep 10, 2021

ROCm MIOpen NHWC Convolution support #63617

ROCm MIOpen NHWC Convolution support #63617

Conversation

amathews-amd commented Aug 19, 2021 • edited

facebook-github-bot commented Aug 19, 2021 • edited

🔗 Helpful links

💊 CI failures summary and remediations

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffdaily left a comment

Choose a reason for hiding this comment

amathews-amd commented Sep 1, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented Sep 2, 2021

facebook-github-bot commented Sep 2, 2021

ezyang commented Sep 8, 2021

amathews-amd commented Sep 8, 2021

facebook-github-bot commented Sep 8, 2021

facebook-github-bot commented Sep 10, 2021

amathews-amd commented Aug 19, 2021 •

edited

facebook-github-bot commented Aug 19, 2021 •

edited

amathews-amd commented Sep 1, 2021 •

edited