Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm MIOpen NHWC Convolution support #63617

Closed

Conversation

amathews-amd
Copy link
Contributor

@amathews-amd amathews-amd commented Aug 19, 2021

  • Added 2D-Convolution NHWC support
    • on ROCm 4.3, with PYTORCH_MIOPEN_SUGGEST_NHWC=1 flag
    • May need to force MIOpen to search for solutions ( see examples below for flags )

PYTORCH_MIOPEN_SUGGEST_NHWC Environment Flag
MIOpen does not officially support NHWC yet, although convolution support has been added to tip-of-tree of MIOpen. This flag is intended to be a short-lived flag to explicitly turn on NHWC support until ROCm officially supports NHWC and performance is verified.

Examples

  1. Example usage 1 : Run test on ROCm4.3
    PYTORCH_TEST_WITH_ROCM=1 PYTORCH_MIOPEN_SUGGEST_NHWC=1 MIOPEN_FIND_ENFORCE=4 MIOPEN_DEBUG_CONV_GEMM=0 MIOPEN_FIND_MODE=1 pytest test_nn.py -v -k "test_conv_cudnn_nhwc"
  2. Example usage 2: Run the following with PYTORCH_MIOPEN_SUGGEST_NHWC=1 on ROCm4.3.
#!/usr/bin/env python3
import torch
model = torch.nn.Conv2d(8, 4, 3).cuda().half()
model = model.to(memory_format=torch.channels_last)
input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, requires_grad=True)
input = input.to(device="cuda", memory_format=torch.channels_last, dtype=torch.float16)

# should print True for is_contiguous(channels_last), and strides must match NHWC format
print(input.is_contiguous(memory_format=torch.channels_last), input.shape, input.stride() )

out = model(input)

# should print True for is_contiguous(channels_last), and strides must match NHWC format
print("Contiguous channel last :", out.is_contiguous(memory_format=torch.channels_last), " out shape :",  out.shape, "out stride :", out.stride() ) 

See https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html for more examples.

cc @jeffdaily @sunway513 @jithunnair-amd @ROCmSupport

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 19, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 2ef3bfb (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@github-actions github-actions bot added the module: rocm AMD GPU support for Pytorch label Aug 19, 2021

bool can_use_miopen_channels_last_2d = false;
#if defined(USE_ROCM) && (ROCM_VERSION >= 40300)
can_use_miopen_channels_last_2d = PYTORCH_MIOPEN_SUGGEST_NHWC && *PYTORCH_MIOPEN_SUGGEST_NHWC && (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the optional is a nullopt because the user set the env var to something other than 0 or 1? Do we care about that use case and failing gracefully?

Aside: The c10::utils::check_env appears to have been added recently (#59052), but is currently not used anywhere else. ATen seems to use getenv directly. Does upstream have a preference how to parse env vars?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see that I am testing both PYTORCH_MIOPEN_SUGGEST_NHWC and *PYTORCH_MIOPEN_SUGGEST_NHWC. The first one is the nullopt check. The second is the true/false check. About handling other arguments like 'True/False', thats upto check_env.

Yes, c10::utils::check_env is new, and not used. I figured I will use it since the API is supported, and it might be a direction where the PyTorch devs want to go.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I missed the nullopt check. looking forward to hearing from upstream reviewer how they'd like to handle env vars.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this feature has to be gated by an environment variable. For alternative layout support, ordinarily you would simply just do the correct layout algorithm depending on what the layout of the weights are. To compare, cuDNN supports alternative layouts without needing an envvar to modulate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason it has to be gated is because MIOpen does not officially support NHWC yet, although convolution support has been added to tip-of-tree of MIOpen. The plan is to remove the environment variable check to a ROCm version check once support is officially added.
( Also, the MIOpen teams do need this support in PyTorch during this time, since we are testing performance of the newly added NHWC code on application loads. )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but it's pretty difficult to "accidentally" end up running NHWC when you didn't intend to (the weights have to be NHWC). Wouldn't you just rather error in that case for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, the plan on record was not clear from the PR description nor the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the application loads request NHWC explicitly now, and I wanted it to fall back on NCHW until support is officially added, and performance verified.

I have opened a ticket to track future removal of the flag: #64427
Let me add that to the code too.

Comment on lines +1176 to +1180
def wrap_fn(self, *args, **kwargs):
if self.device_type == 'cuda':
if not TEST_WITH_ROCM:
reason = "ROCm not available"
raise unittest.SkipTest(reason)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will cause CUDA to skip any test using this decorator because CUDA never sets TEST_WITH_ROCM. I think the logic you were going for was

if self.device_type == 'cuda' and torch.version.hip is not None
    # rocm version parsing etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how the ROCm CI sets PYTORCH_TEST_WITH_ROCM=1, but TEST_WITH_ROCM is just the internal representation of that flag (read in common_utils.py). I got the idea to use this flag from the skipCUDAIfNotRocm() function just a few lines above.

When I am testing, I am manually setting PYTORCH_TEST_WITH_ROCM, as in the description of this PR.

Copy link
Collaborator

@jeffdaily jeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why we need both skipIfRocmVersionLessThan and skipCUDAIfRocmVersionLessThan.

@amathews-amd
Copy link
Contributor Author

amathews-amd commented Sep 1, 2021

I'm confused why we need both skipIfRocmVersionLessThan and skipCUDAIfRocmVersionLessThan.

'device_type' is not defined for a class, TestNN, but it exists for TestNNDeviceType (possibly due to instantiate_device_type_tests(TestNNDeviceType, globals()) line in test_nn.py ). So, we can't just use skipCUDAIfRocmVersionLessThan.

Can we just use skipIfRocmVersionLessThan ? Possibly, since the applicable tests are wrapped in @onlyCUDA anyway. However, this is against the design of the skipping decorator and will even skip on CPU (if support for NHWC is added on CPU in future). So, I am making a conservative choice here.

// Make sure that NC11 strides follow formula
bias_contig.resize_(bias_contig.sizes(), memory_format );

// TODO: Workaround since MIOpen does not support NHWC bias
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Internal tickets opened against frameworks pytorch to fix this later:
https://ontrack-internal.amd.com/browse/SWDEV-301466
See ticket for linked MIOpen ticket for NHWC bias support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link the ticket in the code

// Pass-through
stride[i] = t.stride(i);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this. Why don't you just read out the strides from the tensor, instead of recomputing them here? (Does miopen require some specific canonical form of strides?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Pad dimensions, it needs to be set to 1. So, this addition is needed.
In the ChannelsLast code, the new code is essentially doing what you propose, that is, reading out strides from the tensor.
In the NCHW (and other code), on line 117, I only kept it to recompute because the original code was recomputing. I don't think we need to recompute. If you are also on same opinion, I can change both code to just pass-through

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see. You don't have to fix the original code in this PR but I would like to see it fixed (Unless there is a reason for recompute, in which case it should be commented why)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to fix the original code; Tested locally and test_nn passes. Lets see if CI tests pass. If it does, we are good.

@@ -90,17 +90,17 @@ std::ostream& operator<<(std::ostream & out, const TensorDescriptor& d) {

void TensorDescriptor::print() { std::cout << *this; }

void FilterDescriptor::set(const at::Tensor &t, int64_t pad) {
void FilterDescriptor::set(const at::Tensor &t, const at::MemoryFormat memory_format, int64_t pad) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think passing memory format here explicitly is kind of suspect, but it is symmetric with cuDNN so I'll let it slide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I referred to the cuDNN implementation and tried to match it as much as possible.

@@ -109,9 +109,25 @@ void FilterDescriptor::set(const at::Tensor &t, int64_t pad) {
for (int i = dim; i < pad; ++i) {
size[i] = (int) 1;
}
for (int i = dim - 1; i >=0; --i) {
stride[i] = (i == dim - 1) ? 1 : stride[i+1] * size[i+1];
if( memory_format != at::MemoryFormat::ChannelsLast ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this. Do a switch on memory format and explicitly error if it is an unexpected error; this will make the code robust if a new memory format gets added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated; this is gone.

shape[output_channels_dim] = -1;
at::Tensor bias_contig = bias->reshape(shape).contiguous(memory_format);
// Make sure that NC11 strides follow formula
bias_contig.resize_(bias_contig.sizes(), memory_format );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks totally unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that its not optimal to keep doing this. I was following the pattern in the cudnn implementation.

The NHWC blog (https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) describes this issue.
For general cases the two APIs behave the same. However in special cases for a 4D tensor with size NCHW when either: C==1 or H==1 && W==1, only to would generate a proper stride to represent channels last memory format.

memory_format = (weight->ndimension() == 5) ? /*at::MemoryFormat::ChannelsLast3d*/at::MemoryFormat::Contiguous : at::MemoryFormat::ChannelsLast;
}

auto output_t = at::native::empty_cuda(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this was cargo culted from ConvShared.cpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This empty_cuda function sets the strides correctly given the memory layout, which is why I used it.


Tensor outputBias = at::squeeze( at::sum(grad_output_t, discard_dims, true) );
if( outputBias.dim() == 0 ) {
// always return a tensor of shape [_]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the result is just 1 element, 1 dim, at::squeeze will make the result a scalar. There are tests in test_nn.py that check this return value against a tensor of shape, [1].

@ezyang
Copy link
Contributor

ezyang commented Sep 2, 2021

This looks ok but I'm skeptical about the environment variable

@facebook-github-bot
Copy link
Contributor

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Sep 8, 2021

Sorry about the delay. Need some warning cleanup in this PR:

caffe2/aten/src/ATen/native/ConvUtils.h:114:30: error: unused variable 'PYTORCH_MIOPEN_SUGGEST_NHWC' [-Werror,-Wunused-variable]
  static c10::optional<bool> PYTORCH_MIOPEN_SUGGEST_NHWC = c10::utils::check_env("PYTORCH_MIOPEN_SUGGEST_NHWC");
                             ^
caffe2/aten/src/ATen/native/ConvUtils.h:123:8: error: unused variable 'input_memory_format' [-Werror,-Wunused-variable]
  auto input_memory_format = input.suggest_memory_format();
       ^
caffe2/aten/src/ATen/native/ConvUtils.h:124:8: error: unused variable 'weight_memory_format' [-Werror,-Wunused-variable]
  auto weight_memory_format = weight.suggest_memory_format();

@amathews-amd
Copy link
Contributor Author

Sorry about the delay. Need some warning cleanup in this PR:

caffe2/aten/src/ATen/native/ConvUtils.h:114:30: error: unused variable 'PYTORCH_MIOPEN_SUGGEST_NHWC' [-Werror,-Wunused-variable]
  static c10::optional<bool> PYTORCH_MIOPEN_SUGGEST_NHWC = c10::utils::check_env("PYTORCH_MIOPEN_SUGGEST_NHWC");
                             ^
caffe2/aten/src/ATen/native/ConvUtils.h:123:8: error: unused variable 'input_memory_format' [-Werror,-Wunused-variable]
  auto input_memory_format = input.suggest_memory_format();
       ^
caffe2/aten/src/ATen/native/ConvUtils.h:124:8: error: unused variable 'weight_memory_format' [-Werror,-Wunused-variable]
  auto weight_memory_format = weight.suggest_memory_format();

Fixed, and merged upstream to branch.

@facebook-github-bot
Copy link
Contributor

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 63b180b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged module: rocm AMD GPU support for Pytorch open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants