New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCm MIOpen NHWC Convolution support #63617
ROCm MIOpen NHWC Convolution support #63617
Conversation
20210412 upstream changes
…V-280751_MAGMA_master
Swdev 280751 magma master
…V-280751_MAGMA_remove_test_skips
…t_skips Swdev 280751 magma remove test skips
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 2ef3bfb (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
…onvolution that does not work on AMD gpus
|
||
bool can_use_miopen_channels_last_2d = false; | ||
#if defined(USE_ROCM) && (ROCM_VERSION >= 40300) | ||
can_use_miopen_channels_last_2d = PYTORCH_MIOPEN_SUGGEST_NHWC && *PYTORCH_MIOPEN_SUGGEST_NHWC && ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the optional is a nullopt because the user set the env var to something other than 0 or 1? Do we care about that use case and failing gracefully?
Aside: The c10::utils::check_env appears to have been added recently (#59052), but is currently not used anywhere else. ATen seems to use getenv directly. Does upstream have a preference how to parse env vars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see that I am testing both PYTORCH_MIOPEN_SUGGEST_NHWC
and *PYTORCH_MIOPEN_SUGGEST_NHWC
. The first one is the nullopt check. The second is the true/false check. About handling other arguments like 'True/False', thats upto check_env.
Yes, c10::utils::check_env is new, and not used. I figured I will use it since the API is supported, and it might be a direction where the PyTorch devs want to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I missed the nullopt check. looking forward to hearing from upstream reviewer how they'd like to handle env vars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this feature has to be gated by an environment variable. For alternative layout support, ordinarily you would simply just do the correct layout algorithm depending on what the layout of the weights are. To compare, cuDNN supports alternative layouts without needing an envvar to modulate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason it has to be gated is because MIOpen does not officially support NHWC yet, although convolution support has been added to tip-of-tree of MIOpen. The plan is to remove the environment variable check to a ROCm version check once support is officially added.
( Also, the MIOpen teams do need this support in PyTorch during this time, since we are testing performance of the newly added NHWC code on application loads. )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but it's pretty difficult to "accidentally" end up running NHWC when you didn't intend to (the weights have to be NHWC). Wouldn't you just rather error in that case for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, the plan on record was not clear from the PR description nor the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the application loads request NHWC explicitly now, and I wanted it to fall back on NCHW until support is officially added, and performance verified.
I have opened a ticket to track future removal of the flag: #64427
Let me add that to the code too.
def wrap_fn(self, *args, **kwargs): | ||
if self.device_type == 'cuda': | ||
if not TEST_WITH_ROCM: | ||
reason = "ROCm not available" | ||
raise unittest.SkipTest(reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will cause CUDA to skip any test using this decorator because CUDA never sets TEST_WITH_ROCM. I think the logic you were going for was
if self.device_type == 'cuda' and torch.version.hip is not None
# rocm version parsing etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how the ROCm CI sets PYTORCH_TEST_WITH_ROCM=1
, but TEST_WITH_ROCM
is just the internal representation of that flag (read in common_utils.py
). I got the idea to use this flag from the skipCUDAIfNotRocm()
function just a few lines above.
When I am testing, I am manually setting PYTORCH_TEST_WITH_ROCM
, as in the description of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused why we need both skipIfRocmVersionLessThan
and skipCUDAIfRocmVersionLessThan
.
'device_type' is not defined for a class, TestNN, but it exists for TestNNDeviceType (possibly due to Can we just use |
// Make sure that NC11 strides follow formula | ||
bias_contig.resize_(bias_contig.sizes(), memory_format ); | ||
|
||
// TODO: Workaround since MIOpen does not support NHWC bias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Internal tickets opened against frameworks pytorch to fix this later:
https://ontrack-internal.amd.com/browse/SWDEV-301466
See ticket for linked MIOpen ticket for NHWC bias support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link the ticket in the code
// Pass-through | ||
stride[i] = t.stride(i); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this. Why don't you just read out the strides from the tensor, instead of recomputing them here? (Does miopen require some specific canonical form of strides?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Pad dimensions, it needs to be set to 1. So, this addition is needed.
In the ChannelsLast code, the new code is essentially doing what you propose, that is, reading out strides from the tensor.
In the NCHW (and other code), on line 117, I only kept it to recompute because the original code was recomputing. I don't think we need to recompute. If you are also on same opinion, I can change both code to just pass-through
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see. You don't have to fix the original code in this PR but I would like to see it fixed (Unless there is a reason for recompute, in which case it should be commented why)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to fix the original code; Tested locally and test_nn passes. Lets see if CI tests pass. If it does, we are good.
@@ -90,17 +90,17 @@ std::ostream& operator<<(std::ostream & out, const TensorDescriptor& d) { | |||
|
|||
void TensorDescriptor::print() { std::cout << *this; } | |||
|
|||
void FilterDescriptor::set(const at::Tensor &t, int64_t pad) { | |||
void FilterDescriptor::set(const at::Tensor &t, const at::MemoryFormat memory_format, int64_t pad) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think passing memory format here explicitly is kind of suspect, but it is symmetric with cuDNN so I'll let it slide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I referred to the cuDNN implementation and tried to match it as much as possible.
aten/src/ATen/miopen/Descriptors.cpp
Outdated
@@ -109,9 +109,25 @@ void FilterDescriptor::set(const at::Tensor &t, int64_t pad) { | |||
for (int i = dim; i < pad; ++i) { | |||
size[i] = (int) 1; | |||
} | |||
for (int i = dim - 1; i >=0; --i) { | |||
stride[i] = (i == dim - 1) ? 1 : stride[i+1] * size[i+1]; | |||
if( memory_format != at::MemoryFormat::ChannelsLast ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't do this. Do a switch on memory format and explicitly error if it is an unexpected error; this will make the code robust if a new memory format gets added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated; this is gone.
shape[output_channels_dim] = -1; | ||
at::Tensor bias_contig = bias->reshape(shape).contiguous(memory_format); | ||
// Make sure that NC11 strides follow formula | ||
bias_contig.resize_(bias_contig.sizes(), memory_format ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks totally unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that its not optimal to keep doing this. I was following the pattern in the cudnn implementation.
The NHWC blog (https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) describes this issue.
For general cases the two APIs behave the same. However in special cases for a 4D tensor with size NCHW when either: C==1 or H==1 && W==1, only to would generate a proper stride to represent channels last memory format.
memory_format = (weight->ndimension() == 5) ? /*at::MemoryFormat::ChannelsLast3d*/at::MemoryFormat::Contiguous : at::MemoryFormat::ChannelsLast; | ||
} | ||
|
||
auto output_t = at::native::empty_cuda( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this was cargo culted from ConvShared.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This empty_cuda
function sets the strides correctly given the memory layout, which is why I used it.
|
||
Tensor outputBias = at::squeeze( at::sum(grad_output_t, discard_dims, true) ); | ||
if( outputBias.dim() == 0 ) { | ||
// always return a tensor of shape [_] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the result is just 1 element, 1 dim, at::squeeze will make the result a scalar. There are tests in test_nn.py that check this return value against a tensor of shape, [1].
This looks ok but I'm skeptical about the environment variable |
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Sorry about the delay. Need some warning cleanup in this PR:
|
Fixed, and merged upstream to branch. |
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
PYTORCH_MIOPEN_SUGGEST_NHWC=1
flagPYTORCH_MIOPEN_SUGGEST_NHWC Environment Flag
MIOpen does not officially support NHWC yet, although convolution support has been added to tip-of-tree of MIOpen. This flag is intended to be a short-lived flag to explicitly turn on NHWC support until ROCm officially supports NHWC and performance is verified.
Examples
PYTORCH_TEST_WITH_ROCM=1 PYTORCH_MIOPEN_SUGGEST_NHWC=1 MIOPEN_FIND_ENFORCE=4 MIOPEN_DEBUG_CONV_GEMM=0 MIOPEN_FIND_MODE=1 pytest test_nn.py -v -k "test_conv_cudnn_nhwc"
PYTORCH_MIOPEN_SUGGEST_NHWC=1
on ROCm4.3.See https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html for more examples.
cc @jeffdaily @sunway513 @jithunnair-amd @ROCmSupport