-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Support for XNNPACK max pooling operator. #33766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
💊 CircleCI build failures summary and remediationsAs of commit 24f19b5 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following build failures do not appear to be due to upstream breakages (reran 1 job to discount flakiness):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ceil_mode
used somewhere by XNNPACK for maxpool op? PT documentation suggests that it is used to compute output shape, https://pytorch.org/docs/stable/nn.html#torch.nn.MaxPool2d.
It it is not used then dont we need to figure out what output shape XNNPACK wants and be sure to provide buffer with that shape?
And also impose the restriction on whether maxpool2d can be mapped to xnnpack or not depending on the value of ceil_mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the seg faults in tests are related to the ceil_mode stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, maybe. I'll investigate, thanks.
aten/src/ATen/native/Pooling.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we directly exposing this here? I thought, based on conv and linear, our philosophy was going to be not exposing this directly but require explicit opt-in via "xnnpackify.." short of transform of the network. @dreiss for comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that, this will also make it a little harder to fuse maxpool + relu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binding it directly is good as it directly benefits existing models. It doesn't preclude us from doing more optimizations also like fusion you described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. With Dima. We can have a separate diff that exposes the fused version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ashkan. Overall looks great. I have left a couple of comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from overall point of view!
aten/src/ATen/native/Pooling.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binding it directly is good as it directly benefits existing models. It doesn't preclude us from doing more optimizations also like fusion you described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this work when kernel.size()==1? Layout::Parameter::width is 1, and you can't have kernel[1]. Ah, I see, you are expanding kernel
in create
, but then check for kernel.size() should be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry I didn't get your point about kernel.size() being different, can you elaborate? Yes, I'm allowing 1 and 2 but expanding to 2 prior to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you require 2, then you should not allow 1, and looks like you require 2, because otherwise kernel[1] will segfault.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaah ... how could I have missed that?! :/ Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you'll probably need similar structure for AvgPool and other kinds of pooling, so it makes sens not to put it in max_pool2d namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll move it in a future patch after Kimish merges his. The only place I can currently put it right now is inside Common.h which is starting to become a cluttered mess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done now that Kimish's patch is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integration looks good. Should have some unit tests added for this.
aten/src/ATen/native/Pooling.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. With Dima. We can have a separate diff that exposes the fused version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes should probably be a separate diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please. Can you remove these changes? Else I or you will run into some merge conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, no worries, will revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If output channels is always equal to input channels, and you assert later that it always will be equal to input channels, why have it as a separate parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Message needs to be updated. Also, shouldn't this be checked when the context is created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the message.
The input tensor is not available at creation time. available() checks any parameter provided at creation time while usable gates anything that depends on the input tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check that input_nhwc was created with a guarding allocator, or is that only needed for the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should use allocate_padded_if_needed() in Kimish's patch: only re-allocate input if not already allocated with this allocator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just return output_nhwc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think PyTorch's convention is to return tensors in the same layout they came in so we'll have to switch back to NCHW if that's the layout the input tensor was in. Dima / Natalia can confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the convention, then it makes sense to stick with it.
We should make sure that if we're returning NHWC, we're doing so with a guarding allocator to get maximum performance from a sequence of XNNPACK ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contiguous is a no-op if memory is already in the requested layout. In other words, if the input tensor is in NHWC, contiguous will short circuit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like too tight of a coupling. Can you add a separate check for dim==4 instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thanks for the comments. Will address and upload a new patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Had to add this to prevent a test in test/test_namedtensor.py from failing. Said test expects tensor names to propagate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you separate the named tensor and formatting/comment changes into separate diffs?
const Tensor input_nhwc = input.contiguous(MemoryFormat::ChannelsLast); | ||
const Tensor padded_input_nhwc = allocate_padded_if_needed(input_nhwc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is potentially two copies. Can we combine these into one?
return max_pool2d::available( | ||
input.size(Layout::Activation4D::channels), | ||
parameters.kernel, | ||
parameters.padding, | ||
parameters.stride, | ||
parameters.dilation, | ||
ceil_mode, | ||
internal::max_pool2d::Context::kMin, | ||
internal::max_pool2d::Context::kMax) && | ||
max_pool2d::usable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need available and usable as separate functions?
Tensor create_and_run( | ||
const Tensor& input, | ||
const IntArrayRef kernel, | ||
const IntArrayRef padding, | ||
const IntArrayRef stride, | ||
const IntArrayRef dilation, | ||
const bool ceil_mode, | ||
const float output_min, | ||
const float output_max) { | ||
using namespace internal; | ||
|
||
return internal::max_pool2d::run( | ||
internal::max_pool2d::create( | ||
input.size(Layout::Activation4D::channels), | ||
kernel, | ||
padding, | ||
stride, | ||
dilation, | ||
ceil_mode, | ||
output_min, | ||
output_max), | ||
input); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an unnecessary layer of abstraction since we're not persisting the Context. If you inline both create and run directly into max_pool2d
, can you eliminate Context and shorten/simplify the entire diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to maintain the ability to separate create and run in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? We don't expect it will ever have a significant perf improvement over this implementation, right?
Breaking PR into smaller chunks per David's request. #35354. Closing. |
Add support for XNNPACK 2D max pool operator. The operator is enabled as a result of integration into at::max_pool2d(...), itself registered through native_functions.yaml.
Test Plan: CI