-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Converting hardswish to strucutred kernels with metatensor support #66899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 2ff625b (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
… support" [ghstack-poisoned]
… support" [ghstack-poisoned]
… support" [ghstack-poisoned]
… support" [ghstack-poisoned]
… support" [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few qs.
if (input.data_ptr() == padded_input.data_ptr()) { | ||
hardswish_impl(input, input); | ||
return input; | ||
if (result.data_ptr() == padded_input.data_ptr()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check looks a bit funny. It should only be true if result
and input
tensors are the same and input
wasn't padded which is pretty much the inplace case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to double check what should happen if the input was padded and doesn't fit into the output tensor anymore. @ezyang do you know whom we should bug about xnnpack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should only be true if
result
andinput
tensors are the same andinput
wasn't padded which is pretty much the inplace case.
this is true but then prompts the question of why not just dispatch to the inplace version? i guess that wouldn't be using the structured kernel framework though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albanD since @ezyang is OoO, I was wondering if you might know either
a) if there's an example of structured kernels w/o going through _out?
b) if we add _out
with the semantic that if input is padded we will replacing the out tensor? It seems fine on the second thought and we are maintaining the _inplace
semantic this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) not sure we have @bdhirsh would know as he is working on that in details.
b) I don't think we should have _out
functions that have a different semantic than the other _out
functions. These semantics are tricky enough as it is. Also some people actually use the _out
version with a Tensor that is not the input as the out
argument.
Note that the code here is not doing that though. It check if the out argument is the same as the padded input. And in this case, it is valid to use it yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit of making this op structured is that you only have to write the out= variant, and the inplace and out-of-place variants come for free.
It looks like the mobile kernel is coupled with the cpu kernel (guarded with a build flag), so that benefit should extend to the mobile kernel; we shouldn't need an xnnpack::hardswish
or xnnpack::hardswish_
; just an xnnpack::hardswish_out
function.
if we add _out with the semantic that if input is padded we will replacing the out tensor?
Hmm I don't think this is actually replacing the tensor? This check looks like an optimization for the inplace case: if we're inplace-modifying input, and it's contiguous and not padded, then we don't need to do the computation on an intermediate tensor. In all other cases (including in the out= and functional case) we do the computation on an intermediate tensor, and then copy_()
that intermediate back to the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that we shouldn't need a functional and the inplace variant with this change.
However, I guess I should instead be padding the result tensor and checking it against itself instead, and yes, therefore also covering the case where the out tensor is pre-padded, but not equal to the input tensor.
DECLARE_DISPATCH(hardsigmoid_backward_fn, hardsigmoid_backward_stub); | ||
DECLARE_DISPATCH(hardswish_fn, hardswish_stub); | ||
DECLARE_DISPATCH(hardswish_backward_fn, hardswish_backward_stub); | ||
DECLARE_DISPATCH(structured_activation_fn, hardsigmoid_stub); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are we switching hardsigmoid
to the structured kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hardsigmoid
is already a structured kernel, it was already using the structured_activation_fn
signature. I'm just cleaning up a variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… support" [ghstack-poisoned]
… support" [ghstack-poisoned]
const Tensor& hardswish_out(const Tensor& input, const Tensor& result) { | ||
Tensor padded_input = mobile::allocate_padded_contiguous_if_needed( | ||
input, input.suggest_memory_format()); | ||
input, input.suggest_memory_format()); | ||
|
||
// Don't need to allocate output if input is contiguous & already padded | ||
if (input.data_ptr() == padded_input.data_ptr()) { | ||
hardswish_impl(input, input); | ||
return input; | ||
// Don't need to allocate output if result is contiguous & already padded | ||
if (mobile::is_padded_contiguous(result, result.suggest_memory_format())) { | ||
hardswish_impl(padded_input, result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albanD @bdhirsh @Krovatkin Ok, rewrote this given that it seems that we only need to allocate a temporary output if the result is not properly allocated. Would still love eyes on this, because I don't know this part of the codebase well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh okay - so, I think the meta function needs to change a little. The idea of the meta function is that running it should fully determine what the output size/shape is (any run any shape checks).Looking at the original code for hardswish()
(out-of-place), it looks like it conditionally decides to sometimes not use TensorIterator to create the output tensor for mobile. The logic for allocating the output looks something like:
#if defined(C10_MOBILE) && defined(USE_XNNPACK)
if (xnnpack::use_hardswish(...) && !maybe_get_output().defined()) {
// use special xnnpack logic to determine output's size and dtype
at::native::resize_(maybe_get_output(), computed_size, computed_options);
}
#endif
// use TensorIteratorBase to allocate the output
It's a little hairy, since in the mobile case the op only decides to use TensorIterator to allocate the output some of the time.
Also, it looks like the mobile logic only happens in the hardswish
and hardswish_
case (and not for out=
). So you want to check for that in the meta function. You can probably do that in the meta function with something like:
if (maybe_get_output().defined() && maybe_get_output().data_ptr() != input().data_ptr())` // out= case
I guess that would technically result in different behavior if someone tried to call hardwish(x, out=x)
, but I'm pretty sure that pattern is buggy in a bunch of other ways throughout pytorch anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. However, I thought the XNNPack version of the code should result in the same shape/size, though with the XNNPack padded memory format. I am wondering if the XNNPack implementation would deviate from the other implementation in other ways that metatensors cares about. Regardless, this deserves at least a comment in HardSwish's metafunction.
And if this is incorrect then I think the xnn implemenation is the thing we should scrutinize, and fix first.
Yes, the whole thing is hairy, and I would prefer to avoid trying to put weird logic in the meta function as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did intend on changing the logic of the function to use xnn
if possible for the out
case because it would make the logic cleaner, and I assumed that it was not implemented just because the person adding XNNPack support in the first place just didn't want to support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if the XNNPack implementation would deviate from the other implementation in other ways that metatensors cares about. ... And if this is incorrect then I think the xnn implementation is the thing we should scrutinize, and fix first.
Yeah, I think you're right (and the code that creates the output tensor for mobile uses the same size/dtype, but just uses the custom mobile allocator). So yep, probably ignore what I said about needing mobile-specific logic in the meta function
I think the only thing that I'm worried about is that we want the above check (mobile::is_padded_contiguous(...)
) to return true in the out-of-place hardswish()
case. That'll only be true if the mobile allocator is used to create the output tensor (during the call to to build_unary_op()
in the meta function). Which... I think it is? Because we only ever enter this code path when C10_MOBILE
is set (and there's this logic that sets the mobile allocator during mobile builds)
I did intend on changing the logic of the function to use xnn if possible for the out case because it would make the logic cleaner
Ok yeah, this sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like we need a few unittests to check that the behavior of this op is as expected, and properly goes down the various xnnpack paths. For example we should that the out of place hardswish()
doesn't do unexpected copies and goes down the path intended. I have no familiarity with how to test for that. @bdhirsh , do you have pointers for where to add this test and how to test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I know of. Hey @dhruvbird, do you happen to know if we have internal tests for ops specifically on mobile? E.g. in this case, that hardswish
goes down the XNNPACK fastpath when built with mobile (and that it continues to after this change).
I mean, I'm pretty confident that it is from staring at the code :). But an automated test would definitely help me feel better too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kimishpatel for tests related to XNNPACK.
@bdhirsh if you import this PR to internal and put up a diff, then it should run tests with a bunch (like 150 models) of models that use XNNPACK (probably 50 may use XNNPACK) - if they don't crash at least you'll have some signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What @dhruvbird said should work. I wonder if we can have "code coverage" for those test to see if changed lines are covered. Not sure if there are any tools for that though.
bool use_hardswish(const Tensor& input) { | ||
return xnnpack::internal::available() && (1 <= input.ndimension()) && | ||
(input.device().is_cpu()) && (kFloat == input.scalar_type()) && | ||
!input.requires_grad() && true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it was copy-pasted, but we don't need the && true
, right :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True :P
@Gamrix has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
This pull request has been reverted by bb8978f. To re-land this change, follow these steps. |
// | ||
bool use_hardswish(const Tensor& input); | ||
Tensor hardswish(const Tensor& input); | ||
Tensor& hardswish_(Tensor& input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are some mobile tests at vulkan_api_test.cpp
that expect to use this. Maybe it's worth refactoring them to use the out= variant? (Or worst case, adding this back)
… support" Differential Revision: [D32175963](https://our.internmc.facebook.com/intern/diff/D32175963) [ghstack-poisoned]
… support" Differential Revision: [D32175963](https://our.internmc.facebook.com/intern/diff/D32175963) [ghstack-poisoned]
… support" Differential Revision: [D32175963](https://our.internmc.facebook.com/intern/diff/D32175963) [ghstack-poisoned]
… support" = [ghstack-poisoned]
… support" = [ghstack-poisoned]
@Gamrix has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
… support" = Differential Revision: [D32535014](https://our.internmc.facebook.com/intern/diff/D32535014) [ghstack-poisoned]
@Gamrix has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
rip this PR |
Stack from ghstack:
=
Differential Revision: D32535014