Skip to content

Conversation

Gamrix
Copy link
Contributor

@Gamrix Gamrix commented Oct 19, 2021

@pytorch-probot
Copy link

pytorch-probot bot commented Oct 19, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/2ff625b054814f0614af70cfc48cfbcff36d38a9/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-vulkan-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3.6-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers ✅ triggered
linux-xenial-py3.6-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
docker-builds ciflow/all 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos 🚫 skipped
macos-10-15-py3-x86-64 ciflow/all, ciflow/macos 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Oct 19, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 2ff625b (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-xenial-py3.6-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-11-19T20:36:04.0532376Z The PR is introduc...m to confirm whether this change is wanted or not.
2021-11-19T20:36:04.0519064Z processing existing schema:  alltoall_base(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor _1, Tensor _2, int[] _3, int[] _4) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-11-19T20:36:04.0520441Z processing existing schema:  alltoall(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, Tensor[] _2) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-11-19T20:36:04.0521750Z processing existing schema:  send(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2, int _3) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-11-19T20:36:04.0523062Z processing existing schema:  recv(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2, int _3) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-11-19T20:36:04.0524388Z processing existing schema:  recv_anysource(__torch__.torch.classes.dist_c10d.ProcessGroup _0, Tensor[] _1, int _2) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-11-19T20:36:04.0525668Z processing existing schema:  barrier(__torch__.torch.classes.dist_c10d.ProcessGroup _0) -> (__torch__.torch.classes.dist_c10d.Work _0)
2021-11-19T20:36:04.0526754Z processing existing schema:  __init__(__torch__.torch.classes.dist_c10d.frontend _0) -> (NoneType _0)
2021-11-19T20:36:04.0528212Z processing existing schema:  new_process_group_helper(__torch__.torch.classes.dist_c10d.frontend _0, int _1, int _2, int[] _3, str _4, __torch__.torch.classes.dist_c10d.Store _5, str? _6, int _7) -> (__torch__.torch.classes.dist_c10d.ProcessGroup _0)
2021-11-19T20:36:04.0529794Z processing existing schema:  get_process_group_by_name(__torch__.torch.classes.dist_c10d.frontend _0, str _1) -> (__torch__.torch.classes.dist_c10d.ProcessGroup _0)
2021-11-19T20:36:04.0531192Z processing existing schema:  get_name_of_process_group(__torch__.torch.classes.dist_c10d.frontend _0, __torch__.torch.classes.dist_c10d.ProcessGroup _1) -> (str _0)
2021-11-19T20:36:04.0532376Z The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 
2021-11-19T20:36:04.0532984Z 
2021-11-19T20:36:04.0533237Z Broken ops: [
2021-11-19T20:36:04.0533997Z 	aten::_scatter_reduce(Tensor self, int dim, Tensor index, str reduce, *, int? output_size=None) -> (Tensor)
2021-11-19T20:36:04.0535232Z 	aten::_scatter_reduce.out(Tensor self, int dim, Tensor index, str reduce, *, int? output_size=None, Tensor(a!) out) -> (Tensor(a!))
2021-11-19T20:36:04.0535782Z ]
2021-11-19T20:36:04.0840089Z + cleanup
2021-11-19T20:36:04.0840409Z + retcode=1
2021-11-19T20:36:04.0840779Z + set +x
2021-11-19T20:36:04.0874306Z ##[error]Process completed with exit code 1.
2021-11-19T20:36:04.0917329Z ##[group]Run # Ensure the working directory gets chowned back to the current user

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Gamrix added a commit that referenced this pull request Oct 19, 2021
@Gamrix Gamrix requested review from Krovatkin and eellison October 20, 2021 00:03
@Gamrix Gamrix mentioned this pull request Oct 21, 2021
Copy link
Contributor

@Krovatkin Krovatkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few qs.

if (input.data_ptr() == padded_input.data_ptr()) {
hardswish_impl(input, input);
return input;
if (result.data_ptr() == padded_input.data_ptr()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check looks a bit funny. It should only be true if result and input tensors are the same and input wasn't padded which is pretty much the inplace case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to double check what should happen if the input was padded and doesn't fit into the output tensor anymore. @ezyang do you know whom we should bug about xnnpack?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should only be true if result and input tensors are the same and input wasn't padded which is pretty much the inplace case.

this is true but then prompts the question of why not just dispatch to the inplace version? i guess that wouldn't be using the structured kernel framework though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD since @ezyang is OoO, I was wondering if you might know either
a) if there's an example of structured kernels w/o going through _out?
b) if we add _out with the semantic that if input is padded we will replacing the out tensor? It seems fine on the second thought and we are maintaining the _inplace semantic this way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) not sure we have @bdhirsh would know as he is working on that in details.
b) I don't think we should have _out functions that have a different semantic than the other _out functions. These semantics are tricky enough as it is. Also some people actually use the _out version with a Tensor that is not the input as the out argument.

Note that the code here is not doing that though. It check if the out argument is the same as the padded input. And in this case, it is valid to use it yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit of making this op structured is that you only have to write the out= variant, and the inplace and out-of-place variants come for free.

It looks like the mobile kernel is coupled with the cpu kernel (guarded with a build flag), so that benefit should extend to the mobile kernel; we shouldn't need an xnnpack::hardswish or xnnpack::hardswish_; just an xnnpack::hardswish_out function.

if we add _out with the semantic that if input is padded we will replacing the out tensor?

Hmm I don't think this is actually replacing the tensor? This check looks like an optimization for the inplace case: if we're inplace-modifying input, and it's contiguous and not padded, then we don't need to do the computation on an intermediate tensor. In all other cases (including in the out= and functional case) we do the computation on an intermediate tensor, and then copy_() that intermediate back to the output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that we shouldn't need a functional and the inplace variant with this change.

However, I guess I should instead be padding the result tensor and checking it against itself instead, and yes, therefore also covering the case where the out tensor is pre-padded, but not equal to the input tensor.

DECLARE_DISPATCH(hardsigmoid_backward_fn, hardsigmoid_backward_stub);
DECLARE_DISPATCH(hardswish_fn, hardswish_stub);
DECLARE_DISPATCH(hardswish_backward_fn, hardswish_backward_stub);
DECLARE_DISPATCH(structured_activation_fn, hardsigmoid_stub);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are we switching hardsigmoid to the structured kernel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardsigmoid is already a structured kernel, it was already using the structured_activation_fn signature. I'm just cleaning up a variable.

Copy link
Contributor

@Krovatkin Krovatkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Comment on lines 54 to 60
const Tensor& hardswish_out(const Tensor& input, const Tensor& result) {
Tensor padded_input = mobile::allocate_padded_contiguous_if_needed(
input, input.suggest_memory_format());
input, input.suggest_memory_format());

// Don't need to allocate output if input is contiguous & already padded
if (input.data_ptr() == padded_input.data_ptr()) {
hardswish_impl(input, input);
return input;
// Don't need to allocate output if result is contiguous & already padded
if (mobile::is_padded_contiguous(result, result.suggest_memory_format())) {
hardswish_impl(padded_input, result);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD @bdhirsh @Krovatkin Ok, rewrote this given that it seems that we only need to allocate a temporary output if the result is not properly allocated. Would still love eyes on this, because I don't know this part of the codebase well.

Copy link
Contributor

@bdhirsh bdhirsh Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh okay - so, I think the meta function needs to change a little. The idea of the meta function is that running it should fully determine what the output size/shape is (any run any shape checks).Looking at the original code for hardswish() (out-of-place), it looks like it conditionally decides to sometimes not use TensorIterator to create the output tensor for mobile. The logic for allocating the output looks something like:

  #if defined(C10_MOBILE) && defined(USE_XNNPACK)
  if (xnnpack::use_hardswish(...) && !maybe_get_output().defined()) {
    // use special xnnpack logic to determine output's size and dtype
    at::native::resize_(maybe_get_output(), computed_size, computed_options);
  }
  #endif
  // use TensorIteratorBase to allocate the output

It's a little hairy, since in the mobile case the op only decides to use TensorIterator to allocate the output some of the time.

Also, it looks like the mobile logic only happens in the hardswish and hardswish_ case (and not for out=). So you want to check for that in the meta function. You can probably do that in the meta function with something like:

if (maybe_get_output().defined() && maybe_get_output().data_ptr() != input().data_ptr())` // out= case

I guess that would technically result in different behavior if someone tried to call hardwish(x, out=x), but I'm pretty sure that pattern is buggy in a bunch of other ways throughout pytorch anyway.

Copy link
Contributor Author

@Gamrix Gamrix Nov 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. However, I thought the XNNPack version of the code should result in the same shape/size, though with the XNNPack padded memory format. I am wondering if the XNNPack implementation would deviate from the other implementation in other ways that metatensors cares about. Regardless, this deserves at least a comment in HardSwish's metafunction.

And if this is incorrect then I think the xnn implemenation is the thing we should scrutinize, and fix first.

Yes, the whole thing is hairy, and I would prefer to avoid trying to put weird logic in the meta function as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did intend on changing the logic of the function to use xnn if possible for the out case because it would make the logic cleaner, and I assumed that it was not implemented just because the person adding XNNPack support in the first place just didn't want to support it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if the XNNPack implementation would deviate from the other implementation in other ways that metatensors cares about. ... And if this is incorrect then I think the xnn implementation is the thing we should scrutinize, and fix first.

Yeah, I think you're right (and the code that creates the output tensor for mobile uses the same size/dtype, but just uses the custom mobile allocator). So yep, probably ignore what I said about needing mobile-specific logic in the meta function

I think the only thing that I'm worried about is that we want the above check (mobile::is_padded_contiguous(...)) to return true in the out-of-place hardswish() case. That'll only be true if the mobile allocator is used to create the output tensor (during the call to to build_unary_op() in the meta function). Which... I think it is? Because we only ever enter this code path when C10_MOBILE is set (and there's this logic that sets the mobile allocator during mobile builds)

I did intend on changing the logic of the function to use xnn if possible for the out case because it would make the logic cleaner

Ok yeah, this sounds reasonable.

Copy link
Contributor Author

@Gamrix Gamrix Nov 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we need a few unittests to check that the behavior of this op is as expected, and properly goes down the various xnnpack paths. For example we should that the out of place hardswish() doesn't do unexpected copies and goes down the path intended. I have no familiarity with how to test for that. @bdhirsh , do you have pointers for where to add this test and how to test this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of. Hey @dhruvbird, do you happen to know if we have internal tests for ops specifically on mobile? E.g. in this case, that hardswish goes down the XNNPACK fastpath when built with mobile (and that it continues to after this change).

I mean, I'm pretty confident that it is from staring at the code :). But an automated test would definitely help me feel better too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kimishpatel for tests related to XNNPACK.

@bdhirsh if you import this PR to internal and put up a diff, then it should run tests with a bunch (like 150 models) of models that use XNNPACK (probably 50 may use XNNPACK) - if they don't crash at least you'll have some signal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @dhruvbird said should work. I wonder if we can have "code coverage" for those test to see if changed lines are covered. Not sure if there are any tools for that though.

bool use_hardswish(const Tensor& input) {
return xnnpack::internal::available() && (1 <= input.ndimension()) &&
(input.device().is_cpu()) && (kFloat == input.scalar_type()) &&
!input.requires_grad() && true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was copy-pasted, but we don't need the && true, right :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True :P

@Gamrix
Copy link
Contributor Author

Gamrix commented Nov 4, 2021

@Gamrix has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@Gamrix merged this pull request in 57335a9.

@facebook-github-bot
Copy link
Contributor

This pull request has been reverted by bb8978f. To re-land this change, follow these steps.

//
bool use_hardswish(const Tensor& input);
Tensor hardswish(const Tensor& input);
Tensor& hardswish_(Tensor& input);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are some mobile tests at vulkan_api_test.cpp that expect to use this. Maybe it's worth refactoring them to use the out= variant? (Or worst case, adding this back)

@facebook-github-bot facebook-github-bot deleted the gh/gamrix/18/head branch November 8, 2021 15:16
@Gamrix Gamrix restored the gh/gamrix/18/head branch November 15, 2021 19:11
@Gamrix Gamrix mentioned this pull request Nov 15, 2021
@Gamrix Gamrix reopened this Nov 15, 2021
Gamrix added a commit that referenced this pull request Nov 16, 2021
Gamrix added a commit that referenced this pull request Nov 16, 2021
Gamrix added a commit that referenced this pull request Nov 17, 2021
@Gamrix
Copy link
Contributor Author

Gamrix commented Nov 18, 2021

@Gamrix has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Gamrix added a commit that referenced this pull request Nov 19, 2021
@Gamrix
Copy link
Contributor Author

Gamrix commented Nov 19, 2021

@Gamrix has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@Gamrix Gamrix closed this Dec 22, 2021
@facebook-github-bot facebook-github-bot deleted the gh/gamrix/18/head branch December 25, 2021 15:16
@ezyang
Copy link
Contributor

ezyang commented Jan 7, 2022

rip this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants