Skip to content

Conversation

AshkanAliabadi
Copy link
Contributor

@AshkanAliabadi AshkanAliabadi commented Feb 25, 2020

Add support for XNNPACK 2D max pool operator. The operator is enabled as a result of integration into at::max_pool2d(...), itself registered through native_functions.yaml.

Test Plan: CI

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@dr-ci
Copy link

dr-ci bot commented Feb 25, 2020

💊 CircleCI build failures summary and remediations

As of commit 24f19b5 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakages (reran 1 job to discount flakiness):

See CircleCI build pytorch_xla_linux_xenial_py3_6_clang7_test (1/1)

Step: "Test" (full log | pattern match details) <confirmed not flaky by 2 failures>

Mar 16 01:43:56 unknown file: Failure
Mar 16 01:38:38 [       OK ] AtenXlaTensorTest.TestAdaptiveAvgPool2DNoBatchBackward (300 ms) 
Mar 16 01:38:38 [ RUN      ] AtenXlaTensorTest.TestConv2DBackward 
Mar 16 01:40:20 [       OK ] AtenXlaTensorTest.TestConv2DBackward (101956 ms) 
Mar 16 01:40:20 [ RUN      ] AtenXlaTensorTest.TestTransposedConv2DBackward 
Mar 16 01:41:03 [       OK ] AtenXlaTensorTest.TestTransposedConv2DBackward (42557 ms) 
Mar 16 01:41:03 [ RUN      ] AtenXlaTensorTest.TestConv3DBackward 
Mar 16 01:42:41 [       OK ] AtenXlaTensorTest.TestConv3DBackward (97983 ms) 
Mar 16 01:42:41 [ RUN      ] AtenXlaTensorTest.TestTransposedConv3DBackward 
Mar 16 01:43:56 [       OK ] AtenXlaTensorTest.TestTransposedConv3DBackward (75418 ms) 
Mar 16 01:43:56 [ RUN      ] AtenXlaTensorTest.TestMaxPool2DBackward 
Mar 16 01:43:56 unknown file: Failure 
Mar 16 01:43:56 C++ exception with description "element 0 of tensors does not require grad and does not have a grad_fn0 (run_backward at /var/lib/jenkins/workspace/torch/csrc/autograd/autograd.cpp:74) 
Mar 16 01:43:56 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4a (0x7fe72d66d6da in /var/lib/jenkins/workspace/torch/lib/libc10.so) 
Mar 16 01:43:56 frame #1: <unknown function> + 0x30e4f33 (0x7fe71bcb4f33 in /var/lib/jenkins/workspace/torch/lib/libtorch_cpu.so) 
Mar 16 01:43:56 frame #2: torch::autograd::backward(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10::optional<bool>, bool) + 0x72 (0x7fe71bcb5572 in /var/lib/jenkins/workspace/torch/lib/libtorch_cpu.so) 
Mar 16 01:43:56 frame #3: <unknown function> + 0x347855e (0x7fe71c04855e in /var/lib/jenkins/workspace/torch/lib/libtorch_cpu.so) 
Mar 16 01:43:56 frame #4: void c10::KernelFunction::callUnboxed<void, at::Tensor const&, at::Tensor const&, bool, bool>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&, bool, bool) const + 0x113 (0x57ca33 in ./test_ptxla) 
Mar 16 01:43:56 frame #5: torch_xla::cpp_test::TestBackward(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, c10::Device const&, std::function<at::Tensor (std::vector<at::Tensor, std::allocator<at::Tensor> > const&)> const&, double, double) + 0x604 (0x5723c4 in ./test_ptxla) 
Mar 16 01:43:56 frame #6: ./test_ptxla() [0x6d1241] 
Mar 16 01:43:56 frame #7: torch_xla::cpp_test::ForEachDevice(std::function<void (c10::Device const&)> const&) + 0x2d (0x56f5dd in ./test_ptxla) 
Mar 16 01:43:56 frame #8: torch_xla::cpp_test::AtenXlaTensorTest_TestMaxPool2DBackward_Test::TestBody() + 0x97 (0x624d27 in ./test_ptxla) 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 74 times.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ceil_mode used somewhere by XNNPACK for maxpool op? PT documentation suggests that it is used to compute output shape, https://pytorch.org/docs/stable/nn.html#torch.nn.MaxPool2d.
It it is not used then dont we need to figure out what output shape XNNPACK wants and be sure to provide buffer with that shape?
And also impose the restriction on whether maxpool2d can be mapped to xnnpack or not depending on the value of ceil_mode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the seg faults in tests are related to the ceil_mode stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, maybe. I'll investigate, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we directly exposing this here? I thought, based on conv and linear, our philosophy was going to be not exposing this directly but require explicit opt-in via "xnnpackify.." short of transform of the network. @dreiss for comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that, this will also make it a little harder to fuse maxpool + relu.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binding it directly is good as it directly benefits existing models. It doesn't preclude us from doing more optimizations also like fusion you described.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. With Dima. We can have a separate diff that exposes the fused version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great.

Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ashkan. Overall looks great. I have left a couple of comments.

Copy link
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from overall point of view!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binding it directly is good as it directly benefits existing models. It doesn't preclude us from doing more optimizations also like fusion you described.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this work when kernel.size()==1? Layout::Parameter::width is 1, and you can't have kernel[1]. Ah, I see, you are expanding kernel in create, but then check for kernel.size() should be different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I didn't get your point about kernel.size() being different, can you elaborate? Yes, I'm allowing 1 and 2 but expanding to 2 prior to use.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you require 2, then you should not allow 1, and looks like you require 2, because otherwise kernel[1] will segfault.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah ... how could I have missed that?! :/ Thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you'll probably need similar structure for AvgPool and other kinds of pooling, so it makes sens not to put it in max_pool2d namespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll move it in a future patch after Kimish merges his. The only place I can currently put it right now is inside Common.h which is starting to become a cluttered mess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done now that Kimish's patch is merged.

Copy link
Contributor

@dreiss dreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration looks good. Should have some unit tests added for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. With Dima. We can have a separate diff that exposes the fused version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes should probably be a separate diff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please. Can you remove these changes? Else I or you will run into some merge conflict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, no worries, will revert.

Comment on lines 236 to 237
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If output channels is always equal to input channels, and you assert later that it always will be equal to input channels, why have it as a separate parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message needs to be updated. Also, shouldn't this be checked when the context is created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the message.

The input tensor is not available at creation time. available() checks any parameter provided at creation time while usable gates anything that depends on the input tensor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check that input_nhwc was created with a guarding allocator, or is that only needed for the output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should use allocate_padded_if_needed() in Kimish's patch: only re-allocate input if not already allocated with this allocator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just return output_nhwc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think PyTorch's convention is to return tensors in the same layout they came in so we'll have to switch back to NCHW if that's the layout the input tensor was in. Dima / Natalia can confirm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the convention, then it makes sense to stick with it.

We should make sure that if we're returning NHWC, we're doing so with a guarding allocator to get maximum performance from a sequence of XNNPACK ops.

Copy link
Contributor Author

@AshkanAliabadi AshkanAliabadi Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contiguous is a no-op if memory is already in the requested layout. In other words, if the input tensor is in NHWC, contiguous will short circuit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like too tight of a coupling. Can you add a separate check for dim==4 instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@AshkanAliabadi
Copy link
Contributor Author

Thanks for the comments. Will address and upload a new patch.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Had to add this to prevent a test in test/test_namedtensor.py from failing. Said test expects tensor names to propagate.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@dreiss dreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you separate the named tensor and formatting/comment changes into separate diffs?

Comment on lines +169 to +170
const Tensor input_nhwc = input.contiguous(MemoryFormat::ChannelsLast);
const Tensor padded_input_nhwc = allocate_padded_if_needed(input_nhwc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially two copies. Can we combine these into one?

Comment on lines +301 to +310
return max_pool2d::available(
input.size(Layout::Activation4D::channels),
parameters.kernel,
parameters.padding,
parameters.stride,
parameters.dilation,
ceil_mode,
internal::max_pool2d::Context::kMin,
internal::max_pool2d::Context::kMax) &&
max_pool2d::usable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need available and usable as separate functions?

Comment on lines +212 to +234
Tensor create_and_run(
const Tensor& input,
const IntArrayRef kernel,
const IntArrayRef padding,
const IntArrayRef stride,
const IntArrayRef dilation,
const bool ceil_mode,
const float output_min,
const float output_max) {
using namespace internal;

return internal::max_pool2d::run(
internal::max_pool2d::create(
input.size(Layout::Activation4D::channels),
kernel,
padding,
stride,
dilation,
ceil_mode,
output_min,
output_max),
input);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an unnecessary layer of abstraction since we're not persisting the Context. If you inline both create and run directly into max_pool2d, can you eliminate Context and shorten/simplify the entire diff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to maintain the ability to separate create and run in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? We don't expect it will ever have a significant perf improvement over this implementation, right?

facebook-github-bot pushed a commit that referenced this pull request Mar 24, 2020
…95. (#35081)

Summary:
Required to fix a build issue in #33766.
Pull Request resolved: #35081

Reviewed By: dreiss

Differential Revision: D20567230

Pulled By: AshkanAliabadi

fbshipit-source-id: 1ed61708851402f60b80abc818ae7330e43adb83
@AshkanAliabadi
Copy link
Contributor Author

Breaking PR into smaller chunks per David's request. #35354. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants