[nhwc support for adaptive_avg_pool2d & adaptive_avg_pool2d_backward] #24396

jjsjann123 · 2019-08-15T08:06:42Z

Initial kernel support added for optimized NHWC tensor.

TODO: currently backwards kernel spits out tensor with NHWC stride.
Unfortunately autograd restores grad to contiguous (in either copy or add). This
makes real perf tuning annoying to do. (since I cannot easily measure end-to-end
time in my python script)

My current kernel is blazing fast comparing to the original NCHW kernel in fp16,
since I avoided atomicAdd. I'll finish perf tuning after we merged some future
PR expanding NHWC support in the core.

Initial kernel support added for optimized NHWC tensor. TODO: currently backwards kernel spits out tensor with NHWC stride. Unfortunately autograd restores grad to contiguous (in either copy or add). This makes real perf tuning annoying to do. (since I cannot easily measure end-to-end time in my python script) My current kernel is blazing fast comparing to the original NCHW kernel in fp16, since I avoided atomicAdd. I'll finish perf tuning after we merged some future PR expanding NHWC support in the core.

jjsjann123 · 2019-08-15T08:09:27Z

Supporting #23403
Test will follow after my local build finishes after the rebase.
cc'ing @VitalyFedyunin @csarofeen @ptrblck

VitalyFedyunin · 2019-08-15T14:36:33Z

cc @ifedan

ifedan · 2019-08-15T18:11:27Z

@jjsjann123 Do you have any performance metrics? NCHW vs NHWC

jjsjann123 · 2019-08-15T19:43:23Z

I do not have general speedup showing the perf improvement because of the extra kernel I mentioned in this PR (hence perf tuning will be in a follow up PR, I did run a swipe test and see rough perf improvements so it should be fine)

To give a specific point here, for input as [128, 256, 64, 64] and output as [128, 256, 32, 32], here's the kernel time. You can see slight perf improvement for fp32, and 2x on fp16 backward kernel.

                    5.29%  7.5944ms         4  1.8986ms  1.7728ms  2.2706ms  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::atomicadaptiveaveragegradinput<float>(float*, float, int, int, int, int)
                    3.86%  5.5528ms         4  1.3882ms  1.3844ms  1.3993ms  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::adaptiveaveragegradinputnhwc<int, float>(float*, float, int, int, int, int, int, int, int, float*, float*, float*)
                    3.76%  5.4044ms         4  1.3511ms  1.3504ms  1.3522ms  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::atomicadaptiveaveragegradinput<c10::Half>(c10::Half*, c10::Half, int, int, int, int)
                    3.33%  4.7840ms         4  1.1960ms  1.0737ms  1.5595ms  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::adaptiveaveragepool<float>(float*, float, int, int, int, int, long, long, long)
                    2.93%  4.2095ms         4  1.0524ms  1.0517ms  1.0531ms  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::adaptiveaveragepool<c10::Half>(c10::Half*, c10::Half, int, int, int, int, long, long, long)
                    2.85%  4.0876ms         4  1.0219ms  1.0175ms  1.0328ms  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::adaptiveaveragepoolnhwc<int, c10::Half>(c10::Half*, c10::Half, int, int, int, int, int, int, int, c10::Half*, c10::Half*, c10::Half*)
                    2.75%  3.9492ms         4  987.31us  986.35us  988.31us  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::adaptiveaveragepoolnhwc<int, float>(float*, float, int, int, int, int, int, int, int, float*, float*, float*)
                    1.71%  2.4591ms         4  614.76us  613.58us  616.78us  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::adaptiveaveragegradinputnhwc<int, c10::Half>(c10::Half*, c10::Half, int, int, int, int, int, int, int, c10::Half*, c10::Half*, c10::Half*)

I haven't spend enough time looking at the forward kernel yet, Will revisit that in the perf tuning PR later.

jjsjann123 · 2019-08-17T04:54:38Z

@ifedan Graph showing the relative speedup comparing to PyTorch native kernel.
Could have arrange the graph in a better way (config is a simple product of some knobs), but overall there are decent speedup for the backward kernel.

jjsjann123 · 2019-08-17T05:32:33Z

Notice that my benchmark was done via hacking:

+++ b/torch/csrc/autograd/functions/accumulate_grad.cpp
@@ -43,7 +43,8 @@ auto AccumulateGrad::apply(variable_list&& grads) -> variable_list {
     // under following condition, we can avoid clone()
     if (!GradMode::is_enabled()
         && !new_grad.is_sparse()
-        && new_grad.is_contiguous()
+        //&& new_grad.is_contiguous(variable.get()->is_strides_like_channels_last()? at::MemoryFormat::ChannelsLast : at::MemoryFormat::Contiguous)
+        && new_grad.is_contiguous(at::MemoryFormat::ChannelsLast)
         && new_grad.use_count() <= 1 + !post_hooks().empty()) {
       // first check it is in first-order grad only mode
       // then check not sparse before is_contiguous

I have to explicitly set x.grad = None so it go through this hacked code path in grad computation.
Otherwise, there will be a TensorIterator kernel (either copy or add) which would skew the measurement.

VitalyFedyunin · 2019-10-17T20:45:23Z

Please rebase, we are getting ready to land it.

Previous kernel does not stride on Channel dimension, and the kernel uses shared memory to store temporary result (to break data dependency -> code paralellism) Resulted in requesting more resources than what's available. Fixing: added striding on C to reduce shmem usage per CTA.

jjsjann123 · 2019-10-18T01:57:13Z

Rebased my code and cherry-picked the patch from #25102

VitalyFedyunin

Trying to look for all call inputs. Also some 'dev' practices switch instead of if got changed.

aten/src/ATen/native/AdaptiveAveragePooling.cpp

aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu

2. adding test for non-contiguous input tensor;

jjsjann123 · 2019-10-22T20:01:20Z

Should have addressed all review comments. Feel free to take another look.

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-10-24T20:41:38Z

@VitalyFedyunin merged this pull request in e263dd3.

Summary: Initial kernel support added for optimized NHWC tensor. TODO: currently backwards kernel spits out tensor with NHWC stride. Unfortunately autograd restores grad to contiguous (in either copy or add). This makes real perf tuning annoying to do. (since I cannot easily measure end-to-end time in my python script) My current kernel is blazing fast comparing to the original NCHW kernel in fp16, since I avoided atomicAdd. I'll finish perf tuning after we merged some future PR expanding NHWC support in the core. Pull Request resolved: pytorch/pytorch#24396 Differential Revision: D18115941 Pulled By: VitalyFedyunin fbshipit-source-id: 57b4922b7bf308430ffe1406681f68629baf8834

Summary: Initial kernel support added for optimized NHWC tensor. TODO: currently backwards kernel spits out tensor with NHWC stride. Unfortunately autograd restores grad to contiguous (in either copy or add). This makes real perf tuning annoying to do. (since I cannot easily measure end-to-end time in my python script) My current kernel is blazing fast comparing to the original NCHW kernel in fp16, since I avoided atomicAdd. I'll finish perf tuning after we merged some future PR expanding NHWC support in the core. Pull Request resolved: pytorch#24396 Differential Revision: D18115941 Pulled By: VitalyFedyunin fbshipit-source-id: 57b4922b7bf308430ffe1406681f68629baf8834

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels Aug 15, 2019

VitalyFedyunin mentioned this pull request Aug 15, 2019

Memory Format support for Resnet models #23403

Closed

10 tasks

perf tuning finished

737ef8c

pytorchbot added the module: nn Related to torch.nn label Aug 17, 2019

nhwc unit test added in test_nn.py

ae75550

ezyang added the open source label Sep 18, 2019

cpuhrsch requested a review from VitalyFedyunin October 11, 2019 07:05

cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 11, 2019

cpuhrsch requested a review from ifedan October 11, 2019 07:05

jjsjann123 added 2 commits October 17, 2019 18:23

Merge branch 'master' into adaptiveAvgPool_pr

f111435

VitalyFedyunin suggested changes Oct 18, 2019

View reviewed changes

addressing review comments

1888d0f

VitalyFedyunin reviewed Oct 21, 2019

View reviewed changes

aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu Outdated Show resolved Hide resolved

jjsjann123 added 3 commits October 21, 2019 13:53

Merge remote-tracking branch 'origin/master' into adaptiveAvgPool_pr

6175f02

1. renaming kernel name to discussed convention;

281ea1e

2. adding test for non-contiguous input tensor;

fixing pylint

9637dae

jjsjann123 requested a review from VitalyFedyunin October 23, 2019 17:32

facebook-github-bot reviewed Oct 24, 2019

View reviewed changes

VitalyFedyunin approved these changes Oct 24, 2019

View reviewed changes

facebook-github-bot reviewed Oct 24, 2019

View reviewed changes

facebook-github-bot closed this in e263dd3 Oct 24, 2019

facebook-github-bot added the merged label Oct 24, 2019

VitalyFedyunin mentioned this pull request Oct 24, 2019

Channels Last (NHWC) support plan. #28619

Open

18 tasks

mruberry added the Merged label Oct 28, 2020

[nhwc support for adaptive_avg_pool2d & adaptive_avg_pool2d_backward] #24396

[nhwc support for adaptive_avg_pool2d & adaptive_avg_pool2d_backward] #24396

Uh oh!

Conversation

jjsjann123 commented Aug 15, 2019

Uh oh!

jjsjann123 commented Aug 15, 2019

Uh oh!

VitalyFedyunin commented Aug 15, 2019

Uh oh!

ifedan commented Aug 15, 2019

Uh oh!

jjsjann123 commented Aug 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjsjann123 commented Aug 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjsjann123 commented Aug 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VitalyFedyunin commented Oct 17, 2019

Uh oh!

jjsjann123 commented Oct 18, 2019

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjsjann123 commented Oct 22, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 24, 2019

Uh oh!

Uh oh!

jjsjann123 commented Aug 15, 2019 •

edited

Loading

jjsjann123 commented Aug 17, 2019 •

edited

Loading

jjsjann123 commented Aug 17, 2019 •

edited

Loading