Skip to content

Conversation

jjsjann123
Copy link
Collaborator

Initial kernel support added for optimized NHWC tensor.

TODO: currently backwards kernel spits out tensor with NHWC stride.
Unfortunately autograd restores grad to contiguous (in either copy or add). This
makes real perf tuning annoying to do. (since I cannot easily measure end-to-end
time in my python script)

My current kernel is blazing fast comparing to the original NCHW kernel in fp16,
since I avoided atomicAdd. I'll finish perf tuning after we merged some future
PR expanding NHWC support in the core.

Initial kernel support added for optimized NHWC tensor.

TODO: currently backwards kernel spits out tensor with NHWC stride.
Unfortunately autograd restores grad to contiguous (in either copy or add). This
makes real perf tuning annoying to do. (since I cannot easily measure end-to-end
time in my python script)

My current kernel is blazing fast comparing to the original NCHW kernel in fp16,
since I avoided atomicAdd. I'll finish perf tuning after we merged some future
PR expanding NHWC support in the core.
@pytorchbot pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels Aug 15, 2019
@jjsjann123
Copy link
Collaborator Author

Supporting #23403
Test will follow after my local build finishes after the rebase.
cc'ing @VitalyFedyunin @csarofeen @ptrblck

@VitalyFedyunin
Copy link
Contributor

cc @ifedan

@ifedan
Copy link
Contributor

ifedan commented Aug 15, 2019

@jjsjann123 Do you have any performance metrics? NCHW vs NHWC

@jjsjann123
Copy link
Collaborator Author

jjsjann123 commented Aug 15, 2019

I do not have general speedup showing the perf improvement because of the extra kernel I mentioned in this PR (hence perf tuning will be in a follow up PR, I did run a swipe test and see rough perf improvements so it should be fine)

To give a specific point here, for input as [128, 256, 64, 64] and output as [128, 256, 32, 32], here's the kernel time. You can see slight perf improvement for fp32, and 2x on fp16 backward kernel.

                    5.29%  7.5944ms         4  1.8986ms  1.7728ms  2.2706ms  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::atomicadaptiveaveragegradinput<float>(float*, float, int, int, int, int)
                    3.86%  5.5528ms         4  1.3882ms  1.3844ms  1.3993ms  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::adaptiveaveragegradinputnhwc<int, float>(float*, float, int, int, int, int, int, int, int, float*, float*, float*)
                    3.76%  5.4044ms         4  1.3511ms  1.3504ms  1.3522ms  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::atomicadaptiveaveragegradinput<c10::Half>(c10::Half*, c10::Half, int, int, int, int)
                    3.33%  4.7840ms         4  1.1960ms  1.0737ms  1.5595ms  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::adaptiveaveragepool<float>(float*, float, int, int, int, int, long, long, long)
                    2.93%  4.2095ms         4  1.0524ms  1.0517ms  1.0531ms  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::adaptiveaveragepool<c10::Half>(c10::Half*, c10::Half, int, int, int, int, long, long, long)
                    2.85%  4.0876ms         4  1.0219ms  1.0175ms  1.0328ms  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::adaptiveaveragepoolnhwc<int, c10::Half>(c10::Half*, c10::Half, int, int, int, int, int, int, int, c10::Half*, c10::Half*, c10::Half*)
                    2.75%  3.9492ms         4  987.31us  986.35us  988.31us  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::adaptiveaveragepoolnhwc<int, float>(float*, float, int, int, int, int, int, int, int, float*, float*, float*)
                    1.71%  2.4591ms         4  614.76us  613.58us  616.78us  void at::native::_GLOBAL__N__57_tmpxft_00002ce2_00000000_6_AdaptiveAveragePooling_cpp1_ii_eb1948c3::adaptiveaveragegradinputnhwc<int, c10::Half>(c10::Half*, c10::Half, int, int, int, int, int, int, int, c10::Half*, c10::Half*, c10::Half*)

I haven't spend enough time looking at the forward kernel yet, Will revisit that in the perf tuning PR later.

@jjsjann123
Copy link
Collaborator Author

jjsjann123 commented Aug 17, 2019

@ifedan Graph showing the relative speedup comparing to PyTorch native kernel.
Could have arrange the graph in a better way (config is a simple product of some knobs), but overall there are decent speedup for the backward kernel.

image

@jjsjann123
Copy link
Collaborator Author

jjsjann123 commented Aug 17, 2019

Notice that my benchmark was done via hacking:

+++ b/torch/csrc/autograd/functions/accumulate_grad.cpp
@@ -43,7 +43,8 @@ auto AccumulateGrad::apply(variable_list&& grads) -> variable_list {
     // under following condition, we can avoid clone()
     if (!GradMode::is_enabled()
         && !new_grad.is_sparse()
-        && new_grad.is_contiguous()
+        //&& new_grad.is_contiguous(variable.get()->is_strides_like_channels_last()? at::MemoryFormat::ChannelsLast : at::MemoryFormat::Contiguous)
+        && new_grad.is_contiguous(at::MemoryFormat::ChannelsLast)
         && new_grad.use_count() <= 1 + !post_hooks().empty()) {
       // first check it is in first-order grad only mode
       // then check not sparse before is_contiguous

I have to explicitly set x.grad = None so it go through this hacked code path in grad computation.
Otherwise, there will be a TensorIterator kernel (either copy or add) which would skew the measurement.

@pytorchbot pytorchbot added the module: nn Related to torch.nn label Aug 17, 2019
@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 11, 2019
@cpuhrsch cpuhrsch requested a review from ifedan October 11, 2019 07:05
@VitalyFedyunin
Copy link
Contributor

Please rebase, we are getting ready to land it.

Previous kernel does not stride on Channel dimension, and the kernel uses shared
memory to store temporary result (to break data dependency -> code paralellism)

Resulted in requesting more resources than what's available.

Fixing:
added striding on C to reduce shmem usage per CTA.
@jjsjann123
Copy link
Collaborator Author

Rebased my code and cherry-picked the patch from #25102

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to look for all call inputs. Also some 'dev' practices switch instead of if got changed.

@jjsjann123
Copy link
Collaborator Author

Should have addressed all review comments. Feel free to take another look.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in e263dd3.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 24, 2019
Summary:
Initial kernel support added for optimized NHWC tensor.

TODO: currently backwards kernel spits out tensor with NHWC stride.
Unfortunately autograd restores grad to contiguous (in either copy or add). This
makes real perf tuning annoying to do. (since I cannot easily measure end-to-end
time in my python script)

My current kernel is blazing fast comparing to the original NCHW kernel in fp16,
since I avoided atomicAdd. I'll finish perf tuning after we merged some future
PR expanding NHWC support in the core.
Pull Request resolved: pytorch/pytorch#24396

Differential Revision: D18115941

Pulled By: VitalyFedyunin

fbshipit-source-id: 57b4922b7bf308430ffe1406681f68629baf8834
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
Summary:
Initial kernel support added for optimized NHWC tensor.

TODO: currently backwards kernel spits out tensor with NHWC stride.
Unfortunately autograd restores grad to contiguous (in either copy or add). This
makes real perf tuning annoying to do. (since I cannot easily measure end-to-end
time in my python script)

My current kernel is blazing fast comparing to the original NCHW kernel in fp16,
since I avoided atomicAdd. I'll finish perf tuning after we merged some future
PR expanding NHWC support in the core.
Pull Request resolved: pytorch#24396

Differential Revision: D18115941

Pulled By: VitalyFedyunin

fbshipit-source-id: 57b4922b7bf308430ffe1406681f68629baf8834
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged module: cuda Related to torch.cuda, and CUDA support in general module: nn Related to torch.nn open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants