Add support for reductions to TensorIterator #11908

colesbury · 2018-09-20T21:31:51Z

This adds support for reductions like sum() and mul() to TensorIterator.
Performance is similar to existing optimized code for CPU, and generally
better than existing code for CUDA kernels.

The templatized CUDA kernel requires fewer instantiations than the
existing THCReduce/THCReduceAll code. For example, sum() previously
generated 43 CUDA kernels, while it now requires only one (larger)
CUDA kernel. I suspect this should reduce code-size and
compilation time, but I haven't measured it.

Below are timings for sum() on CPU (12 threads and 1 thread) and CUDA with various tensor sizes.

CPU

Reduction (dim)	Master	PR	Master (1 thread)	PR (1 thread)
1024x1024 (all)	22 us	34 us	136 us	147 us
1024x1024 (0)	30 us	28 us	160 us	160 us
1024x1024 (1)	25 us	25 us	171 us	146 us
1024x10x1024 (all)	542 us	550 us	4.14 ms	3.11 ms
1024x10x1024 (0)	658 us	690 us	6.80 ms	5.93 ms
1024x10x1024 (1)	761 us	757 us	3.34 ms	3.52 ms
1024x10x1024 (2)	538 us	545 us	3.73 ms	3.04 ms
1024x1024x1024 (all)	72 ms	71 ms	364 ms	357 ms
1024x1024x1024 (0)	94 ms	90 ms	935 ms	927 ms
1024x1024x1024 (1)	80 ms	86 ms	881 ms	688 ms
1024x1024x1024 (2)	71 ms	71 ms	456 ms	354 ms

CUDA

Reduction (dim)	M40 base	M40 PR	P100 base	P100 PR
1024x10x1024 (all)	238 us	182 us	136 us	97 us
1024x10x1024 (0)	166 us	179 us	105 us	84 us
1024x10x1024 (1)	181 us	182 us	89 us	91 us
1024x10x1024 (2)	180 us	168 us	88 us	79 us
1024x1024x1024 (all)	17.5 ms	16.4 ms	8.23 ms	7.48 ms
1024x1024x1024 (0)	27.2 ms	28.6 ms	7.63 ms	7.38 ms
1024x1024x1024 (1)	16.5 ms	16.3 ms	7.66 ms	7.40 ms
1024x1024x1024 (2)	17.8 ms	16.4 ms	8.37 ms	7.31 ms

Timings were generated with this script:
https://gist.github.com/colesbury/d3238b266d8a9872fe6f68f77619b379

t-vi · 2018-09-21T05:59:07Z

Awesome stuff! I can't wait to switch einsum and tensordot to this.

aten/src/ATen/native/cpu/Reduce.h

aten/src/THC/THCDeviceUtils.cuh

aten/src/ATen/native/TensorIterator.cpp

fmassa · 2018-09-25T16:29:07Z

ping @colesbury

This adds support for reductions like sum() and mul() to TensorIterator. Performance is similar to existing optimized code for CPU, and generally better than existing code for CUDA kernels. The templatized CUDA kernel requires fewer instantiations than the existing THCReduce/THCReduceAll code. For example, sum() previously generated 43 CUDA kernels, while it now requires only one (larger) CUDA kernel. I suspect this should reduce code-size and compilation time, but I haven't measured it.

facebook-github-bot

colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

gchanan

This looks great. I didn't look through all the math; just have a few questions about functionality.

tools/autograd/derivatives.yaml

aten/src/ATen/native/cuda/Reduce.cuh

+}
+
+template <typename scalar_t, typename func_t>
+struct Reduction {


aten/src/ATen/native/cpu/Reduce.h

+
+#include <sstream>
+
+namespace at { namespace native { namespace {


aten/src/ATen/native/ReduceOps.cpp

+static ScalarType get_dtype(Tensor& result, const Tensor& self, optional<ScalarType> dtype,
+                            bool promote_integers=false) {
+  if (dtype.has_value()) {
+    return dtype.value();


facebook-github-bot

colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: This adds support for reductions like sum() and mul() to TensorIterator. Performance is similar to existing optimized code for CPU, and generally better than existing code for CUDA kernels. The templatized CUDA kernel requires fewer instantiations than the existing THCReduce/THCReduceAll code. For example, sum() previously generated 43 CUDA kernels, while it now requires only one (larger) CUDA kernel. I suspect this should reduce code-size and compilation time, but I haven't measured it. Below are timings for sum() on [CPU](https://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-2680-v3-30M-Cache-2_50-GHz) (12 threads and 1 thread) and CUDA with various tensor sizes. CPU | Reduction (dim) | Master | PR | Master (1 thread) | PR (1 thread) | |----------------------|---------|---------|-------------------|---------------| | 1024x1024 (all) | 22 us | 34 us | 136 us | 147 us | | 1024x1024 (0) | 30 us | 28 us | 160 us | 160 us | | 1024x1024 (1) | 25 us | 25 us | 171 us | 146 us | | 1024x10x1024 (all) | 542 us | 550 us | 4.14 ms | 3.11 ms | | 1024x10x1024 (0) | 658 us | 690 us | 6.80 ms | 5.93 ms | | 1024x10x1024 (1) | 761 us | 757 us | 3.34 ms | 3.52 ms | | 1024x10x1024 (2) | 538 us | 545 us | 3.73 ms | 3.04 ms | | 1024x1024x1024 (all) | 72 ms | 71 ms | 364 ms | 357 ms | | 1024x1024x1024 (0) | 94 ms | 90 ms | 935 ms | 927 ms | | 1024x1024x1024 (1) | 80 ms | 86 ms | 881 ms | 688 ms | | 1024x1024x1024 (2) | 71 ms | 71 ms | 456 ms | 354 ms | CUDA | Reduction (dim) | M40 base | M40 PR | P100 base | P100 PR | |----------------------|----------|---------|-----------|-----------| | 1024x10x1024 (all) | 238 us | 182 us | 136 us | 97 us | | 1024x10x1024 (0) | 166 us | 179 us | 105 us | 84 us | | 1024x10x1024 (1) | 181 us | 182 us | 89 us | 91 us | | 1024x10x1024 (2) | 180 us | 168 us | 88 us | 79 us | | 1024x1024x1024 (all) | 17.5 ms | 16.4 ms | 8.23 ms | 7.48 ms | | 1024x1024x1024 (0) | 27.2 ms | 28.6 ms | 7.63 ms | 7.38 ms | | 1024x1024x1024 (1) | 16.5 ms | 16.3 ms | 7.66 ms | 7.40 ms | | 1024x1024x1024 (2) | 17.8 ms | 16.4 ms | 8.37 ms | 7.31 ms | Timings were generated with this script: https://gist.github.com/colesbury/d3238b266d8a9872fe6f68f77619b379 Pull Request resolved: pytorch/pytorch#11908 Differential Revision: D10071760 Pulled By: colesbury fbshipit-source-id: 40e37a0e6803f1628b94cc5a52a10dfbb601f3d6

colesbury requested review from apaszke, ezyang, gchanan, soumith and zdevito as code owners September 20, 2018 21:31

fmassa reviewed Sep 21, 2018

View reviewed changes

aten/src/ATen/native/cpu/Reduce.h Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

ngimel reviewed Sep 22, 2018

View reviewed changes

aten/src/THC/THCDeviceUtils.cuh Outdated Show resolved Hide resolved

fmassa reviewed Sep 22, 2018

View reviewed changes

aten/src/ATen/native/TensorIterator.cpp Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

fmassa mentioned this pull request Sep 24, 2018

crash when trying to divide or multiply a large tensor on voltas, but not on pascal #11788

Closed

colesbury added 5 commits September 26, 2018 09:25

Fix call to element_size in TensorIteratorReduce

f5233a8

Accept change to JIT output

66cee11

Fixes after rebase

c67f4d0

Use double for CPU ident arg (matches CUDA)

40b0c01

colesbury force-pushed the tensor_iterator_sum branch from aeb0b9d to 2156e2b Compare September 26, 2018 17:25

Remove WARP_SHFL_DOWN specialization on at::Half

58a29a9

colesbury force-pushed the tensor_iterator_sum branch 2 times, most recently from 59a787e to 18b9b19 Compare September 26, 2018 19:50

Attempt to fix RoCM

0d5e19b

colesbury force-pushed the tensor_iterator_sum branch from 18b9b19 to 0d5e19b Compare September 26, 2018 20:01

facebook-github-bot reviewed Sep 26, 2018

View reviewed changes

gchanan reviewed Oct 16, 2018

View reviewed changes

colesbury added 4 commits October 23, 2018 11:51

Merge branch 'master' into tensor_iterator_sum

849ebf9

Fix inter-type gradients and errors from merge

447f53a

Reduction -> Reduce

8860c77

Remove duplicate check

8575703

facebook-github-bot reviewed Oct 24, 2018

View reviewed changes

facebook-github-bot closed this in 9fefab5 Oct 25, 2018

ezyang added the merged label Jun 26, 2019

t-vi mentioned this pull request Nov 4, 2019

torch.sum(tensor, dim=()) is different from np.sum(arr, axis=()) #29137

Open


		#include <sstream>

		namespace at { namespace native { namespace {

Add support for reductions to TensorIterator #11908

Add support for reductions to TensorIterator #11908

Uh oh!

Conversation

colesbury commented Sep 20, 2018

Uh oh!

t-vi commented Sep 21, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

fmassa commented Sep 25, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

gchanan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants