Skip to content

Conversation

colesbury
Copy link
Member

This adds support for reductions like sum() and mul() to TensorIterator.
Performance is similar to existing optimized code for CPU, and generally
better than existing code for CUDA kernels.

The templatized CUDA kernel requires fewer instantiations than the
existing THCReduce/THCReduceAll code. For example, sum() previously
generated 43 CUDA kernels, while it now requires only one (larger)
CUDA kernel. I suspect this should reduce code-size and
compilation time, but I haven't measured it.

Below are timings for sum() on CPU (12 threads and 1 thread) and CUDA with various tensor sizes.

CPU

Reduction (dim) Master PR Master (1 thread) PR (1 thread)
1024x1024 (all) 22 us 34 us 136 us 147 us
1024x1024 (0) 30 us 28 us 160 us 160 us
1024x1024 (1) 25 us 25 us 171 us 146 us
1024x10x1024 (all) 542 us 550 us 4.14 ms 3.11 ms
1024x10x1024 (0) 658 us 690 us 6.80 ms 5.93 ms
1024x10x1024 (1) 761 us 757 us 3.34 ms 3.52 ms
1024x10x1024 (2) 538 us 545 us 3.73 ms 3.04 ms
1024x1024x1024 (all) 72 ms 71 ms 364 ms 357 ms
1024x1024x1024 (0) 94 ms 90 ms 935 ms 927 ms
1024x1024x1024 (1) 80 ms 86 ms 881 ms 688 ms
1024x1024x1024 (2) 71 ms 71 ms 456 ms 354 ms

CUDA

Reduction (dim) M40 base M40 PR P100 base P100 PR
1024x10x1024 (all) 238 us 182 us 136 us 97 us
1024x10x1024 (0) 166 us 179 us 105 us 84 us
1024x10x1024 (1) 181 us 182 us 89 us 91 us
1024x10x1024 (2) 180 us 168 us 88 us 79 us
1024x1024x1024 (all) 17.5 ms 16.4 ms 8.23 ms 7.48 ms
1024x1024x1024 (0) 27.2 ms 28.6 ms 7.63 ms 7.38 ms
1024x1024x1024 (1) 16.5 ms 16.3 ms 7.66 ms 7.40 ms
1024x1024x1024 (2) 17.8 ms 16.4 ms 8.37 ms 7.31 ms

Timings were generated with this script:
https://gist.github.com/colesbury/d3238b266d8a9872fe6f68f77619b379

@t-vi
Copy link
Collaborator

t-vi commented Sep 21, 2018

Awesome stuff! I can't wait to switch einsum and tensordot to this.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@fmassa
Copy link
Member

fmassa commented Sep 25, 2018

ping @colesbury

This adds support for reductions like sum() and mul() to TensorIterator.
Performance is similar to existing optimized code for CPU, and generally
better than existing code for CUDA kernels.

The templatized CUDA kernel requires fewer instantiations than the
existing THCReduce/THCReduceAll code. For example, sum() previously
generated 43 CUDA kernels, while it now requires only one (larger)
CUDA kernel. I suspect this should reduce code-size and
compilation time, but I haven't measured it.
@colesbury colesbury force-pushed the tensor_iterator_sum branch 2 times, most recently from 59a787e to 18b9b19 Compare September 26, 2018 19:50
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@gchanan gchanan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I didn't look through all the math; just have a few questions about functionality.

}

template <typename scalar_t, typename func_t>
struct Reduction {

This comment was marked as off-topic.


#include <sstream>

namespace at { namespace native { namespace {

This comment was marked as off-topic.

This comment was marked as off-topic.

static ScalarType get_dtype(Tensor& result, const Tensor& self, optional<ScalarType> dtype,
bool promote_integers=false) {
if (dtype.has_value()) {
return dtype.value();

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 25, 2018
Summary:
This adds support for reductions like sum() and mul() to TensorIterator.
Performance is similar to existing optimized code for CPU, and generally
better than existing code for CUDA kernels.

The templatized CUDA kernel requires fewer instantiations than the
existing THCReduce/THCReduceAll code. For example, sum() previously
generated 43 CUDA kernels, while it now requires only one (larger)
CUDA kernel. I suspect this should reduce code-size and
compilation time, but I haven't measured it.

Below are timings for sum() on [CPU](https://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-2680-v3-30M-Cache-2_50-GHz) (12 threads and 1 thread) and CUDA with various tensor sizes.

CPU

| Reduction (dim)      | Master  | PR      | Master (1 thread) | PR (1 thread) |
|----------------------|---------|---------|-------------------|---------------|
| 1024x1024 (all)      | 22 us   | 34 us   | 136 us            | 147 us        |
| 1024x1024 (0)        | 30 us   | 28 us   | 160 us            | 160 us        |
| 1024x1024 (1)        | 25 us   | 25 us   | 171 us            | 146 us        |
| 1024x10x1024 (all)   | 542 us  | 550 us  | 4.14 ms           | 3.11 ms       |
| 1024x10x1024 (0)     | 658 us  | 690 us  | 6.80 ms           | 5.93 ms       |
| 1024x10x1024 (1)     | 761 us  | 757 us  | 3.34 ms           | 3.52 ms       |
| 1024x10x1024 (2)     | 538 us  | 545 us  | 3.73 ms           | 3.04 ms       |
| 1024x1024x1024 (all) | 72 ms   | 71 ms   | 364 ms            | 357 ms        |
| 1024x1024x1024 (0)   | 94 ms   | 90 ms   | 935 ms            | 927 ms        |
| 1024x1024x1024 (1)   | 80 ms   | 86 ms   | 881 ms            | 688 ms        |
| 1024x1024x1024 (2)   | 71 ms   | 71 ms   | 456 ms            | 354 ms        |

CUDA

| Reduction (dim)      | M40 base | M40 PR  | P100 base | P100 PR   |
|----------------------|----------|---------|-----------|-----------|
| 1024x10x1024 (all)   | 238 us   | 182 us  | 136 us    | 97 us     |
| 1024x10x1024 (0)     | 166 us   | 179 us  | 105 us    | 84 us     |
| 1024x10x1024 (1)     | 181 us   | 182 us  | 89 us     | 91 us     |
| 1024x10x1024 (2)     | 180 us   | 168 us  | 88 us     | 79 us     |
| 1024x1024x1024 (all) | 17.5 ms  | 16.4 ms | 8.23 ms   | 7.48 ms   |
| 1024x1024x1024 (0)   | 27.2 ms  | 28.6 ms | 7.63 ms   | 7.38 ms   |
| 1024x1024x1024 (1)   | 16.5 ms  | 16.3 ms | 7.66 ms   | 7.40 ms   |
| 1024x1024x1024 (2)   | 17.8 ms  | 16.4 ms | 8.37 ms   | 7.31 ms   |

Timings were generated with this script:
https://gist.github.com/colesbury/d3238b266d8a9872fe6f68f77619b379
Pull Request resolved: pytorch/pytorch#11908

Differential Revision: D10071760

Pulled By: colesbury

fbshipit-source-id: 40e37a0e6803f1628b94cc5a52a10dfbb601f3d6
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants