-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add support for reductions to TensorIterator #11908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Awesome stuff! I can't wait to switch einsum and tensordot to this. |
aten/src/ATen/native/cpu/Reduce.h
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
ping @colesbury |
This adds support for reductions like sum() and mul() to TensorIterator. Performance is similar to existing optimized code for CPU, and generally better than existing code for CUDA kernels. The templatized CUDA kernel requires fewer instantiations than the existing THCReduce/THCReduceAll code. For example, sum() previously generated 43 CUDA kernels, while it now requires only one (larger) CUDA kernel. I suspect this should reduce code-size and compilation time, but I haven't measured it.
aeb0b9d
to
2156e2b
Compare
59a787e
to
18b9b19
Compare
18b9b19
to
0d5e19b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I didn't look through all the math; just have a few questions about functionality.
aten/src/ATen/native/cuda/Reduce.cuh
Outdated
} | ||
|
||
template <typename scalar_t, typename func_t> | ||
struct Reduction { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
||
#include <sstream> | ||
|
||
namespace at { namespace native { namespace { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
static ScalarType get_dtype(Tensor& result, const Tensor& self, optional<ScalarType> dtype, | ||
bool promote_integers=false) { | ||
if (dtype.has_value()) { | ||
return dtype.value(); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This adds support for reductions like sum() and mul() to TensorIterator. Performance is similar to existing optimized code for CPU, and generally better than existing code for CUDA kernels. The templatized CUDA kernel requires fewer instantiations than the existing THCReduce/THCReduceAll code. For example, sum() previously generated 43 CUDA kernels, while it now requires only one (larger) CUDA kernel. I suspect this should reduce code-size and compilation time, but I haven't measured it. Below are timings for sum() on [CPU](https://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-2680-v3-30M-Cache-2_50-GHz) (12 threads and 1 thread) and CUDA with various tensor sizes. CPU | Reduction (dim) | Master | PR | Master (1 thread) | PR (1 thread) | |----------------------|---------|---------|-------------------|---------------| | 1024x1024 (all) | 22 us | 34 us | 136 us | 147 us | | 1024x1024 (0) | 30 us | 28 us | 160 us | 160 us | | 1024x1024 (1) | 25 us | 25 us | 171 us | 146 us | | 1024x10x1024 (all) | 542 us | 550 us | 4.14 ms | 3.11 ms | | 1024x10x1024 (0) | 658 us | 690 us | 6.80 ms | 5.93 ms | | 1024x10x1024 (1) | 761 us | 757 us | 3.34 ms | 3.52 ms | | 1024x10x1024 (2) | 538 us | 545 us | 3.73 ms | 3.04 ms | | 1024x1024x1024 (all) | 72 ms | 71 ms | 364 ms | 357 ms | | 1024x1024x1024 (0) | 94 ms | 90 ms | 935 ms | 927 ms | | 1024x1024x1024 (1) | 80 ms | 86 ms | 881 ms | 688 ms | | 1024x1024x1024 (2) | 71 ms | 71 ms | 456 ms | 354 ms | CUDA | Reduction (dim) | M40 base | M40 PR | P100 base | P100 PR | |----------------------|----------|---------|-----------|-----------| | 1024x10x1024 (all) | 238 us | 182 us | 136 us | 97 us | | 1024x10x1024 (0) | 166 us | 179 us | 105 us | 84 us | | 1024x10x1024 (1) | 181 us | 182 us | 89 us | 91 us | | 1024x10x1024 (2) | 180 us | 168 us | 88 us | 79 us | | 1024x1024x1024 (all) | 17.5 ms | 16.4 ms | 8.23 ms | 7.48 ms | | 1024x1024x1024 (0) | 27.2 ms | 28.6 ms | 7.63 ms | 7.38 ms | | 1024x1024x1024 (1) | 16.5 ms | 16.3 ms | 7.66 ms | 7.40 ms | | 1024x1024x1024 (2) | 17.8 ms | 16.4 ms | 8.37 ms | 7.31 ms | Timings were generated with this script: https://gist.github.com/colesbury/d3238b266d8a9872fe6f68f77619b379 Pull Request resolved: pytorch/pytorch#11908 Differential Revision: D10071760 Pulled By: colesbury fbshipit-source-id: 40e37a0e6803f1628b94cc5a52a10dfbb601f3d6
This adds support for reductions like sum() and mul() to TensorIterator.
Performance is similar to existing optimized code for CPU, and generally
better than existing code for CUDA kernels.
The templatized CUDA kernel requires fewer instantiations than the
existing THCReduce/THCReduceAll code. For example, sum() previously
generated 43 CUDA kernels, while it now requires only one (larger)
CUDA kernel. I suspect this should reduce code-size and
compilation time, but I haven't measured it.
Below are timings for sum() on CPU (12 threads and 1 thread) and CUDA with various tensor sizes.
CPU
CUDA
Timings were generated with this script:
https://gist.github.com/colesbury/d3238b266d8a9872fe6f68f77619b379