implement sum over multiple dimensions (fixes #2006) #6152

t-vi · 2018-03-30T20:00:53Z

Hello,

this implements summing over multiple dimensions as a ATen native function.

As IntList and int64_t is considered the same for the jit signatures, I handle the single-dimension case in the multi-dimension one by fast-tracking it.
in this context, for sum_out, I manually dispatch in ReductionOps.cpp instead of using native_function's mechanism.
the multiple-index version iterates over the one-dimensional op.
I'll add a test and adapt the docs, but I'd appreciate feedback on the approach.

This patch addresses #2006 and would supersede #2116 .
Of course, there is a ton of other ops (prod, mean, squeeze, unsqueeze) that could be handled similarly.

Best regards

Thomas

ssnl · 2018-03-30T20:04:18Z

Can you update the doc and add test cases for this please?

ssnl · 2018-03-30T20:04:26Z

@pytorchbot test this please

t-vi · 2018-03-30T20:59:57Z

@pytorchbot retest this please

ssnl · 2018-03-30T21:00:37Z

@pytorchbot retest this please

ezyang · 2018-03-30T21:45:31Z

@pytorchbot retest this please

apaszke

(Not a complete review. Just a few comments)

aten/src/ATen/WrapDimUtils.h

@@ -27,6 +27,17 @@ static inline int64_t maybe_wrap_dim(int64_t dim, int64_t dim_post_expr, bool wr
  return dim;
 }

+static inline std::vector<bool> dim_list_to_vector(IntList dims, int64_t ndims, bool wrap_scalar=true) {
+  std::vector<bool> seen(ndims, false);


aten/src/ATen/native/ReduceOps.cpp

+    }
+  }
+  size_t ndims = self.dim();
+  std::vector<bool> seen(ndims, false);


aten/src/ATen/native/ReduceOps.cpp

+
+// MULTI DIM REDUCE ###########################################################
+
+Tensor sum(const Tensor &self, IntList dims_, bool keepdim) {


t-vi · 2018-04-01T14:04:56Z

@pytorchbot retest this please

(can I do this?)

apaszke · 2018-04-01T15:51:39Z

@pytorchbot add to whitelist

t-vi · 2018-04-01T20:17:32Z

Hi,

I could use a hint how to resolve the ambiguity that the windows compile stumbles over (Error C2666 regarding the use of bitfield in WrapDimUtils.h).
@peterjc123 maybe?

Thank you

Thomas

peterjc123 · 2018-04-02T08:54:36Z

A solution may be to define a flag in WrapDimUtils.h and make it wrap the division functions of the half tensors in THCHalfAutoNumerics.cuh.

aten/src/ATen/native/ReduceOps.cpp

+  }
+  size_t ndims = self.dim();
+  AT_ASSERT(ndims <= 64, "tensor dimension must be <= 64 for multiple dims")
+  std::bitset<64> seen;


aten/src/ATen/WrapDimUtils.h

+// non-explicit half conversion in THCUNN/THCHalfAutoNumerics.cuh
+// so this is host-code only
+
+static inline std::bitset<64> dim_list_to_vector(IntList dims, int64_t ndims, bool wrap_scalar=true) {


t-vi · 2018-04-02T20:41:58Z

So at last the windows build works after moving the bitmap using functions into a different header (that is not included by .cus.
Thank you for your input!

aten/src/ATen/WrapDimUtilsMulti.h

+constexpr size_t dim_bitset_size = 64;
+
+static inline std::bitset<dim_bitset_size> dim_list_to_vector(IntList dims, int64_t ndims, bool wrap_scalar=true) {
+  AT_ASSERT(ndims <= (int64_t) dim_bitset_size, "tensor dimension must be <= %zu for multiple dims", dim_bitset_size);


aten/src/ATen/WrapDimUtilsMulti.h

+  for (size_t i = 0; i < dims.size(); i++) {
+    size_t dim = maybe_wrap_dim(dims[i], ndims);
+    if (seen[dim])
+      AT_ERROR("repeated dim");


aten/src/ATen/WrapDimUtilsMulti.h

+
+constexpr size_t dim_bitset_size = 64;
+
+static inline std::bitset<dim_bitset_size> dim_list_to_vector(IntList dims, int64_t ndims, bool wrap_scalar=true) {


aten/src/ATen/native/ReduceOps.cpp

+      AT_ERROR("repeated dim");
+    seen[dim] = true;
+    result = reduce_1(result, dim, true);
+  }


aten/src/ATen/native/ReduceOps.cpp

+    auto dim = maybe_wrap_dim(dims_[i], ndims);
+    if (seen[dim])
+      AT_ERROR("repeated dim in sum");
+    seen[dim] = true;


aten/src/ATen/native/native_functions.yaml

@@ -611,13 +611,10 @@
    CPU: _sum_cpu
    CUDA: _sum_cuda

- func: sum(Tensor self, int64_t dim, bool keepdim=False) -> Tensor
+- func: sum(Tensor self, IntList[1] dim, bool keepdim=False) -> Tensor


t-vi · 2018-04-04T19:17:59Z

So far I have assumed that the user prescribes the order of summation. The obvious alternative is to use ascending or descending order in the tensors by iterating over the bitset dims instead of the user-provided list dims_.

Even more radically, one could consider to permute+reshape axes together. Then one would only sum once and not have intermediate results...

apaszke

LGTM, but I think there are still minor things that could be improved. Should be good to go after this

aten/src/ATen/WrapDimUtilsMulti.h

+  std::bitset<dim_bitset_size> seen;
+  for (size_t i = 0; i < dims.size(); i++) {
+    size_t dim = maybe_wrap_dim(dims[i], ndims);
+    AT_ASSERT(!seen[dim], "dim %zu appears multiple times in the list of reduced dims", dim);


aten/src/ATen/native/ReduceOps.cpp

+    return self;
+  }
+  size_t ndims = self.dim();
+  std::bitset<dim_bitset_size> seen = dim_list_to_bitset(dims_, ndims);


aten/src/ATen/native/ReduceOps.cpp

+  Tensor result = self;
+  for (size_t i = 0; i < dims_.size(); i++) {
+    size_t dim = maybe_wrap_dim(dims_[i], ndims);
+    result = reduce_1(result, dim, true);


aten/src/ATen/native/ReduceOps.cpp

+
+template <Tensor (reduce_1)(const Tensor &, int64_t, bool),
+	  Tensor& (reduce_1_out)(Tensor& result, const Tensor &, int64_t, bool)>
+inline Tensor& reduce_multi_out(Tensor &result, const Tensor &self, IntList dims_, bool keepdim) {


test/test_torch.py

+        res1 = torch.sum(x, (2, 1))
+        res2 = torch.Tensor()
+        torch.sum(x, (2, 1), out=res2)
+        self.assertEqual(res1, res2)


t-vi · 2018-04-06T08:05:33Z

The disadvantage here is that you need to meddle with the backwards while this would be automatic if you could separate them. Best regards Thomas Am 4. April 2018 17:31:10 MESZ schrieb Adam Paszke <notifications@github.com>:

…

apaszke commented on this pull request. > @@ -611,13 +611,10 @@ CPU: _sum_cpu CUDA: _sum_cuda -- func: sum(Tensor self, int64_t dim, bool keepdim=False) -> Tensor +- func: sum(Tensor self, IntList[1] dim, bool keepdim=False) -> Tensor But this works too (and I think I prefer it)! -- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: #6152 (comment)

t-vi · 2018-04-09T10:02:27Z

So I looked into the handling of keeping or not dimensions per @apaszke 's comment.
I make the working assumption that we want to use the given order of reductions. (Please correct me if you believe I should not). Then it is possible to replace "keep first, squeeze later" if one distinguishes the two cases and converts the indices to incremental reductions. However, that is a bit elaborate, so in terms of code size (t-vi@48bb8d8), I think keeping it as is is the most practical solution.

apaszke · 2018-04-09T10:38:00Z

I'm pretty sure we'll never really want to use the order in which the dimensions were given, take this for an example:

In [1]: x = torch.randn(100, 100, 100, 100)

In [2]: %timeit x.sum(3).sum(2).sum(1).sum(0)
9.44 ms ± 8.49 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit x.sum(0).sum(0).sum(0).sum(0)
12 ms ± 7.99 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit x.view(-1).sum(0)
9.23 ms ± 15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This makes sense, because reducing in order starting from the innermost dim is good for data locality, as earlier dimension will have lower strides once you get to them. (NB the difference only grows when I use fewer cores)

An additional improvement would be to collapse the pseudo-contiguous pairs of dimensions (stride[i+1] == size[i] * stride[i]) into one. This is what happens in the last line, because such procedure gives you a 1D tensor if the input is contiguous.

We can implement those later, but I feel like sorting dimensions would let us simplify the code a bit.

fmassa · 2018-04-09T11:22:21Z

I have one comment about reduction over multiple axis: I think we should follow numpy behavior.

For many functions where the order of operations doesn't matter (like sum or mean), this is not a problem. But for other operations like median, there is a difference in performing the reduction over the different axes independently or by transpose + reshape + operation.

It seems that numpy performs the operations differently than what is implemented here:

a = np.random.rand(3, 3, 3)

m1 = np.median(a, axis=[0, 2])

# perform a single median, after putting the
# reduction dimensions in together
m2 = np.median(a.transpose((1, 0, 2)).reshape(3, -1), axis=1)

# independently perform the reductions
# on each different axis
m3 = np.median(np.median(a, axis=0), axis=1)
m4 = np.median(np.median(a, axis=2), axis=0)

print (np.all(m1 == m2))  # True
print(np.all(m1 == m3))  # False
print(np.all(m1 == m4))  # False

It might be good to benchmark it, but I have the feeling that it might also be faster to perform permute + reshape + op, instead of n times performing op.

apaszke · 2018-04-09T14:51:39Z

@fmassa good point, the current implementation requires that the function is effectively a commutative and associative, but it is ok for sum. From some very limited benchmarks it seems like we really want to take the current path for operations that have this property:

In [1]: x = torch.randn(10, 20, 100, 10, 100, 10)

In [2]: %timeit x.sum(2).sum(4)
37 ms ± 30.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit x.permute(0, 1, 3, 5, 2, 4).sum(-1).sum(-1)
96.8 ms ± 96 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit x.permute(0, 1, 3, 5, 2, 4).contiguous().view(10, 20, 10, 10, -1).sum(-1)
188 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

fmassa · 2018-04-09T15:10:28Z

Indeed, in some cases the cost of permute + contiguous outweights the single execution of op.
Might just be good to keep in mind this subtlety when extending multiple axis to other ops.

btw, I believe in your first timeit you should have added something like x.sum(2).sum(3) (because we remove the dimension by default). But still, this option is the fastest.

apaszke · 2018-04-09T15:47:22Z

Right, my code was incorrect in the first case, but changing it doesn't affect the final run time for me.

t-vi · 2018-04-09T16:25:49Z

Thank you both for your input! So to have a plan: 1) Use fixed order of reduction. 2) Check whether permute and reshape works: how? a) Permute and use compute_stride or b) avoid permuting and mimic compute_stride 3) Either a) rename reduce function to indicate associatative and commutative requirement or b) include force_reshape template option. While we are at it: 4) a) keep backwards on a case by case basis (sum_backward has different inputs than prod_backward), b) split IntList vs. int64_t for jit dispatch to automatically have backwards for multi-dim c) have a grand reduce_multi_backward template. I'm leaning towards 1), 2a), 3b), 4a) for this PR and revisit 4 when implementing a few more ops, but I'll gladly follow your advice. Best regards Thomas

ezyang · 2018-04-17T19:14:57Z

CC @zdevito @apaszke @jamesr66a I'm not sure you should try a different approach; it might just be that we need to support this in the JIT.

apaszke

Accidental update of the gloo submodule

t-vi · 2018-04-18T14:48:53Z

So while rebasing this, would you have a pointer to what * by itself does in native_functions.yaml? Suddenly there are a lot more sum operators. I can see this is for the dtype support, but I'm a bit lost what the detailed implications are (except that I could just ignore my ignorance and pass through what I don't understand)...

apaszke · 2018-04-18T14:52:26Z

* means that all arguments following it are keyword-only. It also works in Python 3:

def f(arg1, arg2, *, kwonlyarg1):
    pass

t-vi · 2018-04-18T21:11:11Z

So after a rebase against master:
The tests fails in test_jit.py. The reason is that the TestJit.test_keyword jit-compiles torch.sum(x, dim=0, keepdim=False). The jit does not know about the dim being declared IntList[1] (i.e. with automatic conversion from int64_t to IntList) and but would accept torch.sum(x, dim=[0], keepdim=False).
Interestingly, the behaviour seems to be different between passing in arguments in the
jit-compiling torch.nn.functional.adaptive_avg_pool1d has a similar quirk with (x, 0) being OK as arguments, but (x, output_size=0) not and (x, output_size=[0]) being OK but not (x, [0]).

So there might be something to @ezyang 's comment that one might consider a change in the jit here.

Best regards

Thomas

a

ezyang · 2018-04-26T03:09:15Z

Repinging @zdevito, @apaszke, @jamesr66a on the JIT interaction

t-vi · 2018-04-26T06:37:24Z

The JIT interaction is solved by @elanmart 's #6965 (Thank you Marcin!).

t-vi · 2018-04-30T07:04:50Z

So now that the jit is handles IntList[k] (thank you, @elanmart !), I'll look at refreshing the patch and also check that I didn't don't cause a stark performance hit on 1-d.

t-vi · 2018-04-30T19:45:31Z

So comparing the PR with the master commit it is rebased on (6a55d86) doesn't show a difference in performance for a quick run of

a = torch.randn(10,10,10,10)
%timeit b = a.sum(2)

(It gets warnings about caching, but I don't think that that influences whether there is a measurable difference between before and after the PR.)

t-vi · 2018-05-01T20:22:11Z

The failed build seems to say something about virtual memory. Is that me or the CI? My own building with gcc 7.3 and py 3.6 seems to work...
Edit: ...but with the merge conflict, I have a new chance, anyway.

bddppq · 2018-05-03T05:37:57Z

@t-vi @ezyang
Our onnx integration tests have caught some memory error, could you take a look? https://ci.pytorch.org/jenkins/job/onnx-fb-universe-builds/job/py2-gcc5-ubuntu16.04/22776/console

The specific failed test case is here: https://github.com/onnxbot/onnx-fb-universe/blob/master/test/test_operators.py#L310

t-vi · 2018-05-03T06:06:07Z

@bddppq I can offer https://github.com/t-vi/pytorch/tree/fix_onnx_sum
Now I changed the signature where mypy failed.

…6152)

apaszke reviewed Mar 31, 2018

View reviewed changes

t-vi mentioned this pull request Apr 1, 2018

Fix bilinear performance regression #6110

Merged

apaszke reviewed Apr 2, 2018

View reviewed changes

apaszke reviewed Apr 4, 2018

View reviewed changes

t-vi requested review from colesbury, ezyang, gchanan, soumith and zdevito as code owners April 4, 2018 19:06

apaszke reviewed Apr 5, 2018

View reviewed changes

t-vi force-pushed the sum_multi branch from da3c7cb to 7ec7039 Compare April 9, 2018 10:01

apaszke previously requested changes Apr 17, 2018

View reviewed changes

t-vi force-pushed the sum_multi branch from 41515f5 to 2314323 Compare April 18, 2018 18:52

t-vi mentioned this pull request Apr 18, 2018

[JIT] jit does not properly handle IntList[x] params #6734

Closed

elanmart mentioned this pull request Apr 26, 2018

[jit] Fix handling of IntList[k] parameters #6965

Merged

t-vi force-pushed the sum_multi branch from 2314323 to de37b67 Compare April 30, 2018 19:41

implement sum over multiple dimensions (fixes pytorch#2006)

3b631f0

t-vi force-pushed the sum_multi branch from de37b67 to 3b631f0 Compare May 1, 2018 21:26

ezyang approved these changes May 3, 2018

View reviewed changes

ezyang merged commit 07513cf into pytorch:master May 3, 2018

t-vi mentioned this pull request May 3, 2018

Fix onnx sum #7232

Merged

Jorghi12 pushed a commit to wsttiger/pytorch that referenced this pull request May 10, 2018

implement sum over multiple dimensions (fixes pytorch#2006) (pytorch#…

a45bb92

…6152)

weiyangfb pushed a commit to weiyangfb/pytorch that referenced this pull request Jun 11, 2018

implement sum over multiple dimensions (fixes pytorch#2006) (pytorch#…

adfd010

…6152)

karandwivedi42 mentioned this pull request Jun 27, 2018

[Needs someone to complete] Reduce sum on many axes #2116

Closed

6 tasks

ezyang added the open source label Jun 24, 2019

f0k mentioned this pull request Oct 17, 2019

Tensor.max()/min() over multiple axes #28213

Closed

pyscorcher mentioned this pull request Jul 9, 2021

Reduce with any(), all(), median() over multiple dimensions #61453

Open

vikigenius mentioned this pull request Jul 21, 2022

Mean, Sum across arbitrary dimensions (especially batch dimension) coreylowman/dfdx#114

Closed

ancestor-mithril mentioned this pull request May 11, 2023

Using torch.sum over multiple dimensions MIC-DKFZ/nnUNet#1447

Closed


		// MULTI DIM REDUCE ###########################################################

		Tensor sum(const Tensor &self, IntList dims_, bool keepdim) {


		constexpr size_t dim_bitset_size = 64;

		static inline std::bitset<dim_bitset_size> dim_list_to_vector(IntList dims, int64_t ndims, bool wrap_scalar=true) {

implement sum over multiple dimensions (fixes #2006) #6152

implement sum over multiple dimensions (fixes #2006) #6152

Conversation

t-vi commented Mar 30, 2018

ssnl commented Mar 30, 2018

ssnl commented Mar 30, 2018

t-vi commented Mar 30, 2018

ssnl commented Mar 30, 2018

ezyang commented Mar 30, 2018

apaszke left a comment

Choose a reason for hiding this comment

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

t-vi commented Apr 1, 2018

apaszke commented Apr 1, 2018

t-vi commented Apr 1, 2018 • edited Loading

peterjc123 commented Apr 2, 2018

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

t-vi commented Apr 2, 2018 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

t-vi commented Apr 4, 2018 • edited Loading

apaszke left a comment

Choose a reason for hiding this comment

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

t-vi commented Apr 6, 2018 via email

t-vi commented Apr 9, 2018

apaszke commented Apr 9, 2018 • edited Loading

fmassa commented Apr 9, 2018

apaszke commented Apr 9, 2018 • edited Loading

fmassa commented Apr 9, 2018

apaszke commented Apr 9, 2018

t-vi commented Apr 9, 2018 via email

ezyang commented Apr 17, 2018

apaszke left a comment

Choose a reason for hiding this comment

t-vi commented Apr 18, 2018

apaszke commented Apr 18, 2018

t-vi commented Apr 18, 2018

ezyang commented Apr 26, 2018

t-vi commented Apr 26, 2018

t-vi commented Apr 30, 2018

t-vi commented Apr 30, 2018

t-vi commented May 1, 2018 • edited Loading

bddppq commented May 3, 2018

t-vi commented May 3, 2018 • edited Loading

t-vi commented Apr 1, 2018 •

edited

Loading

t-vi commented Apr 2, 2018 •

edited

Loading

t-vi commented Apr 4, 2018 •

edited

Loading

apaszke commented Apr 9, 2018 •

edited

Loading

apaszke commented Apr 9, 2018 •

edited

Loading

t-vi commented May 1, 2018 •

edited

Loading

t-vi commented May 3, 2018 •

edited

Loading