Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement sum over multiple dimensions (fixes #2006) #6152

Merged
merged 1 commit into from
May 3, 2018

Conversation

t-vi
Copy link
Collaborator

@t-vi t-vi commented Mar 30, 2018

Hello,

this implements summing over multiple dimensions as a ATen native function.

  • As IntList and int64_t is considered the same for the jit signatures, I handle the single-dimension case in the multi-dimension one by fast-tracking it.
  • in this context, for sum_out, I manually dispatch in ReductionOps.cpp instead of using native_function's mechanism.
  • the multiple-index version iterates over the one-dimensional op.
    I'll add a test and adapt the docs, but I'd appreciate feedback on the approach.

This patch addresses #2006 and would supersede #2116 .
Of course, there is a ton of other ops (prod, mean, squeeze, unsqueeze) that could be handled similarly.

Best regards

Thomas

@ssnl
Copy link
Collaborator

ssnl commented Mar 30, 2018

Can you update the doc and add test cases for this please?

@ssnl
Copy link
Collaborator

ssnl commented Mar 30, 2018

@pytorchbot test this please

@t-vi
Copy link
Collaborator Author

t-vi commented Mar 30, 2018

@pytorchbot retest this please

2 similar comments
@ssnl
Copy link
Collaborator

ssnl commented Mar 30, 2018

@pytorchbot retest this please

@ezyang
Copy link
Contributor

ezyang commented Mar 30, 2018

@pytorchbot retest this please

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not a complete review. Just a few comments)

@@ -27,6 +27,17 @@ static inline int64_t maybe_wrap_dim(int64_t dim, int64_t dim_post_expr, bool wr
return dim;
}

static inline std::vector<bool> dim_list_to_vector(IntList dims, int64_t ndims, bool wrap_scalar=true) {
std::vector<bool> seen(ndims, false);

This comment was marked as off-topic.

}
}
size_t ndims = self.dim();
std::vector<bool> seen(ndims, false);

This comment was marked as off-topic.


// MULTI DIM REDUCE ###########################################################

Tensor sum(const Tensor &self, IntList dims_, bool keepdim) {

This comment was marked as off-topic.

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 1, 2018

@pytorchbot retest this please

(can I do this?)

@apaszke
Copy link
Contributor

apaszke commented Apr 1, 2018

@pytorchbot add to whitelist

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 1, 2018

Hi,

I could use a hint how to resolve the ambiguity that the windows compile stumbles over (Error C2666 regarding the use of bitfield in WrapDimUtils.h).
@peterjc123 maybe?

Thank you

Thomas

@peterjc123
Copy link
Collaborator

A solution may be to define a flag in WrapDimUtils.h and make it wrap the division functions of the half tensors in THCHalfAutoNumerics.cuh.

}
size_t ndims = self.dim();
AT_ASSERT(ndims <= 64, "tensor dimension must be <= 64 for multiple dims")
std::bitset<64> seen;

This comment was marked as off-topic.

// non-explicit half conversion in THCUNN/THCHalfAutoNumerics.cuh
// so this is host-code only

static inline std::bitset<64> dim_list_to_vector(IntList dims, int64_t ndims, bool wrap_scalar=true) {

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 2, 2018

So at last the windows build works after moving the bitmap using functions into a different header (that is not included by .cus.
Thank you for your input!

constexpr size_t dim_bitset_size = 64;

static inline std::bitset<dim_bitset_size> dim_list_to_vector(IntList dims, int64_t ndims, bool wrap_scalar=true) {
AT_ASSERT(ndims <= (int64_t) dim_bitset_size, "tensor dimension must be <= %zu for multiple dims", dim_bitset_size);

This comment was marked as off-topic.

for (size_t i = 0; i < dims.size(); i++) {
size_t dim = maybe_wrap_dim(dims[i], ndims);
if (seen[dim])
AT_ERROR("repeated dim");

This comment was marked as off-topic.


constexpr size_t dim_bitset_size = 64;

static inline std::bitset<dim_bitset_size> dim_list_to_vector(IntList dims, int64_t ndims, bool wrap_scalar=true) {

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

AT_ERROR("repeated dim");
seen[dim] = true;
result = reduce_1(result, dim, true);
}

This comment was marked as off-topic.

auto dim = maybe_wrap_dim(dims_[i], ndims);
if (seen[dim])
AT_ERROR("repeated dim in sum");
seen[dim] = true;

This comment was marked as off-topic.

@@ -611,13 +611,10 @@
CPU: _sum_cpu
CUDA: _sum_cuda

- func: sum(Tensor self, int64_t dim, bool keepdim=False) -> Tensor
- func: sum(Tensor self, IntList[1] dim, bool keepdim=False) -> Tensor

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 4, 2018

So far I have assumed that the user prescribes the order of summation. The obvious alternative is to use ascending or descending order in the tensors by iterating over the bitset dims instead of the user-provided list dims_.

Even more radically, one could consider to permute+reshape axes together. Then one would only sum once and not have intermediate results...

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I think there are still minor things that could be improved. Should be good to go after this

std::bitset<dim_bitset_size> seen;
for (size_t i = 0; i < dims.size(); i++) {
size_t dim = maybe_wrap_dim(dims[i], ndims);
AT_ASSERT(!seen[dim], "dim %zu appears multiple times in the list of reduced dims", dim);

This comment was marked as off-topic.

return self;
}
size_t ndims = self.dim();
std::bitset<dim_bitset_size> seen = dim_list_to_bitset(dims_, ndims);

This comment was marked as off-topic.

Tensor result = self;
for (size_t i = 0; i < dims_.size(); i++) {
size_t dim = maybe_wrap_dim(dims_[i], ndims);
result = reduce_1(result, dim, true);

This comment was marked as off-topic.

This comment was marked as off-topic.


template <Tensor (reduce_1)(const Tensor &, int64_t, bool),
Tensor& (reduce_1_out)(Tensor& result, const Tensor &, int64_t, bool)>
inline Tensor& reduce_multi_out(Tensor &result, const Tensor &self, IntList dims_, bool keepdim) {

This comment was marked as off-topic.

res1 = torch.sum(x, (2, 1))
res2 = torch.Tensor()
torch.sum(x, (2, 1), out=res2)
self.assertEqual(res1, res2)

This comment was marked as off-topic.

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 6, 2018 via email

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 9, 2018

So I looked into the handling of keeping or not dimensions per @apaszke 's comment.
I make the working assumption that we want to use the given order of reductions. (Please correct me if you believe I should not). Then it is possible to replace "keep first, squeeze later" if one distinguishes the two cases and converts the indices to incremental reductions. However, that is a bit elaborate, so in terms of code size (t-vi@48bb8d8), I think keeping it as is is the most practical solution.

@apaszke
Copy link
Contributor

apaszke commented Apr 9, 2018

I'm pretty sure we'll never really want to use the order in which the dimensions were given, take this for an example:

In [1]: x = torch.randn(100, 100, 100, 100)

In [2]: %timeit x.sum(3).sum(2).sum(1).sum(0)
9.44 ms ± 8.49 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit x.sum(0).sum(0).sum(0).sum(0)
12 ms ± 7.99 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit x.view(-1).sum(0)
9.23 ms ± 15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This makes sense, because reducing in order starting from the innermost dim is good for data locality, as earlier dimension will have lower strides once you get to them. (NB the difference only grows when I use fewer cores)

An additional improvement would be to collapse the pseudo-contiguous pairs of dimensions (stride[i+1] == size[i] * stride[i]) into one. This is what happens in the last line, because such procedure gives you a 1D tensor if the input is contiguous.

We can implement those later, but I feel like sorting dimensions would let us simplify the code a bit.

@fmassa
Copy link
Member

fmassa commented Apr 9, 2018

I have one comment about reduction over multiple axis: I think we should follow numpy behavior.

For many functions where the order of operations doesn't matter (like sum or mean), this is not a problem. But for other operations like median, there is a difference in performing the reduction over the different axes independently or by transpose + reshape + operation.

It seems that numpy performs the operations differently than what is implemented here:

a = np.random.rand(3, 3, 3)

m1 = np.median(a, axis=[0, 2])

# perform a single median, after putting the
# reduction dimensions in together
m2 = np.median(a.transpose((1, 0, 2)).reshape(3, -1), axis=1)

# independently perform the reductions
# on each different axis
m3 = np.median(np.median(a, axis=0), axis=1)
m4 = np.median(np.median(a, axis=2), axis=0)

print (np.all(m1 == m2))  # True
print(np.all(m1 == m3))  # False
print(np.all(m1 == m4))  # False

It might be good to benchmark it, but I have the feeling that it might also be faster to perform permute + reshape + op, instead of n times performing op.

@apaszke
Copy link
Contributor

apaszke commented Apr 9, 2018

@fmassa good point, the current implementation requires that the function is effectively a commutative and associative, but it is ok for sum. From some very limited benchmarks it seems like we really want to take the current path for operations that have this property:

In [1]: x = torch.randn(10, 20, 100, 10, 100, 10)

In [2]: %timeit x.sum(2).sum(4)
37 ms ± 30.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit x.permute(0, 1, 3, 5, 2, 4).sum(-1).sum(-1)
96.8 ms ± 96 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit x.permute(0, 1, 3, 5, 2, 4).contiguous().view(10, 20, 10, 10, -1).sum(-1)
188 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@fmassa
Copy link
Member

fmassa commented Apr 9, 2018

Indeed, in some cases the cost of permute + contiguous outweights the single execution of op.
Might just be good to keep in mind this subtlety when extending multiple axis to other ops.

btw, I believe in your first timeit you should have added something like x.sum(2).sum(3) (because we remove the dimension by default). But still, this option is the fastest.

@apaszke
Copy link
Contributor

apaszke commented Apr 9, 2018

Right, my code was incorrect in the first case, but changing it doesn't affect the final run time for me.

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 9, 2018 via email

@ezyang
Copy link
Contributor

ezyang commented Apr 17, 2018

CC @zdevito @apaszke @jamesr66a I'm not sure you should try a different approach; it might just be that we need to support this in the JIT.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental update of the gloo submodule

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 18, 2018

So while rebasing this, would you have a pointer to what * by itself does in native_functions.yaml? Suddenly there are a lot more sum operators. I can see this is for the dtype support, but I'm a bit lost what the detailed implications are (except that I could just ignore my ignorance and pass through what I don't understand)...

@apaszke
Copy link
Contributor

apaszke commented Apr 18, 2018

* means that all arguments following it are keyword-only. It also works in Python 3:

def f(arg1, arg2, *, kwonlyarg1):
    pass

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 18, 2018

So after a rebase against master:
The tests fails in test_jit.py. The reason is that the TestJit.test_keyword jit-compiles torch.sum(x, dim=0, keepdim=False). The jit does not know about the dim being declared IntList[1] (i.e. with automatic conversion from int64_t to IntList) and but would accept torch.sum(x, dim=[0], keepdim=False).
Interestingly, the behaviour seems to be different between passing in arguments in the
jit-compiling torch.nn.functional.adaptive_avg_pool1d has a similar quirk with (x, 0) being OK as arguments, but (x, output_size=0) not and (x, output_size=[0]) being OK but not (x, [0]).

So there might be something to @ezyang 's comment that one might consider a change in the jit here.

Best regards

Thomas

@ezyang
Copy link
Contributor

ezyang commented Apr 26, 2018

Repinging @zdevito, @apaszke, @jamesr66a on the JIT interaction

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 26, 2018

The JIT interaction is solved by @elanmart 's #6965 (Thank you Marcin!).

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 30, 2018

So now that the jit is handles IntList[k] (thank you, @elanmart !), I'll look at refreshing the patch and also check that I didn't don't cause a stark performance hit on 1-d.

@t-vi
Copy link
Collaborator Author

t-vi commented Apr 30, 2018

So comparing the PR with the master commit it is rebased on (6a55d86) doesn't show a difference in performance for a quick run of

a = torch.randn(10,10,10,10)
%timeit b = a.sum(2)

(It gets warnings about caching, but I don't think that that influences whether there is a measurable difference between before and after the PR.)

@t-vi
Copy link
Collaborator Author

t-vi commented May 1, 2018

The failed build seems to say something about virtual memory. Is that me or the CI? My own building with gcc 7.3 and py 3.6 seems to work...
Edit: ...but with the merge conflict, I have a new chance, anyway.

@ezyang ezyang merged commit 07513cf into pytorch:master May 3, 2018
@bddppq
Copy link
Contributor

bddppq commented May 3, 2018

@t-vi @ezyang
Our onnx integration tests have caught some memory error, could you take a look? https://ci.pytorch.org/jenkins/job/onnx-fb-universe-builds/job/py2-gcc5-ubuntu16.04/22776/console

The specific failed test case is here: https://github.com/onnxbot/onnx-fb-universe/blob/master/test/test_operators.py#L310

@t-vi
Copy link
Collaborator Author

t-vi commented May 3, 2018

@bddppq I can offer https://github.com/t-vi/pytorch/tree/fix_onnx_sum
Now I changed the signature where mypy failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants