[sparse] torch.sparse.sum() #12430

weiyangfb · 2018-10-07T05:16:57Z

to fix torch.sum() for sparse tensor #12241
add _sparse_sum() to ATen, and expose as torch.sparse.sum(), not support SparseTensor.sum() currently
this PR depends on [sparse] Autograd get_indices/values and sparse_coo ctor #11253, and will need to be updated upon it lands
implement forward
implement backward
performance benchmark script:
- sum all dims is fastest for sparse tensor
- when input is sparse enough nnz = 0.1%, sum of sparse tensor is faster than dense in CPU, but not necessary in CUDA
- CUDA backward is comparable (<2x) between sum several dims vs sum all dims in sparse
- CPU backward uses binary search is still slow in sparse, takes 5x time in sum [0, 2, 3] dims vs sum all dims
  - optimize CUDA backward for now
    - using thrust for sort and binary search, but runtime not improved
- both of CPU and CUDA forward are slow in sparse (sum several dims vs sum all dims), at most 20x slower in CPU, and 10x in CUDA
  - improve CPU and CUDA forward kernels

(nnz, sizes, sum_dims, keepdim, sum all or dims, bk=backward)	CPU (sparse vs dense)	CUDA(sparse vs dense)
(1000, [1000, 1000, 2, 2], [0, 1], False, sumAll)	8.77 µs vs 72.9 µs	42.5 µs vs 108 µs
(1000, [1000, 1000, 2, 2], [0, 1], False, sumD)	112 µs vs 4.47 ms	484 µs vs 407 µs
(1000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk)	141 µs vs 148 µs	647 µs vs 231 µs
(1000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk)	235 µs vs 1.23 ms	781 µs vs 213 µs
(1000, [1000, 1000, 2, 2], [2, 3], False, sumD)	48.5 µs vs 360 µs	160 µs vs 2.03 ms
(1000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk)	258 µs vs 1.22 ms	798 µs vs 224 µs
(1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD)	204 µs vs 882 µs	443 µs vs 133 µs
(1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk)	709 µs vs 1.15 ms	893 µs vs 202 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumAll)	39.8 µs vs 81 µs	42.4 µs vs 113 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumD)	747 µs vs 4.7 ms	2.4 ms vs 414 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk)	1.04 ms vs 126 µs	5.03 ms vs 231 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk)	1.12 ms vs 1.24 ms	5.99 ms vs 213 µs
(10000, [1000, 1000, 2, 2], [2, 3], False, sumD)	133 µs vs 366 µs	463 µs vs 2.03 ms
(10000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk)	1.56 ms vs 1.22 ms	6.11 ms vs 229 µs
(10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD)	1.53 ms vs 799 µs	824 µs vs 134 µs
(10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk)	5.15 ms vs 1.09 ms	7.02 ms vs 205 µs

after improving CPU and CUDA forward kernels
- in (1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD) forward, CPU takes ~~171 µs~~, in which 130 µs is spent on coalesce(), for CUDA, total time is ~~331 µs~~, in which 141 µs is spent on coalesce(), we need to reduce time at other places outside coalesce().
- after a few simple tweaks, now in the forward, it is at most 10x slower in CPU, and 7x in CUDA. And time takes in sum dense dims only [2, 3] is ~2x of sum all dims. Speed of sum all sparse dims [0, 1] is on bar with sum all dims

(nnz, sizes, sum_dims, keepdim, sum all or dims, bk=backward)	CPU (sparse vs dense)	CUDA(sparse vs dense)
(1000, [1000, 1000, 2, 2], [0, 1], False, sumAll)	7 µs vs 69.5 µs	31.5 µs vs 61.6 µs
(1000, [1000, 1000, 2, 2], [0, 1], False, sumD)	11.3 µs vs 4.72 ms	35.2 µs vs 285 µs
(1000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk)	197 µs vs 124 µs	857 µs vs 134 µs
(1000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk)	124 µs vs 833 µs	796 µs vs 106 µs
(1000, [1000, 1000, 2, 2], [2, 3], False, sumD)	20.5 µs vs 213 µs	39.4 µs vs 1.24 ms
(1000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk)	131 µs vs 830 µs	881 µs vs 132 µs
(1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD)	95.8 µs vs 409 µs	246 µs vs 87.2 µs
(1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk)	624 µs vs 820 µs	953 µs vs 124 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumAll)	45.3 µs vs 72.9 µs	33.9 µs vs 57.2 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumD)	81.4 µs vs 4.49 ms	39.7 µs vs 280 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk)	984 µs vs 111 µs	6.41 ms vs 121 µs
(10000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk)	1.45 ms vs 828 µs	6.77 ms vs 113 µs
(10000, [1000, 1000, 2, 2], [2, 3], False, sumD)	74.9 µs vs 209 µs	37.7 µs vs 1.23 ms
(10000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk)	1.48 ms vs 845 µs	6.96 ms vs 132 µs
(10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD)	1.14 ms vs 411 µs	252 µs vs 87.8 µs
(10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk)	4.53 ms vs 851 µs	7.12 ms vs 128 µs

time takes in CUDA backward of sparse is super long with large variance (in case of nnz=10000, it normally takes 6-7ms). To improve backward of sparse ops, we will need to debug at places other than CUDA kernels. here is a benchmark of torch.copy_():

>>> d = [1000, 1000, 2, 2]
>>> nnz = 10000
>>> I = torch.cat([torch.randint(0, d[0], size=(nnz,)), 
               torch.randint(0, d[1], size=(nnz,))], 0).reshape(2, nnz)
>>> V = torch.randn(nnz, d[2], d[3])
>>> size = torch.Size(d)
>>> S = torch.sparse_coo_tensor(I, V, size).coalesce().cuda()
>>> S2 = torch.sparse_coo_tensor(I, V, size).coalesce().cuda().requires_grad_()
>>> data = S2.clone()
>>> S.copy_(S2)
>>> y = S * 2
>>> torch.cuda.synchronize()
>>> %timeit y.backward(data, retain_graph=True); torch.cuda.synchronize()
7.07 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)

aten/src/ATen/native/sparse/SparseTensorMath.cpp

+Tensor sparse_sum(const SparseTensor& t, IntList dims, bool keepdim) {
+
+  const int64_t total_dims = t.dim();
+  check_dims_errors(dims, total_dims);


aten/src/ATen/native/TensorTransformations.h

-  // check if number of axis in dim is valid
-  AT_CHECK(flip_dims_size > 0 && flip_dims_size <= total_dims,
-    "flip dims size out of range, got flip dims size=", flip_dims_size);
+static inline void check_dims_errors(IntList dims, int64_t total_dims) {


aten/src/ATen/native/native_functions.yaml

@@ -1790,6 +1790,23 @@
    SparseCPU: norm_sparse
    SparseCUDA: norm_sparse

+# TODO: reduce signatures down to one when optinoal args is available


apaszke

Why are we adding a new function for this instead of simply using torch.sum?

aten/src/ATen/native/native_functions.yaml

+- func: sparse_sum(Tensor self, *, ScalarType dtype) -> Tensor
+  variant: method, function
+
+- func: sparse_sum(Tensor self, IntList[1] dims, bool keepdim) -> Tensor


aten/src/ATen/native/sparse/SparseTensorMath.cpp

+}
+
+Tensor sparse_sum(const SparseTensor& t, ScalarType dtype) {
+  return t._values().sum().to(dtype);


weiyangfb · 2018-10-09T20:29:27Z

@apaszke It is for the sake of autograd support. Ops prefixed with sparse_ will have autograd support with gradients zeroed out at zero input locations during backward. But yes, I can also provide sum(SparseTensor) without backward.

apaszke · 2018-10-09T21:23:33Z

Why is zeroing those gradients a good thing? That's not the real gradient of this operation.

weiyangfb · 2018-10-10T16:41:50Z

@apaszke I agree, zeroing out gradients may not make sense in my application, but it might be useful for others use cases such Graph Networks @AntoinePrv (#10043)

apaszke · 2018-10-10T17:22:32Z

I see the use case, but I don't feel like adding a sparse_X for every single operation we have is a good solution. The right way to do it would be to simply have some kind of a specialized masked tensor type, which also holds on to a mask specifying which entries are valid (where mask can be represented in a sparse way ofc). That's what @AntoinePrv originally proposed, and is very reasonable.

weiyangfb · 2018-10-10T21:11:00Z

@apaszke ok, I think here is more about the badness of sparse_X ops, to address it, how about to add a sparse=bool args for each op supports SparseTensor? Similar to embedding, sparse=True indicates sparse gradients during the backward. About masked tensor representation, it is more of an option of possible way to implement backward since it might not be universally optimal for all ops.

ezyang · 2018-10-10T22:07:39Z

@apaszke Are you recommending that we have a new backend, SparseMaskedTensor? Even if we do it that way, you still shouldn't call the operation sum, because it's not a sum. I suppose that if SparseMaskedTensor is just a Python-level only wrapper on top of tensor and not actually a tensor, you're allowed to do that, but if you want it to be an honest to goodness Tensor you have to respect our semantics which say, at the very least, that the derivative of all operations called "sum" should be the same.

apaszke · 2018-10-11T14:10:15Z

@weiyangfb adding a sparse flag to every single operation we have is having a tail wag the dog.

@ezyang why is that not a sum? I don't understand. If the sparsity pattern encodes the "valid entries" on which you want to do the compute, then it's exactly the sum. In this case you shouldn't think of it as a tensor with zeros defined everywhere, but as some sentinel values that always act as neutral elements of every operation. I don't know if I'm proposing a backend, because backend is heavily overloaded in our vocabulary. What I want is just a tensor-like object that overloads operations which we already have.

Of course the new type would be some kind of a wrapper around other tensor operations, and in case it can't be expressed using regular tensor math, they can always desugar to more specialized implementations to the backend (think sth like sparse_sum, except that would be an internal function). However none of this should be of concern to the user. Otherwise, we'll end up with a 100000 variations of every function for every weird quirk that people want, and both maintenance and discoverability will be a problem.

weiyangfb · 2018-10-11T22:39:19Z

@apaszke Thanks for the clarification! I think your proposal sounds reasonable to me now. One important thing is to define zero locations at SparseTensor as masked locations, i.e., not involved in computations during either forward or backward. This allows SparseTensor ops to share the same names as in dense tensor with autograd support. A new type sounds fine to me. Is it meant to replace the current dispatch mechanism for SparseTensor?

weiyangfb · 2018-10-12T22:44:44Z

Hi @apaszke, would you like to comment this further so that I can unblock this work? Thanks!

apaszke · 2018-10-14T20:14:06Z

I'm not sure what's the current dispatch for sparse tensors. My point is that we should avoid increasing our API surface by duplicating every single function with a sparse_ prefix. If this is blocking for a few models then we can merge this PR, but I'd like us to clean this mess up in the future (by deprecating them and replacing with a new tensor type).

soumith · 2018-10-15T03:22:42Z

We could put them in a torch.sparse.* namespace, for example torch.sparse.sum, to clearly indicate that the gradients are sparsified approximations, and not true gradients.

We can clearly declare this at the top level doc page for torch.sparse.*.
I think that's the best way forward. @apaszke thoughts?

apaszke · 2018-10-15T19:18:13Z

@soumith I don't really care if they're called torch.sparse_sum or torch.sparse.sum (although it does seem slightly nicer to put it in a submodule). My point is that it's not the right API for this kind of thing, and should get deprecated sooner or later.

weiyangfb · 2018-10-15T21:48:31Z

@apaszke The current dispatch for SparseTensor relies on files like pytorch/aten/src/ATen/native/LegacyBridge.cpp that dispatches to different backends depends on type of input tensor. Without introducing a new type for SparseTensor, we can still keep the API surface the same by relying on the current dispatch along with the masked positions rule (masked positions are not involved in computations in either forward or backward). This way we can support autograd for sparse with the same API names. On the other hand, a new type is welcomed if it makes more sense than the existing dispatching mechanism.

apaszke · 2018-10-17T21:12:11Z

I thought you meant that we will want to have both sum and sparse_sum which returns the incorrect gradient. In that case we need an entirely new sparse tensor type (which says that "zero" entries are not really zero, but they are "invalid").

weiyangfb · 2018-10-17T21:44:34Z

@apaszke actually I meant to keep sum() only. But yes, it will have backward and zero entries are invalid. We will need to define "correctness" of the gradient for sparse tensor in this way.

apaszke · 2018-10-18T19:15:24Z

Hmmm I'm not sure if we want this to be the default meaning of a sparse tensor. In my view they should really be a drop in replacement for dense semantics, except that we would be optimizing most entries out for space reasons. On the other hand, allocating large dense tensors as their gradients might OOM too...

weiyangfb · 2018-11-13T23:20:44Z

Hi @ezyang, I think the PR is good for review again :) Thanks!

weiyangfb · 2018-11-14T04:17:47Z

@soumith can I get a review?

soumith

i'm just getting back familiarity with sparse, so parts of my review are probably pretty ill-informed.
I left some comments in-line.

tbd on review is: cuda implementation, backward

torch/sparse/__init__.py

@@ -1 +1,92 @@
 # The Tensor classes are added to this module by python_tensor.cpp
+import torch


aten/src/ATen/SparseTensorUtils.h

+// Ex2:
+//   dims_to_flatten = [1]
+//   new_indices = [ 3, 1, 3 ]  # uncoalesced
+inline LongTensor flatten_indices_by_dims(const LongTensor& indices, const IntList& sizes, const IntList& dims_to_flatten){


aten/src/ATen/SparseTensorUtils.h

+//   dims_to_flatten = [1]
+//   new_indices = [ 3, 1, 3 ]  # uncoalesced
+inline LongTensor flatten_indices_by_dims(const LongTensor& indices, const IntList& sizes, const IntList& dims_to_flatten){
+  LongTensor new_indices = at::zeros({indices.size(1)}, indices.options());


aten/src/ATen/SparseTensorUtils.h

+inline LongTensor flatten_indices_by_dims(const LongTensor& indices, const IntList& sizes, const IntList& dims_to_flatten){
+  LongTensor new_indices = at::zeros({indices.size(1)}, indices.options());
+  for (auto d : dims_to_flatten) {
+    new_indices.mul_(sizes[d]);


aten/src/ATen/native/sparse/SparseTensorMath.cpp

+// for ops like sum, max, and min.
+// --------------------------------------------------------------------
+Tensor _sparse_sum(const SparseTensor& input) {
+  return input.coalesce().values().sum();


aten/src/ATen/native/sparse/SparseTensorMath.cpp

+    else {
+      if (keepdim) {
+        new_indices = at::zeros_like(indices);
+        if (!sum_all_sparse_dim) {


aten/src/ATen/native/sparse/SparseTensorMath.cpp

+        }
+      }
+      else {
+        if (sum_all_sparse_dim) {


aten/src/ATen/native/sparse/SparseTensorMath.cpp

+        new_indices = at::zeros_like(indices);
+        if (!sum_all_sparse_dim) {
+          for (int64_t d = 0; d < sparse_dim; d++) {
+            if (!dims_to_sum_b[d]) new_indices[d].copy_(indices[d]);


test/test_sparse.py

+    @skipIfRocm
+    def test_sparse_sum(self):
+
+        def run_tests(S, td=None, k=False):


test/test_sparse.py

+                S_grad_dense = S_grad.to_dense() if S_grad.is_sparse else S_grad
+                self.assertEqual(S_grad_dense, D_grad)
+
+        nnz = 10


weiyangfb · 2018-11-21T18:30:11Z

I removed keepdim args, and addressed comments, the CI failures do not look related. This PR is ready for review again. cc @soumith

resolved

facebook-github-bot

@weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

- move sparse_sum to sparse.sum - requires input sparse tensor to be coalesced in sparse_sum to ease autograd support - optimize CPU kernel at backward with binary search - optimize runtime of forward and backward, use cheap sparse tensor ctor _sparse_coo_tensor_with_dims_and_tensors() - optimize backward CUDA kernel with binary search, runtime doesn't seem to improve at all, not sure why - improve speed of forward when summing all sparse dims; address comments

…y sparse tensor input

facebook-github-bot

@weiyangfb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: - to fix #12241 - add `_sparse_sum()` to ATen, and expose as `torch.sparse.sum()`, not support `SparseTensor.sum()` currently - this PR depends on #11253, and will need to be updated upon it lands - [x] implement forward - [x] implement backward - performance [benchmark script](https://gist.github.com/weiyangfb/f4c55c88b6092ef8f7e348f6b9ad8946#file-sparse_sum_benchmark-py): - sum all dims is fastest for sparse tensor - when input is sparse enough nnz = 0.1%, sum of sparse tensor is faster than dense in CPU, but not necessary in CUDA - CUDA backward is comparable (<2x) between `sum several dims` vs `sum all dims` in sparse - CPU backward uses binary search is still slow in sparse, takes `5x` time in `sum [0, 2, 3] dims` vs `sum all dims` - optimize CUDA backward for now - using thrust for sort and binary search, but runtime not improved - both of CPU and CUDA forward are slow in sparse (`sum several dims` vs `sum all dims`), at most `20x` slower in CPU, and `10x` in CUDA - improve CPU and CUDA forward kernels (nnz, sizes, sum_dims, keepdim, sum all or dims, bk=backward) | CPU (sparse vs dense) | CUDA(sparse vs dense) -- | -- | -- (1000, [1000, 1000, 2, 2], [0, 1], False, sumAll) | 8.77 µs vs 72.9 µs | 42.5 µs vs 108 µs (1000, [1000, 1000, 2, 2], [0, 1], False, sumD) | 112 µs vs 4.47 ms | 484 µs vs 407 µs (1000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk) | 141 µs vs 148 µs | 647 µs vs 231 µs (1000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk) | 235 µs vs 1.23 ms | 781 µs vs 213 µs (1000, [1000, 1000, 2, 2], [2, 3], False, sumD) | 48.5 µs vs 360 µs | 160 µs vs 2.03 ms (1000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk) | 258 µs vs 1.22 ms | 798 µs vs 224 µs (1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD) | 204 µs vs 882 µs | 443 µs vs 133 µs (1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk) | 709 µs vs 1.15 ms | 893 µs vs 202 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumAll) | 39.8 µs vs 81 µs | 42.4 µs vs 113 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumD) | 747 µs vs 4.7 ms | 2.4 ms vs 414 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk) | 1.04 ms vs 126 µs | 5.03 ms vs 231 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk) | 1.12 ms vs 1.24 ms | 5.99 ms vs 213 µs (10000, [1000, 1000, 2, 2], [2, 3], False, sumD) | 133 µs vs 366 µs | 463 µs vs 2.03 ms (10000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk) | 1.56 ms vs 1.22 ms | 6.11 ms vs 229 µs (10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD) | 1.53 ms vs 799 µs | 824 µs vs 134 µs (10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk) | 5.15 ms vs 1.09 ms | 7.02 ms vs 205 µs - after improving CPU and CUDA forward kernels - in `(1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD)` forward, CPU takes ~~`171 µs`~~, in which `130 µs` is spent on `coalesce()`, for CUDA, total time is ~~`331 µs`~~, in which `141 µs` is spent on `coalesce()`, we need to reduce time at other places outside `coalesce()`. - after a few simple tweaks, now in the forward, it is at most `10x` slower in CPU, and `7x` in CUDA. And time takes in `sum dense dims only [2, 3]` is `~2x` of `sum all dims`. Speed of `sum all sparse dims [0, 1]` is on bar with `sum all dims` (nnz, sizes, sum_dims, keepdim, sum all or dims, bk=backward) | CPU (sparse vs dense) | CUDA(sparse vs dense) -- | -- | -- (1000, [1000, 1000, 2, 2], [0, 1], False, sumAll) | 7 µs vs 69.5 µs | 31.5 µs vs 61.6 µs (1000, [1000, 1000, 2, 2], [0, 1], False, sumD) | 11.3 µs vs 4.72 ms | 35.2 µs vs 285 µs (1000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk) | 197 µs vs 124 µs | 857 µs vs 134 µs (1000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk) | 124 µs vs 833 µs | 796 µs vs 106 µs (1000, [1000, 1000, 2, 2], [2, 3], False, sumD) | 20.5 µs vs 213 µs | 39.4 µs vs 1.24 ms (1000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk) | 131 µs vs 830 µs | 881 µs vs 132 µs (1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD) | 95.8 µs vs 409 µs | 246 µs vs 87.2 µs (1000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk) | 624 µs vs 820 µs | 953 µs vs 124 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumAll) | 45.3 µs vs 72.9 µs | 33.9 µs vs 57.2 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumD) | 81.4 µs vs 4.49 ms | 39.7 µs vs 280 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumAll, bk) | 984 µs vs 111 µs | 6.41 ms vs 121 µs (10000, [1000, 1000, 2, 2], [0, 1], False, sumD, bk) | 1.45 ms vs 828 µs | 6.77 ms vs 113 µs (10000, [1000, 1000, 2, 2], [2, 3], False, sumD) | 74.9 µs vs 209 µs | 37.7 µs vs 1.23 ms (10000, [1000, 1000, 2, 2], [2, 3], False, sumD, bk) | 1.48 ms vs 845 µs | 6.96 ms vs 132 µs (10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD) | 1.14 ms vs 411 µs | 252 µs vs 87.8 µs (10000, [1000, 1000, 2, 2], [0, 2, 3], False, sumD, bk) | 4.53 ms vs 851 µs | 7.12 ms vs 128 µs - time takes in CUDA backward of sparse is super long with large variance (in case of nnz=10000, it normally takes 6-7ms). To improve backward of sparse ops, we will need to debug at places other than CUDA kernels. here is a benchmark of `torch.copy_()`: ``` >>> d = [1000, 1000, 2, 2] >>> nnz = 10000 >>> I = torch.cat([torch.randint(0, d[0], size=(nnz,)), torch.randint(0, d[1], size=(nnz,))], 0).reshape(2, nnz) >>> V = torch.randn(nnz, d[2], d[3]) >>> size = torch.Size(d) >>> S = torch.sparse_coo_tensor(I, V, size).coalesce().cuda() >>> S2 = torch.sparse_coo_tensor(I, V, size).coalesce().cuda().requires_grad_() >>> data = S2.clone() >>> S.copy_(S2) >>> y = S * 2 >>> torch.cuda.synchronize() >>> %timeit y.backward(data, retain_graph=True); torch.cuda.synchronize() 7.07 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` Pull Request resolved: pytorch/pytorch#12430 Differential Revision: D12878313 Pulled By: weiyangfb fbshipit-source-id: e16dc7681ba41fdabf4838cf05e491ca9108c6fe

weiyangfb requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners October 7, 2018 05:16

weiyangfb commented Oct 7, 2018

View reviewed changes

aten/src/ATen/native/sparse/SparseTensorMath.cpp Outdated

Tensor sparse_sum(const SparseTensor& t, IntList dims, bool keepdim) {

const int64_t total_dims = t.dim();

check_dims_errors(dims, total_dims);

This comment was marked as off-topic.

Sign in to view

weiyangfb force-pushed the sparse_sum branch from 325a07a to 9e197c7 Compare October 9, 2018 01:41

ezyang reviewed Oct 9, 2018

View reviewed changes

aten/src/ATen/native/native_functions.yaml Outdated

@@ -1790,6 +1790,23 @@

SparseCPU: norm_sparse

SparseCUDA: norm_sparse

# TODO: reduce signatures down to one when optinoal args is available

This comment was marked as off-topic.

Sign in to view

apaszke previously requested changes Oct 9, 2018

View reviewed changes

ezyang reviewed Oct 9, 2018

View reviewed changes

aten/src/ATen/native/native_functions.yaml Outdated

- func: sparse_sum(Tensor self, *, ScalarType dtype) -> Tensor

variant: method, function

- func: sparse_sum(Tensor self, IntList[1] dims, bool keepdim) -> Tensor

This comment was marked as off-topic.

Sign in to view

ezyang reviewed Oct 9, 2018

View reviewed changes

weiyangfb force-pushed the sparse_sum branch from 105c540 to c83ca0a Compare October 15, 2018 22:41

weiyangfb force-pushed the sparse_sum branch from 0811fa6 to 18b5e98 Compare November 13, 2018 06:37

weiyangfb force-pushed the sparse_sum branch from c305bb4 to f04ee0d Compare November 13, 2018 23:24

weiyangfb force-pushed the sparse_sum branch from f04ee0d to cf0b6dd Compare November 15, 2018 07:27

soumith requested changes Nov 20, 2018

View reviewed changes

weiyangfb force-pushed the sparse_sum branch 2 times, most recently from 6ebca4b to b41fba6 Compare November 21, 2018 06:12

soumith approved these changes Nov 27, 2018

View reviewed changes

weiyangfb force-pushed the sparse_sum branch from f3db251 to 9e2f6b0 Compare November 27, 2018 18:51

facebook-github-bot reviewed Nov 27, 2018

View reviewed changes

weiyangfb force-pushed the sparse_sum branch from 9e2f6b0 to 97e1d92 Compare November 27, 2018 19:13

facebook-github-bot reviewed Nov 27, 2018

View reviewed changes

weiyangfb mentioned this pull request Nov 28, 2018

Todo functions and autograd supports for Sparse Tensor #8853

Open

14 tasks

weiyangfb added 7 commits November 27, 2018 17:54

add notes and remove sorting of flattened grad indices

aba2133

more notes and clean up

88b030a

address comments

eb29cfc

1. support uncoalesced input; 2. remove keepdim arg; 3. test for empt…

940bc7e

…y sparse tensor input

clean up notes

14f08f8

clean up docs

c75311f

weiyangfb force-pushed the sparse_sum branch from 97e1d92 to c75311f Compare November 28, 2018 01:55

facebook-github-bot reviewed Nov 28, 2018

View reviewed changes

facebook-github-bot closed this in be7c618 Nov 28, 2018

ezyang added the merged label Jun 26, 2019

liu-jc mentioned this pull request Jul 9, 2020

Memory bug for backward on torch.sparse.mm? #41128

Open

		@@ -1 +1,92 @@
		# The Tensor classes are added to this module by python_tensor.cpp
		import torch

[sparse] torch.sparse.sum() #12430

[sparse] torch.sparse.sum() #12430

Conversation

weiyangfb commented Oct 7, 2018 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

apaszke left a comment • edited Loading

Choose a reason for hiding this comment

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

weiyangfb commented Oct 9, 2018

apaszke commented Oct 9, 2018

weiyangfb commented Oct 10, 2018

apaszke commented Oct 10, 2018

weiyangfb commented Oct 10, 2018

ezyang commented Oct 10, 2018

apaszke commented Oct 11, 2018

weiyangfb commented Oct 11, 2018

weiyangfb commented Oct 12, 2018

apaszke commented Oct 14, 2018

soumith commented Oct 15, 2018

apaszke commented Oct 15, 2018

weiyangfb commented Oct 15, 2018

apaszke commented Oct 17, 2018

weiyangfb commented Oct 17, 2018

apaszke commented Oct 18, 2018

weiyangfb commented Nov 13, 2018

weiyangfb commented Nov 14, 2018

soumith left a comment

Choose a reason for hiding this comment

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

weiyangfb commented Nov 21, 2018

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

weiyangfb commented Oct 7, 2018 •

edited

Loading

apaszke left a comment •

edited

Loading