torch.to_dense backward ignores unspecified elements in sparse inputs #95550

pearu · 2023-02-25T11:23:34Z

Issue description

For historical reasons, torch.to_dense backward on sparse inputs implements masked semantics that contradicts the current interpretation of sparse tensors, that is, sparse tensors are semantically equivalent to strided tensors. The use of a sparse format is considered merely a memory optimization that does not define a mask for operations on sparse tensors.

The masked semantics of tensors is currently implemented in torch.masked (the future) and torch.sparse (to be deprecated).

This issue breaks autograd on sparse tensors because the use of to_dense method is required for operations resulting in sparse tensors, for example:

>>> import torch
>>> a = torch.tensor([[0, 1], [2, 3]], dtype=torch.float64).to_sparse().requires_grad_()
>>> torch.autograd.gradcheck(torch.Tensor.t, a)
<snip>
ValueError: Sparse output is not supported at gradcheck yet. Please call to_dense() on the output of fn for gradcheck.

Code example

When using the recommendation from the above example, gradcheck using non-masked semantics fails:

>>> torch.autograd.gradcheck(lambda x: torch.Tensor.t(x).to_dense(), a, masked=False)
<snip>
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor([[1.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 1.0000, 0.0000],
        [0.0000, 1.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 1.0000]], dtype=torch.float64)
analytical:tensor([[0., 0., 0., 0.],
        [0., 0., 1., 0.],
        [0., 1., 0., 0.],
        [0., 0., 0., 1.]], dtype=torch.float64)

but succeeds under masked semantics:

>>> torch.autograd.gradcheck(lambda x: torch.Tensor.t(x).to_dense(), a, masked=True, check_sparse_nnz=True)
True

Possible solutions

A solution is implemented in Add sparse semantics context manager. #94728 that introduces a global Context flag that defines how the operations and their backward implementations should interpret the unspecified elements of sparse tensors.
Introduce a masked kw argument (default is False) to to_dense that enables explicit control of semantics when using to_dense method. For example, the following examples should succeed:

torch.autograd.gradcheck(lambda x: torch.Tensor.t(x).to_dense(masked=False), a, masked=False)

torch.autograd.gradcheck(lambda x: torch.Tensor.t(x).to_dense(masked=True), a, masked=True)

Discussion

Both solutions have pros and cons (see discussions in #94728) and both will be BC-breaking because the default semantics of operations in torch namespace need to switch to non-masked semantics. This cannot be avoided when pursuing the idea of considering sparse tensors semantically equivalent to strided tensors. Fortunately, most tensor operations (from torch namespace) on sparse tensors already implement non-masked semantics, for example:

>>> torch.autograd.gradcheck(lambda x: torch.Tensor.t(x)[0, 0], a, masked=False)
True
>>> torch.autograd.gradcheck(lambda x: torch.Tensor.t(x)[0, 0], a, masked=True, check_sparse_nnz=True)
<snip>
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor([[0.],
        [0.],
        [0.],
        [0.]], dtype=torch.float64)
analytical:tensor([[1.],
        [0.],
        [0.],
        [0.]], dtype=torch.float64)
>>> torch.autograd.gradcheck(lambda x: torch.Tensor.t(x)[0, 1], a, masked=True, check_sparse_nnz=True)
True
>>> torch.autograd.gradcheck(lambda x: torch.Tensor.t(x)[0, 1], a, masked=False)
True

The failure above is expected because both transposing and indexing use non-masked semantics (the use of indexing op circumvents "Sparse output is not supported at gradcheck yet" exception and the current issue with to_dense).

From the future perspective, when the masked and non-masked semantics will be well separated between torch.masked and torch namespaces, the use of masked kw argument both in gradcheck and to_dense become unnecessary because the semantics will be defined by the operations (whether these are from torch.masked or torch namespaces). Ironically, this sounds very similar to the original plan of having torch.sparse and torch namespaces for different semantics. I guess the main difference is that the memory optimization and masked semantics features from using sparse tensors become uncoupled features.

System Info

PyTorch version:

#95405 that implements masked kw argument support to gradcheck.

cc @alexsamardzic @nikitaved @cpuhrsch @amjames @bhosmer @ezyang @gchanan @albanD @zou3519 @gqchen @soulitzer @lezcano @Varal7

The text was updated successfully, but these errors were encountered:

vadimkantorov · 2023-02-26T00:47:46Z

Personally, I'm in favor of more explicit solutions (such as masked=True arguments) first, and then if needed - context managers.

nikitaved · 2023-02-27T10:08:09Z

I am not sure we have to enforce any masked semantics, be it autograd (looking at to_dense specifically), torch.masked and/or context managers, because:

We have the tools to implement any sparse semantics with differentiable sparse_mask (95165).
For example, using sparse_mask it is easy to mimic the behavior of torch.sparse.mm:

In [1]: import torch

In [2]: x_data = torch.rand(3, 3)

In [3]: y_data = torch.rand(3, 3)

In [4]: x1 = x_data.mul(x_data < 0.5).to_sparse().requires_grad_(True)

In [5]: x2 = x_data.mul(x_data < 0.5).to_sparse().requires_grad_(True)

In [6]: y1 = y_data.mul(y_data < 0.5).to_sparse().requires_grad_(True)

In [7]: y2 = y_data.mul(y_data < 0.5).to_sparse().requires_grad_(True)

In [8]: def custom_sparse_mm(x, y):
   ...:     x = x.sparse_mask(x)
   ...:     y = y.sparse_mask(y)
   ...:     res = x @ y
   ...:     res = res.sparse_mask(res)
   ...:     return res
   ...: 

In [9]: torch.autograd.grad(custom_sparse_mm(y1, x1), (y1, x1), torch.ones(3, 3).to_sparse())
<ipython-input-8-4d9400ee8449>:4: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at /home/nik/git/Quansight/pytorch/aten/src/ATen/SparseCsrTensorImpl.cpp:54.)
  res = x @ y
Out[9]: 
(tensor(indices=tensor([[0, 1, 2, 2, 2],
                        [1, 0, 0, 1, 2]]),
        values=tensor([0.8071, 0.6476, 0.6476, 0.8071, 0.2095]),
        size=(3, 3), nnz=5, layout=torch.sparse_coo),
 tensor(indices=tensor([[0, 0, 0, 1, 1, 1, 2],
                        [0, 1, 2, 0, 1, 2, 2]]),
        values=tensor([0.1705, 0.1705, 0.1705, 0.6261, 0.6261, 0.6261, 0.3517]),
        size=(3, 3), nnz=7, layout=torch.sparse_coo))

In [10]: torch.autograd.grad(torch.sparse.mm(y2, x2), (y2, x2), torch.ones(3, 3).to_sparse())
Out[10]: 
(tensor(indices=tensor([[0, 1, 2, 2, 2],
                        [1, 0, 0, 1, 2]]),
        values=tensor([0.8071, 0.6476, 0.6476, 0.8071, 0.2095]),
        size=(3, 3), nnz=5, layout=torch.sparse_coo),
 tensor(indices=tensor([[0, 0, 0, 1, 1, 1, 2],
                        [0, 1, 2, 0, 1, 2, 2]]),
        values=tensor([0.1705, 0.1705, 0.1705, 0.6261, 0.6261, 0.6261, 0.3517]),
        size=(3, 3), nnz=7, layout=torch.sparse_coo))


In [11]: x1
Out[11]: 
tensor(indices=tensor([[0, 0, 0, 1, 1, 1, 2],
                       [0, 1, 2, 0, 1, 2, 2]]),
       values=tensor([0.0543, 0.1362, 0.4572, 0.0271, 0.4451, 0.3349, 0.2095]),
       size=(3, 3), nnz=7, layout=torch.sparse_coo, requires_grad=True)

In [12]: y1
Out[12]: 
tensor(indices=tensor([[0, 1, 2, 2, 2],
                       [1, 0, 0, 1, 2]]),
       values=tensor([0.3481, 0.0818, 0.0887, 0.2780, 0.3517]),
       size=(3, 3), nnz=5, layout=torch.sparse_coo, requires_grad=True)

In general, any composition of operations (f_n o ... o f_1)(x) can be turned into a "masked" operation
[(m_n o f_n) o ... o (m_1 o f_1)](x.sparse_mask(x)), where m_i(x) := id(x) or x.sparse_mask(mask_i) with mask_i potentially equal to x.

This approach:
a. Is very flexible and does not rely on any global behavior but just sparse being an optimization layout.
b. Masking of type x.sparse_mask(x) is a trivial operation (in forward) with the only overhead of creating additional node in the computational graph.
c. Masks can trivially be shared along the composition of functions for reduced memory footprint.
d. No need to multiply our codebase at all! In fact, we can remove some parts :)

Masked semantics (just like with sparse.mm, gradient projection onto specified elements) implies sparse parametrization only in linear cases.
Instead of Masked semantics, it is better to have a notion of sparse parametrization. This is what is expected from sparse in some cases: Triangular solver for sparse matrices #87358 (comment).
Autograd can treat sparse tensors just the same as dense tensors. If needed, the output can be modified with output = output.sparse_mask(output); return output.to_dense(), so we can eliminate sparse-specific parts of code in the gradcheck like check_sparse_nnz and related. But I am not sure it is that easy though, cc @albanD .
The example of failing gradcheck can be modified to not rely on any explicit masked=True semantics:

>>> a = torch.tensor([[0, 1], [2, 3]], dtype=torch.float64).to_sparse().requires_grad_()
>>> def f(a):
...     a = a.sparse_mask(a)
...     res = torch.Tensor.t(a)
...     res = res.sparse_mask(res)
...     return res.to_dense()
... 
>>> torch.autograd.gradcheck(lambda a: f(a), a, check_sparse_nnz=True)
True

Alternatively, without explicit masked=True and check_sparse_nnz=True:

>>> a = torch.tensor([[0, 1], [2, 3]], dtype=torch.float64).requires_grad_(True)
>>> mask = a.detach().to_sparse()
>>> def f(a, mask):
...     res = a.sparse_mask(mask)
...     res = torch.Tensor.t(res)
...     res = res.sparse_mask(res)
...     return res.to_dense()
... 
>>> torch.autograd.gradcheck(lambda a: f(a, mask), (a,))
True

pearu · 2023-02-27T12:34:00Z

I am not sure we have to enforce any masked semantics, be it autograd (looking at to_dense specifically), torch.masked and/or context managers.

@nikitaved Do you agree or disagree that to_dense_backward implementation is currently wrong in that it applies input as mask to grad when the input is a tensor with a sparse layout?

nikitaved · 2023-02-27T12:36:13Z

@pearu, I agree with that statement. But I also propose an alternative solution to the ones you posited.

pearu · 2023-02-27T13:00:56Z

@pearu, I agree with that statement. But I also propose an alternative solution to the ones you posited.

Great!

IIUC, your solution is to not have masked kw arguments neither in gradcheck nor in to_dense. Both will assume non-masked semantics. That is:

in gradcheck implementation, sparse inputs will be always densified for numerical jacobian calculations
in to_dense backward implementation, remove sparse_mask call

With the above, torch.sparse.mm numerical jacobian will be the same as the numerical jacobian of torch.mm but the corresponding analytical jacobians will be different.

So, how would you test torch.sparse.mm autograd correctness using gradcheck?

nikitaved · 2023-02-27T16:49:11Z

@pearu, like this. Based on our PRs:

diff --git a/tools/autograd/derivatives.yaml b/tools/autograd/derivatives.yaml
index 71ee6f03773..bc0a1bae4cb 100644
--- a/tools/autograd/derivatives.yaml
+++ b/tools/autograd/derivatives.yaml
@@ -1679,7 +1679,7 @@
 # - name: to_dense(Tensor self, ScalarType? dtype=None) -> Tensor
 #
 - name: _to_dense(Tensor self, ScalarType? dtype=None) -> Tensor
-  self: to_dense_backward(grad, self)
+  self: grad.to_sparse()

>>> x = torch.tensor([[0, 1], [2, 3]], dtype=torch.float64).to_sparse().requires_grad_()
>>> mask = x.detach()
>>> torch.autograd.gradcheck(lambda t: t.sparse_mask(mask).to_dense(), (x,), masked=False)
True
>>> torch.autograd.gradcheck(lambda t: torch.Tensor.t(t.sparse_mask(mask)).to_dense(), (x,), masked=False)
True
>>> def f(x, mask):
...     x = x.sparse_mask(mask)
...     x = torch.sparse.mm(x, x)
...     return x.to_dense()
>>> torch.autograd.gradcheck(lambda x: f(x, mask), (x,), masked=False)
True

sparse_mask enforces sparse semantics and makes sure that both forward and backward respect it.

As in the title. The masked kw argument is required for `to_dense` backward to distinguish the expected semantics of sparse tensors. `masked=True` means that the `to_dense` backward will apply a mask to the returned gradient where the mask is defined by the input indices. The default semantics implies `masked==False`. The PR is BC-breaking in the sense that the masked semantics has been the default semantics for `to_dense` (its backward ignores unspecified elements in the input) and this PR enables the non-masked semantics as the default. As a consequence, existing code that is run through autograd engine must replace `.to_dense()` calls with `.to_dense(masked=True)`. For example, ```python torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense()) ``` must be updated to ```python torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense(masked=True)) ``` Fixes #95550 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

As in the title. The masked kw argument is required for `to_dense` backward to distinguish the expected semantics of sparse tensors. `masked=True` means that the `to_dense` backward will apply a mask to the returned gradient where the mask is defined by the input indices. The default semantics implies `masked==False`. The PR is BC-breaking in the sense that the masked semantics has been the default semantics for `to_dense` (its backward ignores unspecified elements in the input) and this PR enables the non-masked semantics as the default. As a consequence, existing code that is run through autograd engine must replace `.to_dense()` calls with `.to_dense(masked=True)`. For example, ```python torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense()) ``` must be updated to ```python torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense(masked=True)) ``` Fixes #95550 cc alexsamardzic nikitaved cpuhrsch amjames bhosmer ezyang albanD zou3519 gqchen soulitzer Lezcano Varal7 jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

As in the title. The masked kw argument is required for `to_dense` backward to distinguish the expected semantics of sparse tensors. `masked=True` means that the `to_dense` backward will apply a mask to the returned gradient where the mask is defined by the input indices. The default semantics implies `masked==False` for BC but see the [comment](https://github.com/pytorch/pytorch/pull/96095/files#diff-d4df180433a09071e891d552426911c227b30ae9b8a8e56da31046e7ecb1afbeR501-R513) in `to_dense_backward`. ~The PR is BC-breaking in the sense that the masked semantics has been the default semantics for `to_dense` (its backward ignores unspecified elements in the input) and this PR enables the non-masked semantics as the default.~ As a consequence, existing code that is run through autograd engine must replace `.to_dense()` calls with `.to_dense(masked=False)`. For example, ```python torch.autograd.gradcheck(lambda x: torch.sum(x, [0]).to_dense()) torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense()) ``` (recall, gradcheck has `masked=False` as default) must be updated to ```python torch.autograd.gradcheck(lambda x: torch.sum(x, [0]).to_dense(masked=False)) torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense(masked=True), masked=True) ``` Fixes #95550 cc alexsamardzic nikitaved cpuhrsch amjames bhosmer ezyang albanD zou3519 gqchen soulitzer Lezcano Varal7 jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

As in the title. The `masked_grad` kw argument is required for `to_dense` backward to distinguish the expected semantics of sparse tensors. `masked_grad=True` means that the `to_dense` backward will apply a mask to the returned gradient where the mask is defined by the input indices. The default semantics implies `masked_grad==True` for BC but see the [comment](https://github.com/pytorch/pytorch/pull/96095/files#diff-d4df180433a09071e891d552426911c227b30ae9b8a8e56da31046e7ecb1afbeR501-R513) in `to_dense_backward`. As a consequence, existing code that is run through autograd engine must replace `.to_dense()` calls with `.to_dense(masked_grad=False)`. For example, ```python torch.autograd.gradcheck(lambda x: torch.sum(x, [0]).to_dense()) torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense()) ``` (recall, gradcheck has `masked=False` as default) must be updated to ```python torch.autograd.gradcheck(lambda x: torch.sum(x, [0]).to_dense(masked_grad=False)) torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense(masked_grad=True), masked=True) ``` Fixes #95550 cc alexsamardzic nikitaved cpuhrsch amjames bhosmer ezyang albanD zou3519 gqchen soulitzer Lezcano Varal7 jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

As in the title. The `masked_grad` kw argument is required for `to_dense` backward to distinguish the expected semantics of sparse tensors. `masked_grad=True` means that the `to_dense` backward will apply a mask to the returned gradient where the mask is defined by the input indices. The default semantics implies `masked_grad==True` for BC but see the [comment](https://github.com/pytorch/pytorch/pull/96095/files#diff-d4df180433a09071e891d552426911c227b30ae9b8a8e56da31046e7ecb1afbeR501-R513) in `to_dense_backward`. As a consequence, existing code that is run through autograd engine must replace `.to_dense()` calls with `.to_dense(masked_grad=False)`. For example, ```python torch.autograd.gradcheck(lambda x: torch.sum(x, [0]).to_dense()) torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense()) ``` (recall, gradcheck has `masked=False` as default) must be updated to ```python torch.autograd.gradcheck(lambda x: torch.sum(x, [0]).to_dense(masked_grad=False)) torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense(masked_grad=True), masked=True) ``` Fixes pytorch/pytorch#95550 Pull Request resolved: pytorch/pytorch#96095 Approved by: https://github.com/cpuhrsch

pearu added module: sparse Related to torch.sparse module: bc-breaking Related to a BC-breaking change module: autograd Related to torch.autograd, and the autograd engine in general bug labels Feb 25, 2023

nikitaved mentioned this issue Feb 27, 2023

Implement sparse semantics support in gradcheck (2nd try) #95405

Closed

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 27, 2023

pearu mentioned this issue Feb 28, 2023

Add sparse semantics context manager. #94728

Closed

pearu self-assigned this Mar 6, 2023

pearu mentioned this issue Mar 6, 2023

Add masked_grad kw argument to to_dense #96095

Closed

pearu mentioned this issue Mar 15, 2023

Add masked_grad kw argument to to_dense (2) #96810

Closed

pytorchmergebot closed this as completed in 2abcafc Mar 16, 2023

pearu mentioned this issue Apr 26, 2023

sparse_mask: backward support for sparse lhs #95165

Closed

nikitaved mentioned this issue Jun 13, 2023

gradcheck produces false positives with sparse inputs when masked=False. #103518

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.to_dense backward ignores unspecified elements in sparse inputs #95550

torch.to_dense backward ignores unspecified elements in sparse inputs #95550

pearu commented Feb 25, 2023 •

edited by pytorch-bot bot

vadimkantorov commented Feb 26, 2023

nikitaved commented Feb 27, 2023 •

edited

pearu commented Feb 27, 2023

nikitaved commented Feb 27, 2023 •

edited

pearu commented Feb 27, 2023

nikitaved commented Feb 27, 2023 •

edited

torch.to_dense backward ignores unspecified elements in sparse inputs #95550

torch.to_dense backward ignores unspecified elements in sparse inputs #95550

Comments

pearu commented Feb 25, 2023 • edited by pytorch-bot bot

Issue description

Code example

Possible solutions

Discussion

System Info

vadimkantorov commented Feb 26, 2023

nikitaved commented Feb 27, 2023 • edited

pearu commented Feb 27, 2023

nikitaved commented Feb 27, 2023 • edited

pearu commented Feb 27, 2023

nikitaved commented Feb 27, 2023 • edited

pearu commented Feb 25, 2023 •

edited by pytorch-bot bot

nikitaved commented Feb 27, 2023 •

edited

nikitaved commented Feb 27, 2023 •

edited

nikitaved commented Feb 27, 2023 •

edited