Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory bug for backward on torch.sparse.mm? #41128

Open
liu-jc opened this issue Jul 8, 2020 · 10 comments
Open

Memory bug for backward on torch.sparse.mm? #41128

liu-jc opened this issue Jul 8, 2020 · 10 comments
Labels
module: autograd Related to torch.autograd, and the autograd engine in general module: sparse Related to torch.sparse triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@liu-jc
Copy link

liu-jc commented Jul 8, 2020

馃悰 Bug

To Reproduce

Steps to reproduce the behavior:

import torch
import numpy as np
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')
n = 110000
nnz = 860000
rows = np.random.randint(0, n, nnz)
cols = np.random.randint(0, n, nnz)
values = torch.randn(nnz)
X_sparse = torch.sparse_coo_tensor([rows,cols], values, size=(n,n)).cuda().requires_grad_(True)
Y_dense = torch.randn((n,200)).cuda().requires_grad_(True)
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')
input("Press")
t = torch.sparse.mm(X_sparse, Y_dense).sum()
t.backward()
print(f'SPMM, x.grad: {X_sparse.grad}, y.grad: {Y_dense.grad}')
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')

Expected behavior

I got an error about OOM only in the backward step. RuntimeError: CUDA out of memory. Tried to allocate 45.08 GiB (GPU 0; 10.73 GiB total capacity; 202.00 MiB already allocated; 9.68 GiB free; 204.00 MiB reserved in total by PyTorch).
I think the memory usage of backward should be similar to that of forward. Before backward, there is still ~9GB free. I expect it would not try to allocate 45GB. I don't know whether there is a bug in backward support for torch.sparse.mm? Feel free to correct me if I am wrong.

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
  • PyTorch Version (e.g., 1.0): 1.4.0
  • OS (e.g., Linux): Ubuntu 18.04.2 LTS
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.7
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: GPU 0: GeForce RTX 2080 Ti
  • Any other relevant information:

Additional context

I am not sure if it is a bug. But if so and anyone can point out the problem, I am willing to help and dig further to see how to fix it.

cc @ezyang @ssnl @albanD @zou3519 @gqchen @vincentqb @aocsa

@liu-jc
Copy link
Author

liu-jc commented Jul 9, 2020

I think it might be caused by the dense gradient? I found a related comment here #12430 (comment). But I think the gradient wrt a sparse tensor should be a sparse tensor as well right?
@apaszke Could you help me confirm whether this is also related to the dense gradient? Thank you.

@t-vi
Copy link
Collaborator

t-vi commented Jul 9, 2020

So yes, the problem is that the gradient will be dense.
If we look at C = A @ B and some l = f(C) with A sparse, B, C dense, we have that dl/dA = (dl/dC) @ B.t() and both arguments in that matmul are dense matrices.
If we wanted to support a "sparsity preserving gradient", we would need to add an entirely different backward.
As a work around, you could probably use scatter ops (an external module) to emulate spmm.

@albanD albanD added module: autograd Related to torch.autograd, and the autograd engine in general module: sparse Related to torch.sparse triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 13, 2020
@romanpogodin
Copy link

romanpogodin commented Sep 16, 2020

Adding to @t-vi suggestion of scatter ops, the torch-sparse package (which is related to torch-scatter) doesn't seem to have this issue. The OP's problem fits Tesla T4 (16GB VRAM) just fine:

import torch_sparse
import numpy as np
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')
n = 110000
nnz = 860000
rows = np.random.randint(0, n, nnz)
cols = np.random.randint(0, n, nnz)

indices = torch.tensor([rows,cols], dtype=torch.long).cuda()
values = torch.randn(nnz).cuda().requires_grad_(True)
Y_dense = torch.randn((n,200)).cuda().requires_grad_(True)
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')
t = torch_sparse.spmm(indices, values, n, n, Y_dense).sum()
t.backward()
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')
print(f'GPU max memory usage: {torch.cuda.max_memory_allocated(device=None)/10**9}')

Outputs:

GPU memory usage: 0.0
GPU memory usage: 0.106200576
GPU memory usage: 0.1977216
GPU max memory usage: 2.861641728

@liu-jc
Copy link
Author

liu-jc commented Sep 17, 2020

Adding to @t-vi suggestion of scatter ops, the torch-sparse package (which is related to torch-scatter) doesn't seem to have this issue. The OP's problem fits Tesla T4 (16GB VRAM) just fine:

import torch_sparse
import numpy as np
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')
n = 110000
nnz = 860000
rows = np.random.randint(0, n, nnz)
cols = np.random.randint(0, n, nnz)

indices = torch.tensor([rows,cols], dtype=torch.long).cuda()
values = torch.randn(nnz).cuda().requires_grad_(True)
Y_dense = torch.randn((n,200)).cuda().requires_grad_(True)
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')
t = torch_sparse.spmm(indices, values, n, n, Y_dense).sum()
t.backward()
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')
print(f'GPU max memory usage: {torch.cuda.max_memory_allocated(device=None)/10**9}')

Outputs:

GPU memory usage: 0.0
GPU memory usage: 0.106200576
GPU memory usage: 0.1977216
GPU max memory usage: 2.861641728

Thanks for your sharing. I used torch-sparse as well, which perfectly solves the problem. Sorry for not updating here. I'd like to know whether there is any plan to support sparse gradients naturally in Pytorch? I think it should be an important feature as many applications/research relies on sparse tensors.

@ezyang
Copy link
Contributor

ezyang commented Sep 17, 2020

cc @pearu

@maqy1995
Copy link

import torch
import numpy as np
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')
n = 110000
nnz = 860000
rows = np.random.randint(0, n, nnz)
cols = np.random.randint(0, n, nnz)
values = torch.randn(nnz)
X_sparse = torch.sparse_coo_tensor([rows,cols], values, size=(n,n)).cuda().requires_grad_(True)
Y_dense = torch.randn((n,200)).cuda().requires_grad_(True)
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')
input("Press")
t = torch.sparse.mm(X_sparse, Y_dense).sum()
t.backward()
print(f'SPMM, x.grad: {X_sparse.grad}, y.grad: {Y_dense.grad}')
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')

I'm a little curious about this, in the code, why we need about 45GiB of gpu memory in backward? Can someone provide a simple calculation?

@albanD
Copy link
Collaborator

albanD commented Dec 17, 2020

@maqy1995 there is an explanation in @t-vi comment above: since the gradient need to be dense, it will allocate a dense Tensor corresponding to the sparse Tensor size. This dense tensor would require this much memory.

@pearu pearu added this to To do in Sparse tensors Aug 10, 2021
@tvercaut
Copy link

tvercaut commented Jan 6, 2022

Out of curiosity I checked if the behaviour would be different by requiring the gradient on the values of the sparse COO matrix (rather than on the sparse matrix itself so as to clarify that we are not after a dense result as per #41128 (comment)) but unfortunately, the same out-of-memory issue arises. I can also confim that pytorch_sparse doesn't face the memory bottleneck.

Also, reading though the documentation of torch.sparse.mm, I can see that

Note that the gradients of mat1 is a coalesced sparse tensor.

As such, I would expect that if A is a sparse matrix, the gradient of C = torch.sparse.mm(A, B) with respect to "sparse A" should be the gradient with respect to the non-zero values in A (and not with respect to a dense view of A). This is indeed what happens in practice with the current version of the code (when there is enough memory to not face an OOM issue). Therefore, I don't think the reason pointed out in #41128 (comment) and re-stated in #41128 (comment) is a fundamental showstopper. This analysis seems to confirm that this issue is indeed an implementation limitation rather than a semantics related issue but I might be missing something as well...

Here (and on colab) is an expanded version of the first test case in this issue:

import torch
print(torch.__version__)

torchdevice = torch.device('cpu')
if torch.cuda.is_available():
  torchdevice = torch.device('cuda')
  print('Default GPU is ' + torch.cuda.get_device_name(torch.device('cuda')))
print('Running on ' + str(torchdevice))

!pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-{torch.__version__}.html
import torch_sparse

# Covenience wrappers around torch_sparse matmult
def ts_spmm(A,B):
  Ats = torch_sparse.SparseTensor.from_torch_sparse_coo_tensor(A)
  AB = torch_sparse.matmul(Ats,B)
  return AB

print(f'GPU memory usage: {torch.cuda.memory_allocated(device=torchdevice)/10**9}')

# Dimension of the square sparse matrix
n = 100000
# Number of non-zero elements
nnz = 20000
# Second dimension of the dense matrix
m = 200

rowidx = torch.randint(low=0, high=n, size=(nnz,), device=torchdevice)
colidx = torch.randint(low=0, high=n, size=(nnz,), device=torchdevice)
itemidx = torch.vstack((rowidx,colidx))
xvalues = torch.randn(nnz, device=torchdevice)

Y_dense = torch.randn((n,m), device=torchdevice)


# Require gradients on the values of the sparse matrix (not the matrix itself)
xvalues_0 = xvalues.detach().clone().requires_grad_(True)
X_sparse_0 = torch.sparse_coo_tensor(itemidx, xvalues_0, size=(n,n))
Y_dense_0 = Y_dense.detach().clone().requires_grad_(True)

xvalues_1 = xvalues.detach().clone().requires_grad_(True)
X_sparse_1 = torch.sparse_coo_tensor(itemidx, xvalues_1, size=(n,n))
Y_dense_1 = Y_dense.detach().clone().requires_grad_(True)

print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')

# torch_sparse path
t0 = ts_spmm(X_sparse_0, Y_dense_0).sum()
t0.backward()

print(f'torch_sparse SPMM, x.grad: {xvalues_0.grad}, y.grad: {Y_dense_0.grad}')
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')

# vanilla pytorch path
t1 = torch.sparse.mm(X_sparse_1, Y_dense_1).sum()
t1.backward()

print(f'Vanilla torch SPMM, x.grad: {xvalues_1.grad}, y.grad: {Y_dense_1.grad}')
print(f'GPU memory usage: {torch.cuda.memory_allocated(device=None)/10**9}')

@tvercaut
Copy link

tvercaut commented Oct 20, 2022

Following the discussion in #86963 and #87358 (in particular #87358 (comment)), I understand that the semantics of torch.sparse.mm when it comes to the gradient with respect to the sparse matrix (i.e. maintaining sparsity) may be unconventional for PyTorch. However, getting the gradient with respect to the vector of non-zero input in the sparse matrix is often the desired semantics (as done in https://github.com/rusty1s/pytorch_sparse and https://github.com/flaport/torch_sparse_solve).

This expected sparse semantics may be available at some point with MaskedTensors as suggested by @lezcano in #87358 (comment) but for now MaskedTensor is labelled as a prototype library only.

In the interim, the need remains for having a sparse matrix multiplication with sparsity-preserving gradient that can provide a memory-efficient gradient with respect to the vector of non-zero values in the sparse matrix.

While rusty1s/pytorch_sparse offers a solution for COO matrices, it doesn't support CSR matrices and its interaction with PyTorch can be fiddly.

As of now, the least problematic solution I found is to rely on writting a cutom sparse @ dense multiplication operation where I manually specify the backward pass. This would certainly benefit from being handled better in PyTorch proper though.

The rough custom op copied here for convenience seems to require much less memory than a direct use of torch.sparse.mm

class MySparseMatMul(torch.autograd.Function):
  @staticmethod
  def forward(ctx, a, b):
    # Is the detach needed / helpful?
    ad = a.detach()
    bd = b.detach()
    x = torch.sparse.mm(ad, bd)
    if a.requires_grad or b.requires_grad:
      # Not sure if the following is needed / helpful
      x.requires_grad = True
    # Save context for backward pass
    ctx.save_for_backward(ad, bd)
    return x

  @staticmethod
  def backward(ctx, prev_grad):
    # Recover context
    a, b = ctx.saved_tensors
    # The gradient with respect to the matrix a seen as a dense matrix would
    # lead to a backprop rule as follows
    # grad_a = prev_grad @ b.T
    # but we are only interested in the gradient with respect to
    # the (non-zero) values of a. To save memory, instead of computing the full
    # dense matrix prev_grad @ b and then subsampling at the nnz locations in a,
    # we can directly only compute the required values:
    # grad_a[i,j] = dotprod(prev_grad[i,:], b[j,:])
    # We start by getting the i and j indices
    if(a.layout == torch.sparse_coo):
      grad_a_idx = a.indices()
      grad_a_row_idx = grad_a_idx[0,:]
      grad_a_col_idx = grad_a_idx[1,:]
    elif(a.layout == torch.sparse_csr):
      grad_a_col_idx = a.col_indices()
      grad_a_crow_idx = a.crow_indices()
      # uncompress row indices
      grad_a_row_idx = torch.repeat_interleave(
          torch.arange(a.size()[0], device=a.device),
          grad_a_crow_idx[1:]-grad_a_crow_idx[:-1] )
    else:
      raise ValueError(f"Unsupported layout: {a.layout}")
    
    # Get prev_grad[a_row_idx,:]
    prev_grad_select = prev_grad.index_select(0,grad_a_row_idx)
    # Get b[a_col_idx,:]
    b_select = b.index_select(0,grad_a_col_idx)
    # Element-wise multiplication
    prev_grad_b_ewise = prev_grad_select * b_select
    if b.dim() == 1:
      # if b is a vector, the dot prod and elementwise multiplication are the same
      grad_a_vals = prev_grad_b_ewise
    else:
      # if b is a matrix, the dot prod requires summation
      grad_a_vals = torch.sum( prev_grad_b_ewise, dim=1 )
    # Create a sparse matrix of the gradient with respect to the nnz of a
    if(a.layout == torch.sparse_coo):
      grad_a = torch.sparse_coo_tensor(grad_a_idx, grad_a_vals, a.shape)
    elif(a.layout == torch.sparse_csr):
      grad_a = torch.sparse_csr_tensor(grad_a_crow_idx, grad_a_col_idx,
                                       grad_a_vals, a.shape)

    # Now compute the (dense) gradient with respect to b
    grad_b = torch.t(a) @ prev_grad
    return grad_a, grad_b

my_sparse_mm = MySparseMatMul.apply

@tvercaut
Copy link

Just confirming that this memory issue in the backward pass of torch.sparse.mm (sparse x dense) is still present in pytorch 2.0.1:
https://colab.research.google.com/drive/1gbjux4UEVA_nOttcJuMJzTIWyX88Wpqd?usp=sharing

The workaround suggested above can however now be found here for convenience (with additional support for batched operations):
https://github.com/cai4cai/torchsparsegradutils/blob/main/torchsparsegradutils/sparse_matmul.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: autograd Related to torch.autograd, and the autograd engine in general module: sparse Related to torch.sparse triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Development

No branches or pull requests

7 participants