Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Little idea about sparse matrix multiplication in torch_sparse #356

Closed
MrShouxingMa opened this issue Dec 10, 2023 · 6 comments
Closed

Little idea about sparse matrix multiplication in torch_sparse #356

MrShouxingMa opened this issue Dec 10, 2023 · 6 comments

Comments

@MrShouxingMa
Copy link

First of all, thank you very much for the PyTorch Geometric build, I use it all the time and it's very smooth!

When debugging the base code, I noticed that for sparse matrix multiplication, you call torch.sparse.mm() directly; however, as far as I know, there is another method for pytorch to handle sparse matrix multiplication, torch.spmm();
I would like to know the reason why you finally choose torch.sparse.mm(), since by experimentally verified, torch.spmm() seems to be faster.

In addition, I know that there is another way to implement sparse matrix multiplication in numpy, by calling the relevant functions through the scipy library. In order to test the efficiency of each of them, I conducted an experiment on a separate server with the following code and results.

import torch
import numpy as np
from time import time
import scipy.sparse as sp

torch.manual_seed(2022)
np.random.seed(2022)


a = np.random.rand(512, 512)
row = np.random.choice(np.arange(a.shape[0]), replace=False,
                       size=int(a.shape[0] * 0.5))
col = np.random.choice(np.arange(a.shape[1]), replace=False,
                       size=int(a.shape[1] * 0.5))
a[row, col] = 0
sp_a = sp.coo_matrix(a)
np_scipy_start_time = time()
y0 = sp_a.dot(sp_a)
# y0 = sp.coo_matrix(sp_a * sp_a)
np_scipy_end_time = time()

a = torch.tensor(a).to_sparse().cuda()
sparse_start_time = time()
y1 = torch.sparse.mm(a, a)
sparse_end_time = time()
spmm_start_time = time()
y2 = torch.spmm(a, a)
spmm_end_time = time()
print(y0)
print("=============================================")
print(y1)
print("=============================================")
print(y2)
print("=============================================")
print("np_scipy:       ", (np_scipy_end_time - np_scipy_start_time), "s")
print("torch.sparse.mm: ", (sparse_end_time - sparse_start_time), "s")
print("torch.spmm:      ", (spmm_end_time - spmm_start_time), "s")
np_scipy:        0.2647557258605957 s
torch.sparse.mm:  0.07332849502563477 s
torch.spmm:       0.04523587226867676 s

Since torch_sparse is based on pytorch development, I independently compared torch.sparse.mm and torch.spmm,

import torch
from time import time
torch.manual_seed(2022)

a = torch.rand(512, 512, dtype=torch.double).to_sparse().cuda()
b = torch.rand(512, 512, dtype=torch.double).to_sparse().cuda()
sparse_start_time = time()
y1 = torch.sparse.mm(a, b)
sparse_end_time = time()
spmm_start_time = time()
y2 = torch.spmm(a, b)
spmm_end_time = time()

print(y1)
print("=============================================")
print(y2)
print("=============================================")

print("torch.sparse.mm: ", (sparse_end_time - sparse_start_time), "s")
print("torch.spmm:      ", (spmm_end_time - spmm_start_time), "s")
torch.sparse.mm:  0.13776421546936035 s
torch.spmm:       0.07683801651000977 s

The configuration of the server is
2x Intel Xeon Gold 6226R 2.9GHz 16 cores 22MB L3 Cache (Max Turbo Freq. 3.9GHz, Min 3.6GHz)
192GB 3200MHz ECC DDR4-RAM (Six Channel)
2x 1.2TB 10,000 RPM SAS II Hard Drives (Raid 1) and 6x 2.4TB 10,000 RPM SAS II Hard Drives (Raid 10)
2x NVIDIA Quadro RTX 6000 Passive (4,608 Cores, 576 Tensor Cores, 24GB Memory) (GPU)

Dependency package specifics:

_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
blas                      1.0                         mkl  
brotli-python             1.0.9            py38h6a678d5_7  
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2023.08.22           h06a4308_0  
certifi                   2023.11.17       py38h06a4308_0  
cffi                      1.16.0           py38h5eee18b_0  
charset-normalizer        2.0.4              pyhd3eb1b0_0  
cryptography              41.0.7           py38hdda0065_0  
cuda-cudart               12.1.105                      0    nvidia
cuda-cupti                12.1.105                      0    nvidia
cuda-libraries            12.1.0                        0    nvidia
cuda-nvrtc                12.1.105                      0    nvidia
cuda-nvtx                 12.1.105                      0    nvidia
cuda-opencl               12.3.101                      0    nvidia
cuda-runtime              12.1.0                        0    nvidia
daal4py                   2023.1.1         py38h79cecc1_0  
dal                       2023.1.1         hdb19cb5_48680  
ffmpeg                    4.3                  hf484d3e_0    pytorch
filelock                  3.13.1           py38h06a4308_0  
freetype                  2.12.1               h4a9f257_0  
giflib                    5.2.1                h5eee18b_3  
gmp                       6.2.1                h295c915_3  
gmpy2                     2.1.2            py38heeb90bb_0  
gnutls                    3.6.15               he1e5248_0  
idna                      3.4              py38h06a4308_0  
intel-openmp              2023.1.0         hdb19cb5_46306  
jinja2                    3.1.2            py38h06a4308_0  
joblib                    1.2.0            py38h06a4308_0  
jpeg                      9e                   h5eee18b_1  
lame                      3.100                h7b6447c_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
lerc                      3.0                  h295c915_0  
libcublas                 12.1.0.26                     0    nvidia
libcufft                  11.0.2.4                      0    nvidia
libcufile                 1.8.1.2                       0    nvidia
libcurand                 10.3.4.101                    0    nvidia
libcusolver               11.4.4.55                     0    nvidia
libcusparse               12.0.2.55                     0    nvidia
libdeflate                1.17                 h5eee18b_1  
libffi                    3.4.4                h6a678d5_0  
libgcc-ng                 11.2.0               h1234567_1  
libgfortran-ng            11.2.0               h00389a5_1  
libgfortran5              11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libiconv                  1.16                 h7f8727e_2  
libidn2                   2.3.4                h5eee18b_0  
libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
libnpp                    12.0.2.50                     0    nvidia
libnvjitlink              12.1.105                      0    nvidia
libnvjpeg                 12.1.1.14                     0    nvidia
libpng                    1.6.39               h5eee18b_0  
libstdcxx-ng              11.2.0               h1234567_1  
libtasn1                  4.19.0               h5eee18b_0  
libtiff                   4.5.1                h6a678d5_0  
libunistring              0.9.10               h27cfd23_0  
libwebp                   1.3.2                h11a3e52_0  
libwebp-base              1.3.2                h5eee18b_0  
llvm-openmp               14.0.6               h9e868ea_0  
lz4-c                     1.9.4                h6a678d5_0  
markupsafe                2.1.1            py38h7f8727e_0  
mkl                       2023.1.0         h213fc3f_46344  
mkl-service               2.4.0            py38h5eee18b_1  
mkl_fft                   1.3.8            py38h5eee18b_0  
mkl_random                1.2.4            py38hdb19cb5_0  
mpc                       1.1.0                h10f8cd9_1  
mpfr                      4.0.2                hb69a4c5_1  
mpi                       1.0                       mpich  
mpich                     4.1.1                hbae89fd_0  
mpmath                    1.3.0            py38h06a4308_0  
ncurses                   6.4                  h6a678d5_0  
nettle                    3.7.3                hbbd107a_1  
networkx                  3.1              py38h06a4308_0  
numpy                     1.24.3           py38hf6e8229_1  
numpy-base                1.24.3           py38h060ed82_1  
openh264                  2.1.1                h4ff587b_0  
openjpeg                  2.4.0                h3ad879b_0  
openssl                   3.0.12               h7f8727e_0  
packaging                 23.1             py38h06a4308_0  
pillow                    10.0.1           py38ha6cbd5a_0  
pip                       23.3.1           py38h06a4308_0  
platformdirs              3.10.0           py38h06a4308_0  
pooch                     1.7.0            py38h06a4308_0  
psutil                    5.9.0            py38h5eee18b_0  
pycparser                 2.21               pyhd3eb1b0_0  
pyg                       2.4.0           py38_torch_2.1.0_cu121    pyg
pyg-lib                   0.3.1+pt21cu121          pypi_0    pypi
pyopenssl                 23.2.0           py38h06a4308_0  
pyparsing                 3.0.9            py38h06a4308_0  
pysocks                   1.7.1            py38h06a4308_0  
python                    3.8.18               h955ad1f_0  
pytorch                   2.1.0           py3.8_cuda12.1_cudnn8.9.2_0    pytorch
pytorch-cuda              12.1                 ha16c6d3_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pyyaml                    6.0.1            py38h5eee18b_0  
readline                  8.2                  h5eee18b_0  
requests                  2.31.0           py38h06a4308_0  
scikit-learn              1.3.0            py38h1128e8f_0  
scikit-learn-intelex      2023.1.1         py38h06a4308_0  
scipy                     1.10.1           py38hf6e8229_1  
setuptools                68.0.0           py38h06a4308_0  
sqlite                    3.41.2               h5eee18b_0  
sympy                     1.12             py38h06a4308_0  
tbb                       2021.8.0             hdb19cb5_0  
threadpoolctl             2.2.0              pyh0d69192_0  
tk                        8.6.12               h1ccaba5_0  
torch-cluster             1.6.3+pt21cu121          pypi_0    pypi
torch-scatter             2.1.2+pt21cu121          pypi_0    pypi
torch-sparse              0.6.18+pt21cu121          pypi_0    pypi
torch-spline-conv         1.2.2+pt21cu121          pypi_0    pypi
torchaudio                2.1.0                py38_cu121    pytorch
torchtriton               2.1.0                      py38    pytorch
torchvision               0.16.0               py38_cu121    pytorch
tqdm                      4.65.0           py38hb070fc8_0  
typing_extensions         4.7.1            py38h06a4308_0  
urllib3                   1.26.18          py38h06a4308_0  
wheel                     0.41.2           py38h06a4308_0  
xz                        5.4.5                h5eee18b_0  
yaml                      0.2.5                h7b6447c_0  
zlib                      1.2.13               h5eee18b_0  
zstd                      1.5.5                hc292b87_0

BTW,When I try to increase the amount of tensor data in order to test larger sparse matrices, the program reports an error

y1 = torch.sparse.mm(a, a)
RuntimeError: CUDA error: call "cusparseSpGEMM_workEstimation (handle, opA, opB, &alpha, matA, matB, &beta, matC, computeType, CUSPARSE_ SPGEMM_DEFAULT, spgemmDesc, &bufferSize1, dBuffer1)" when "Insufficient resources

I've seen your reply under this ERROR before, but I still don't understand why this is happening, I'm interpreting it as a data leak causing an exception.

Finally, I also had a look at the underlying torch.sparse.mm and torch.spmm code, and it seems that torch.sparse.mm can do gradient backpropagation, whereas torch.spmm can't; furthermore, I can't see the underlying code for torch.spmm, and I suspect that he's written directly from c-code.

Maybe my test code is not complete, I just had this doubt while running the code and wanted to share and discuss with you! 😊

Looking forward to your reply!😘

@rusty1s
Copy link
Owner

rusty1s commented Dec 10, 2023

Thanks for the issue. Main reason we swapped to torch.sparse.mm is to utilize the reduce argument that PyTorch introduced for the CPU path. Besides that, on GPU, both functions should map to the same underlying implementation. You can verify this by correcting your benchmark. When benchmarking on GPUs, you need to avoid measuring warm up times. The following code fixes this:

from time import time

import torch

torch.manual_seed(2022)

a = torch.rand(512, 512, dtype=torch.double).to_sparse().cuda()
b = torch.rand(512, 512, dtype=torch.double).to_sparse().cuda()

for i in range(100):
    if i == 20:
        sparse_start_time = time()
    y1 = torch.sparse.mm(a, b)
sparse_end_time = time()

for i in range(100):
    if i == 20:
        spmm_start_time = time()
    y2 = torch.spmm(a, b)
spmm_end_time = time()

print(y1)
print("=============================================")
print(y2)
print("=============================================")

print("torch.sparse.mm: ", (sparse_end_time - sparse_start_time), "s")
print("torch.spmm:      ", (spmm_end_time - spmm_start_time), "s")

Output:

torch.sparse.mm:  5.9526426792144775 s
torch.spmm:       5.965423583984375 s

@MrShouxingMa
Copy link
Author

Thank you very much for your prompt reply, I redid the test as per your reminder and the test result is consistent with what you said!👍

import torch
import numpy as np
from time import time
import scipy.sparse as sp

torch.manual_seed(2022)
np.random.seed(2022)

a = np.random.rand(512, 512)
row = np.random.choice(np.arange(a.shape[0]), replace=False,
                       size=int(a.shape[0] * 0.5))
col = np.random.choice(np.arange(a.shape[1]), replace=False,
                       size=int(a.shape[1] * 0.5))
a[row, col] = 0
sp_a = sp.coo_matrix(a)

for i in range(10000):
    if i == 20:
        np_scipy_start_time = time()
    y0 = sp_a.dot(sp_a)
np_scipy_end_time = time()

# y0 = sp.coo_matrix(sp_a * sp_a)
a = torch.tensor(a).to_sparse().cuda()

for i in range(10000):
    if i == 20:
        sparse_start_time = time()
    y1 = torch.sparse.mm(a, a)
sparse_end_time = time()

for i in range(10000):
    if i == 20:
        spmm_start_time = time()
    y2 = torch.spmm(a, a)
spmm_end_time = time()

print("np_scipy:       ", (np_scipy_end_time - np_scipy_start_time), "s")
print("torch.sparse.mm: ", (sparse_end_time - sparse_start_time), "s")
print("torch.spmm:      ", (spmm_end_time - spmm_start_time), "s")

Output:

np_scipy:        1961.8122715950012 s
torch.sparse.mm:  275.561555147171 s
torch.spmm:       275.70165848731995 s

@guohaoqiang
Copy link

Thanks for the issue. Main reason we swapped to torch.sparse.mm is to utilize the reduce argument that PyTorch introduced for the CPU path. Besides that, on GPU, both functions should map to the same underlying implementation. You can verify this by correcting your benchmark. When benchmarking on GPUs, you need to avoid measuring warm up times. The following code fixes this:

from time import time

import torch

torch.manual_seed(2022)

a = torch.rand(512, 512, dtype=torch.double).to_sparse().cuda()
b = torch.rand(512, 512, dtype=torch.double).to_sparse().cuda()

for i in range(100):
    if i == 20:
        sparse_start_time = time()
    y1 = torch.sparse.mm(a, b)
sparse_end_time = time()

for i in range(100):
    if i == 20:
        spmm_start_time = time()
    y2 = torch.spmm(a, b)
spmm_end_time = time()

print(y1)
print("=============================================")
print(y2)
print("=============================================")

print("torch.sparse.mm: ", (sparse_end_time - sparse_start_time), "s")
print("torch.spmm:      ", (spmm_end_time - spmm_start_time), "s")

Output:

torch.sparse.mm:  5.9526426792144775 s
torch.spmm:       5.965423583984375 s

Hi, why your approach can avoid warm-up times? Thanks
Btw, does the backend kernel for spmm call cusparse kernels?

@rusty1s
Copy link
Owner

rusty1s commented Jan 17, 2024

Warm-up times are avoided by just measuring time from the 20th iteration onwards. I don't know if there exists a better way to do this, but that's what I am constantly using and found out to work quite well.

Backward pass does basically two things: (1) Compute the transposed version of the sparse matrix and (2) perform grad_mat = sparse_mat.t() @ grad_out

@guohaoqiang
Copy link

guohaoqiang commented Jan 17, 2024

Warm-up times are avoided by just measuring time from the 20th iteration onwards. I don't know if there exists a better way to do this, but that's what I am constantly using and found out to work quite well.

Backward pass does basically two things: (1) Compute the transposed version of the sparse matrix and (2) perform grad_mat = sparse_mat.t() @ grad_out

Thank you!
That's what ge-spmm did (https://github.com/hgyhungry/ge-spmm/blob/master/pytorch-custom/op.py).
However, I also found the post (https://discuss.pytorch.org/t/manually-calculate-the-gradient-of-a-sparse-matrix/86203/2?u=jiuhnny) said it is only for gradient of the dense. The gradient of the sparse is also required, namely df/dA = (df/dY)@ B.t() in the post.
Screen Shot 2024-01-17 at 12 21 16 PM

@rusty1s
Copy link
Owner

rusty1s commented Jan 17, 2024

Yes, that is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants