Bug in element-wise multiplication of `torch.sparse_csr_tensor`s on GPU - 0's in result considered significant - PyTorch 2.1.1 #114529

Mystic-Slice · 2023-11-25T05:43:02Z

🐛 Describe the bug

Problem:
The problem occurs when using the PyTorch version 2.1.1. The result of element-wise multiplication between two torch.sparse_csr_tensors consider 0s as significant values and retain them in their sparse form. This is not the expected behavior. Moreover, this bug occurs only when the operation is run on a gpu.

Code:

import torch

A = [[0, 0], 
     [1, 0], 
     [0, 2]]

B = [[1, 0],
     [0, 0],
     [2, 3]]

a = torch.tensor(A, device='cuda:0').float().to_sparse_csr()
b = torch.tensor(B, device='cuda:0').float().to_sparse_csr()

# a = torch.tensor(A).float().to_sparse_csr() # for runs on cpu
# b = torch.tensor(B).float().to_sparse_csr()

print("Torch version: ", torch.__version__)
print(a * b)

Output (Torch 2.0.0 - GPU): Expected Output

Torch version:  2.0.0
tensor(crow_indices=tensor([0, 0, 0, 1]),
       col_indices=tensor([1]),
       values=tensor([6.]), device='cuda:0', size=(3, 2), nnz=1, layout=torch.sparse_csr)

Output (Torch 2.1.1 - CPU): Expected Output

Torch version:  2.1.1
tensor(crow_indices=tensor([0, 0, 0, 1]),
       col_indices=tensor([1]),
       values=tensor([6.]), size=(3, 2), nnz=1, layout=torch.sparse_csr)

Output (Torch 2.1.1 - GPU):

Torch version:  2.1.1
tensor(crow_indices=tensor([0, 0, 1, 2]),
       col_indices=tensor([0, 1]),
       values=tensor([0., 6.]), device='cuda:0', size=(3, 2), nnz=2, layout=torch.sparse_csr)

The resulting 0 from multiplication is considered significant.

Versions

PyTorch version: 2.1.1
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
Nvidia driver version: 526.98
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             16
On-line CPU(s) list:                0-15
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen 7 5800H with Radeon Graphics
CPU family:                         25
Model:                              80
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
Stepping:                           0
BogoMIPS:                           6387.84
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm
Virtualization:                     AMD-V
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          256 KiB (8 instances)
L1i cache:                          256 KiB (8 instances)
L2 cache:                           4 MiB (8 instances)
L3 cache:                           16 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.0
[pip3] torch==2.1.1
[pip3] torchaudio==2.1.1
[pip3] torchvision==0.16.1
[pip3] triton==2.1.0
[conda] blas                      1.0                         mkl
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
[conda] mkl                       2023.1.0         h213fc3f_46344
[conda] mkl-service               2.4.0           py311h5eee18b_1
[conda] mkl_fft                   1.3.8           py311h5eee18b_0
[conda] mkl_random                1.2.4           py311hdb19cb5_0
[conda] numpy                     1.26.0          py311h08b1b3b_0
[conda] numpy-base                1.26.0          py311hf175353_0
[conda] pytorch                   2.1.1           py3.11_cuda12.1_cudnn8.9.2_0    pytorch
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                2.1.1               py311_cu121    pytorch
[conda] torchtriton               2.1.0                     py311    pytorch
[conda] torchvision               0.16.1              py311_cu121    pytorch

cc @alexsamardzic @nikitaved @pearu @cpuhrsch @amjames @bhosmer

The text was updated successfully, but these errors were encountered:

malfet · 2023-11-27T18:46:38Z

Just co clarify, when you are using 2.0 vs 2.1, are you using the same CUDA versions or different?

Mystic-Slice · 2023-11-27T18:54:14Z

I checked the versions through this command torch.version.cuda
They are different
Torch 2.0.0 with Cuda 11.7
Torch 2.1.1 with Cuda 12.1

cpuhrsch · 2023-11-27T19:42:46Z

@amjames Can you take a look?

amjames · 2023-11-27T21:28:51Z

Took a while to trace this through, but confirmed there is a deviation in behaviour here beween the cuda and cpu paths.

Some notes:

CSR * CSR will convert to COO * COO, you can also reproduce the issue by converting the inputs to coo.
The cpu implementation does not dispatch to the intersection generic kernel when both inputs are coalesced. There is a note that suggests the intersection generic could be slower than "the brute-force solution below" in this case.
The cuda version will unconditionally end up in the intersection generic there is no alternative "brute force" solution on the cuda path.
If we remove the "brute-force" solution from the cpu path and forward coalesced * coalesced inputs into the generic we have behavior parity with cuda, but then both would produce a result with an explicit zero.

We don't explicitly promise that we won't produce outputs with explicit zeros. We do strive to have coalesced inputs produce coalesced outputs wherever possible, but technically our definition of coalesced does not include "no explicit zeros will be stored".

The comment does seem to hold up though, doing a quick modification to force all sparse-sparse multiply on the CPU to go through the generic algo used for cuda, compared against the path taken currently the generic does seem to be a noticeable amount slower:

algo	size	t (us)
brute	512	3061
generic	512	16676
brute	1024	12343
generic	1024	82904
brute	2048	63633
generic	2048	318669
brute	4096	244440
generic	4096	1373275

The unexpected result is not really wrong, the expectation is slightly flawed as sparse tensor may have explicit zeros, but the divergence of behavior between cpu/cuda here is something I would consider a minor bug.

Working on a fix now.

pearu · 2023-11-28T08:42:41Z

I think it would make sense to keep the operation algorithm and the algorithm of eliminating explicit zeros separate. This is relevant for efficiency (one may want to eliminate the explicit zeros as a last step after applying operations to sparse tensors because the zeros elimination is an expensive operation) as well as for having generic algorithms to support implementing masked and non-masked semantics of sparse tensors (in masked semantics, keeping explicit zeros may be preferred when mask invariance of operations is required).

amjames · 2023-11-28T13:26:07Z

@pearu In this case the algorithm which does not produce explicit zeros in the output is already separate from the generic intersection kernel, and it is faster, at least for the case where it is used (both inputs are coalesced)

nikitaved · 2023-11-28T14:36:54Z

Compressed formats need their own intersection primitive implemented to avoid such issues. It should be pretty straightforward to implement as well, since it shares some logic with the COO case. With COO, however, materializing zeroes is the only way to avoid device synchronizations with at least one coalesced argument.

malfet added the module: sparse Related to torch.sparse label Nov 27, 2023

ClaudiaComito mentioned this issue Nov 27, 2023

[Bug]: sparse element-wise multiplication returns wrong indptr on CUDA helmholtz-analytics/heat#1273

Open

malfet added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: regression It used to work, and now it doesn't labels Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in element-wise multiplication of `torch.sparse_csr_tensor`s on GPU - 0's in result considered significant - PyTorch 2.1.1 #114529

Bug in element-wise multiplication of `torch.sparse_csr_tensor`s on GPU - 0's in result considered significant - PyTorch 2.1.1 #114529

Mystic-Slice commented Nov 25, 2023 •

edited by pytorch-bot bot

Loading

malfet commented Nov 27, 2023

Mystic-Slice commented Nov 27, 2023

cpuhrsch commented Nov 27, 2023

amjames commented Nov 27, 2023 •

edited

Loading

pearu commented Nov 28, 2023

amjames commented Nov 28, 2023

nikitaved commented Nov 28, 2023 •

edited

Loading

Bug in element-wise multiplication of torch.sparse_csr_tensors on GPU - 0's in result considered significant - PyTorch 2.1.1 #114529

Bug in element-wise multiplication of torch.sparse_csr_tensors on GPU - 0's in result considered significant - PyTorch 2.1.1 #114529

Comments

Mystic-Slice commented Nov 25, 2023 • edited by pytorch-bot bot Loading

🐛 Describe the bug

Versions

malfet commented Nov 27, 2023

Mystic-Slice commented Nov 27, 2023

cpuhrsch commented Nov 27, 2023

amjames commented Nov 27, 2023 • edited Loading

pearu commented Nov 28, 2023

amjames commented Nov 28, 2023

nikitaved commented Nov 28, 2023 • edited Loading

Bug in element-wise multiplication of `torch.sparse_csr_tensor`s on GPU - 0's in result considered significant - PyTorch 2.1.1 #114529

Bug in element-wise multiplication of `torch.sparse_csr_tensor`s on GPU - 0's in result considered significant - PyTorch 2.1.1 #114529

Mystic-Slice commented Nov 25, 2023 •

edited by pytorch-bot bot

Loading

amjames commented Nov 27, 2023 •

edited

Loading

nikitaved commented Nov 28, 2023 •

edited

Loading