Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in element-wise multiplication of torch.sparse_csr_tensors on GPU - 0's in result considered significant - PyTorch 2.1.1 #114529

Open
Mystic-Slice opened this issue Nov 25, 2023 · 7 comments
Labels
module: regression It used to work, and now it doesn't module: sparse Related to torch.sparse triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@Mystic-Slice
Copy link

Mystic-Slice commented Nov 25, 2023

🐛 Describe the bug

Problem:
The problem occurs when using the PyTorch version 2.1.1. The result of element-wise multiplication between two torch.sparse_csr_tensors consider 0s as significant values and retain them in their sparse form. This is not the expected behavior. Moreover, this bug occurs only when the operation is run on a gpu.

Code:

import torch

A = [[0, 0], 
     [1, 0], 
     [0, 2]]

B = [[1, 0],
     [0, 0],
     [2, 3]]

a = torch.tensor(A, device='cuda:0').float().to_sparse_csr()
b = torch.tensor(B, device='cuda:0').float().to_sparse_csr()

# a = torch.tensor(A).float().to_sparse_csr() # for runs on cpu
# b = torch.tensor(B).float().to_sparse_csr()

print("Torch version: ", torch.__version__)
print(a * b)

Output (Torch 2.0.0 - GPU): Expected Output

Torch version:  2.0.0
tensor(crow_indices=tensor([0, 0, 0, 1]),
       col_indices=tensor([1]),
       values=tensor([6.]), device='cuda:0', size=(3, 2), nnz=1, layout=torch.sparse_csr)

Output (Torch 2.1.1 - CPU): Expected Output

Torch version:  2.1.1
tensor(crow_indices=tensor([0, 0, 0, 1]),
       col_indices=tensor([1]),
       values=tensor([6.]), size=(3, 2), nnz=1, layout=torch.sparse_csr)

Output (Torch 2.1.1 - GPU):

Torch version:  2.1.1
tensor(crow_indices=tensor([0, 0, 1, 2]),
       col_indices=tensor([0, 1]),
       values=tensor([0., 6.]), device='cuda:0', size=(3, 2), nnz=2, layout=torch.sparse_csr)

The resulting 0 from multiplication is considered significant.

Versions

PyTorch version: 2.1.1
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
Nvidia driver version: 526.98
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             16
On-line CPU(s) list:                0-15
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen 7 5800H with Radeon Graphics
CPU family:                         25
Model:                              80
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
Stepping:                           0
BogoMIPS:                           6387.84
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm
Virtualization:                     AMD-V
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          256 KiB (8 instances)
L1i cache:                          256 KiB (8 instances)
L2 cache:                           4 MiB (8 instances)
L3 cache:                           16 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.0
[pip3] torch==2.1.1
[pip3] torchaudio==2.1.1
[pip3] torchvision==0.16.1
[pip3] triton==2.1.0
[conda] blas                      1.0                         mkl
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
[conda] mkl                       2023.1.0         h213fc3f_46344
[conda] mkl-service               2.4.0           py311h5eee18b_1
[conda] mkl_fft                   1.3.8           py311h5eee18b_0
[conda] mkl_random                1.2.4           py311hdb19cb5_0
[conda] numpy                     1.26.0          py311h08b1b3b_0
[conda] numpy-base                1.26.0          py311hf175353_0
[conda] pytorch                   2.1.1           py3.11_cuda12.1_cudnn8.9.2_0    pytorch
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                2.1.1               py311_cu121    pytorch
[conda] torchtriton               2.1.0                     py311    pytorch
[conda] torchvision               0.16.1              py311_cu121    pytorch

cc @alexsamardzic @nikitaved @pearu @cpuhrsch @amjames @bhosmer

@malfet malfet added the module: sparse Related to torch.sparse label Nov 27, 2023
@malfet malfet added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: regression It used to work, and now it doesn't labels Nov 27, 2023
@malfet
Copy link
Contributor

malfet commented Nov 27, 2023

Just co clarify, when you are using 2.0 vs 2.1, are you using the same CUDA versions or different?

@Mystic-Slice
Copy link
Author

I checked the versions through this command torch.version.cuda
They are different
Torch 2.0.0 with Cuda 11.7
Torch 2.1.1 with Cuda 12.1

@cpuhrsch
Copy link
Contributor

@amjames Can you take a look?

@amjames
Copy link
Collaborator

amjames commented Nov 27, 2023

Took a while to trace this through, but confirmed there is a deviation in behaviour here beween the cuda and cpu paths.

Some notes:

  • CSR * CSR will convert to COO * COO, you can also reproduce the issue by converting the inputs to coo.
  • The cpu implementation does not dispatch to the intersection generic kernel when both inputs are coalesced. There is a note that suggests the intersection generic could be slower than "the brute-force solution below" in this case.
  • The cuda version will unconditionally end up in the intersection generic there is no alternative "brute force" solution on the cuda path.
  • If we remove the "brute-force" solution from the cpu path and forward coalesced * coalesced inputs into the generic we have behavior parity with cuda, but then both would produce a result with an explicit zero.

We don't explicitly promise that we won't produce outputs with explicit zeros. We do strive to have coalesced inputs produce coalesced outputs wherever possible, but technically our definition of coalesced does not include "no explicit zeros will be stored".

The comment does seem to hold up though, doing a quick modification to force all sparse-sparse multiply on the CPU to go through the generic algo used for cuda, compared against the path taken currently the generic does seem to be a noticeable amount slower:

algo size t (us)
brute 512 3061
generic 512 16676
brute 1024 12343
generic 1024 82904
brute 2048 63633
generic 2048 318669
brute 4096 244440
generic 4096 1373275

The unexpected result is not really wrong, the expectation is slightly flawed as sparse tensor may have explicit zeros, but the divergence of behavior between cpu/cuda here is something I would consider a minor bug.

Working on a fix now.

@pearu
Copy link
Collaborator

pearu commented Nov 28, 2023

I think it would make sense to keep the operation algorithm and the algorithm of eliminating explicit zeros separate. This is relevant for efficiency (one may want to eliminate the explicit zeros as a last step after applying operations to sparse tensors because the zeros elimination is an expensive operation) as well as for having generic algorithms to support implementing masked and non-masked semantics of sparse tensors (in masked semantics, keeping explicit zeros may be preferred when mask invariance of operations is required).

@amjames
Copy link
Collaborator

amjames commented Nov 28, 2023

@pearu In this case the algorithm which does not produce explicit zeros in the output is already separate from the generic intersection kernel, and it is faster, at least for the case where it is used (both inputs are coalesced)

@nikitaved
Copy link
Collaborator

nikitaved commented Nov 28, 2023

Compressed formats need their own intersection primitive implemented to avoid such issues. It should be pretty straightforward to implement as well, since it shares some logic with the COO case. With COO, however, materializing zeroes is the only way to avoid device synchronizations with at least one coalesced argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: regression It used to work, and now it doesn't module: sparse Related to torch.sparse triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants