Sparse CSR layout GPU backend tracking issue #60854

IvanYashchuk · 2021-06-28T09:15:47Z

PyTorch 1.11

Currently, cuSPARSE is already used in PyTorch for some operations with COO sparse matrix format. Internally COO indices are converted to a low-level CSR representation that is used to call cuSPARSE routines and reconstruct the result back to COO.

For PyTorch 1.11 we're focusing on improving sparse CSR support and performance on GPUs.
This work aims to utilize all available functionality of cuSPARSE to expand operator support for PyTorch CSR matrices.

Go to the list of tasks.

cuSPARSE Generic API

cuSPARSE started to deprecate low-level functions that are similar to BLAS to more high level "generic" interface:

The cuSPARSE Generic APIs allow computing the most common sparse linear algebra operations, such as sparse matrix-vector (SpMV) and sparse matrix-matrix multiplication (SpMM), in a flexible way. The new APIs have the following capabilities and features:
Set matrix data layouts, number of batches, and storage formats (for example, CSR, COO, and so on)
Set input/output/compute data types. This also allows mixed data-type computation
Set types of sparse matrix indices
Choose the algorithm for the computation
Provide external device memory for internal operations
Provide extensive consistency checks across input matrices and vectors for a given routine. This includes the validation of matrix sizes, data types, layout, allowed operations, etc.

Hopefully, the price for the generic vs low-level API is not large, and we have no choice here as the low-level BLAS-like functions are deprecated and being removed.

CUDA version requirement for cuSPARSE

cuSPARSE Generic API was introduced in CUDA 10.1, Windows distribution was missing until CUDA 11.
For now, the requirement will be CUDA 11+, because it has more functionality. This requirement can be revisited later on a function-by-function basis.
PyTorch will raise a runtime error if an older CUDA version is used while calling sparse operators on GPU.

Not all architectures are supported. If a specific GPU model does not provide native support for a given data type, the routine returns the CUSPARSE_STATUS_ARCH_MISMATCH error.
Unsupported data types and Compute Capability (CC):
__half on GPUs with CC < 53 (e.g. Kepler)
__nv_bfloat16 on GPUs with CC < 80 (e.g. Kepler, Maxwell, Pascal, Volta, Turing).

Sparse/Dense Matrix/Vector descriptors

cuSPARSE Generic API uses descriptor structs to collect the information about the tensor (size, indices, values, etc.)
There are 4 descriptors:

cusparseSpVecDescr - sparse vector
cusparseSpMatDescr - sparse matrix (COO or CSR format)
cusparseDnVecDescr - dense vector
cusparseDnMatDescr - dense matrix

Each descriptor struct has constructors, destructors and a few attribute getters/setters defined in cuSPARSE.
cuDNN follows a similar model and in PyTorch, the descriptors are wrapped in aten/src/ATen/cudnn/Descriptors.h.

Indices dtype

PyTorch uses int64 indices for COO format. The current implementation of CSR supports both int32 and int64 dtypes for indices.

cuSPARSE supports choosing 32-bit or 64-bit indexing in runtime. The generic API only SpGEMM function doesn't support 64-bit indexing, and cusparse<t>csrgeam2 doesn't support it.

Batched computation support

cuSPARSE descriptors for dense and sparse matrices have the functionality to set batch count and batch stride. Currently, only the cusparseSpMM function can use this information. This poses the requirement that indices and values are batch strided (contiguous) tensors, that is

attribute	batched shape	single matrix shape
crow_indices	(batch_shape, num_rows+1)	(num_rows+1,)
col_indices	(batch_shape, nnz)	(nnz,)
values	(batch_shape, nnz)	(nnz,)

For functions that do not have batched support, we can have a for loop on the batch dimension.

PyTorch 1.11 Tasks

If you are planning to work on a specific task please put your name next to the task or leave a comment below. For any related pull requests use the module: sparse label. Feel free to tag @IvanYashchuk as a reviewer.

A list of tasks enumerated by cuSPARSE routine (ordered by priority):

Note that tensors with the CSR layout on CUDA device use SparseCsrCUDA dispatch key in native_functions.yaml file.

Possible future work

There is a new library for structured sparse-dense matrix multiplication cuSPARSELt, it might have better performance than cuSPARSE for some sparsity patterns #62153.

cc @ngimel @nikitaved @pearu @cpuhrsch @IvanYashchuk

The text was updated successfully, but these errors were encountered:

cpuhrsch · 2021-06-28T18:11:15Z

Looks great! Does the listing of tasks represent the preferred order of execution / priority?

IvanYashchuk · 2021-06-28T18:38:34Z

Yes, it's ordered by priority.

fmassa · 2021-08-20T07:57:42Z

Hi,

This is great!

One thing I would consider is if we should use torch.sparse.masked_matmul as a name, or instead name it torch.sparse.sddmm (as it is known in the literature). I'm not sure how common / clear masked_matmul is, and maybe sampled_matmul or sampled_dense_dense_matmul would be more appropriate as names

pearu · 2021-08-20T08:44:50Z

I would wait for a little with naming functions with masked_ prefix because according to the Masked reductions proposal (TBA, @cpuhrsch), masked_op takes a pair(s) (tensor, mask) as input(s) and returns similar pair(s) as outputs, and it would be good to settle down this convention before implementing other possible masked-tensor-input-output conventions.

IvanYashchuk · 2021-08-20T11:12:35Z

All three options (sddmm, sampled_matmul, sampled_dense_dense_matmul) sound good to me. Maybe the latter two are better as there is a tendency in PyTorch to use more descriptive names than accronyms.

cpuhrsch · 2021-08-23T18:08:46Z

I am in favor of adding functions named based on consensus across the vendor libraries we're doing to bind or as specified by already existing popular libraries. Consistency across frameworks I think makes it easier for users to work on them and find more information in other contexts than a more verbose or "clear" name within our framework.

IvanYashchuk · 2022-03-04T09:19:10Z

The last part that needs to be done from the original list of tasks is batched CSR support.

This is the first portion of changes required to enable Batched CSR format described in #60854 (comment). Currently, only the same batch shape for indices and values is allowed. In the future, we could enable "broadcasting" of indices and batched values, as done in xFormers (https://github.com/facebookresearch/xformers/blob/dd96b8d8beda5308fb433c1ef3ff04b7f178c263/xformers/components/attention/_sputnik_sparse.py#L441). This PR adds possibility to construct a batched CSR matrix with `torch.sparse_csr_tensor` and this batched CSR can be converted to a dense tensor with a `.to_dense()` call. Pull Request resolved: #74542 Approved by: https://github.com/cpuhrsch

Summary: This is the first portion of changes required to enable Batched CSR format described in #60854 (comment). Currently, only the same batch shape for indices and values is allowed. In the future, we could enable "broadcasting" of indices and batched values, as done in xFormers (https://github.com/facebookresearch/xformers/blob/dd96b8d8beda5308fb433c1ef3ff04b7f178c263/xformers/components/attention/_sputnik_sparse.py#L441). This PR adds possibility to construct a batched CSR matrix with `torch.sparse_csr_tensor` and this batched CSR can be converted to a dense tensor with a `.to_dense()` call. Pull Request resolved: #74542 Approved by: https://github.com/cpuhrsch Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/c7ae23b50e5f96261889ab9d55df1be7a6b1d55f Reviewed By: b0noI Differential Revision: D35485699 fbshipit-source-id: fa1c0c5cf256ac886717a9016a83e62ea2772f75

IvanYashchuk added the module: sparse Related to torch.sparse label Jun 28, 2021

IvanYashchuk self-assigned this Jun 28, 2021

IvanYashchuk mentioned this issue Jun 28, 2021

torch.sparse improvements - tracking issue #44634

Open

26 tasks

jbschlosser added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 28, 2021

heitorschueroff added the tracker A tracking issue label Jul 9, 2021

IvanYashchuk mentioned this issue Jul 26, 2021

cuSPARSELt Integration #62153

Closed

pearu added this to In progress in Sparse tensors Aug 10, 2021

pearu moved this from In progress to To Do: CSR in Sparse tensors Aug 10, 2021

IvanYashchuk mentioned this issue Sep 27, 2021

Add cuSPARSE descriptors and update CSR addmm #60838

Closed

IvanYashchuk mentioned this issue Mar 22, 2022

Extend CSR constructor to support batched indices and values #74542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse CSR layout GPU backend tracking issue #60854

Sparse CSR layout GPU backend tracking issue #60854

IvanYashchuk commented Jun 28, 2021 •

edited

cpuhrsch commented Jun 28, 2021

IvanYashchuk commented Jun 28, 2021

fmassa commented Aug 20, 2021

pearu commented Aug 20, 2021

IvanYashchuk commented Aug 20, 2021

cpuhrsch commented Aug 23, 2021

IvanYashchuk commented Mar 4, 2022

Sparse CSR layout GPU backend tracking issue #60854

Sparse CSR layout GPU backend tracking issue #60854

Comments

IvanYashchuk commented Jun 28, 2021 • edited

PyTorch 1.11

cuSPARSE Generic API

CUDA version requirement for cuSPARSE

Sparse/Dense Matrix/Vector descriptors

Indices dtype

Batched computation support

PyTorch 1.11 Tasks

Possible future work

cpuhrsch commented Jun 28, 2021

IvanYashchuk commented Jun 28, 2021

fmassa commented Aug 20, 2021

pearu commented Aug 20, 2021

IvanYashchuk commented Aug 20, 2021

cpuhrsch commented Aug 23, 2021

IvanYashchuk commented Mar 4, 2022

IvanYashchuk commented Jun 28, 2021 •

edited