Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse CSR layout GPU backend tracking issue #60854

Open
19 of 24 tasks
IvanYashchuk opened this issue Jun 28, 2021 · 7 comments
Open
19 of 24 tasks

Sparse CSR layout GPU backend tracking issue #60854

IvanYashchuk opened this issue Jun 28, 2021 · 7 comments
Assignees
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: sparse Related to torch.sparse tracker A tracking issue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@IvanYashchuk
Copy link
Collaborator

IvanYashchuk commented Jun 28, 2021

PyTorch 1.11

Currently, cuSPARSE is already used in PyTorch for some operations with COO sparse matrix format. Internally COO indices are converted to a low-level CSR representation that is used to call cuSPARSE routines and reconstruct the result back to COO.

For PyTorch 1.11 we're focusing on improving sparse CSR support and performance on GPUs.
This work aims to utilize all available functionality of cuSPARSE to expand operator support for PyTorch CSR matrices.

Go to the list of tasks.

cuSPARSE Generic API

cuSPARSE started to deprecate low-level functions that are similar to BLAS to more high level "generic" interface:

The cuSPARSE Generic APIs allow computing the most common sparse linear algebra operations, such as sparse matrix-vector (SpMV) and sparse matrix-matrix multiplication (SpMM), in a flexible way. The new APIs have the following capabilities and features:
Set matrix data layouts, number of batches, and storage formats (for example, CSR, COO, and so on)
Set input/output/compute data types. This also allows mixed data-type computation
Set types of sparse matrix indices
Choose the algorithm for the computation
Provide external device memory for internal operations
Provide extensive consistency checks across input matrices and vectors for a given routine. This includes the validation of matrix sizes, data types, layout, allowed operations, etc.

Hopefully, the price for the generic vs low-level API is not large, and we have no choice here as the low-level BLAS-like functions are deprecated and being removed.

CUDA version requirement for cuSPARSE

cuSPARSE Generic API was introduced in CUDA 10.1, Windows distribution was missing until CUDA 11.
For now, the requirement will be CUDA 11+, because it has more functionality. This requirement can be revisited later on a function-by-function basis.
PyTorch will raise a runtime error if an older CUDA version is used while calling sparse operators on GPU.

Not all architectures are supported. If a specific GPU model does not provide native support for a given data type, the routine returns the CUSPARSE_STATUS_ARCH_MISMATCH error.
Unsupported data types and Compute Capability (CC):
__half on GPUs with CC < 53 (e.g. Kepler)
__nv_bfloat16 on GPUs with CC < 80 (e.g. Kepler, Maxwell, Pascal, Volta, Turing).

Sparse/Dense Matrix/Vector descriptors

cuSPARSE Generic API uses descriptor structs to collect the information about the tensor (size, indices, values, etc.)
There are 4 descriptors:

  • cusparseSpVecDescr - sparse vector
  • cusparseSpMatDescr - sparse matrix (COO or CSR format)
  • cusparseDnVecDescr - dense vector
  • cusparseDnMatDescr - dense matrix

Each descriptor struct has constructors, destructors and a few attribute getters/setters defined in cuSPARSE.
cuDNN follows a similar model and in PyTorch, the descriptors are wrapped in aten/src/ATen/cudnn/Descriptors.h.

Indices dtype

PyTorch uses int64 indices for COO format. The current implementation of CSR supports both int32 and int64 dtypes for indices.

cuSPARSE supports choosing 32-bit or 64-bit indexing in runtime. The generic API only SpGEMM function doesn't support 64-bit indexing, and cusparse<t>csrgeam2 doesn't support it.

Batched computation support

cuSPARSE descriptors for dense and sparse matrices have the functionality to set batch count and batch stride. Currently, only the cusparseSpMM function can use this information. This poses the requirement that indices and values are batch strided (contiguous) tensors, that is
attribute batched shape single matrix shape
crow_indices (batch_shape, num_rows+1) (num_rows+1,)
col_indices (batch_shape, nnz) (nnz,)
values (batch_shape, nnz) (nnz,)

For functions that do not have batched support, we can have a for loop on the batch dimension.

PyTorch 1.11 Tasks

If you are planning to work on a specific task please put your name next to the task or leave a comment below. For any related pull requests use the module: sparse label. Feel free to tag @IvanYashchuk as a reviewer.

A list of tasks enumerated by cuSPARSE routine (ordered by priority):

Note that tensors with the CSR layout on CUDA device use SparseCsrCUDA dispatch key in native_functions.yaml file.

Possible future work

There is a new library for structured sparse-dense matrix multiplication cuSPARSELt, it might have better performance than cuSPARSE for some sparsity patterns #62153.

cc @ngimel @nikitaved @pearu @cpuhrsch @IvanYashchuk

@IvanYashchuk IvanYashchuk added the module: sparse Related to torch.sparse label Jun 28, 2021
@IvanYashchuk IvanYashchuk self-assigned this Jun 28, 2021
@jbschlosser jbschlosser added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 28, 2021
@cpuhrsch
Copy link
Contributor

Looks great! Does the listing of tasks represent the preferred order of execution / priority?

@IvanYashchuk
Copy link
Collaborator Author

Yes, it's ordered by priority.

@heitorschueroff heitorschueroff added the tracker A tracking issue label Jul 9, 2021
@pearu pearu added this to In progress in Sparse tensors Aug 10, 2021
@pearu pearu moved this from In progress to To Do: CSR in Sparse tensors Aug 10, 2021
@fmassa
Copy link
Member

fmassa commented Aug 20, 2021

Hi,

This is great!

One thing I would consider is if we should use torch.sparse.masked_matmul as a name, or instead name it torch.sparse.sddmm (as it is known in the literature). I'm not sure how common / clear masked_matmul is, and maybe sampled_matmul or sampled_dense_dense_matmul would be more appropriate as names

@pearu
Copy link
Collaborator

pearu commented Aug 20, 2021

I would wait for a little with naming functions with masked_ prefix because according to the Masked reductions proposal (TBA, @cpuhrsch), masked_op takes a pair(s) (tensor, mask) as input(s) and returns similar pair(s) as outputs, and it would be good to settle down this convention before implementing other possible masked-tensor-input-output conventions.

@IvanYashchuk
Copy link
Collaborator Author

All three options (sddmm, sampled_matmul, sampled_dense_dense_matmul) sound good to me. Maybe the latter two are better as there is a tendency in PyTorch to use more descriptive names than accronyms.

@cpuhrsch
Copy link
Contributor

I am in favor of adding functions named based on consensus across the vendor libraries we're doing to bind or as specified by already existing popular libraries. Consistency across frameworks I think makes it easier for users to work on them and find more information in other contexts than a more verbose or "clear" name within our framework.

@IvanYashchuk
Copy link
Collaborator Author

The last part that needs to be done from the original list of tasks is batched CSR support.

pytorchmergebot pushed a commit that referenced this issue Mar 29, 2022
This is the first portion of changes required to enable Batched CSR format described in #60854 (comment).

Currently, only the same batch shape for indices and values is allowed. In the future, we could enable "broadcasting" of indices and batched values, as done in xFormers (https://github.com/facebookresearch/xformers/blob/dd96b8d8beda5308fb433c1ef3ff04b7f178c263/xformers/components/attention/_sputnik_sparse.py#L441).

This PR adds possibility to construct a batched CSR matrix with `torch.sparse_csr_tensor` and this batched CSR can be converted to a dense tensor with a `.to_dense()` call.
Pull Request resolved: #74542
Approved by: https://github.com/cpuhrsch
pytorchmergebot pushed a commit that referenced this issue Apr 4, 2022
This is the first portion of changes required to enable Batched CSR format described in #60854 (comment).

Currently, only the same batch shape for indices and values is allowed. In the future, we could enable "broadcasting" of indices and batched values, as done in xFormers (https://github.com/facebookresearch/xformers/blob/dd96b8d8beda5308fb433c1ef3ff04b7f178c263/xformers/components/attention/_sputnik_sparse.py#L441).

This PR adds possibility to construct a batched CSR matrix with `torch.sparse_csr_tensor` and this batched CSR can be converted to a dense tensor with a `.to_dense()` call.
Pull Request resolved: #74542
Approved by: https://github.com/cpuhrsch
pytorchmergebot pushed a commit that referenced this issue Apr 7, 2022
This is the first portion of changes required to enable Batched CSR format described in #60854 (comment).

Currently, only the same batch shape for indices and values is allowed. In the future, we could enable "broadcasting" of indices and batched values, as done in xFormers (https://github.com/facebookresearch/xformers/blob/dd96b8d8beda5308fb433c1ef3ff04b7f178c263/xformers/components/attention/_sputnik_sparse.py#L441).

This PR adds possibility to construct a batched CSR matrix with `torch.sparse_csr_tensor` and this batched CSR can be converted to a dense tensor with a `.to_dense()` call.
Pull Request resolved: #74542
Approved by: https://github.com/cpuhrsch
facebook-github-bot pushed a commit that referenced this issue Apr 8, 2022
Summary:
This is the first portion of changes required to enable Batched CSR format described in #60854 (comment).

Currently, only the same batch shape for indices and values is allowed. In the future, we could enable "broadcasting" of indices and batched values, as done in xFormers (https://github.com/facebookresearch/xformers/blob/dd96b8d8beda5308fb433c1ef3ff04b7f178c263/xformers/components/attention/_sputnik_sparse.py#L441).

This PR adds possibility to construct a batched CSR matrix with `torch.sparse_csr_tensor` and this batched CSR can be converted to a dense tensor with a `.to_dense()` call.

Pull Request resolved: #74542
Approved by: https://github.com/cpuhrsch

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/c7ae23b50e5f96261889ab9d55df1be7a6b1d55f

Reviewed By: b0noI

Differential Revision: D35485699

fbshipit-source-id: fa1c0c5cf256ac886717a9016a83e62ea2772f75
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: sparse Related to torch.sparse tracker A tracking issue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Sparse tensors
To Do: CSR
Development

No branches or pull requests

6 participants