New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse CSR layout GPU backend tracking issue #60854
Comments
Looks great! Does the listing of tasks represent the preferred order of execution / priority? |
Yes, it's ordered by priority. |
Hi, This is great! One thing I would consider is if we should use |
I would wait for a little with naming functions with |
All three options ( |
I am in favor of adding functions named based on consensus across the vendor libraries we're doing to bind or as specified by already existing popular libraries. Consistency across frameworks I think makes it easier for users to work on them and find more information in other contexts than a more verbose or "clear" name within our framework. |
The last part that needs to be done from the original list of tasks is batched CSR support. |
This is the first portion of changes required to enable Batched CSR format described in #60854 (comment). Currently, only the same batch shape for indices and values is allowed. In the future, we could enable "broadcasting" of indices and batched values, as done in xFormers (https://github.com/facebookresearch/xformers/blob/dd96b8d8beda5308fb433c1ef3ff04b7f178c263/xformers/components/attention/_sputnik_sparse.py#L441). This PR adds possibility to construct a batched CSR matrix with `torch.sparse_csr_tensor` and this batched CSR can be converted to a dense tensor with a `.to_dense()` call. Pull Request resolved: #74542 Approved by: https://github.com/cpuhrsch
This is the first portion of changes required to enable Batched CSR format described in #60854 (comment). Currently, only the same batch shape for indices and values is allowed. In the future, we could enable "broadcasting" of indices and batched values, as done in xFormers (https://github.com/facebookresearch/xformers/blob/dd96b8d8beda5308fb433c1ef3ff04b7f178c263/xformers/components/attention/_sputnik_sparse.py#L441). This PR adds possibility to construct a batched CSR matrix with `torch.sparse_csr_tensor` and this batched CSR can be converted to a dense tensor with a `.to_dense()` call. Pull Request resolved: #74542 Approved by: https://github.com/cpuhrsch
This is the first portion of changes required to enable Batched CSR format described in #60854 (comment). Currently, only the same batch shape for indices and values is allowed. In the future, we could enable "broadcasting" of indices and batched values, as done in xFormers (https://github.com/facebookresearch/xformers/blob/dd96b8d8beda5308fb433c1ef3ff04b7f178c263/xformers/components/attention/_sputnik_sparse.py#L441). This PR adds possibility to construct a batched CSR matrix with `torch.sparse_csr_tensor` and this batched CSR can be converted to a dense tensor with a `.to_dense()` call. Pull Request resolved: #74542 Approved by: https://github.com/cpuhrsch
Summary: This is the first portion of changes required to enable Batched CSR format described in #60854 (comment). Currently, only the same batch shape for indices and values is allowed. In the future, we could enable "broadcasting" of indices and batched values, as done in xFormers (https://github.com/facebookresearch/xformers/blob/dd96b8d8beda5308fb433c1ef3ff04b7f178c263/xformers/components/attention/_sputnik_sparse.py#L441). This PR adds possibility to construct a batched CSR matrix with `torch.sparse_csr_tensor` and this batched CSR can be converted to a dense tensor with a `.to_dense()` call. Pull Request resolved: #74542 Approved by: https://github.com/cpuhrsch Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/c7ae23b50e5f96261889ab9d55df1be7a6b1d55f Reviewed By: b0noI Differential Revision: D35485699 fbshipit-source-id: fa1c0c5cf256ac886717a9016a83e62ea2772f75
PyTorch 1.11
Currently, cuSPARSE is already used in PyTorch for some operations with COO sparse matrix format. Internally COO indices are converted to a low-level CSR representation that is used to call cuSPARSE routines and reconstruct the result back to COO.
For PyTorch 1.11 we're focusing on improving sparse CSR support and performance on GPUs.
This work aims to utilize all available functionality of cuSPARSE to expand operator support for PyTorch CSR matrices.
Go to the list of tasks.
cuSPARSE Generic API
cuSPARSE started to deprecate low-level functions that are similar to BLAS to more high level "generic" interface:
Hopefully, the price for the generic vs low-level API is not large, and we have no choice here as the low-level BLAS-like functions are deprecated and being removed.
CUDA version requirement for cuSPARSE
cuSPARSE Generic API was introduced in CUDA 10.1, Windows distribution was missing until CUDA 11.
For now, the requirement will be CUDA 11+, because it has more functionality. This requirement can be revisited later on a function-by-function basis.
PyTorch will raise a runtime error if an older CUDA version is used while calling sparse operators on GPU.
Not all architectures are supported. If a specific GPU model does not provide native support for a given data type, the routine returns the
CUSPARSE_STATUS_ARCH_MISMATCH
error.Unsupported data types and Compute Capability (CC):
__half
on GPUs with CC < 53 (e.g. Kepler)__nv_bfloat16
on GPUs with CC < 80 (e.g. Kepler, Maxwell, Pascal, Volta, Turing).Sparse/Dense Matrix/Vector descriptors
cuSPARSE Generic API uses descriptor structs to collect the information about the tensor (size, indices, values, etc.)
There are 4 descriptors:
Each descriptor struct has constructors, destructors and a few attribute getters/setters defined in cuSPARSE.
cuDNN follows a similar model and in PyTorch, the descriptors are wrapped in
aten/src/ATen/cudnn/Descriptors.h
.Indices dtype
PyTorch uses
int64
indices for COO format. The current implementation of CSR supports bothint32
andint64
dtypes for indices.cuSPARSE supports choosing 32-bit or 64-bit indexing in runtime. The generic API only SpGEMM function doesn't support 64-bit indexing, and
cusparse<t>csrgeam2
doesn't support it.Batched computation support
cuSPARSE descriptors for dense and sparse matrices have the functionality to set batch count and batch stride. Currently, only the cusparseSpMM function can use this information. This poses the requirement that indices and values are batch strided (contiguous) tensors, that isFor functions that do not have batched support, we can have a for loop on the batch dimension.
PyTorch 1.11 Tasks
If you are planning to work on a specific task please put your name next to the task or leave a comment below. For any related pull requests use the
module: sparse
label. Feel free to tag @IvanYashchuk as a reviewer.A list of tasks enumerated by cuSPARSE routine (ordered by priority):
cusparseSpMM (sparse matrix - dense matrix multiplication, dense result) (@IvanYashchuk)
cuSPARSE supports batching Cᵢ = A ⋅ Bᵢ, Cᵢ = Aᵢ ⋅ B, Cᵢ = Aᵢ ⋅ Bᵢ.
The deterministic algorithm is available (only column-major dense matrices, no batching).
Both 32-bit and 64-bit indices are supported.
Uniform precision and mixed-precision computations are supported.
Dtypes: float32, float64, complex64, complex128, float16, bfloat16 (
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kHalf, kBFloat16
)Backend for:
PR: Sparse CSR CUDA: Add torch.baddbmm and torch.bmm #68711
done via a hack
pytorch/test/test_sparse_csr.py
Lines 649 to 651 in bd3db01
PR in progress: Add cuSPARSE descriptors and update CSR addmm #60838
cusparseSpGEMM (sparse matrix - sparse matrix multiplication, sparse result) (@IvanYashchuk)
Only deterministic algorithm is available.
Only 32-bit indices are supported.
Only opA, opB equal to
CUSPARSE_OPERATION_NON_TRANSPOSE
are supported.Only uniform-precision computation
Dtypes: float32, float64, complex64, complex128, float16, bfloat16 (
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kHalf, kBFloat16
)PR in progress: Sparse CSR CUDA: add
torch.addmm
with all inputs sparse #63511cusparseSpMV (sparse matrix - dense vector multiplication, dense result) (@IvanYashchuk)
Deterministic algorithm is available (only
opA == CUSPARSE_OPERATION_NON_TRANSPOSE
).Both 32-bit and 64-bit indices are supported.
Uniform precision and mixed-precision computations are supported.
Dtypes: float32, float64, complex64, complex128, float16, bfloat16 (
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND2(kHalf, kBFloat16
)Backend for:
PR in progress: Sparse CSR CUDA: add
addmv_out
#61407cusparseSpSM (sparse matrix - dense matrix triangular solver, dense result) (@IvanYashchuk)
cusparseSpSV (sparse matrix - dense vector triangular solver, dense result)
Only deterministic algorithm is available.
Both 32-bit and 64-bit indices are supported.
Dtypes: float32, float64, complex64, complex128 (
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES
)Backend for:
PR in progress: Sparse CSR CUDA: add
triangular_solve_out
#61858cusparse<t>csrgeam2 (sparse matrix - sparse matrix addition, sparse result) (@IvanYashchuk)
Only 32-bit indices are supported.
Dtypes: float32, float64, complex64, complex128 (
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES
)Backend for:
PR is progress: Sparse CSR CUDA: add
torch.add
with all inputs sparse #63948cusparseSDDMM (sparse mask, dense matrix - dense matrix multiplication, sparse result)
Only a deterministic algorithm is available.
Both 32-bit and 64-bit indices are supported.
Dtypes: float32, float64, complex64, complex128 (
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES
)Backend for:
PR: Sparse CSR CUDA: Add torch.sparse.sampled_addmm #68007
all above should support batched CSR input
Note that tensors with the CSR layout on CUDA device use
SparseCsrCUDA
dispatch key innative_functions.yaml
file.Possible future work
There is a new library for structured sparse-dense matrix multiplication cuSPARSELt, it might have better performance than cuSPARSE for some sparsity patterns #62153.
cc @ngimel @nikitaved @pearu @cpuhrsch @IvanYashchuk
The text was updated successfully, but these errors were encountered: