New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse CSR CUDA: add addmv_out
#61407
Conversation
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit e78419a (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 3df74cb4ed07aa58352590d6ab2bea497721a34e Pull Request resolved: pytorch#61407
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 81f14bb892eae6d607f6b0a578a50cb62b853a1f Pull Request resolved: pytorch#61407
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 05379cadceaccd015195d9e29fde0829ca84cbe1 Pull Request resolved: pytorch#61407
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 32ca31b4fe27ba827baca43db1456edcbb59436b Pull Request resolved: pytorch#61407
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 31acf911b4d19503647769d5a4f512c4679a0a8e Pull Request resolved: pytorch#61407
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 6d46cb14ac4049a3c28f59ffe884de68e0bdb41b Pull Request resolved: pytorch#61407
@ngimel, could you please review and help merge this stack starting from this PR? |
@IvanYashchuk what are the bugs with bfloat16/float16 mv? cc @xwang233 to follow up with cusparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, I've left minor comments. Do you know if there's an advantage to using cusparseSpMV compared to cusparseSpMM (that is used anyway for bfloat16/half)?
|
||
// cuSPARSE doesn't support non-contiguous vectors | ||
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input.is_contiguous()); | ||
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input.is_non_overlapping_and_dense()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is guaranteed to be true if is_contiguous
is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, I'll remove it. Good it's only for debugging 🙂
Lines 2299 to 2300 in 5f15186
is_non_overlapping_and_dense_ = | |
is_contiguous_ || compute_non_overlapping_and_dense(); |
@@ -39,7 +39,13 @@ class CuSparseDescriptor { | |||
class TORCH_CUDA_CPP_API CuSparseDnMatDescriptor | |||
: public CuSparseDescriptor<cusparseDnMatDescr, &cusparseDestroyDnMat> { | |||
public: | |||
CuSparseDnMatDescriptor(const Tensor& input); | |||
explicit CuSparseDnMatDescriptor(const Tensor& input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
aten/src/ATen/native/Blas.cpp
Outdated
@@ -17,6 +17,10 @@ TORCH_META_FUNC(addmv)(const Tensor &self, const Tensor &mat, const Tensor &vec, | |||
"size mismatch, got ", self.size(0), ", ", mat.size(0), "x", mat.size(1), ",", vec.size(0)); | |||
auto names = at::namedinference::propagate_names_for_addmv(mat, vec, self); | |||
set_output(0, IntArrayRef(mat.sizes().data(), 1), {}, mat.options(), names); | |||
auto result = maybe_get_output(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these lines were removed in #65686, is there a conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I resolved the conflict incorrectly. Will fix that.
TORCH_CHECK(mat.dim() == 2, "addmv: Expected mat to be 2-D"); | ||
TORCH_CHECK(vec.dim() == 1, "addmv: Expected vec to be 1-D"); | ||
|
||
TensorArg args[]{{result, "out", 0}, {self, "self", 1}, {mat, "mat", 2}, {vec, "vec", 3}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need TensorArgs? It's a perf penalty for very small convenience of using checkAllSameGPU
. #62653 is landing soon that will enable these checks conveniently on the Tensors.
Also, out of curiosity, how does det_device
and is_cuda
in checkAllSameGPU work for sparse mat
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, we don't need this check here at all because it's already there in the generated code.
SparseCsrCUDA is part of is_cuda_dispatch_key
and device check is generated using
device_check = RegisterDispatchKey.gen_device_check(f.device_check, list(device_check_args), name) |
Example of generated code for addmv:
at::Tensor & wrapper_out_addmv_out_out(const at::Tensor & self, const at::Tensor & mat, const at::Tensor & vec, const at::Scalar & beta, const at::Scalar & alpha, at::Tensor & out) {
c10::optional<Device> common_device = nullopt;
(void)common_device; // Suppress unused variable warning
c10::impl::check_and_update_common_device(common_device, out, "wrapper_out_addmv_out_out", "out");
c10::impl::check_and_update_common_device(common_device, self, "wrapper_out_addmv_out_out", "self");
c10::impl::check_and_update_common_device(common_device, mat, "wrapper_out_addmv_out_out", "mat");
c10::impl::check_and_update_common_device(common_device, vec, "wrapper_out_addmv_out_out", "vec");
const OptionalDeviceGuard device_guard(device_of(self));
return at::native::addmv_out_sparse_csr_cuda(self, mat, vec, beta, alpha, out);
}
I was thinking that device checks and guards are not generated for sparse because of #59058 but the checks are not generated only for SparseCPU + dense CUDA.
is_cuda
is TensorImpl's method and it's just specialized for the case of key_set_
equal to DispatchKey::SparseCsrCUDA
:
Lines 853 to 860 in 5f15186
bool is_cuda() const { | |
// NB: This method is not virtual and avoid dispatches for performance | |
// reasons. | |
return key_set_.has(DispatchKey::CUDA) || | |
key_set_.has(DispatchKey::SparseCUDA) || | |
key_set_.has(DispatchKey::SparseCsrCUDA) || | |
key_set_.has(DispatchKey::QuantizedCUDA); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addmm_out_cuda_impl
(dense CUDA implementation) has this "sameGPU" check and probably it shouldn't be there.
pytorch/aten/src/ATen/native/cuda/Blas.cpp
Lines 99 to 100 in 5f15186
TensorArg args[]{{result, "out", 0}, {self, "self", 1}, {mat1, "mat1", 2}, {mat2, "mat2", 3}}; | |
checkAllSameGPU(__func__, args); |
@skipCUDAIfNoCusparseGeneric | ||
@dtypes(*torch.testing.floating_types()) | ||
@dtypesIfCUDA(*get_all_complex_dtypes(), | ||
*get_all_fp_dtypes(include_half=SM53OrLater, include_bfloat16=SM80OrLater)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is SM80OrLater correct guard for bfloat16? For regular addmm bfloat16 is supported (with perf equivalent to fp32) for earlier architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, cuSPARSE raises CUSPARSE_STATUS_ARCH_MISMATCH
for earlier architectures.
From the documentation:
Unsupported data types and Compute Capability (CC):
__half on GPUs with CC < 53 (e.g. Kepler)
__nv_bfloat16 on GPUs with CC < 80 (e.g. Kepler, Maxwell, Pascal, Volta, Turing)
test/test_sparse_csr.py
Outdated
@dtypes(*torch.testing.floating_types()) | ||
@dtypesIfCUDA(*get_all_complex_dtypes(), | ||
*get_all_fp_dtypes(include_half=SM53OrLater, include_bfloat16=SM80OrLater)) | ||
@precisionOverride({torch.bfloat16: 1e-2, torch.float16: 1e-2}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, 1e-2 seems high for float16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. It seems that cuSPARSE uses a different accumulation strategy or something else is different leading to less accurate results than cuBLAS computes.
I'll verify again the tolerances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, 1e-2 is required for tests to pass for float16.
Running python -m pytest test/test_sparse_csr.py -k "test_csr_matvec" -vvv
fails with
Tensors failed to compare as equal!With rtol=0.001 and atol=0.001, found 5 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0078125 (4.2578125 vs. 4.25), which occurred at index 69.
Interestingly running specific test python -m pytest test/test_sparse_csr.py -k "test_csr_matvec_cuda_float16" -vvv
to generate a different input with same size gives exactly the same greatest difference of 0.0078125!
Tensors failed to compare as equal!With rtol=0.001 and atol=0.001, found 4 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0078125 (-3.404296875 vs. -3.412109375), which occurred at index 92.
Tested on CUDA 11.4.2 and Turing card.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using SpMV instead of SpMM makes the test pass without precision overrides for float16. So I'll remove it here.
|
||
#include <c10/util/MaybeOwned.h> | ||
|
||
namespace at { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also move addmm_out_sparse_csr_dense_cuda
from SparseCsrTensorMath.cu here? That would be a logical place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved addmm_out_sparse_csr_dense_cuda
with a separate PR #66485.
There are also real unused variable errors on this PR (probably not caused by it, and just require a rebase), unfortunately they are too hard to find in the logs
|
I checked for unused variables and there are no warnings for the files that this PR touches. |
I didn't do any performance comparisons actually. I just expect SpMV to be at least as good as SpMM or slightly faster since they added a separate function. Let's hope it's not only for dropping a few dimension arguments. |
I tried to compile a standalone cpp file with bfloat16/float16 SpMV and it works correctly. So the problem is somewhere in my code in this PR. For some reason for small sizes the result is all zeros and for larger sizes some parts of the result is zeros: In [1]: import torch
In [2]: dtype = torch.float16
In [3]: a = torch.tensor([[1, 0, 2, 3], [0, 4, 0, 0], [5, 0, 6, 7], [0, 8, 0, 9]], dtype=dtype, device='cuda')
In [4]: b = torch.tensor([1, 2, 3, 4], dtype=dtype, device='cuda')
In [5]: aa = a.to_sparse_csr()
In [6]: torch.mv(aa, b)
Out[6]: tensor([0., 0., 0., 0.], device='cuda:0', dtype=torch.float16)
# expected 19.0 8.0 51.0 52.0 UPD: found the problem. alpha and beta must be of type |
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. cc nikitaved pearu cpuhrsch @IvanYashchuk ngimel [ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 7350f9b088e864feb83a955279c225861ddb6c67 Pull Request resolved: pytorch#61407
I updated the pull request:
|
Unused var errors are still generated, maybe a rebase is needed?
|
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. cc nikitaved pearu cpuhrsch @IvanYashchuk ngimel [ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 124f9a9435a2f5356a43865f4c71fa505f957e12 Pull Request resolved: pytorch#61407
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: Pull Request resolved: #61407 This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. cc nikitaved pearu cpuhrsch IvanYashchuk ngimel Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D31584499 Pulled By: ngimel fbshipit-source-id: 4c507791471ada88969116b88eeaaba7a7536431
Stack from ghstack:
triangular_solve_out
#62180triangular_solve_out
#61858torch.addmm
#65606torch.add
with all inputs sparse #64391torch.add
with all inputs sparse #63948torch.addmm
with all inputs sparse #63511addmv_out
#61536addmm
andmm
#66485addmv_out
#61407This PR adds
addmv_out_sparse_csr_cuda
. The operation is used tocompute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case
addmm
isused with squeezes and unsqueezes.
cc @nikitaved @pearu @cpuhrsch @IvanYashchuk @ngimel
Differential Revision: D31584499