Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse CSR CUDA: add addmv_out #61407

Closed
wants to merge 32 commits into from

Conversation

IvanYashchuk
Copy link
Collaborator

@IvanYashchuk IvanYashchuk commented Jul 8, 2021

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 8, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit e78419a (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Jul 8, 2021
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

ghstack-source-id: 3df74cb4ed07aa58352590d6ab2bea497721a34e
Pull Request resolved: pytorch#61407
@IvanYashchuk IvanYashchuk added module: sparse Related to torch.sparse module: cuda Related to torch.cuda, and CUDA support in general labels Jul 8, 2021
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

[ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

[ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Jul 9, 2021
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

ghstack-source-id: 81f14bb892eae6d607f6b0a578a50cb62b853a1f
Pull Request resolved: pytorch#61407
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Jul 9, 2021
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

ghstack-source-id: 05379cadceaccd015195d9e29fde0829ca84cbe1
Pull Request resolved: pytorch#61407
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Jul 12, 2021
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

ghstack-source-id: 32ca31b4fe27ba827baca43db1456edcbb59436b
Pull Request resolved: pytorch#61407
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

[ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

[ghstack-poisoned]
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Jul 26, 2021
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

ghstack-source-id: 31acf911b4d19503647769d5a4f512c4679a0a8e
Pull Request resolved: pytorch#61407
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Oct 11, 2021
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

ghstack-source-id: 6d46cb14ac4049a3c28f59ffe884de68e0bdb41b
Pull Request resolved: pytorch#61407
@IvanYashchuk
Copy link
Collaborator Author

@ngimel, could you please review and help merge this stack starting from this PR?

@ngimel
Copy link
Collaborator

ngimel commented Oct 11, 2021

@IvanYashchuk what are the bugs with bfloat16/float16 mv? cc @xwang233 to follow up with cusparse

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, I've left minor comments. Do you know if there's an advantage to using cusparseSpMV compared to cusparseSpMM (that is used anyway for bfloat16/half)?


// cuSPARSE doesn't support non-contiguous vectors
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input.is_contiguous());
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input.is_non_overlapping_and_dense());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is guaranteed to be true if is_contiguous is true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, I'll remove it. Good it's only for debugging 🙂

pytorch/c10/core/TensorImpl.h

Lines 2299 to 2300 in 5f15186

is_non_overlapping_and_dense_ =
is_contiguous_ || compute_non_overlapping_and_dense();

@@ -39,7 +39,13 @@ class CuSparseDescriptor {
class TORCH_CUDA_CPP_API CuSparseDnMatDescriptor
: public CuSparseDescriptor<cusparseDnMatDescr, &cusparseDestroyDnMat> {
public:
CuSparseDnMatDescriptor(const Tensor& input);
explicit CuSparseDnMatDescriptor(const Tensor& input);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@@ -17,6 +17,10 @@ TORCH_META_FUNC(addmv)(const Tensor &self, const Tensor &mat, const Tensor &vec,
"size mismatch, got ", self.size(0), ", ", mat.size(0), "x", mat.size(1), ",", vec.size(0));
auto names = at::namedinference::propagate_names_for_addmv(mat, vec, self);
set_output(0, IntArrayRef(mat.sizes().data(), 1), {}, mat.options(), names);
auto result = maybe_get_output(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these lines were removed in #65686, is there a conflict?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I resolved the conflict incorrectly. Will fix that.

TORCH_CHECK(mat.dim() == 2, "addmv: Expected mat to be 2-D");
TORCH_CHECK(vec.dim() == 1, "addmv: Expected vec to be 1-D");

TensorArg args[]{{result, "out", 0}, {self, "self", 1}, {mat, "mat", 2}, {vec, "vec", 3}};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need TensorArgs? It's a perf penalty for very small convenience of using checkAllSameGPU. #62653 is landing soon that will enable these checks conveniently on the Tensors.
Also, out of curiosity, how does det_device and is_cuda in checkAllSameGPU work for sparse mat?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, we don't need this check here at all because it's already there in the generated code.
SparseCsrCUDA is part of is_cuda_dispatch_key and device check is generated using

device_check = RegisterDispatchKey.gen_device_check(f.device_check, list(device_check_args), name)

Example of generated code for addmv:

at::Tensor & wrapper_out_addmv_out_out(const at::Tensor & self, const at::Tensor & mat, const at::Tensor & vec, const at::Scalar & beta, const at::Scalar & alpha, at::Tensor & out) {
  c10::optional<Device> common_device = nullopt;
(void)common_device; // Suppress unused variable warning

  c10::impl::check_and_update_common_device(common_device, out, "wrapper_out_addmv_out_out", "out");
  c10::impl::check_and_update_common_device(common_device, self, "wrapper_out_addmv_out_out", "self");
  c10::impl::check_and_update_common_device(common_device, mat, "wrapper_out_addmv_out_out", "mat");
  c10::impl::check_and_update_common_device(common_device, vec, "wrapper_out_addmv_out_out", "vec");

  const OptionalDeviceGuard device_guard(device_of(self));
  return at::native::addmv_out_sparse_csr_cuda(self, mat, vec, beta, alpha, out);
}

I was thinking that device checks and guards are not generated for sparse because of #59058 but the checks are not generated only for SparseCPU + dense CUDA.

is_cuda is TensorImpl's method and it's just specialized for the case of key_set_ equal to DispatchKey::SparseCsrCUDA:

bool is_cuda() const {
// NB: This method is not virtual and avoid dispatches for performance
// reasons.
return key_set_.has(DispatchKey::CUDA) ||
key_set_.has(DispatchKey::SparseCUDA) ||
key_set_.has(DispatchKey::SparseCsrCUDA) ||
key_set_.has(DispatchKey::QuantizedCUDA);
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addmm_out_cuda_impl (dense CUDA implementation) has this "sameGPU" check and probably it shouldn't be there.

TensorArg args[]{{result, "out", 0}, {self, "self", 1}, {mat1, "mat1", 2}, {mat2, "mat2", 3}};
checkAllSameGPU(__func__, args);

@skipCUDAIfNoCusparseGeneric
@dtypes(*torch.testing.floating_types())
@dtypesIfCUDA(*get_all_complex_dtypes(),
*get_all_fp_dtypes(include_half=SM53OrLater, include_bfloat16=SM80OrLater))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is SM80OrLater correct guard for bfloat16? For regular addmm bfloat16 is supported (with perf equivalent to fp32) for earlier architectures.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, cuSPARSE raises CUSPARSE_STATUS_ARCH_MISMATCH for earlier architectures.
From the documentation:

Unsupported data types and Compute Capability (CC):
__half on GPUs with CC < 53 (e.g. Kepler)
__nv_bfloat16 on GPUs with CC < 80 (e.g. Kepler, Maxwell, Pascal, Volta, Turing)

@dtypes(*torch.testing.floating_types())
@dtypesIfCUDA(*get_all_complex_dtypes(),
*get_all_fp_dtypes(include_half=SM53OrLater, include_bfloat16=SM80OrLater))
@precisionOverride({torch.bfloat16: 1e-2, torch.float16: 1e-2})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, 1e-2 seems high for float16?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. It seems that cuSPARSE uses a different accumulation strategy or something else is different leading to less accurate results than cuBLAS computes.
I'll verify again the tolerances.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, 1e-2 is required for tests to pass for float16.
Running python -m pytest test/test_sparse_csr.py -k "test_csr_matvec" -vvv fails with

Tensors failed to compare as equal!With rtol=0.001 and atol=0.001, found 5 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0078125 (4.2578125 vs. 4.25), which occurred at index 69.

Interestingly running specific test python -m pytest test/test_sparse_csr.py -k "test_csr_matvec_cuda_float16" -vvv to generate a different input with same size gives exactly the same greatest difference of 0.0078125!

Tensors failed to compare as equal!With rtol=0.001 and atol=0.001, found 4 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0078125 (-3.404296875 vs. -3.412109375), which occurred at index 92.

Tested on CUDA 11.4.2 and Turing card.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using SpMV instead of SpMM makes the test pass without precision overrides for float16. So I'll remove it here.


#include <c10/util/MaybeOwned.h>

namespace at {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also move addmm_out_sparse_csr_dense_cuda from SparseCsrTensorMath.cu here? That would be a logical place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved addmm_out_sparse_csr_dense_cuda with a separate PR #66485.

@ngimel
Copy link
Collaborator

ngimel commented Oct 11, 2021

There are also real unused variable errors on this PR (probably not caused by it, and just require a rebase), unfortunately they are too hard to find in the logs

2021-10-11T06:27:54.9960767Z �[01m�[K/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:�[m�[K In function '�[01m�[KPyObject* THPQUInt2x4Storage_new(PyObject*, PyObject*)�[m�[K':
2021-10-11T06:27:54.9963406Z �[01m�[K/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:62:8:�[m�[K �[01;31m�[Kerror: �[m�[Kunused variable '�[01m�[Kself�[m�[K' [�[01;31m�[K-Werror=unused-variable�[m�[K]
2021-10-11T06:27:54.9965138Z    auto �[01;31m�[Kself�[m�[K = (THPStorage*)_self;
2021-10-11T06:27:54.9966001Z         �[01;31m�[K^~~~�[m�[K
2021-10-11T06:27:54.9967879Z �[01m�[K/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:�[m�[K In function '�[01m�[KPyObject* THPQUInt2x4Storage_newWithFile(PyObject*, PyObject*)�[m�[K':
2021-10-11T06:27:54.9970722Z �[01m�[K/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:289:13:�[m�[K �[01;31m�[Kerror: �[m�[Kunused variable '�[01m�[Kfd_obj�[m�[K' [�[01;31m�[K-Werror=unused-variable�[m�[K]
2021-10-11T06:27:54.9972337Z    PyObject *�[01;31m�[Kfd_obj�[m�[K = PyTuple_GetItem(args, 0);
2021-10-11T06:27:54.9973178Z              �[01;31m�[K^~~~~~�[m�[K
2021-10-11T06:27:54.9974063Z cc1plus: all warnings being treated as errors

@IvanYashchuk
Copy link
Collaborator Author

There are also real unused variable errors on this PR (probably not caused by it, and just require a rebase), unfortunately they are too hard to find in the logs

I checked for unused variables and there are no warnings for the files that this PR touches.

@IvanYashchuk
Copy link
Collaborator Author

Do you know if there's an advantage to using cusparseSpMV compared to cusparseSpMM (that is used anyway for bfloat16/half)?

I didn't do any performance comparisons actually. I just expect SpMV to be at least as good as SpMM or slightly faster since they added a separate function. Let's hope it's not only for dropping a few dimension arguments.
I will do the performance checks later once we finish with this stack.

@IvanYashchuk
Copy link
Collaborator Author

IvanYashchuk commented Oct 12, 2021

@IvanYashchuk what are the bugs with bfloat16/float16 mv? cc @xwang233 to follow up with cusparse

I tried to compile a standalone cpp file with bfloat16/float16 SpMV and it works correctly. So the problem is somewhere in my code in this PR. For some reason for small sizes the result is all zeros and for larger sizes some parts of the result is zeros:

In [1]: import torch

In [2]: dtype = torch.float16

In [3]: a = torch.tensor([[1, 0, 2, 3], [0, 4, 0, 0], [5, 0, 6, 7], [0, 8, 0, 9]], dtype=dtype, device='cuda')

In [4]: b = torch.tensor([1, 2, 3, 4], dtype=dtype, device='cuda')

In [5]: aa = a.to_sparse_csr()

In [6]: torch.mv(aa, b)
Out[6]: tensor([0., 0., 0., 0.], device='cuda:0', dtype=torch.float16)
# expected 19.0 8.0 51.0 52.0

UPD: found the problem. alpha and beta must be of type computeType (so float) in this case, if it's __half then the result is wrong.

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

cc nikitaved pearu cpuhrsch @IvanYashchuk ngimel

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Oct 12, 2021
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

ghstack-source-id: 7350f9b088e864feb83a955279c225861ddb6c67
Pull Request resolved: pytorch#61407
@IvanYashchuk
Copy link
Collaborator Author

I updated the pull request:

  • removed device checks, it's code generated anyway
  • fixed float16/bfloat16 case to use SpMV instead of SpMM
  • removed precision overrides in the test because SpMV for float16/bfloat16 is more accurate than SpMM path so these overrides are not needed anymore

@ngimel
Copy link
Collaborator

ngimel commented Oct 12, 2021

Unused var errors are still generated, maybe a rebase is needed?

/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp: In function 'PyObject* THPQUInt2x4Storage_elementSize(PyObject*, PyObject*)':
/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:54:8: error: unused variable 'self' [-Werror=unused-variable]
   auto self = (THPStorage*)_self;
        ^~~~
/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp: In function 'PyObject* THPQUInt2x4Storage_new(PyObject*, PyObject*)':
/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:62:8: error: unused variable 'self' [-Werror=unused-variable]
   auto self = (THPStorage*)_self;
        ^~~~
/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp: In function 'PyObject* THPQUInt2x4Storage_newWithFile(PyObject*, PyObject*)':
/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:289:13: error: unused variable 'fd_obj' [-Werror=unused-variable]
   PyObject *fd_obj = PyTuple_GetItem(args, 0);
             ^~~~~~
cc1plus: all warnings being treated as errors

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

cc nikitaved pearu cpuhrsch @IvanYashchuk ngimel

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Oct 12, 2021
This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

ghstack-source-id: 124f9a9435a2f5356a43865f4c71fa505f957e12
Pull Request resolved: pytorch#61407
Sparse tensors automation moved this from Review in progress to Reviewer approved Oct 12, 2021
@ngimel
Copy link
Collaborator

ngimel commented Oct 12, 2021

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sparse tensors automation moved this from Reviewer approved to Done Oct 13, 2021
@facebook-github-bot facebook-github-bot deleted the gh/ivanyashchuk/27/head branch October 16, 2021 14:30
wconstab pushed a commit that referenced this pull request Oct 20, 2021
Summary:
Pull Request resolved: #61407

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case `addmm` is
used with squeezes and unsqueezes.

cc nikitaved pearu cpuhrsch IvanYashchuk ngimel

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31584499

Pulled By: ngimel

fbshipit-source-id: 4c507791471ada88969116b88eeaaba7a7536431
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed module: cuda Related to torch.cuda, and CUDA support in general module: sparse Related to torch.sparse open source
Projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants