Sparse CSR CUDA: add `addmv_out` #61407

IvanYashchuk · 2021-07-08T11:26:28Z

Stack from ghstack:

This PR adds addmv_out_sparse_csr_cuda. The operation is used to
compute matrix-vector multiplication. Since structured_delegate is used
we only need to implement the out variant, the in-place and normal
variants are autogenerated.
Working on this PR revealed that float16 (and probably bfloat16) inputs
do not work correctly in cusparse, therefore for this case addmm is
used with squeezes and unsqueezes.

cc @nikitaved @pearu @cpuhrsch @IvanYashchuk @ngimel

Differential Revision: D31584499

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]

facebook-github-bot · 2021-07-08T11:26:34Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/61407
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit e78419a (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 3df74cb4ed07aa58352590d6ab2bea497721a34e Pull Request resolved: pytorch#61407

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 81f14bb892eae6d607f6b0a578a50cb62b853a1f Pull Request resolved: pytorch#61407

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 05379cadceaccd015195d9e29fde0829ca84cbe1 Pull Request resolved: pytorch#61407

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 32ca31b4fe27ba827baca43db1456edcbb59436b Pull Request resolved: pytorch#61407

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. [ghstack-poisoned]

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 31acf911b4d19503647769d5a4f512c4679a0a8e Pull Request resolved: pytorch#61407

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 6d46cb14ac4049a3c28f59ffe884de68e0bdb41b Pull Request resolved: pytorch#61407

IvanYashchuk · 2021-10-11T06:17:25Z

@ngimel, could you please review and help merge this stack starting from this PR?

ngimel · 2021-10-11T19:22:20Z

@IvanYashchuk what are the bugs with bfloat16/float16 mv? cc @xwang233 to follow up with cusparse

ngimel

This looks good, I've left minor comments. Do you know if there's an advantage to using cusparseSpMV compared to cusparseSpMM (that is used anyway for bfloat16/half)?

ngimel · 2021-10-11T19:23:45Z

aten/src/ATen/cuda/CUDASparseDescriptors.cpp

+
+  // cuSPARSE doesn't support non-contiguous vectors
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input.is_contiguous());
+  TORCH_INTERNAL_ASSERT_DEBUG_ONLY(input.is_non_overlapping_and_dense());


this is guaranteed to be true if is_contiguous is true?

That's right, I'll remove it. Good it's only for debugging 🙂

pytorch/c10/core/TensorImpl.h

Lines 2299 to 2300 in 5f15186

is_non_overlapping_and_dense_ =

is_contiguous_ || compute_non_overlapping_and_dense();

ngimel · 2021-10-11T19:34:01Z

aten/src/ATen/cuda/CUDASparseDescriptors.h

@@ -39,7 +39,13 @@ class CuSparseDescriptor {
 class TORCH_CUDA_CPP_API CuSparseDnMatDescriptor
    : public CuSparseDescriptor<cusparseDnMatDescr, &cusparseDestroyDnMat> {
 public:
-  CuSparseDnMatDescriptor(const Tensor& input);
+  explicit CuSparseDnMatDescriptor(const Tensor& input);


ngimel · 2021-10-11T19:43:12Z

aten/src/ATen/native/Blas.cpp

@@ -17,6 +17,10 @@ TORCH_META_FUNC(addmv)(const Tensor &self, const Tensor &mat, const Tensor &vec,
     "size mismatch, got ", self.size(0), ", ", mat.size(0), "x", mat.size(1), ",", vec.size(0));
  auto names = at::namedinference::propagate_names_for_addmv(mat, vec, self);
  set_output(0, IntArrayRef(mat.sizes().data(), 1), {}, mat.options(), names);
+  auto result = maybe_get_output(0);


these lines were removed in #65686, is there a conflict?

Yes, I resolved the conflict incorrectly. Will fix that.

ngimel · 2021-10-11T19:50:44Z

aten/src/ATen/native/sparse/cuda/SparseBlas.cpp

+  TORCH_CHECK(mat.dim() == 2, "addmv: Expected mat to be 2-D");
+  TORCH_CHECK(vec.dim() == 1, "addmv: Expected vec to be 1-D");
+
+  TensorArg args[]{{result, "out", 0}, {self, "self", 1}, {mat, "mat", 2}, {vec, "vec", 3}};


Do we still need TensorArgs? It's a perf penalty for very small convenience of using checkAllSameGPU. #62653 is landing soon that will enable these checks conveniently on the Tensors.
Also, out of curiosity, how does det_device and is_cuda in checkAllSameGPU work for sparse mat?

Alright, we don't need this check here at all because it's already there in the generated code.
SparseCsrCUDA is part of is_cuda_dispatch_key and device check is generated using

pytorch/tools/codegen/dest/register_dispatch_key.py

Line 347 in 5f15186

device_check = RegisterDispatchKey.gen_device_check(f.device_check, list(device_check_args), name)

Example of generated code for addmv:

at::Tensor & wrapper_out_addmv_out_out(const at::Tensor & self, const at::Tensor & mat, const at::Tensor & vec, const at::Scalar & beta, const at::Scalar & alpha, at::Tensor & out) { c10::optional<Device> common_device = nullopt; (void)common_device; // Suppress unused variable warning c10::impl::check_and_update_common_device(common_device, out, "wrapper_out_addmv_out_out", "out"); c10::impl::check_and_update_common_device(common_device, self, "wrapper_out_addmv_out_out", "self"); c10::impl::check_and_update_common_device(common_device, mat, "wrapper_out_addmv_out_out", "mat"); c10::impl::check_and_update_common_device(common_device, vec, "wrapper_out_addmv_out_out", "vec"); const OptionalDeviceGuard device_guard(device_of(self)); return at::native::addmv_out_sparse_csr_cuda(self, mat, vec, beta, alpha, out); }

I was thinking that device checks and guards are not generated for sparse because of #59058 but the checks are not generated only for SparseCPU + dense CUDA.

is_cuda is TensorImpl's method and it's just specialized for the case of key_set_ equal to DispatchKey::SparseCsrCUDA:

pytorch/c10/core/TensorImpl.h

Lines 853 to 860 in 5f15186

bool is_cuda() const {

// NB: This method is not virtual and avoid dispatches for performance

// reasons.

return key_set_.has(DispatchKey::CUDA) ||

key_set_.has(DispatchKey::SparseCUDA) ||

key_set_.has(DispatchKey::SparseCsrCUDA) ||

key_set_.has(DispatchKey::QuantizedCUDA);

}

addmm_out_cuda_impl (dense CUDA implementation) has this "sameGPU" check and probably it shouldn't be there.

pytorch/aten/src/ATen/native/cuda/Blas.cpp

Lines 99 to 100 in 5f15186

TensorArg args[]{{result, "out", 0}, {self, "self", 1}, {mat1, "mat1", 2}, {mat2, "mat2", 3}};

checkAllSameGPU(__func__, args);

ngimel · 2021-10-11T20:08:02Z

test/test_sparse_csr.py

+    @skipCUDAIfNoCusparseGeneric
+    @dtypes(*torch.testing.floating_types())
+    @dtypesIfCUDA(*get_all_complex_dtypes(),
+                  *get_all_fp_dtypes(include_half=SM53OrLater, include_bfloat16=SM80OrLater))


is SM80OrLater correct guard for bfloat16? For regular addmm bfloat16 is supported (with perf equivalent to fp32) for earlier architectures.

Unfortunately, cuSPARSE raises CUSPARSE_STATUS_ARCH_MISMATCH for earlier architectures.
From the documentation:

Unsupported data types and Compute Capability (CC):
__half on GPUs with CC < 53 (e.g. Kepler)
__nv_bfloat16 on GPUs with CC < 80 (e.g. Kepler, Maxwell, Pascal, Volta, Turing)

ngimel · 2021-10-11T20:08:19Z

test/test_sparse_csr.py

+    @dtypes(*torch.testing.floating_types())
+    @dtypesIfCUDA(*get_all_complex_dtypes(),
+                  *get_all_fp_dtypes(include_half=SM53OrLater, include_bfloat16=SM80OrLater))
+    @precisionOverride({torch.bfloat16: 1e-2, torch.float16: 1e-2})


hm, 1e-2 seems high for float16?

It is. It seems that cuSPARSE uses a different accumulation strategy or something else is different leading to less accurate results than cuBLAS computes.
I'll verify again the tolerances.

Unfortunately, 1e-2 is required for tests to pass for float16.
Running python -m pytest test/test_sparse_csr.py -k "test_csr_matvec" -vvv fails with

Tensors failed to compare as equal!With rtol=0.001 and atol=0.001, found 5 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0078125 (4.2578125 vs. 4.25), which occurred at index 69.

Interestingly running specific test python -m pytest test/test_sparse_csr.py -k "test_csr_matvec_cuda_float16" -vvv to generate a different input with same size gives exactly the same greatest difference of 0.0078125!

Tensors failed to compare as equal!With rtol=0.001 and atol=0.001, found 4 element(s) (out of 100) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0078125 (-3.404296875 vs. -3.412109375), which occurred at index 92.

Tested on CUDA 11.4.2 and Turing card.

Using SpMV instead of SpMM makes the test pass without precision overrides for float16. So I'll remove it here.

ngimel · 2021-10-11T20:15:15Z

aten/src/ATen/native/sparse/cuda/SparseBlas.cpp

+
+#include <c10/util/MaybeOwned.h>
+
+namespace at {


can you also move addmm_out_sparse_csr_dense_cuda from SparseCsrTensorMath.cu here? That would be a logical place.

Moved addmm_out_sparse_csr_dense_cuda with a separate PR #66485.

ngimel · 2021-10-11T20:26:10Z

There are also real unused variable errors on this PR (probably not caused by it, and just require a rebase), unfortunately they are too hard to find in the logs

2021-10-11T06:27:54.9960767Z �[01m�[K/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:�[m�[K In function '�[01m�[KPyObject* THPQUInt2x4Storage_new(PyObject*, PyObject*)�[m�[K':
2021-10-11T06:27:54.9963406Z �[01m�[K/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:62:8:�[m�[K �[01;31m�[Kerror: �[m�[Kunused variable '�[01m�[Kself�[m�[K' [�[01;31m�[K-Werror=unused-variable�[m�[K]
2021-10-11T06:27:54.9965138Z    auto �[01;31m�[Kself�[m�[K = (THPStorage*)_self;
2021-10-11T06:27:54.9966001Z         �[01;31m�[K^~~~�[m�[K
2021-10-11T06:27:54.9967879Z �[01m�[K/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:�[m�[K In function '�[01m�[KPyObject* THPQUInt2x4Storage_newWithFile(PyObject*, PyObject*)�[m�[K':
2021-10-11T06:27:54.9970722Z �[01m�[K/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:289:13:�[m�[K �[01;31m�[Kerror: �[m�[Kunused variable '�[01m�[Kfd_obj�[m�[K' [�[01;31m�[K-Werror=unused-variable�[m�[K]
2021-10-11T06:27:54.9972337Z    PyObject *�[01;31m�[Kfd_obj�[m�[K = PyTuple_GetItem(args, 0);
2021-10-11T06:27:54.9973178Z              �[01;31m�[K^~~~~~�[m�[K
2021-10-11T06:27:54.9974063Z cc1plus: all warnings being treated as errors

IvanYashchuk · 2021-10-12T10:16:03Z

There are also real unused variable errors on this PR (probably not caused by it, and just require a rebase), unfortunately they are too hard to find in the logs

I checked for unused variables and there are no warnings for the files that this PR touches.

IvanYashchuk · 2021-10-12T12:41:29Z

Do you know if there's an advantage to using cusparseSpMV compared to cusparseSpMM (that is used anyway for bfloat16/half)?

I didn't do any performance comparisons actually. I just expect SpMV to be at least as good as SpMM or slightly faster since they added a separate function. Let's hope it's not only for dropping a few dimension arguments.
I will do the performance checks later once we finish with this stack.

IvanYashchuk · 2021-10-12T13:53:21Z

@IvanYashchuk what are the bugs with bfloat16/float16 mv? cc @xwang233 to follow up with cusparse

I tried to compile a standalone cpp file with bfloat16/float16 SpMV and it works correctly. So the problem is somewhere in my code in this PR. For some reason for small sizes the result is all zeros and for larger sizes some parts of the result is zeros:

In [1]: import torch

In [2]: dtype = torch.float16

In [3]: a = torch.tensor([[1, 0, 2, 3], [0, 4, 0, 0], [5, 0, 6, 7], [0, 8, 0, 9]], dtype=dtype, device='cuda')

In [4]: b = torch.tensor([1, 2, 3, 4], dtype=dtype, device='cuda')

In [5]: aa = a.to_sparse_csr()

In [6]: torch.mv(aa, b)
Out[6]: tensor([0., 0., 0., 0.], device='cuda:0', dtype=torch.float16)
# expected 19.0 8.0 51.0 52.0

UPD: found the problem. alpha and beta must be of type computeType (so float) in this case, if it's __half then the result is wrong.

@IvanYashchuk

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. cc nikitaved pearu cpuhrsch @IvanYashchuk ngimel [ghstack-poisoned]

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 7350f9b088e864feb83a955279c225861ddb6c67 Pull Request resolved: pytorch#61407

IvanYashchuk · 2021-10-12T17:09:31Z

I updated the pull request:

removed device checks, it's code generated anyway
fixed float16/bfloat16 case to use SpMV instead of SpMM
removed precision overrides in the test because SpMV for float16/bfloat16 is more accurate than SpMM path so these overrides are not needed anymore

ngimel · 2021-10-12T18:15:54Z

Unused var errors are still generated, maybe a rebase is needed?

/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp: In function 'PyObject* THPQUInt2x4Storage_elementSize(PyObject*, PyObject*)':
/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:54:8: error: unused variable 'self' [-Werror=unused-variable]
   auto self = (THPStorage*)_self;
        ^~~~
/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp: In function 'PyObject* THPQUInt2x4Storage_new(PyObject*, PyObject*)':
/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:62:8: error: unused variable 'self' [-Werror=unused-variable]
   auto self = (THPStorage*)_self;
        ^~~~
/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp: In function 'PyObject* THPQUInt2x4Storage_newWithFile(PyObject*, PyObject*)':
/var/lib/jenkins/workspace/torch/csrc/generic/StorageMethods.cpp:289:13: error: unused variable 'fd_obj' [-Werror=unused-variable]
   PyObject *fd_obj = PyTuple_GetItem(args, 0);
             ^~~~~~
cc1plus: all warnings being treated as errors

@IvanYashchuk

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. cc nikitaved pearu cpuhrsch @IvanYashchuk ngimel [ghstack-poisoned]

This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. ghstack-source-id: 124f9a9435a2f5356a43865f4c71fa505f957e12 Pull Request resolved: pytorch#61407

ngimel · 2021-10-12T19:48:26Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Pull Request resolved: #61407 This PR adds `addmv_out_sparse_csr_cuda`. The operation is used to compute matrix-vector multiplication. Since structured_delegate is used we only need to implement the out variant, the in-place and normal variants are autogenerated. Working on this PR revealed that float16 (and probably bfloat16) inputs do not work correctly in cusparse, therefore for this case `addmm` is used with squeezes and unsqueezes. cc nikitaved pearu cpuhrsch IvanYashchuk ngimel Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D31584499 Pulled By: ngimel fbshipit-source-id: 4c507791471ada88969116b88eeaaba7a7536431

facebook-github-bot added the cla signed label Jul 8, 2021

This was referenced Jul 8, 2021

Add cuSPARSE descriptors and update CSR addmm #60838

Closed

Sparse CSR: add test parametrization for enabling sparse_csr #61408

Closed

IvanYashchuk added module: sparse Related to torch.sparse module: cuda Related to torch.cuda, and CUDA support in general labels Jul 8, 2021

pytorchbot added the open source label Jul 8, 2021

IvanYashchuk mentioned this pull request Jul 8, 2021

Sparse CSR layout GPU backend tracking issue #60854

Open

24 tasks

IvanYashchuk added 2 commits July 9, 2021 10:09

IvanYashchuk mentioned this pull request Jul 12, 2021

Sparse CSR CPU: add addmv_out #61536

Closed

This was referenced Jul 19, 2021

Port triangular_solve to structured kernel #61857

Closed

Sparse CSR CUDA: add triangular_solve_out #61858

Closed

_convert_coo_to_csr CPP and CUDA functionality #61838

Closed

IvanYashchuk mentioned this pull request Jul 26, 2021

Sparse CSR CPU: add triangular_solve_out #62180

Closed

IvanYashchuk mentioned this pull request Oct 11, 2021

Sparse CSR CUDA: Support mixed memory format input for triangular_solve #66401

Closed

IvanYashchuk requested a review from ngimel October 11, 2021 06:17

ngimel reviewed Oct 11, 2021

View reviewed changes

IvanYashchuk mentioned this pull request Oct 12, 2021

Sparse CSR CUDA: fix input checks for addmm and mm #66485

Closed

ngimel approved these changes Oct 12, 2021

View reviewed changes

Sparse tensors automation moved this from Review in progress to Reviewer approved Oct 12, 2021

facebook-github-bot closed this in 08f3823 Oct 13, 2021

Sparse tensors automation moved this from Reviewer approved to Done Oct 13, 2021

facebook-github-bot deleted the gh/ivanyashchuk/27/head branch October 16, 2021 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse CSR CUDA: add `addmv_out` #61407

Sparse CSR CUDA: add `addmv_out` #61407

IvanYashchuk commented Jul 8, 2021 •

edited by ngimel

facebook-github-bot commented Jul 8, 2021 •

edited

IvanYashchuk commented Oct 11, 2021

ngimel commented Oct 11, 2021

ngimel left a comment

ngimel Oct 11, 2021

IvanYashchuk Oct 12, 2021

ngimel Oct 11, 2021

ngimel Oct 11, 2021

IvanYashchuk Oct 12, 2021

ngimel Oct 11, 2021

IvanYashchuk Oct 12, 2021

IvanYashchuk Oct 12, 2021

ngimel Oct 11, 2021

IvanYashchuk Oct 12, 2021

ngimel Oct 11, 2021

IvanYashchuk Oct 12, 2021

IvanYashchuk Oct 12, 2021

IvanYashchuk Oct 12, 2021

ngimel Oct 11, 2021

IvanYashchuk Oct 12, 2021

ngimel commented Oct 11, 2021

IvanYashchuk commented Oct 12, 2021

IvanYashchuk commented Oct 12, 2021

IvanYashchuk commented Oct 12, 2021 •

edited

IvanYashchuk commented Oct 12, 2021

ngimel commented Oct 12, 2021

ngimel commented Oct 12, 2021

	is_non_overlapping_and_dense_ =
	is_contiguous_ \|\| compute_non_overlapping_and_dense();

	bool is_cuda() const {
	// NB: This method is not virtual and avoid dispatches for performance
	// reasons.
	return key_set_.has(DispatchKey::CUDA) \|\|
	key_set_.has(DispatchKey::SparseCUDA) \|\|
	key_set_.has(DispatchKey::SparseCsrCUDA) \|\|
	key_set_.has(DispatchKey::QuantizedCUDA);
	}

	TensorArg args[]{{result, "out", 0}, {self, "self", 1}, {mat1, "mat1", 2}, {mat2, "mat2", 3}};
	checkAllSameGPU(__func__, args);


		#include <c10/util/MaybeOwned.h>

		namespace at {

Sparse CSR CUDA: add addmv_out #61407

Sparse CSR CUDA: add addmv_out #61407

Conversation

IvanYashchuk commented Jul 8, 2021 • edited by ngimel

facebook-github-bot commented Jul 8, 2021 • edited

🔗 Helpful links

💊 CI failures summary and remediations

IvanYashchuk commented Oct 11, 2021

ngimel commented Oct 11, 2021

ngimel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngimel commented Oct 11, 2021

IvanYashchuk commented Oct 12, 2021

IvanYashchuk commented Oct 12, 2021

IvanYashchuk commented Oct 12, 2021 • edited

IvanYashchuk commented Oct 12, 2021

ngimel commented Oct 12, 2021

ngimel commented Oct 12, 2021

Sparse CSR CUDA: add `addmv_out` #61407

Sparse CSR CUDA: add `addmv_out` #61407

IvanYashchuk commented Jul 8, 2021 •

edited by ngimel

facebook-github-bot commented Jul 8, 2021 •

edited

IvanYashchuk commented Oct 12, 2021 •

edited