Skip to content

Conversation

@kurtamohler
Copy link
Collaborator

@kurtamohler kurtamohler commented Feb 17, 2020

Add sparse-dense BMM operation for CUDA and CPU.

Closes #5672

@dr-ci
Copy link

dr-ci bot commented Feb 17, 2020

💊 Build failures summary and remediations

As of commit d01b921 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_xenial_py3_6_clang7_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 266 times.

@kurtamohler kurtamohler force-pushed the bmm-sparse-dense-5672 branch 2 times, most recently from 05ef9cc to 06b33ee Compare February 18, 2020 05:47
@kurtamohler
Copy link
Collaborator Author

The clang-tidy check is failing to install clang-tidy:

+ sudo apt-get install -y clang-tidy-8
Reading package lists...
Building dependency tree...
Reading state information...
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 clang-tidy-8 : Depends: libllvm8 (= 1:8.0.1+svn369350-1~exp1~20200112113617.82) but 1:8.0.1+svn369350-1~exp1~20200114191400.80 is to be installed
                Depends: clang-tools-8 but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
##[error]Process completed with exit code 100.

@kurtamohler kurtamohler changed the title Bmm sparse dense 5672 Bmm sparse dense Feb 18, 2020
@ezyang
Copy link
Contributor

ezyang commented Feb 19, 2020

@pearu @nikitaved How about you guys do the first pass reviewing the algorithm? I can help with more framework review but I'd like you guys to do the bulk of the actual algorithm review.

@yf225 yf225 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: sparse Related to torch.sparse labels Feb 19, 2020
@kurtamohler
Copy link
Collaborator Author

Below are charts of sparse-dense bmm's performance compared to a workaround function that someone posted to the original issue. My performance script is here: https://github.com/kurtamohler/pytorch-perf-test-scripts/blob/master/bmm-sparse-dense/bmm_perf.py

The "input cases" that the x-axis refers to are the input size combinations found in the same order in the table below the charts. "Pre-coalescing" means that before bmm() is called, I'm coalescing the sparse matrix outside the timed loop so that it does not have to be done inside the timed bmm() call. Measuring with and without the pre-coalescing step allows us to see how much of an impact coalescing has on the run time. Note that the way the workaround works, it does not need a coalesced sparse matrix, so it can skip that step. It looks like coalescing increases run time significantly in some cases but not in others.

The builtin methods are almost always faster than the workarounds, and CUDA is almost always faster than CPU.

BMM sparse-dense performance (not pre-coalesced)

BMM sparse-dense performance (pre-coalesced)

Input cases:

num_matrices squ_mat_elements output_size sparsity nnz
10 100 10000 0.9 100
10 100 100000 0 1000
10 10000 1000000 0.999 100
10 10000 10000000 0.99 1000
10 10000 100000000 0.9 10000
10 10000 1000000000 0 100000
10 1000000 10000000000 0.999 10000
10 1000000 100000000000 0.99 100000
100 100 100000 0.9 1000
100 100 1000000 0 10000
100 10000 10000000 0.999 1000
100 10000 100000000 0.99 10000
100 10000 1000000000 0.9 100000
100 1000000 100000000000 0.999 100000
1000 100 1000000 0.9 10000
1000 100 10000000 0 100000
1000 10000 100000000 0.999 10000
1000 10000 1000000000 0.99 100000
10000 100 10000000 0.9 100000
10000 10000 1000000000 0.999 100000

@kurtamohler
Copy link
Collaborator Author

I'm getting some CI errors related to JIT:

Feb 21 00:52:01 ======================================================================
Feb 21 00:52:01 ERROR [0.108s]: test_nested2 (__main__.EagerModePostTrainingQuantTest)
Feb 21 00:52:01 ----------------------------------------------------------------------
Feb 21 00:52:01 Traceback (most recent call last):
Feb 21 00:52:01   File "test_quantization.py", line 195, in test_nested2
Feb 21 00:52:01     checkQuantized(model)
Feb 21 00:52:01   File "test_quantization.py", line 193, in checkQuantized
Feb 21 00:52:01     self.checkScriptable(model, self.calib_data)
Feb 21 00:52:01   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_quantization.py", line 140, in checkScriptable
Feb 21 00:52:01     self._checkScriptable(orig_mod, traced, calib_data, check_save_load)
Feb 21 00:52:01   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_quantization.py", line 144, in _checkScriptable
Feb 21 00:52:01     self._checkModuleCorrectnessAgainstOrig(orig_mod, script_mod, calib_data)
Feb 21 00:52:01   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_quantization.py", line 161, in _checkModuleCorrectnessAgainstOrig
Feb 21 00:52:01     scripted_output = test_mod(inp)
Feb 21 00:52:01   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
Feb 21 00:52:01     result = self.forward(*input, **kwargs)
Feb 21 00:52:01   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 90, in prof_meth_call
Feb 21 00:52:01     return prof_callable(meth_call, *args, **kwargs)
Feb 21 00:52:01   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 84, in prof_callable
Feb 21 00:52:01     return callable(*args, **kwargs)
Feb 21 00:52:01 RuntimeError: 
Feb 21 00:52:01 Couldn't find an operator for aten::bmm(Tensor self, Tensor mat2) -> Tensor. Do you have to update a set of hardcoded JIT ops?

I'm not really sure what I need to do.

@ezyang
Copy link
Contributor

ezyang commented Mar 4, 2020

@kurtamohler Alright, it's this problem. Can you split bmm into two functions, bmm(Tensor, Tensor) and _bmm(Tensor, Tensor, bool deterministic)? Then have bmm just call _bmm with the deterministic flag set, but define autograd separately on bmm and _bmm. I think that will suffice to make this error go away.

cc'ing @eellison if you have a better protocol. (I am aware that we could also just bash this out by finding the schema string in JIT and updating it, but I kind of don't want to do that here; I feel the JIT doesn't want to see the determinism flag.)

@eellison
Copy link
Contributor

eellison commented Mar 4, 2020

@ezyang @kurtamohler is the error is in shape analysis because aten::bmm(Tensor self, Tensor mat2) -> Tensor doesn't exist ? If that's the case, i would say just update it and hope that we are done deprecating it completely soon (cc @Krovatkin). Otherwise I'm not sure.

@kurtamohler kurtamohler requested a review from apaszke as a code owner March 5, 2020 20:44
@kurtamohler
Copy link
Collaborator Author

Alright well I'm not sure if what I just changed fixed the issue or not. I haven't figured out how to manually run the failing test. So I'll just let CI run it.

@kurtamohler kurtamohler force-pushed the bmm-sparse-dense-5672 branch from 716b27c to f1aba62 Compare March 5, 2020 23:31
@kurtamohler
Copy link
Collaborator Author

kurtamohler commented Mar 10, 2020

Yesterday I discovered that my CPU implementation has a flaw in its method of searching the 3-D sparse tensor's indices for each 2-D matrix. Apparently, depending on how you create the sparse tensor, the tensor of indices can be laid out in either row-major or column-major order in memory. Looks like the dense_tensor.to_sparse() method gives the opposite index matrix ordering than what you get if you create a sparse matrix directly with something like torch.sparse.Tensor(). My implementation was only handling one of these cases, and the following gave the wrong result:

a = torch.rand([2,2,2]).to_sparse()
b = torch.rand([2,2,2])
a.bmm(b)

Here is an example of how two different sparse tensors' index matrices can have different strides:

>>> b = torch.sparse.FloatTensor(torch.LongTensor([[0,0,1],[0,1,1],[0,1,1]]),torch.FloatTensor([1,2,3]),torch.Size([2,2,2])).coalesce()
>>> b.indices().stride()
(3, 1)
>>> b = b.to_dense().to_sparse()
>>> b.indices().stride()
(1, 3)

I have a working fix to my search function that takes this into account, and I will push it shortly.

@kurtamohler kurtamohler force-pushed the bmm-sparse-dense-5672 branch 2 times, most recently from 035faa4 to 61c1f64 Compare March 10, 2020 22:24
@ezyang
Copy link
Contributor

ezyang commented Mar 11, 2020

re the hip errors (cc @iotamudelta )

22:43:07 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/hip/SparseHIPTensorMath.hip:793:1: error: unknown type name 'hipDataType_t'
22:43:07 hipDataType_t getTensorCudaDataType(Tensor self) {
22:43:07 ^
22:43:07 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/hip/SparseHIPTensorMath.hip:794:3: error: unknown type name 'hipDataType_t'
22:43:07   hipDataType_t cuda_data_type;
22:43:07   ^
22:43:07 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/hip/SparseHIPTensorMath.hip:797:24: error: use of undeclared identifier 'hipR32F'
22:43:07       cuda_data_type = hipR32F;
22:43:07                        ^
22:43:07 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/hip/SparseHIPTensorMath.hip:800:24: error: use of undeclared identifier 'hipR64F'
22:43:07       cuda_data_type = hipR64F;
22:43:07                        ^
22:43:07 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/hip/SparseHIPTensorMath.hip:880:3: error: unknown type name 'cusparseSpMMAlg_t'
22:43:07   cusparseSpMMAlg_t mm_alg = deterministic ? CUSPARSE_COOMM_ALG2 : CUSPARSE_COOMM_ALG1;
22:43:07   ^
22:43:07 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/hip/SparseHIPTensorMath.hip:880:46: error: use of undeclared identifier 'CUSPARSE_COOMM_ALG2'
22:43:07   cusparseSpMMAlg_t mm_alg = deterministic ? CUSPARSE_COOMM_ALG2 : CUSPARSE_COOMM_ALG1;
22:43:07                                              ^
22:43:07 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/hip/SparseHIPTensorMath.hip:880:68: error: use of undeclared identifier 'CUSPARSE_COOMM_ALG1'
22:43:07   cusparseSpMMAlg_t mm_alg = deterministic ? CUSPARSE_COOMM_ALG2 : CUSPARSE_COOMM_ALG1;
22:43:07                                                                    ^
22:43:07 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/hip/SparseHIPTensorMath.hip:899:11: error: unknown type name 'hipDataType_t'
22:43:07           hipDataType_t cuda_data_type = getTensorCudaDataType(mat2_contig);
22:43:07           ^
22:43:07 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/hip/SparseHIPTensorMath.hip:904:11: error: unknown type name 'cusparseSpMatDescr_t'; did you mean 

for now you should just ifdef out the code in the HIP case and say that it's not supported on HIP

@iotamudelta
Copy link
Contributor

It looks like either missing features in ROCm or (more likely) a mishipification. Agreed with @ezyang that this can be ifdef'd out for now, we'll have a look what's up there and may also ping you if we need some input?

@kurtamohler
Copy link
Collaborator Author

Alright, thanks for letting me know. I'll put in the ifdef.

@kurtamohler kurtamohler force-pushed the bmm-sparse-dense-5672 branch from 61c1f64 to dbc89bb Compare March 11, 2020 18:59
@kurtamohler
Copy link
Collaborator Author

kurtamohler commented Apr 14, 2020

@peterjc123, I think you're right, WIN32_ should be enough. I think I was using TORCH_INTERNAL_ASSERT incorrectly. I incorrectly thought that it was an unconditional version of TORCH_CHECK and only needed one argument (the error message). So, changing back to using TORCH_CHECK with the conditional set to false must have been the real reason why the error is now being thrown given the correct conditions. I think this might be a demonstration of why preprocessor macros should be avoided if possible.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Apr 15, 2020

Drat, it looks like our internal version of cusparse is too old. Is there a way you can add macro ifdefs that appropriate test the version of cusparse before making API calls?

stderr: caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(807): warning: statement is unreachable
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(898): error: identifier "cusparseSpMMAlg_t" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(898): error: identifier "CUSPARSE_COOMM_ALG2" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(898): error: identifier "CUSPARSE_COOMM_ALG1" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "cusparseSpMatDescr_t" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "CUSPARSE_INDEX_32I" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "cusparseCreateCoo" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "cusparseDnMatDescr_t" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "CUSPARSE_ORDER_COL" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "cusparseCreateDnMat" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "cusparseDnMatDescr_t" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "CUSPARSE_ORDER_COL" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "cusparseCreateDnMat" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "cusparseSpMM_bufferSize" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "cusparseSpMM" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "cusparseDestroySpMat" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "cusparseDestroyDnMat" is undefined
caffe2/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(902): error: identifier "cusparseDestroyDnMat" is undefined

@kurtamohler
Copy link
Collaborator Author

Alright, I changed the ifdefs so that an error is thrown if the cuda version is less than 10.1. I also added a test to make sure the error is being thrown correctly and skip the other tests if less than cuda 10.1.

@kurtamohler
Copy link
Collaborator Author

Looks like the macos environment doesn't have the torch._C._cuda_getCompiledVersion() function that I was using to decide whether to skip tests. I changed it to use torch.version.cuda instead, hopefully that's available in all the testing environments.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@kurtamohler
Copy link
Collaborator Author

Darnit, the version string comparison module I was using doesn't exist on macos and windows:

    from packaging import version
ModuleNotFoundError: No module named 'packaging'

@kurtamohler
Copy link
Collaborator Author

I removed the packaging module import and chose to use this method of comparison instead:

[int(x) for x in torch.version.cuda.split(".")] >= [10, 1]

Hopefully that's robust enough not to break.

@kurtamohler
Copy link
Collaborator Author

Wasn't robust enough. I think this will be.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Apr 16, 2020

This looks like it was sufficient. Unfortunately it looks like we need to merge with master.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@kurtamohler
Copy link
Collaborator Author

@ezyang, do you know what caused Facebook Internal to fail?

@ezyang
Copy link
Contributor

ezyang commented Apr 20, 2020

it's fake, I think

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in c7cf4c1.

facebook-github-bot pushed a commit that referenced this pull request Aug 4, 2020
Summary:
Fixes #42406

### cusparse Xcsrmm2 API:

(#37202)

- new: https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm
- old (deprecated in cuda 11): https://docs.nvidia.com/cuda/archive/10.2/cusparse/index.html#csrmm2

Before:

|cuda ver | windows | linux |
|--|--|--|
| 10.1 | old api | old api  |
| 10.2 | old api | new api |
| 11    | old api (build error claimed in #42406) | new api |

After:

|cuda ver | windows | linux |
|--|--|--|
| 10.1 | old api | old api  |
| 10.2 | old api | **old api** |
| 11    | **new api** | new api |

### cusparse bmm-sparse-dense API

<details><summary>reverted, will be revisited in the future</summary>
(cc kurtamohler #33430)

- new: https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm

Before:

|cuda ver | windows | linux |
|--|--|--|
| 10.1 | not supported | new api  |
| 10.2 | not supported | new api |
| 11    | not supported | new api |

After:

|cuda ver | windows | linux |
|--|--|--|
| 10.1 | not supported | new api  |
| 10.2 | not supported | new api |
| 11    | **new api** | new api |

</details>

Pull Request resolved: #42412

Reviewed By: agolynski

Differential Revision: D22892032

Pulled By: ezyang

fbshipit-source-id: cded614af970f0efdc79c74e18e1d9ea8a46d012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: sparse Related to torch.sparse open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature request] sparse x dense bmm

10 participants