-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Batched sparse QR factorizations and solves with cusolver #1716
Conversation
THCIndexTensor *rowIndices = THCIndexTensor_(newSelect)(state, Ai, 0, 0); | ||
THCIndexTensor *colIndices = THCIndexTensor_(newSelect)(state, Ai, 0, 1); | ||
THCudaIntTensor *csr = THCSTensor_(toCSR)(state, rowIndices, m, nnz); | ||
THCudaIntTensor *colIndicesInt = THCudaIntTensor_newWithSize1d(state, colIndices->size[0]); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
THAssert(cusparse_status == CUSPARSE_STATUS_SUCCESS); | ||
|
||
cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL); | ||
cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This is looking nice! I have a remark though: |
Hmm, for the CPU version I don't think (but am not sure) any libraries are linked in to PyTorch that can do sparse non-batched QR factorizations and solves. It looks like somebody at Intel posted a prototype mkl interface almost a year ago, but it hasn't hit a release yet: https://software.intel.com/en-us/articles/intel-sparse-qr-factorization-prototype-preview-package What about linking in and calling suitesparse? |
I'm ok with it as long as it's optional |
Hey @bamos! So, if I understand correctly, cuSolver only supports CSR/CSC tensors, but PyTorch's current sparse representation only supports COO at the moment. So, there is nothing to talk about: you simply can't use the existing representation to represent the inputs. Maybe at some point we should add support for CSR/CSC. |
@ezyang but he is converting to CSR before calling cuSolver, right? |
Hi, is there anything I can do now to help get this through or do you all still want some more time to think about the batched sparse matrices with the same sparsity pattern? I'd like to merge this version with GPU support as soon as we've agreed upon the best way to format the data so that people using my qpth library can use PyTorch master for GPU operations. Afterwards can we file another issue for CPU support? |
I think it would still require a CPU implementation, except if @apaszke is ok with having a GPU-only operation for the moment? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but I think two things should be changed:
- The signature should be
(sparseA, denseB)
(i.e. use SparseTensors to represent the matrix). The implementation would go toTHCS
then. - We should really rethink the naming of these functions, as it's becoming a mess. Right now we have
gesv
andbtrisolve
(which are already two different conventions), this PR addsspbqrfactsolve
, which is still different form both of these. I think we should drop theqrfact
part, as the algorithm used to solve the system is a bit of an implementation detail (the factorization is not returned). Also, I'm wondering if it's not possible to make it part oftorch.gesolve
, so that it works both with dense and sparse systems.
&size_qr); | ||
THAssert(cusolver_status == CUSOLVER_STATUS_SUCCESS); | ||
|
||
cuda_status = cudaMalloc((void**)&buffer_qr, size_qr); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
THCIndexTensor *colIndices = THCIndexTensor_(newSelect)(state, Ai, 0, 1); | ||
THCudaIntTensor *csr = THCSTensor_(toCSR)(state, rowIndices, m, nnz); | ||
THCudaIntTensor *colIndicesInt = THCudaIntTensor_newWithSize1d(state, colIndices->size[0]); | ||
THCudaIntTensor_copyCudaLong(state, colIndicesInt, colIndices); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
csrqrInfo_t info; | ||
|
||
cusolverSpHandle_t cusolverH; | ||
cusolver_status = cusolverSpCreate(&cusolverH); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
THCTensor_(data)(state, rx_), | ||
nBatch, info, | ||
buffer_qr); | ||
THAssert(cusolver_status == CUSOLVER_STATUS_SUCCESS); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
buffer_qr); | ||
THAssert(cusolver_status == CUSOLVER_STATUS_SUCCESS); | ||
|
||
cusolverSpDestroyCsrqrInfo(info); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
I'm ok with merging a CPU implementation later |
I'm currently rebasing this. I'm noticing that the batch dimension for A (in the @apaszke you said this function would be useful for some torch.distributions things. Would those still work with the assumption that the batch dimension is last for sparse tensors like this? |
…b9fdd5 Summary: Previous import was 8384c788939bc65463f9754b6a7a00b212b18ba1 Included changes: - **[7abd834](onnx/onnx@7abd834)**: Clarify some aspects of the Loop spec. (pytorch#1587) <Scott McKay> - **[5a5b15f](onnx/onnx@5a5b15f)**: Support rtol and atol at the model granularity (pytorch#1723) <Lu Fang> - **[ba76e45](onnx/onnx@ba76e45)**: print some information (pytorch#1724) <Lu Fang> - **[797390d](onnx/onnx@797390d)**: Update README.md (pytorch#1722) <Prasanth Pulavarthi> - **[40cdb5f](onnx/onnx@40cdb5f)**: repaire convtranspose shape inference (pytorch#1660) <peter yang> - **[68fdb3f](onnx/onnx@68fdb3f)**: [Minor] Fix Windows line ending in test coverage generating script (pytorch#1717) <Raymond Yang> - **[00101bf](onnx/onnx@00101bf)**: Remove ConstantLike op. Updates to ConstantOfShape op. (pytorch#1716) <Spandan Tiwari> - **[c59e90a](onnx/onnx@c59e90a)**: add a shape inference test for group conv (pytorch#1719) <Lu Fang> Reviewed By: zrphercule Differential Revision: D13629499 fbshipit-source-id: 61d4b30be2018f6b1e39a6acf9d80f8a5f26d7fc
…b9fdd5 (#15942) Summary: Pull Request resolved: #15942 Previous import was 8384c788939bc65463f9754b6a7a00b212b18ba1 Included changes: - **[7abd834](onnx/onnx@7abd834)**: Clarify some aspects of the Loop spec. (#1587) <Scott McKay> - **[5a5b15f](onnx/onnx@5a5b15f)**: Support rtol and atol at the model granularity (#1723) <Lu Fang> - **[ba76e45](onnx/onnx@ba76e45)**: print some information (#1724) <Lu Fang> - **[797390d](onnx/onnx@797390d)**: Update README.md (#1722) <Prasanth Pulavarthi> - **[40cdb5f](onnx/onnx@40cdb5f)**: repaire convtranspose shape inference (#1660) <peter yang> - **[68fdb3f](onnx/onnx@68fdb3f)**: [Minor] Fix Windows line ending in test coverage generating script (#1717) <Raymond Yang> - **[00101bf](onnx/onnx@00101bf)**: Remove ConstantLike op. Updates to ConstantOfShape op. (#1716) <Spandan Tiwari> - **[c59e90a](onnx/onnx@c59e90a)**: add a shape inference test for group conv (#1719) <Lu Fang> Reviewed By: zrphercule Differential Revision: D13629499 fbshipit-source-id: 4b3e4cb29bdb84c3777a8fb26263548efb20f317
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
…1716) I feel that this could be helpful for debugging. Using the example that breaks our system in csarofeen#1692 Before ``` TransformPrinter : T0_g[ iS0{i1}, iS1{i2} ] root domain : (iS0{i1},iS1{i2}) T2_l[ iS11{( ceilDiv(i1, 3) )}, iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf, rS14{4}rf ] root domain : (iS9{i1},iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf,rS14{4}rf) Split: iS9{i1} by factor 3 -> iS11{( ceilDiv(i1, 3) )}, iS12{3}, start offset: 0, stop offset: 0 T1_g[ rS17{( ceilDiv(i1, 3) )}, rS19{( 3 * ( ceilDiv(i2, 4) ) )} ] root domain : (rS16{i1},rS19{( 3 * ( ceilDiv(i2, 4) ) )}) Split: rS16{i1} by factor 3 -> rS17{( ceilDiv(i1, 3) )}, rS18{3}, start offset: 0, stop offset: 0 } ``` After: ``` TransformPrinter : T0_g[ iS0{i1}, iS1{i2} ] root domain : (iS0{i1},iS1{i2}) T2_l[ iS11{( ceilDiv(i1, 3) )}, iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf, rS14{4}rf ] root domain : (iS9{i1},rS10{i2}rf) Split: iS9{i1} by factor 3 -> iS11{( ceilDiv(i1, 3) )}, iS12{3}, start offset: 0, stop offset: 0 Split: rS10{i2}rf by factor 4 -> iS13{( ceilDiv(i2, 4) )}rf, rS14{4}rf, start offset: 0, stop offset: 0 Merge: iS12{3} and iS13{( ceilDiv(i2, 4) )}rf -> iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf rfactor domain : (iS9{i1},iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf,rS14{4}rf) Split: iS9{i1} by factor 3 -> iS11{( ceilDiv(i1, 3) )}, iS12{3}, start offset: 0, stop offset: 0 T1_g[ rS17{( ceilDiv(i1, 3) )}, rS19{( 3 * ( ceilDiv(i2, 4) ) )} ] root domain : (rS16{i1},rS19{( 3 * ( ceilDiv(i2, 4) ) )}) Split: rS16{i1} by factor 3 -> rS17{( ceilDiv(i1, 3) )}, rS18{3}, start offset: 0, stop offset: 0 } ```
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Approved by: https://github.com/csarofeen, https://github.com/malfet
Summary: Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Reviewed By: ejguan Differential Revision: D36678948 Pulled By: davidberard98 fbshipit-source-id: 0ccde965acbd31da67d99c6adb2eaaa888948105
This PR is connected to issue #1178.
This is still a little hacky and a WIP but I'm sending it in to get a conversation started and move this towards something mergable. I am currently using this with the sparse solvers I have in qpth and it's working well: https://github.com/locuslab/qpth/blob/master/qpth/solvers/pdipm/spbatch.py
I think that batched sparse tensor operations with the same sparsity pattern will come up frequently to take advantage of GPU parallelism and I sent in this discussion a while ago about handling this case: https://discuss.pytorch.org/t/3d-sparse-batch-tensors-with-the-same-sparsity-pattern/2806
We could move this broader conversation back to that thread. I could add some transpose operations to my code here and use hybrid sparse tensors as @ezyang mentioned. However for simplicity and because I'm not sure what you all prefer, I've implemented this to not do that and I'm not using the sparse tensor wrapper and am just passing the appropriate tensors around. What are your opinions on how I should modify this function and interface?