[WIP] Batched sparse QR factorizations and solves with cusolver #1716

bamos · 2017-06-04T16:23:01Z

This PR is connected to issue #1178.

This is still a little hacky and a WIP but I'm sending it in to get a conversation started and move this towards something mergable. I am currently using this with the sparse solvers I have in qpth and it's working well: https://github.com/locuslab/qpth/blob/master/qpth/solvers/pdipm/spbatch.py

I think that batched sparse tensor operations with the same sparsity pattern will come up frequently to take advantage of GPU parallelism and I sent in this discussion a while ago about handling this case: https://discuss.pytorch.org/t/3d-sparse-batch-tensors-with-the-same-sparsity-pattern/2806

We could move this broader conversation back to that thread. I could add some transpose operations to my code here and use hybrid sparse tensors as @ezyang mentioned. However for simplicity and because I'm not sure what you all prefer, I've implemented this to not do that and I'm not using the sparse tensor wrapper and am just passing the appropriate tensors around. What are your opinions on how I should modify this function and interface?

torch/lib/THCS/generic/THCSTensorMath.cu

+   THCIndexTensor *rowIndices = THCIndexTensor_(newSelect)(state, Ai, 0, 0);
+   THCIndexTensor *colIndices = THCIndexTensor_(newSelect)(state, Ai, 0, 1);
+   THCudaIntTensor *csr = THCSTensor_(toCSR)(state, rowIndices, m, nnz);
+   THCudaIntTensor *colIndicesInt = THCudaIntTensor_newWithSize1d(state, colIndices->size[0]);


torch/lib/THCS/generic/THCSTensorMath.cu

+   THAssert(cusparse_status == CUSPARSE_STATUS_SUCCESS);
+
+   cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL);
+   cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO);


fmassa · 2017-06-05T06:20:41Z

This is looking nice! I have a remark though:
for it to be merged in master, it will need a VPU implementation. It can be naive, but preferably better than converting the sparse tensors to dense tensors
Do you have an idea of whar we could do for the CPU side?

bamos · 2017-06-05T08:23:05Z

Hmm, for the CPU version I don't think (but am not sure) any libraries are linked in to PyTorch that can do sparse non-batched QR factorizations and solves. It looks like somebody at Intel posted a prototype mkl interface almost a year ago, but it hasn't hit a release yet: https://software.intel.com/en-us/articles/intel-sparse-qr-factorization-prototype-preview-package

What about linking in and calling suitesparse?

fmassa · 2017-06-06T17:05:28Z

@apaszke @soumith what do you think about linking other libraries in pytorch? Seems like the easiest thing to do in this case.

apaszke · 2017-06-06T17:06:35Z

I'm ok with it as long as it's optional

ezyang · 2017-06-07T18:52:37Z

Hey @bamos! So, if I understand correctly, cuSolver only supports CSR/CSC tensors, but PyTorch's current sparse representation only supports COO at the moment. So, there is nothing to talk about: you simply can't use the existing representation to represent the inputs.

Maybe at some point we should add support for CSR/CSC.

fmassa · 2017-06-07T19:26:42Z

@ezyang but he is converting to CSR before calling cuSolver, right?

bamos · 2017-06-13T12:23:30Z

Hi, is there anything I can do now to help get this through or do you all still want some more time to think about the batched sparse matrices with the same sparsity pattern?

I'd like to merge this version with GPU support as soon as we've agreed upon the best way to format the data so that people using my qpth library can use PyTorch master for GPU operations. Afterwards can we file another issue for CPU support?

fmassa · 2017-06-13T12:59:54Z

I think it would still require a CPU implementation, except if @apaszke is ok with having a GPU-only operation for the moment?

apaszke

This looks good, but I think two things should be changed:

The signature should be (sparseA, denseB) (i.e. use SparseTensors to represent the matrix). The implementation would go to THCS then.
We should really rethink the naming of these functions, as it's becoming a mess. Right now we have gesv and btrisolve (which are already two different conventions), this PR adds spbqrfactsolve, which is still different form both of these. I think we should drop the qrfact part, as the algorithm used to solve the system is a bit of an implementation detail (the factorization is not returned). Also, I'm wondering if it's not possible to make it part of torch.gesolve, so that it works both with dense and sparse systems.

torch/lib/THCS/generic/THCSTensorMath.cu

+       &size_qr);
+   THAssert(cusolver_status == CUSOLVER_STATUS_SUCCESS);
+
+   cuda_status = cudaMalloc((void**)&buffer_qr, size_qr);


torch/lib/THCS/generic/THCSTensorMath.cu

+   THCIndexTensor *colIndices = THCIndexTensor_(newSelect)(state, Ai, 0, 1);
+   THCudaIntTensor *csr = THCSTensor_(toCSR)(state, rowIndices, m, nnz);
+   THCudaIntTensor *colIndicesInt = THCudaIntTensor_newWithSize1d(state, colIndices->size[0]);
+   THCudaIntTensor_copyCudaLong(state, colIndicesInt, colIndices);


torch/lib/THCS/generic/THCSTensorMath.cu

+   csrqrInfo_t info;
+
+   cusolverSpHandle_t cusolverH;
+   cusolver_status = cusolverSpCreate(&cusolverH);


torch/lib/THCS/generic/THCSTensorMath.cu

+       THCTensor_(data)(state, rx_),
+       nBatch, info,
+       buffer_qr);
+   THAssert(cusolver_status == CUSOLVER_STATUS_SUCCESS);


torch/lib/THCS/generic/THCSTensorMath.cu

+       buffer_qr);
+   THAssert(cusolver_status == CUSOLVER_STATUS_SUCCESS);
+
+   cusolverSpDestroyCsrqrInfo(info);


apaszke · 2017-06-13T13:00:38Z

I'm ok with merging a CPU implementation later

zou3519 · 2018-01-04T16:10:39Z

I'm currently rebasing this. I'm noticing that the batch dimension for A (in the Ax = b solver function) is the last one: the shape of A is (m, n, nBatch). This makes sense because pytorch's sparse tensors can have dense dimensions only as the last dimensions (following sparse dimensions), but it breaks a little from the convention that the batch dimension is first.

@apaszke you said this function would be useful for some torch.distributions things. Would those still work with the assumption that the batch dimension is last for sparse tensors like this?

…b9fdd5 Summary: Previous import was 8384c788939bc65463f9754b6a7a00b212b18ba1 Included changes: - **[7abd834](onnx/onnx@7abd834)**: Clarify some aspects of the Loop spec. (pytorch#1587) <Scott McKay> - **[5a5b15f](onnx/onnx@5a5b15f)**: Support rtol and atol at the model granularity (pytorch#1723) <Lu Fang> - **[ba76e45](onnx/onnx@ba76e45)**: print some information (pytorch#1724) <Lu Fang> - **[797390d](onnx/onnx@797390d)**: Update README.md (pytorch#1722) <Prasanth Pulavarthi> - **[40cdb5f](onnx/onnx@40cdb5f)**: repaire convtranspose shape inference (pytorch#1660) <peter yang> - **[68fdb3f](onnx/onnx@68fdb3f)**: [Minor] Fix Windows line ending in test coverage generating script (pytorch#1717) <Raymond Yang> - **[00101bf](onnx/onnx@00101bf)**: Remove ConstantLike op. Updates to ConstantOfShape op. (pytorch#1716) <Spandan Tiwari> - **[c59e90a](onnx/onnx@c59e90a)**: add a shape inference test for group conv (pytorch#1719) <Lu Fang> Reviewed By: zrphercule Differential Revision: D13629499 fbshipit-source-id: 61d4b30be2018f6b1e39a6acf9d80f8a5f26d7fc

…b9fdd5 (#15942) Summary: Pull Request resolved: #15942 Previous import was 8384c788939bc65463f9754b6a7a00b212b18ba1 Included changes: - **[7abd834](onnx/onnx@7abd834)**: Clarify some aspects of the Loop spec. (#1587) <Scott McKay> - **[5a5b15f](onnx/onnx@5a5b15f)**: Support rtol and atol at the model granularity (#1723) <Lu Fang> - **[ba76e45](onnx/onnx@ba76e45)**: print some information (#1724) <Lu Fang> - **[797390d](onnx/onnx@797390d)**: Update README.md (#1722) <Prasanth Pulavarthi> - **[40cdb5f](onnx/onnx@40cdb5f)**: repaire convtranspose shape inference (#1660) <peter yang> - **[68fdb3f](onnx/onnx@68fdb3f)**: [Minor] Fix Windows line ending in test coverage generating script (#1717) <Raymond Yang> - **[00101bf](onnx/onnx@00101bf)**: Remove ConstantLike op. Updates to ConstantOfShape op. (#1716) <Spandan Tiwari> - **[c59e90a](onnx/onnx@c59e90a)**: add a shape inference test for group conv (#1719) <Lu Fang> Reviewed By: zrphercule Differential Revision: D13629499 fbshipit-source-id: 4b3e4cb29bdb84c3777a8fb26263548efb20f317

github-actions · 2022-03-23T23:05:30Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
Stale pull requests will automatically be closed 30 days after being marked Stale

…1716) I feel that this could be helpful for debugging. Using the example that breaks our system in csarofeen#1692 Before ``` TransformPrinter : T0_g[ iS0{i1}, iS1{i2} ] root domain : (iS0{i1},iS1{i2}) T2_l[ iS11{( ceilDiv(i1, 3) )}, iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf, rS14{4}rf ] root domain : (iS9{i1},iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf,rS14{4}rf) Split: iS9{i1} by factor 3 -> iS11{( ceilDiv(i1, 3) )}, iS12{3}, start offset: 0, stop offset: 0 T1_g[ rS17{( ceilDiv(i1, 3) )}, rS19{( 3 * ( ceilDiv(i2, 4) ) )} ] root domain : (rS16{i1},rS19{( 3 * ( ceilDiv(i2, 4) ) )}) Split: rS16{i1} by factor 3 -> rS17{( ceilDiv(i1, 3) )}, rS18{3}, start offset: 0, stop offset: 0 } ``` After: ``` TransformPrinter : T0_g[ iS0{i1}, iS1{i2} ] root domain : (iS0{i1},iS1{i2}) T2_l[ iS11{( ceilDiv(i1, 3) )}, iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf, rS14{4}rf ] root domain : (iS9{i1},rS10{i2}rf) Split: iS9{i1} by factor 3 -> iS11{( ceilDiv(i1, 3) )}, iS12{3}, start offset: 0, stop offset: 0 Split: rS10{i2}rf by factor 4 -> iS13{( ceilDiv(i2, 4) )}rf, rS14{4}rf, start offset: 0, stop offset: 0 Merge: iS12{3} and iS13{( ceilDiv(i2, 4) )}rf -> iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf rfactor domain : (iS9{i1},iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf,rS14{4}rf) Split: iS9{i1} by factor 3 -> iS11{( ceilDiv(i1, 3) )}, iS12{3}, start offset: 0, stop offset: 0 T1_g[ rS17{( ceilDiv(i1, 3) )}, rS19{( 3 * ( ceilDiv(i2, 4) ) )} ] root domain : (rS16{i1},rS19{( 3 * ( ceilDiv(i2, 4) ) )}) Split: rS16{i1} by factor 3 -> rS17{( ceilDiv(i1, 3) )}, rS18{3}, start offset: 0, stop offset: 0 } ```

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Approved by: https://github.com/csarofeen, https://github.com/malfet

Summary: Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ A few bigger updates: 1. Initial support of cp.async and cp.async.wait: csarofeen#1619 2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643 3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440 Commits that's actually in this PR from the csarofeen branch ``` * dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710) * b3d1c3f Fix missing cooperative launch (#1726) * dc670a2 Async gmem copy support on sm80+ (#1619) * 5e6a8da Add turing mma support and test (#1643) * d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723) * 7093e39 Mma op integration on ampere (#1440) * fade8da patch python test for bfloat16 (#1724) * 8fbd0b1 Fine-grained kernel profiling (#1720) * 77c1b4f Adding dry run mode to skip arch dependent checks (#1702) * 151d95b More precise concretization analysis (#1719) * f4d3630 Enable complex python tests (#1667) * 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715) * 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716) * f68b830 Fix scheduling with polymorphic broadcast (#1714) * 4ab5ef7 updating_ci_machine (#1718) * 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517 * 174d453 Allow using nvFuser on CUDA extension (#1701) * 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676) ``` Pull Request resolved: #78244 Reviewed By: ejguan Differential Revision: D36678948 Pulled By: davidberard98 fbshipit-source-id: 0ccde965acbd31da67d99c6adb2eaaa888948105

Brandon Amos added 2 commits June 4, 2017 12:11

Initial commit of spbqrfactsolve

2f0c35d

test_sparse: Add spbqrfactsolve test

fccc9cc

fmassa reviewed Jun 5, 2017

View reviewed changes

Brandon Amos added 3 commits June 5, 2017 04:01

Fix formatting.

444a6b6

spbqrfactsolve: Free allocated memory.

5d0e86f

spbqrfactsolve: Use TH_INDEX_BASE.

522c560

apaszke reviewed Jun 13, 2017

View reviewed changes

soumith changed the title ~~Batched sparse QR factorizations and solves with cusolver~~ [WIP] Batched sparse QR factorizations and solves with cusolver Jun 22, 2017

soumith added the in progress label Jun 23, 2017

soumith removed the in progress label Jul 3, 2017

soumith added this to Ready in PR Status Jul 3, 2017

soumith moved this from Ready for Review to In Progress in PR Status Jul 3, 2017

soumith moved this from In Progress to In Progresss in PR Status Jul 4, 2017

houseroad mentioned this pull request Jan 10, 2019

Automatic update of fbcode/onnx to 7abd834091f1024c11749dcfd25126802db9fdd5 #15942

Closed

facebook-github-bot added the cla signed label Oct 30, 2020

pearu added this to In progress in Sparse tensors Aug 10, 2021

github-actions bot added the Stale label Mar 23, 2022

github-actions bot closed this Apr 22, 2022

Sparse tensors automation moved this from In progress to Done Apr 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Batched sparse QR factorizations and solves with cusolver #1716

[WIP] Batched sparse QR factorizations and solves with cusolver #1716

bamos commented Jun 4, 2017 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

fmassa commented Jun 5, 2017

bamos commented Jun 5, 2017

fmassa commented Jun 6, 2017

apaszke commented Jun 6, 2017

ezyang commented Jun 7, 2017

fmassa commented Jun 7, 2017

bamos commented Jun 13, 2017

fmassa commented Jun 13, 2017

apaszke left a comment

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

apaszke commented Jun 13, 2017

zou3519 commented Jan 4, 2018

github-actions bot commented Mar 23, 2022

[WIP] Batched sparse QR factorizations and solves with cusolver #1716

[WIP] Batched sparse QR factorizations and solves with cusolver #1716

Conversation

bamos commented Jun 4, 2017 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

fmassa commented Jun 5, 2017

bamos commented Jun 5, 2017

fmassa commented Jun 6, 2017

apaszke commented Jun 6, 2017

ezyang commented Jun 7, 2017

fmassa commented Jun 7, 2017

bamos commented Jun 13, 2017

fmassa commented Jun 13, 2017

apaszke left a comment

Choose a reason for hiding this comment

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

apaszke commented Jun 13, 2017

zou3519 commented Jan 4, 2018

github-actions bot commented Mar 23, 2022

bamos commented Jun 4, 2017 •

edited

Loading