Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Batched sparse QR factorizations and solves with cusolver #1716

Closed
wants to merge 5 commits into from

Conversation

bamos
Copy link
Contributor

@bamos bamos commented Jun 4, 2017

This PR is connected to issue #1178.

This is still a little hacky and a WIP but I'm sending it in to get a conversation started and move this towards something mergable. I am currently using this with the sparse solvers I have in qpth and it's working well: https://github.com/locuslab/qpth/blob/master/qpth/solvers/pdipm/spbatch.py

I think that batched sparse tensor operations with the same sparsity pattern will come up frequently to take advantage of GPU parallelism and I sent in this discussion a while ago about handling this case: https://discuss.pytorch.org/t/3d-sparse-batch-tensors-with-the-same-sparsity-pattern/2806

We could move this broader conversation back to that thread. I could add some transpose operations to my code here and use hybrid sparse tensors as @ezyang mentioned. However for simplicity and because I'm not sure what you all prefer, I've implemented this to not do that and I'm not using the sparse tensor wrapper and am just passing the appropriate tensors around. What are your opinions on how I should modify this function and interface?

THCIndexTensor *rowIndices = THCIndexTensor_(newSelect)(state, Ai, 0, 0);
THCIndexTensor *colIndices = THCIndexTensor_(newSelect)(state, Ai, 0, 1);
THCudaIntTensor *csr = THCSTensor_(toCSR)(state, rowIndices, m, nnz);
THCudaIntTensor *colIndicesInt = THCudaIntTensor_newWithSize1d(state, colIndices->size[0]);

This comment was marked as off-topic.

This comment was marked as off-topic.

THAssert(cusparse_status == CUSPARSE_STATUS_SUCCESS);

cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO);

This comment was marked as off-topic.

This comment was marked as off-topic.

@fmassa
Copy link
Member

fmassa commented Jun 5, 2017

This is looking nice! I have a remark though:
for it to be merged in master, it will need a VPU implementation. It can be naive, but preferably better than converting the sparse tensors to dense tensors
Do you have an idea of whar we could do for the CPU side?

@bamos
Copy link
Contributor Author

bamos commented Jun 5, 2017

Hmm, for the CPU version I don't think (but am not sure) any libraries are linked in to PyTorch that can do sparse non-batched QR factorizations and solves. It looks like somebody at Intel posted a prototype mkl interface almost a year ago, but it hasn't hit a release yet: https://software.intel.com/en-us/articles/intel-sparse-qr-factorization-prototype-preview-package

What about linking in and calling suitesparse?

@fmassa
Copy link
Member

fmassa commented Jun 6, 2017

@apaszke @soumith what do you think about linking other libraries in pytorch? Seems like the easiest thing to do in this case.

@apaszke
Copy link
Contributor

apaszke commented Jun 6, 2017

I'm ok with it as long as it's optional

@ezyang
Copy link
Contributor

ezyang commented Jun 7, 2017

Hey @bamos! So, if I understand correctly, cuSolver only supports CSR/CSC tensors, but PyTorch's current sparse representation only supports COO at the moment. So, there is nothing to talk about: you simply can't use the existing representation to represent the inputs.

Maybe at some point we should add support for CSR/CSC.

@fmassa
Copy link
Member

fmassa commented Jun 7, 2017

@ezyang but he is converting to CSR before calling cuSolver, right?

@bamos
Copy link
Contributor Author

bamos commented Jun 13, 2017

Hi, is there anything I can do now to help get this through or do you all still want some more time to think about the batched sparse matrices with the same sparsity pattern?

I'd like to merge this version with GPU support as soon as we've agreed upon the best way to format the data so that people using my qpth library can use PyTorch master for GPU operations. Afterwards can we file another issue for CPU support?

@fmassa
Copy link
Member

fmassa commented Jun 13, 2017

I think it would still require a CPU implementation, except if @apaszke is ok with having a GPU-only operation for the moment?

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but I think two things should be changed:

  1. The signature should be (sparseA, denseB) (i.e. use SparseTensors to represent the matrix). The implementation would go to THCS then.
  2. We should really rethink the naming of these functions, as it's becoming a mess. Right now we have gesv and btrisolve (which are already two different conventions), this PR adds spbqrfactsolve, which is still different form both of these. I think we should drop the qrfact part, as the algorithm used to solve the system is a bit of an implementation detail (the factorization is not returned). Also, I'm wondering if it's not possible to make it part of torch.gesolve, so that it works both with dense and sparse systems.

&size_qr);
THAssert(cusolver_status == CUSOLVER_STATUS_SUCCESS);

cuda_status = cudaMalloc((void**)&buffer_qr, size_qr);

This comment was marked as off-topic.

THCIndexTensor *colIndices = THCIndexTensor_(newSelect)(state, Ai, 0, 1);
THCudaIntTensor *csr = THCSTensor_(toCSR)(state, rowIndices, m, nnz);
THCudaIntTensor *colIndicesInt = THCudaIntTensor_newWithSize1d(state, colIndices->size[0]);
THCudaIntTensor_copyCudaLong(state, colIndicesInt, colIndices);

This comment was marked as off-topic.

csrqrInfo_t info;

cusolverSpHandle_t cusolverH;
cusolver_status = cusolverSpCreate(&cusolverH);

This comment was marked as off-topic.

THCTensor_(data)(state, rx_),
nBatch, info,
buffer_qr);
THAssert(cusolver_status == CUSOLVER_STATUS_SUCCESS);

This comment was marked as off-topic.

buffer_qr);
THAssert(cusolver_status == CUSOLVER_STATUS_SUCCESS);

cusolverSpDestroyCsrqrInfo(info);

This comment was marked as off-topic.

@apaszke
Copy link
Contributor

apaszke commented Jun 13, 2017

I'm ok with merging a CPU implementation later

@soumith soumith changed the title Batched sparse QR factorizations and solves with cusolver [WIP] Batched sparse QR factorizations and solves with cusolver Jun 22, 2017
@soumith soumith added this to Ready in PR Status Jul 3, 2017
@soumith soumith moved this from Ready for Review to In Progress in PR Status Jul 3, 2017
@soumith soumith moved this from In Progress to In Progresss in PR Status Jul 4, 2017
@zou3519
Copy link
Contributor

zou3519 commented Jan 4, 2018

I'm currently rebasing this. I'm noticing that the batch dimension for A (in the Ax = b solver function) is the last one: the shape of A is (m, n, nBatch). This makes sense because pytorch's sparse tensors can have dense dimensions only as the last dimensions (following sparse dimensions), but it breaks a little from the convention that the batch dimension is first.

@apaszke you said this function would be useful for some torch.distributions things. Would those still work with the assumption that the batch dimension is last for sparse tensors like this?

houseroad added a commit to houseroad/pytorch that referenced this pull request Jan 10, 2019
…b9fdd5

Summary:
Previous import was 8384c788939bc65463f9754b6a7a00b212b18ba1

Included changes:
- **[7abd834](onnx/onnx@7abd834)**: Clarify some aspects of the Loop spec. (pytorch#1587) <Scott McKay>
- **[5a5b15f](onnx/onnx@5a5b15f)**: Support rtol and atol at the model granularity (pytorch#1723) <Lu Fang>
- **[ba76e45](onnx/onnx@ba76e45)**: print some information (pytorch#1724) <Lu Fang>
- **[797390d](onnx/onnx@797390d)**: Update README.md (pytorch#1722) <Prasanth Pulavarthi>
- **[40cdb5f](onnx/onnx@40cdb5f)**: repaire convtranspose shape inference (pytorch#1660) <peter yang>
- **[68fdb3f](onnx/onnx@68fdb3f)**: [Minor] Fix Windows line ending in test coverage generating script (pytorch#1717) <Raymond Yang>
- **[00101bf](onnx/onnx@00101bf)**: Remove ConstantLike op. Updates to ConstantOfShape op. (pytorch#1716) <Spandan Tiwari>
- **[c59e90a](onnx/onnx@c59e90a)**: add a shape inference test for group conv (pytorch#1719) <Lu Fang>

Reviewed By: zrphercule

Differential Revision: D13629499

fbshipit-source-id: 61d4b30be2018f6b1e39a6acf9d80f8a5f26d7fc
facebook-github-bot pushed a commit that referenced this pull request Jan 11, 2019
…b9fdd5 (#15942)

Summary:
Pull Request resolved: #15942

Previous import was 8384c788939bc65463f9754b6a7a00b212b18ba1

Included changes:
- **[7abd834](onnx/onnx@7abd834)**: Clarify some aspects of the Loop spec. (#1587) <Scott McKay>
- **[5a5b15f](onnx/onnx@5a5b15f)**: Support rtol and atol at the model granularity (#1723) <Lu Fang>
- **[ba76e45](onnx/onnx@ba76e45)**: print some information (#1724) <Lu Fang>
- **[797390d](onnx/onnx@797390d)**: Update README.md (#1722) <Prasanth Pulavarthi>
- **[40cdb5f](onnx/onnx@40cdb5f)**: repaire convtranspose shape inference (#1660) <peter yang>
- **[68fdb3f](onnx/onnx@68fdb3f)**: [Minor] Fix Windows line ending in test coverage generating script (#1717) <Raymond Yang>
- **[00101bf](onnx/onnx@00101bf)**: Remove ConstantLike op. Updates to ConstantOfShape op. (#1716) <Spandan Tiwari>
- **[c59e90a](onnx/onnx@c59e90a)**: add a shape inference test for group conv (#1719) <Lu Fang>

Reviewed By: zrphercule

Differential Revision: D13629499

fbshipit-source-id: 4b3e4cb29bdb84c3777a8fb26263548efb20f317
@pearu pearu added this to In progress in Sparse tensors Aug 10, 2021
@github-actions
Copy link

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
Stale pull requests will automatically be closed 30 days after being marked Stale

@github-actions github-actions bot added the Stale label Mar 23, 2022
@github-actions github-actions bot closed this Apr 22, 2022
Sparse tensors automation moved this from In progress to Done Apr 22, 2022
jjsjann123 pushed a commit to jjsjann123/pytorch that referenced this pull request May 24, 2022
…1716)

I feel that this could be helpful for debugging. Using the example that breaks our system in csarofeen#1692

Before
```
TransformPrinter : 
T0_g[ iS0{i1}, iS1{i2} ]
 root domain : (iS0{i1},iS1{i2})
T2_l[ iS11{( ceilDiv(i1, 3) )}, iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf, rS14{4}rf ]
 root domain : (iS9{i1},iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf,rS14{4}rf)
  Split: iS9{i1} by factor 3 -> iS11{( ceilDiv(i1, 3) )}, iS12{3}, start offset: 0, stop offset: 0
T1_g[ rS17{( ceilDiv(i1, 3) )}, rS19{( 3 * ( ceilDiv(i2, 4) ) )} ]
 root domain : (rS16{i1},rS19{( 3 * ( ceilDiv(i2, 4) ) )})
  Split: rS16{i1} by factor 3 -> rS17{( ceilDiv(i1, 3) )}, rS18{3}, start offset: 0, stop offset: 0
}
```

After:
```
TransformPrinter : 
T0_g[ iS0{i1}, iS1{i2} ]
 root domain : (iS0{i1},iS1{i2})
T2_l[ iS11{( ceilDiv(i1, 3) )}, iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf, rS14{4}rf ]
 root domain : (iS9{i1},rS10{i2}rf)
  Split: iS9{i1} by factor 3 -> iS11{( ceilDiv(i1, 3) )}, iS12{3}, start offset: 0, stop offset: 0
  Split: rS10{i2}rf by factor 4 -> iS13{( ceilDiv(i2, 4) )}rf, rS14{4}rf, start offset: 0, stop offset: 0
  Merge: iS12{3} and iS13{( ceilDiv(i2, 4) )}rf -> iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf
 rfactor domain : (iS9{i1},iS15{( 3 * ( ceilDiv(i2, 4) ) )}rf,rS14{4}rf)
  Split: iS9{i1} by factor 3 -> iS11{( ceilDiv(i1, 3) )}, iS12{3}, start offset: 0, stop offset: 0
T1_g[ rS17{( ceilDiv(i1, 3) )}, rS19{( 3 * ( ceilDiv(i2, 4) ) )} ]
 root domain : (rS16{i1},rS19{( 3 * ( ceilDiv(i2, 4) ) )})
  Split: rS16{i1} by factor 3 -> rS17{( ceilDiv(i1, 3) )}, rS18{3}, start offset: 0, stop offset: 0
}
```
malfet pushed a commit that referenced this pull request Jun 8, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: csarofeen#1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440

Commits that's actually in this PR from the csarofeen branch
```
* dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f Fix missing cooperative launch (#1726)
* dc670a2 Async gmem copy support on sm80+ (#1619)
* 5e6a8da Add turing mma support and test (#1643)
* d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39 Mma op integration on ampere (#1440)
* fade8da patch python test for bfloat16 (#1724)
* 8fbd0b1 Fine-grained kernel profiling (#1720)
* 77c1b4f Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b More precise concretization analysis (#1719)
* f4d3630 Enable complex python tests (#1667)
* 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830 Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7 updating_ci_machine (#1718)
* 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453 Allow using nvFuser on CUDA extension (#1701)
* 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676)
```
Pull Request resolved: #78244
Approved by: https://github.com/csarofeen, https://github.com/malfet
facebook-github-bot pushed a commit that referenced this pull request Jun 8, 2022
Summary:
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

A few bigger updates:
1. Initial support of cp.async and cp.async.wait: csarofeen#1619
2. Emulate ampere's mma 16816 with Turing's mma 1688, for a unified interface: csarofeen#1643
3. Extending the infrastructure to support mma operators on turing and ampere arch: csarofeen#1440

Commits that's actually in this PR from the csarofeen branch
```
* dd23252 (csarofeen/devel) Fusion Segmenter: Unify single kernel and multi-kernel runtime path (#1710)
* b3d1c3f Fix missing cooperative launch (#1726)
* dc670a2 Async gmem copy support on sm80+ (#1619)
* 5e6a8da Add turing mma support and test (#1643)
* d6d6b7d Fix rFactor when there are indirect root domain(s), and refactor (#1723)
* 7093e39 Mma op integration on ampere (#1440)
* fade8da patch python test for bfloat16 (#1724)
* 8fbd0b1 Fine-grained kernel profiling (#1720)
* 77c1b4f Adding dry run mode to skip arch dependent checks (#1702)
* 151d95b More precise concretization analysis (#1719)
* f4d3630 Enable complex python tests (#1667)
* 4ceeee5 Minor bugfix in transform_rfactor.cpp (#1715)
* 3675c70 Separate root domain and rfactor domain in TransformPrinter (#1716)
* f68b830 Fix scheduling with polymorphic broadcast (#1714)
* 4ab5ef7 updating_ci_machine (#1718)
* 56585c5 Merge pull request #1711 from csarofeen/upstream_master_bump_0517
* 174d453 Allow using nvFuser on CUDA extension (#1701)
* 18bee67 Validate LOOP concrete IDs have complete IterDomains (#1676)
```

Pull Request resolved: #78244

Reviewed By: ejguan

Differential Revision: D36678948

Pulled By: davidberard98

fbshipit-source-id: 0ccde965acbd31da67d99c6adb2eaaa888948105
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
PR Status
In Progresss
Development

Successfully merging this pull request may close these issues.

None yet

7 participants