Performance Issues with `sprs-ldl`. #199

PTNobel · 2020-06-18T22:50:44Z

I've been using sprs-ldl to solve some symmetric sparse matrix systems and found the performance to be surprisingly poor. lsolve_csc_dense_rhs has had significantly better performance.

Here is the benchmark I've been using: https://github.com/PTNobel/sprs-performance-test

which produced the following output:

LdlNumeric took 5775ms to solve 3 sparse systems
lsolve_csc_dense_rhs took 11ms to solve 3 sparse systems

I'm happy to answer more questions if needed, but that'll probably have to come on the weekend.

This issue was filed per https://www.reddit.com/r/rust/comments/gzazna/whats_the_current_state_of_rusts_sparse_linear/fv9hf0e/?context=3

The text was updated successfully, but these errors were encountered:

rth · 2020-06-19T11:43:47Z

Thanks for the report @PTNobel ! Similarly to #188 it would probably be nice to adapt your code to add it to the benchmarks as a starting point. I'm not familiar with the implementation so can't comment on the reasons behind this difference in performance. Someone would probably need to profile it to see what is happening..

vbarrielle · 2020-06-19T20:15:58Z

Hello @PTNobel and thanks for the report. I'm afraid the performance of sprs-ldl is not that good right now, for several reasons:

I have not done much code-specific optimizations yet, meaning there may be some hot spots in the implementation. It would be interesting to also benchmark against https://crates.io/crates/sprs_suitesparse_ldl which is a wrapper around a state of the art Cholesky factorization.
Cholesky factorization alone is not enough for solving a sparse system efficiently, because a factorization is bound to create some fill-in which can sometimes dramatically increase the number of nonzero values. Therefore, a fill-in reduction method must be used. Currently, only reverse Cuthil-McKee ordering has been implemented (see Fill in reduction with Cuthill-McKee ordering #182 and Implementation reverse Cuthill-McKee #186), which is a start, but far from state of the art. An implementation of Approximate Minimum Degree and/or Nested Dissection would be good, but so far I've been unable to implement these. These algorithms are significantly more complex to implement and I currently don't have the time to try and implement them. Maybe binding their suitesparse implementation could be the way. It would be however interesting to see how using reverse Cuthill-McKee would fare in your benchmark. This is not yet published on crates.io though, only on the master branch.
There is a symmetry check in the algorithm, if you're confident your system matrix is symmetric this could be skipped (though the API to do that is probably inconvenient right now).

Something troubles me about your benchmark though: it is expected that lsolve_csc_dense_rhs should be 2 to 3 times faster than a solve with LDL on an already factorized system (because LDL actually does two triangular solves and a diagonal solve). However lsolve_csc_dense_rhs requires its input to be lower triangular, and does not check. So it's quite possible the solve results in your benchmark are quite different.

Another thing about the benchmark: it includes the cost to convert from triplet format to csc, but it would probably be better to avoid counting this as solve time.

I should investigate this once I'm done improving the matrix product performance (currently investigating parallelization opportunities). Please note this can take some time, I don't have much free time at the moment. Any help is welcome of course :)

Thanks for the report!

PTNobel · 2020-06-20T08:53:55Z

@vbarrielle, I think it sounds like the lowest hanging fruit here is to

move the csc transforms out of the timed section
Verify that the two solutions have the same results
Add the sprs_suitsparse_ldl as a test against
Apply Reverse Cuthill-McKee as another test
Incorporate the tests into the sprs benchmark framework and merge into here.
Figure out how to turn off the symmetry test

vbarrielle · 2020-07-03T20:50:23Z

Hello, just a note that I can start investigating now that #201 is done.

PTNobel · 2020-07-09T00:26:28Z

Anything you want me to do? I'm happy to try and work through my list of items from my last comment.

vbarrielle · 2020-07-09T20:27:56Z

@PTNobel I've checked 1, and the csc transform does not take too much time so it does not matter if it stays in the timed section.

I'm going to focus on 4 and 6, which mostly requires that I publish a new version of sprs with Cuthill-McKee exposed, once it's done I can enhance the API of sprs-ldl to let the caller ask for a fill-in reduction method, and disable the symmetry check.

So if you want you can look into 2, 3 or 5. Thanks for proposing your help by the way.

vbarrielle · 2020-07-09T21:09:15Z

I wanted to have comparison points with other factorization algorithms, so I serialized the first jacobian in your example and loaded it in scipy.

There I've been able to test LU factorization (scipy uses the library SuperLU under the hood):

In [23]: %timeit scipy.sparse.linalg.splu(jaco)
1 loop, best of 5: 1.9 s per loop

I've also timed the CHOLMOD algorithm, which is probably the best library for Cholesky decomposition. I've had to add a diagonal offset to avoid an error signaling the matrix is not positive definite:

In [20]: %timeit sksparse.cholmod.cholesky(jaco, 100.)
1 loop, best of 5: 462 ms per loop

For the record, on my machine, using Reverse Cuthill-McKee, I get the following timing on your benchmark:

LdlNumeric took 4379ms to solve 3 sparse systems

which would mean about 1.46s per solve but this is not an accurate comparison as the symbolic part is re-used but the backsolve is included.

In general, the number that governs the performance of the factorization is the fill-in, ie the number of additional nonzeros created in the factorization. The jacobian has 41818 nonzeros. Factoring with sprs-ldl gives 1654098 nonzeros, which can be reduced to 1498310 nonzeros with Reverse Cuthill-McKee. SuperLU has 1354237 nonzeros in its L factor, and 1354558 in its U factor. CHOLMOD has 1168635 nonzeros.

So part of the better performance in CHOLMOD is probably explained by its usage of a better fill-in reducing permutation (it probably uses AMD). It's also an extremely well tuned library.

However I find the fill-in quite big, even for CHOLMOD. But I notice the graph is wired randomly, and I'm pretty sure fill-in reducing permutations work by discovering structure in the nonzero graph, so that means this is quite a pathological example. Maybe we can expect better numbers if the graph is for instance a regular grid, or a surface mesh. That's something to try.

PTNobel · 2020-07-10T06:18:21Z

But I notice the graph is wired randomly

I'll add, that this is because my use-case encounters genuinely random graphs quite frequently sadly, the more structure present, the work we're doing is less useful.

PTNobel · 2020-07-10T07:09:19Z

So as you suspected, the results being returned are wildly different suggesting that lsolve_csc_dense_rhs was fast because it is wrong... I'll push this code to my repo tonight resolving 2.

PTNobel · 2020-07-10T07:15:43Z

Question:

What is the difference between LdlLongNumeric and LdlNumeric in sprs_suitesparse_ldl?

vbarrielle · 2020-07-11T12:28:54Z

The difference comes from the type of the integer type that the underlying C library expects for the storage of the indices and indptr of the system matrix, the former expects a libc::c_long while the latter expects a libc::c_int. This mostly matters if your number of rows/columns if bigger than what can fit in an integer. Here it shouldn't matter. You probably need to call CsMat::to_other_types to convert to the correct C types.

vbarrielle · 2020-08-13T21:26:38Z

Latest micro-optimizations in sprs-ldl (see #207) have put its performance on-par with the original C implementation of LDL (see PTNobel/sprs-performance-test#2).

I guess the way to go now would be to have a better fill-in reducing permutation, which will require binding to those available in suitesparse, or implementing one of them.

I'll probably look into the former in the near future.

vbarrielle · 2020-10-09T20:43:54Z

The binding of SuiteSparse's CAMD has enabled some more performance gains, and, as mentionned in PTNobel/sprs-performance-test#2 (comment), I don't think there's much more improvements to be hoped for using LDL. A good way to improve performance would be to have a supernodal Cholesky decomposition (like CHOLMOD), or a multifrontal one to exploit parallelism. I think the latter is a bit more accessible than the former, so I'll probably have a look into it at some point in the future, but it's probably going to take a while, as my free time is scarce these days.

rwl · 2022-04-29T13:05:54Z

An implementation of Approximate Minimum Degree and/or Nested Dissection would be good...

I published a pure Rust translation of AMD:

https://crates.io/crates/amd

It works well with this solver:

https://crates.io/crates/rlu

mulimoen · 2022-05-01T08:00:13Z

@rwl That seems like it could fit very well with this crate!

rwl · 2022-05-01T08:38:57Z

I also translated QDLDL from the OSQP solver project:

https://crates.io/crates/ldl

It is published under the Apache 2.0 license and could also be a good fit.

vbarrielle self-assigned this Jun 19, 2020

vbarrielle mentioned this issue Jul 20, 2020

Enhance ldl API #203

Merged

This was referenced Aug 5, 2020

Update to sprs v0.8 and sprs-ldl v0.6 to enable more tests PTNobel/sprs-performance-test#1

Merged

Faster ldl decomposition #207

Merged

vbarrielle mentioned this issue Sep 15, 2020

Bind suitesparse camd #223

Merged

vbarrielle mentioned this issue Oct 9, 2020

Update to latest sprs-ldl version to show performance improvements PTNobel/sprs-performance-test#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issues with `sprs-ldl`. #199

Performance Issues with `sprs-ldl`. #199

PTNobel commented Jun 18, 2020

rth commented Jun 19, 2020

vbarrielle commented Jun 19, 2020

PTNobel commented Jun 20, 2020

vbarrielle commented Jul 3, 2020

PTNobel commented Jul 9, 2020

vbarrielle commented Jul 9, 2020 •

edited

Loading

vbarrielle commented Jul 9, 2020

PTNobel commented Jul 10, 2020

PTNobel commented Jul 10, 2020

PTNobel commented Jul 10, 2020

vbarrielle commented Jul 11, 2020

vbarrielle commented Aug 13, 2020

vbarrielle commented Oct 9, 2020

rwl commented Apr 29, 2022

mulimoen commented May 1, 2022

rwl commented May 1, 2022

Performance Issues with sprs-ldl. #199

Performance Issues with sprs-ldl. #199

Comments

PTNobel commented Jun 18, 2020

rth commented Jun 19, 2020

vbarrielle commented Jun 19, 2020

PTNobel commented Jun 20, 2020

vbarrielle commented Jul 3, 2020

PTNobel commented Jul 9, 2020

vbarrielle commented Jul 9, 2020 • edited Loading

vbarrielle commented Jul 9, 2020

PTNobel commented Jul 10, 2020

PTNobel commented Jul 10, 2020

PTNobel commented Jul 10, 2020

vbarrielle commented Jul 11, 2020

vbarrielle commented Aug 13, 2020

vbarrielle commented Oct 9, 2020

rwl commented Apr 29, 2022

mulimoen commented May 1, 2022

rwl commented May 1, 2022

Performance Issues with `sprs-ldl`. #199

Performance Issues with `sprs-ldl`. #199

vbarrielle commented Jul 9, 2020 •

edited

Loading