Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quadratic complexity of sparse matrices multiplication #184

Closed
espdev opened this issue Mar 4, 2020 · 12 comments
Closed

Quadratic complexity of sparse matrices multiplication #184

espdev opened this issue Mar 4, 2020 · 12 comments

Comments

@espdev
Copy link

espdev commented Mar 4, 2020

Hello!

I try to use sprs package in my experimental code. I have encountered the issue of very slow multiplication of large sparse matrices.

My code uses large diagonal matrices in CSR format. The matrices might contain from hundreds/thousands to millions NNZ elements. Multiplication of such matrices in sprs has almost quadratic complexity O(nnz^2). So we cannot use sprs in production code currently.

I wrote some tests in Rust+sprs and Python+scipy. SciPy uses SMMP algorithm (Sparse Matrix Multiplication Package) and perform multiplication such matrices in tens/hundreds of milliseconds.

The code of my benchmarks can be found on gist: https://gist.github.com/espdev/d278962828ce7360d653b8679d7bb511

The plot for mat * mat.T:
sprs_scipy_matmul

Could you consider using SMMP algorithm in sprs? This algorithm has complexity O(n_row*K^2 + max(n_row,n_col)) where K is the maximum nnz in a row of A and column of B.

@rth
Copy link
Contributor

rth commented Mar 4, 2020

Thanks for investigating this! I have always wondered how sprs would compare to scipy.sparse performance wise scipy/scipy#10201 (comment) It would be interesting to also have this comparison random sparse arrays to evaluate the impact of shape and density.

@vbarrielle
Copy link
Collaborator

That's a problem indeed, thanks for reporting it. I'll try and find some time to investigate.

@espdev
Copy link
Author

espdev commented Mar 4, 2020

I have compared sprs and scipy on random sparse matrices with some shapes and density. It is interesting. It seems I incorrectly defined the dependency sprs complexity from nnz. sprs mat*mat complexity directly depends on matrices sizes also. The performance is dramatically reduced while increasing the sizes of the matrices independently from nnz. At the same time in scipy the dependency is quite different.

Here are some plots from my benchmarks.

scipy:
scipy_sparse_matmul

sprs
sprs_sparse_matmul

scipy vs sprs
sprs_scipy_sparse_matmul_0
sprs_scipy_sparse_matmul_1

Benches code can be found at the gist: https://gist.github.com/espdev/f16f139f8007d3b37d1a57835fee6a19

@vbarrielle
Copy link
Collaborator

Interesting. Here's the main loop of the matrix multiplication in sprs: https://github.com/vbarrielle/sprs/blob/master/src/sparse/prod.rs#L207-L226

It accumulates each row of the result matrix into a dense vector, before converting it to sparse. That explains the dependency on the dimension (actually, the number of columns). But it makes it more interesting when the matrix is denser.

I've started a branch to implement SMMP. We'll have more information when I'm done implementing it. But your graphs mean it might be interesting to keep the original product implementation around, but perhaps with a modification to sparsify the structure that accumulates the result row.

@vbarrielle
Copy link
Collaborator

vbarrielle commented Mar 6, 2020

Progress report: I'm done implementing SMMP symbolic and numeric in the smmp branch. I won't have time to check the performance this weekend, but I should look at it next week.

I also have a request for you, @espdev: would you be ok with contributing your benchmarks in the examples folder of sprs? I'd like to have them around to be able to monitor performance, and I want to be sure you're fine with contributing them under sprs' license.

@espdev
Copy link
Author

espdev commented Mar 6, 2020

@vbarrielle

Wow! It's great. Thank you!
I will clean and refactor my benchmarks to contribute to sprs examples soon.

@vbarrielle
Copy link
Collaborator

vbarrielle commented Mar 9, 2020

Here are the performance results for smmp on your first benchmark:

sprs_vs_scipy

(EDIT: these results have a caveat, I generated them on my unplugged laptop).

Looks good, now I have to take a look at your second benchmark. This will probably show a complexity difference between scipy's implementation of SMMP and mine, I have to sort the indices of the matrices because that's a guarantee of the CsMat class. That makes the complexity O(n_row*K^2 + max(n_row,n_col))*K*log(K). That should not matter for most workloads though.

@vbarrielle
Copy link
Collaborator

Here are the results for the random matrices:

sprs_vs_scipy_random_mats

We can see that sprs is lagging behind now, I've profiled it, and most of the time is spent sorting the indices. If I disable indices sorting sprs is more performant:
sprs_unsorted_vs_scipy_random_mats

I'm going to investigate alternative indices sorting methods. I can try using an array of booleans if the nnz count is too large.

@espdev
Copy link
Author

espdev commented Mar 10, 2020

@vbarrielle

I noticed the sizes of matrices are different in legend text on the graphs:

sprs (15000, 2500)
scipy (15000, 25000)

It looks like it is just a typo.

@vbarrielle
Copy link
Collaborator

Indeed, I meant to type (15000, 25000) for both.

@vbarrielle
Copy link
Collaborator

vbarrielle commented Mar 11, 2020

Small update, I've figured I could dynamically switch between using the "linked list" indices from SMMP, followed by a sort, and using a linear scan of the encountered nnz indices, according to the expected amount of work.

This makes the performance much better when there is a good amount of non-zeros:

sprs_smart_vs_scipy_random_mats

(EDIT: the legend is false once again, I forgot to remove the (unsorted), it is sorted...)

My test to switch between the two methods is a bit naive right now, but it seems to work well enough. In short, for each row, I compare the amount of elements I need to scan (difference between the max encountered nnz col and the min) versus the amount of work sorting would take (k * log(k) with k the number of nnz in the row. It's possible I should make some timings to determine a more precise bound. There are also a few optimizations that could be done for the linear scan, as it could take advantage of SIMD.

@vbarrielle
Copy link
Collaborator

Quick mention that sparse product performance has been further improved in #201.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants