-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
ENH: sparse: efficient conversion from DIA to CSR format #23009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
By the way, simply removing the current Python + NumPy implementations of However, Line 63 in 7ef6aeb
even for canonical data. So there are two possible solutions:
|
|
While implementing the second “solution” from the previous comment, I've discovered missing bound checks for ill-formed DIA arrays (when diagonal are longer than the array width or lie completely outside), so this is corrected in 677f10b. The remaining new commits are related to eliminating the inefficient |
|
Here are more extensive benchmarks for arrays of various sizes and comparing conversions with arithmetic operations on equal footing. Using Using Using So conversion from DIA becomes 3–6 times faster to CSR, >2 times faster to BSR, and somewhat faster to all other formats. Still slower than simple arithmetic operations, but not as much as it currently is (like an order of magnitude for the 10000 × 10000 size). |
dschult
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
The timing results look good -- and I agree with your choice to implement the second approach to has_canonical_format in tocoo. Adding the tocoo method and setting has_canonical_format inside _csrbase.tocoo should work fine. The result of conversion from canonical CSR to COO is canonical. Using the has_canonical_format property on CSR can trigger a check of canonical if that property has not been accessed before. But this is a case when that is desirable IIUC.
This C++ code allocates the space for the output directly, whereas most other functions in this suite of tools have the allocation occur before calling and then "prune" the results after returning. I don't know enough about std::vector to know the impacts of that approach. And I see that the vectors get resized twice. But we don't have to prune after the return. Can you say a few words about this choice to allocate inside the C++ function and to resize twice? Are there advantages over preallocating and pruning via numpy?
The addition of copy=copy in COO __init__ as well as the very minor tweaks to the
dia_matmat code look good.
And I have a request below for you to add a comment block at the top of the new C++ dia_tocsr function.
Thanks!
scipy/sparse/_dia.py
Outdated
| idx_dtype = self._get_index_dtype(maxval=self.data.size) | ||
| order = np.argsort(self.offsets).astype(idx_dtype, copy=False) | ||
| indices, indptr, data = dia_tocsr(*self.shape, *self.data.shape, | ||
| self.offsets, self.data, order) | ||
| # If possible, shrink indexing dtype using actual nnz. | ||
| idx_dtype = self._get_index_dtype(maxval=indptr[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically for get_index_dtype we set maxval based on self.shape as well as nnz. This tocsr() uses self.data.size which is slightly different. Can you say a little bit about that? self.data.size includes multiple diagonals and also all the zeros in those diagonals. But it doesn't directly include the shape of the matrix. Is it easy to see why this would be a not-too-high upper bound on the shape and nnz?
And when you recheck the index dtype after conversion, do we need to include the shape rather than just nnz?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, let's use self.shape here as well (in case we have indices on the minor axis that are really huge).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
@dschult, thank you for reviewing so quickly! (This time I was busy, so replying only now.)
I'm not familiar with SciPy internals and strategies to be sure. Were can I read about that? And how the connection between NumPy arrays C++ vectors is done internally (the '*V' and '*W' codes in The general idea was to reduce memory allocations and copying.
On the other hand, now I see that
Definitely will do. And also will look at the rest slightly later. P.S. By the way, there is a bug in Since |
|
Can you add a separate PR for the fix of DIA.nnz in ill-formed cases? I am intrigued by the idea of reducing allocation and copying -- with Perhaps @perimosocordiae has some perspective on tradeoffs of "pre-allocate via numpy && |
Sure, see gh-23055. Your review was in fact very helpful, because I've tried preallocating NumPy output arrays using the new So I'll put this PR on hold until gh-23055 is merged, and then will refactor to accommodate those changes. |
|
@MikhailRyazanov is this PR still in draft mode, or are you ready for review? |
scipy/sparse/_dia.py
Outdated
| idx_dtype = self._get_index_dtype(maxval=self.data.size) | ||
| order = np.argsort(self.offsets).astype(idx_dtype, copy=False) | ||
| indices, indptr, data = dia_tocsr(*self.shape, *self.data.shape, | ||
| self.offsets, self.data, order) | ||
| # If possible, shrink indexing dtype using actual nnz. | ||
| idx_dtype = self._get_index_dtype(maxval=indptr[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, let's use self.shape here as well (in case we have indices on the minor axis that are really huge).
scipy/sparse/tests/test_base.py
Outdated
| d1 = [[1]] # diagonal shorter than width | ||
| d3 = [[1, 2, 3]] # diagonal longer than width | ||
|
|
||
| cases = [(d1, [-1], 1, [[0, 0], [1, 0]]), # within |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's convert each of these comments into a string, and include that in the assert message (i.e., as the third argument to assert_array_equal). This will make it a lot easier to figure out which case failed in the event of a test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, please take a look.
Still a draft. I need to refactor it using the gh-23055 changes (so the recent commit was only to get them in this branch) and the discussions above, then will reopen as "ready for review". |
than direct Python/NumPy implementations
otherwise TestCOO::test_constructor1_base fails "not sparse_may_share_memory" assertion for formats with default tocoo(copy=False)
These tests are generic for both `dia_array` and `dia_matrix`, thus should not use a specific type (overlooked in scipygh-21180).
|
@dschult, @perimosocordiae: Benchmark comparisons for various sizes (see above); “ For small sizes, the performance is still better than the original but slightly worse than from the previous attempt with I've also tried two alternative approaches:
I don't think that the first alternative is really better, despite its slightly better best-case performance. The second alternative has some advantages and some disadvantages, so I'm not sure. What do you think? |
scipy/sparse/sparsetools/dia.h
Outdated
| j = i + offsets[n]; // column | ||
| if (j < 0 or j >= j_end) | ||
| continue; | ||
| const T x = (data + npy_intp(L) * n)[j]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use a different indexing expression here, to be more consistent with the rest of the usages in sparsetools:
data[L*n + j]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was basically to be consistent with the idiom
const T diag = diags + npy_intp(L) * n;
... diag[j] ...;used throughout dia.h.
Do you also want to change other occurrences, like line 219 here:
scipy/scipy/sparse/sparsetools/dia.h
Lines 218 to 224 in 98e24c5
| // element at i-th row, k-th column in A | |
| const T a = (A_data + npy_intp(A_L) * n)[k]; | |
| // k-th row in B | |
| const T* B_row = B_data + npy_intp(B_cols) * k; | |
| // loop over columns in current output row | |
| for (I j = 0; j < cols; ++j) | |
| row[j] += a * B_row[j]; |
to
const T a = A_data[npy_intp(A_L) * n + k];?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's avoid mixing pointer arithmetic and indexing in the same expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
|
Thanks @MikhailRyazanov! Based on your benchmark results, I think it's reasonable to keep the current approach ( |
dschult
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me!
Thanks @MikhailRyazanov !
Reference issue
Addendum to gh-21180, should close gh-21039.
What does this implement/fix?
Direct and efficient conversion from DIA format to CSR format, to speed-up all operations not directly implemented for DIA.
Additional information
After merging gh-21180, almost all arithmetic operations with DIA arrays became faster than conversion from DIA to other formats, needed for other operations, such as arithmetic with arrays in other formats and passing DIA arrays to methods not implemented for them directly.
For comparison — arithmetic operations:
and conversions:
(keep in mind that conversion benchmarks use$100^2 \times 100^2$ arrays, while arithmetic benchmarks are for $250^2 \times 250^2$ , and generally with more diagonals, so the relative performance of conversions is even worse than these benchmarks show). This is because currently direct conversion is implemented only to CSC and COO, which is quite inefficient.
The direct and efficient conversion to CSR format implemented in this PR gives it ~5-fold speed-up (which also significantly improves conversion to BSR and slightly to LIL and DOK), making it no longer much slower than arithmetic operations:
P.S.
--quickoption forasv, for better reproducibility and more meaningful numbers.--compare mainapproach, it apparently started rebuilding the whole SciPy instead of just recompiling the changed files. Am I missing something?UPD: Conversion to CSC and COO formats can also be made faster by simply removing current
tocsc()andtocoo()methods, see below.