Skip to content

Conversation

@MikhailRyazanov
Copy link
Contributor

@MikhailRyazanov MikhailRyazanov commented May 17, 2025

Reference issue

Addendum to gh-21180, should close gh-21039.

What does this implement/fix?

Direct and efficient conversion from DIA format to CSR format, to speed-up all operations not directly implemented for DIA.

Additional information

After merging gh-21180, almost all arithmetic operations with DIA arrays became faster than conversion from DIA to other formats, needed for other operations, such as arithmetic with arrays in other formats and passing DIA arrays to methods not implemented for them directly.

For comparison — arithmetic operations:

sparse.Arithmetic.time_arithmetic
======== ==== ============= ============= ============= =============
--                                       op                          
------------- -------------------------------------------------------
 format   XY     __add__       __sub__       multiply      __mul__   
======== ==== ============= ============= ============= =============
  dia     AA     600±20μs      617±20μs    1.34±0.05ms    6.18±0.2ms 
  dia     AB   2.18±0.01ms   2.92±0.02ms   1.33±0.04ms     12.8±2ms  
  dia     BA   2.22±0.04ms   2.21±0.05ms   1.28±0.04ms     12.7±2ms  
  dia     BB   1.26±0.01ms   1.32±0.02ms   3.17±0.04ms    26.5±0.3ms 
======== ==== ============= ============= ============= =============

and conversions:

sparse.Conversion.time_conversion
============= ============= ============= ============= ============= ============ ============ =============
--                                                       to_format                                           
------------- -----------------------------------------------------------------------------------------------
 from_format       csr           csc           coo           dia          lil          dok           bsr     
============= ============= ============= ============= ============= ============ ============ =============
     dia       1.74±0.05ms     866±40μs      691±10μs      212±5ns     19.8±0.5ms   22.4±0.3ms   2.28±0.06ms 
============= ============= ============= ============= ============= ============ ============ =============

(keep in mind that conversion benchmarks use $100^2 \times 100^2$ arrays, while arithmetic benchmarks are for $250^2 \times 250^2$, and generally with more diagonals, so the relative performance of conversions is even worse than these benchmarks show). This is because currently direct conversion is implemented only to CSC and COO, which is quite inefficient.

The direct and efficient conversion to CSR format implemented in this PR gives it ~5-fold speed-up (which also significantly improves conversion to BSR and slightly to LIL and DOK), making it no longer much slower than arithmetic operations:

sparse.Conversion.time_conversion
============= ============= ============= ============= ============= ============ ============ =============
--                                                       to_format                                           
------------- -----------------------------------------------------------------------------------------------
 from_format       csr           csc           coo           dia          lil          dok           bsr     
============= ============= ============= ============= ============= ============ ============ =============
     dia         273±2μs       814±5μs       597±8μs       217±9ns     17.4±0.1ms   20.5±0.07ms     781±2μs   
============= ============= ============= ============= ============= ============ ============ =============
P.S.
  • All the benchmarks above were done without the --quick option for asv, for better reproducibility and more meaningful numbers.
  • When I tried to use the --compare main approach, it apparently started rebuilding the whole SciPy instead of just recompiling the changed files. Am I missing something?

UPD: Conversion to CSC and COO formats can also be made faster by simply removing current tocsc() and tocoo() methods, see below.

@github-actions github-actions bot added scipy.sparse C/C++ Items related to the internal C/C++ code base enhancement A new feature or improvement labels May 17, 2025
@MikhailRyazanov
Copy link
Contributor Author

MikhailRyazanov commented May 18, 2025

By the way, simply removing the current Python + NumPy implementations of .tocsc() and .tocoo() methods, thus making these conversions to be done indirectly as DIA → CSR → CSC/COO in fact speeds them up by >50%:

============= ============= ============= ============= ============= ============= ============= =============
--                                                        to_format                                            
------------- -------------------------------------------------------------------------------------------------
 from_format       csr           csc           coo           dia           lil           dok           bsr     
============= ============= ============= ============= ============= ============= ============= =============
     dia         270±2μs       484±7μs       386±3μs       207±2ns      17.2±0.2ms    21.2±0.1ms     777±5μs   
============= ============= ============= ============= ============= ============= ============= =============

However, TestDIA::test_tocoo_gh10050 starts failing because the converted COO matrix has .has_canonical_format = False even though it is in fact canonical. This seems to be caused by the COO constructor indiscriminately setting

self.has_canonical_format = False

even for canonical data. So there are two possible solutions:

  1. Relax the test such that it fails only for unsafe results (meaning that m.has_canonical_format == False is safe when inds_are_sorted == True, like it is implied by the COO constructor, but the opposite is an error).
  2. Explicitly add tocoo() to _csr.py which will transfer has_canonical_format to the conversion result (as far as I understand, canonical CSR always produces canonical COO). Might be relevant for TST: sparse: more tests for sparse has_canonical_format flag #22936 as well.

@MikhailRyazanov MikhailRyazanov marked this pull request as draft May 18, 2025 03:20
@MikhailRyazanov
Copy link
Contributor Author

MikhailRyazanov commented May 18, 2025

While implementing the second “solution” from the previous comment, I've discovered missing bound checks for ill-formed DIA arrays (when diagonal are longer than the array width or lie completely outside), so this is corrected in 677f10b.

The remaining new commits are related to eliminating the inefficient tocsc() and tocoo() methods and then dealing with the revealed deficiencies in COO construction. I'm not sure that the last change (1d03664) is completely correct, but it seems to be more appropriate than setting the default copy=True in all tocoo() methods (or assuming that internal/intermediate operations will copy the relevant data for methods with default copy=False).

@MikhailRyazanov MikhailRyazanov marked this pull request as ready for review May 18, 2025 05:56
@MikhailRyazanov
Copy link
Contributor Author

Here are more extensive benchmarks for arrays of various sizes and comparing conversions with arithmetic operations on equal footing. main: means how it is now, this: means this PR; arithmetic benchmarks are not affected.

Using poisson2d(20, ...) = 400 × 400 array:

      sparse.Conversion.time_conversion
      --                                      to_format
      ------------- ---------------------------------------------------------
       from_format      csr        csc       coo      lil     dok      bsr   
      ============= =========== ========= ========= ======= ======= =========
main:      dia      225±2μs     122±10μs  117±0.3μs 882±3μs 721±5μs 296±0.3μs
this:      dia       57.0±0.3μs 104±0.7μs 111±0.3μs 648±5μs 522±2μs 114±0.3μs

sparse.Arithmetic.time_arithmetic
--                              op
------------- ---------------------------------------
 format   XY     __add__      __sub__      multiply  
======== ==== ============= ============ ============
  dia     AA   43.0±0.5μs    43.3±0.4μs   64.9±0.5μs  

Using poisson2d(100, ...) = 10000 × 10000 array:

      sparse.Conversion.time_conversion
      --                                    to_format
      ------------- ---------------------------------------------------------------
       from_format      csr       csc      coo       lil        dok         bsr    
      ============= =========== ======== ======== ========== ========== ===========
main:     dia       1.74±0.05ms 866±40μs 691±10μs 19.8±0.5ms 22.4±0.3ms 2.28±0.06ms
this:     dia         270±2μs   484±7μs  386±3μs  17.2±0.2ms 21.2±0.1ms   777±5μs  

sparse.Arithmetic.time_arithmetic
--                               op
------------- -----------------------------------------
 format   XY     __add__       __sub__       multiply  
======== ==== ============= ============= =============
  dia     AA    99.8±0.2μs    101±0.3μs      168±1μs   

Using poisson2d(250, ...) = 62500 × 62500 array:

      sparse.Conversion.time_conversion
      --                                        to_format
      ------------- ----------------------------------------------------------------
       from_format     csr         csc        coo        lil       dok        bsr   
      ============= =========== ========== ========== ========= ========= ==========
main:      dia      15.5±0.6ms  7.31±0.6ms 8.89±1ms   126±1ms   142±1ms   19.3±0.6ms
this:      dia       4.55±0.3ms 6.65±0.1ms 5.24±0.3ms 114±0.7ms 130±0.4ms 7.99±0.2ms

sparse.Arithmetic.time_arithmetic
--                               op
------------- -----------------------------------------
 format   XY     __add__       __sub__       multiply  
======== ==== ============= ============= =============
  dia     AA     600±20μs      617±20μs    1.34±0.05ms 

So conversion from DIA becomes 3–6 times faster to CSR, >2 times faster to BSR, and somewhat faster to all other formats. Still slower than simple arithmetic operations, but not as much as it currently is (like an order of magnitude for the 10000 × 10000 size).

Copy link
Contributor

@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!
The timing results look good -- and I agree with your choice to implement the second approach to has_canonical_format in tocoo. Adding the tocoo method and setting has_canonical_format inside _csrbase.tocoo should work fine. The result of conversion from canonical CSR to COO is canonical. Using the has_canonical_format property on CSR can trigger a check of canonical if that property has not been accessed before. But this is a case when that is desirable IIUC.

This C++ code allocates the space for the output directly, whereas most other functions in this suite of tools have the allocation occur before calling and then "prune" the results after returning. I don't know enough about std::vector to know the impacts of that approach. And I see that the vectors get resized twice. But we don't have to prune after the return. Can you say a few words about this choice to allocate inside the C++ function and to resize twice? Are there advantages over preallocating and pruning via numpy?

The addition of copy=copy in COO __init__ as well as the very minor tweaks to the
dia_matmat code look good.

And I have a request below for you to add a comment block at the top of the new C++ dia_tocsr function.

Thanks!

Comment on lines 419 to 421
idx_dtype = self._get_index_dtype(maxval=self.data.size)
order = np.argsort(self.offsets).astype(idx_dtype, copy=False)
indices, indptr, data = dia_tocsr(*self.shape, *self.data.shape,
self.offsets, self.data, order)
# If possible, shrink indexing dtype using actual nnz.
idx_dtype = self._get_index_dtype(maxval=indptr[-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically for get_index_dtype we set maxval based on self.shape as well as nnz. This tocsr() uses self.data.size which is slightly different. Can you say a little bit about that? self.data.size includes multiple diagonals and also all the zeros in those diagonals. But it doesn't directly include the shape of the matrix. Is it easy to see why this would be a not-too-high upper bound on the shape and nnz?

And when you recheck the index dtype after conversion, do we need to include the shape rather than just nnz?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let's use self.shape here as well (in case we have indices on the minor axis that are really huge).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@MikhailRyazanov
Copy link
Contributor Author

@dschult, thank you for reviewing so quickly! (This time I was busy, so replying only now.)

This C++ code allocates the space for the output directly, whereas most other functions in this suite of tools have the allocation occur before calling and then "prune" the results after returning. I don't know enough about std::vector to know the impacts of that approach. And I see that the vectors get resized twice. But we don't have to prune after the return. Can you say a few words about this choice to allocate inside the C++ function and to resize twice? Are there advantages over preallocating and pruning via numpy?

I'm not familiar with SciPy internals and strategies to be sure. Were can I read about that? And how the connection between NumPy arrays C++ vectors is done internally (the '*V' and '*W' codes in _generate_sparsetools.py)?

The general idea was to reduce memory allocations and copying.

nnz is not used in the beginning because, as I understand, DIA doesn't store its value but calculates it using _getnnz() with a Python loop, so doing the same thing in C++ is faster (and trivial to implement). Also, pruning with scipy._lib._prune_array() makes a copy, which C++ vector::resize() avoids (only change the internal size variable, since memory reallocation is most likely unnecessary in this case).

On the other hand, now I see that _prune_array() actually tolerates up to a factor-of-two memory overhead. From this perspective, it might be actually fine to just directly allocate DIA's data.size for the output arrays, since nnz never exceeds data.size and on average (for uniformly distributed diagonals) nnzdata.size/2. Then all estimations and pruning can be completely removed. What do you think?

And I have a request below for you to add a comment block at the top of the new C++ dia_tocsr function.

Definitely will do. And also will look at the rest slightly later.

P.S. By the way, there is a bug in _getnnz() for ill-constructed arrays:

>>> from scipy.sparse import dia_array
>>> d = dia_array(([[1, 2, 3]], [5]), shape=(3, 3))
>>> d.toarray()
array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])
>>> d.nnz
-2

Since sparse/tests currently do have some tests with such weird inputs, I suspect that _getnnz() must return 0.
Should I just add a commit to correct that? Or open a separate issue/PR?

@dschult
Copy link
Contributor

dschult commented May 23, 2025

Can you add a separate PR for the fix of DIA.nnz in ill-formed cases?
I think it will likely speed both sets of changes to have them separated -- it helps focus the discussions.

I am intrigued by the idea of reducing allocation and copying -- with std::vector or using numpy allocations. But I suspect the current sparse approach looked at that question back in the day.

Perhaps @perimosocordiae has some perspective on tradeoffs of "pre-allocate via numpy && _prune_array" vs "allocate and resize std:vector in C++".

@MikhailRyazanov
Copy link
Contributor Author

@dschult,

Can you add a separate PR for the fix of DIA.nnz in ill-formed cases?

Sure, see gh-23055.

Your review was in fact very helpful, because I've tried preallocating NumPy output arrays using the new _getnzz() and discovered that the overall conversion performance improves significantly. Apparently because sparsetools.cxx constructs ndarray from std:vector by always making a copy (not sure why this is necessary), while writing from C++ to existing ndarray is direct.

So I'll put this PR on hold until gh-23055 is merged, and then will refactor to accommodate those changes.

@perimosocordiae
Copy link
Member

@MikhailRyazanov is this PR still in draft mode, or are you ready for review?

Comment on lines 419 to 421
idx_dtype = self._get_index_dtype(maxval=self.data.size)
order = np.argsort(self.offsets).astype(idx_dtype, copy=False)
indices, indptr, data = dia_tocsr(*self.shape, *self.data.shape,
self.offsets, self.data, order)
# If possible, shrink indexing dtype using actual nnz.
idx_dtype = self._get_index_dtype(maxval=indptr[-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let's use self.shape here as well (in case we have indices on the minor axis that are really huge).

d1 = [[1]] # diagonal shorter than width
d3 = [[1, 2, 3]] # diagonal longer than width

cases = [(d1, [-1], 1, [[0, 0], [1, 0]]), # within
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's convert each of these comments into a string, and include that in the assert message (i.e., as the third argument to assert_array_equal). This will make it a lot easier to figure out which case failed in the event of a test failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, please take a look.

@MikhailRyazanov
Copy link
Contributor Author

@MikhailRyazanov is this PR still in draft mode, or are you ready for review?

Still a draft. I need to refactor it using the gh-23055 changes (so the recent commit was only to get them in this branch) and the discussions above, then will reopen as "ready for review".

@MikhailRyazanov
Copy link
Contributor Author

@dschult, @perimosocordiae:
OK, here is the improved variant, with the output arrays allocated by NumPy, based on nnz, and then pruned. And hopefully also addressing all the review comments above.

Benchmark comparisons for various sizes (see above); “main” is the current implementation from the main branch, “this” is this PR (other lines are explained below):

          sparse.Conversion.time_conversion
          --                                                 to_format
          ------------- ------------------------------------------------------------------------------------
           from_format        csr           csc          coo          lil           dok            bsr
          ============= ============== ============= ============ ============= ============= ==============
400 × 400:
  main:        dia       225±2μs        122±10μs      117±0.3μs    882±3μs       721±5μs       296±0.3μs
  this:        dia        71.5±0.09μs   122±0.5μs     123±0.3μs    661±1μs       544±6μs       131±0.1μs

  no nnz:      dia        51.9±2μs       97.5±0.3μs   106±0.8μs    635±3μs       511±1μs       109±0.5μs
  +copy:       dia        54.1±0.2μs    101±0.7μs     107±0.6μs    644±2μs       518±2μs       111±0.3μs

  cntnz:       dia        50.4±0.2μs     99.5±0.3μs   108±1μs      644±5μs       565±60μs      110±1μs

10000 × 10000:
  main:        dia       1.74±0.05ms    866±40μs      691±10μs     19.8±0.5ms    22.4±0.3ms    2.28±0.06ms
  this:        dia         175±1μs      392±4μs       290±7μs      17.2±0.08ms   18.8±0.1ms      692±8μs

  no nnz:      dia         157±0.5μs    366±3μs       266±2μs      16.9±0.08ms   18.6±0.06ms     662±3μs
  +copy:       dia         198±1μs      417±3μs       318±3μs      17.3±0.07ms   18.8±0.09ms     714±7μs

  cntnz:       dia         195±1μs      421±10μs      305±0.7μs    17.1±0.1ms    18.6±0.2ms      695±7μs

62500 × 62500:
  main:        dia       15.5±0.6ms     7.31±0.6ms    8.89±1ms     126±1ms       142±1ms       19.3±0.6ms
  this:        dia        1.06±0.02ms   5.94±0.07ms   3.05±0.6ms   113±0.5ms     126±0.3ms      6.80±0.06ms

  no nnz:      dia        1.07±0.04ms   5.76±0.08ms   3.66±0.9ms   113±1ms       127±0.5ms      6.81±0.2ms
  +copy:       dia        4.06±0.2ms    6.93±0.5ms    4.66±0.2ms   115±0.9ms     130±0.6ms      7.76±0.3ms

  cntnz:       dia        1.38±0.04ms   6.24±0.2ms    3.50±0.9ms   113±0.7ms     131±0.8ms      7.13±0.2ms

For small sizes, the performance is still better than the original but slightly worse than from the previous attempt with nnz calculated during the conversion (likely because ._getnnz() adds a constant, size-independent, overhead), however, for larger sizes, the performance gain is now even better than before.

I've also tried two alternative approaches:

  1. Using max_nnz = self.data.size instead of max_nnz = self.nnz, labeled “no nnz” in the benchmarks above. This improves the timings by ~20 μs, which is noticeable for small sizes, but at the same time is more likely to trigger copying when pruning the output, with a huge impact for large sizes (see the “+copy” lines above, which include artificially triggering it), plus higher memory usage (might be very undesirable when sizes are large).
  2. Using self.count_nonzero() to get the exact output size before the conversion and thus avoid any memory overhead and reallocations. This would be useless with the current count_nonzero() implementation, which is slower than the conversion itself, but reimplementing it in C++ (see the dia-cntnz branch in my fork) gives ~7-fold speed-up, and the overall results are shown as “cntnz” above. They are quite good for small sizes, but for large sizes are ~25% slower than “this”, although still ~10 faster than “main” and cannot degrade to the worst-case “+copy”. However, another C++ function (dia_count_nonzero()) additionally increases the library size by about as much as this PR does (with only added dia_tocsr()), so using this approach would be reasonable only if fast count_nonzero() for DIA arrays is useful on its own.

I don't think that the first alternative is really better, despite its slightly better best-case performance. The second alternative has some advantages and some disadvantages, so I'm not sure. What do you think?

@MikhailRyazanov MikhailRyazanov marked this pull request as ready for review June 21, 2025 01:58
j = i + offsets[n]; // column
if (j < 0 or j >= j_end)
continue;
const T x = (data + npy_intp(L) * n)[j];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a different indexing expression here, to be more consistent with the rest of the usages in sparsetools:

data[L*n + j]

Copy link
Contributor Author

@MikhailRyazanov MikhailRyazanov Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was basically to be consistent with the idiom

const T diag = diags + npy_intp(L) * n;
... diag[j] ...;

used throughout dia.h.
Do you also want to change other occurrences, like line 219 here:

// element at i-th row, k-th column in A
const T a = (A_data + npy_intp(A_L) * n)[k];
// k-th row in B
const T* B_row = B_data + npy_intp(B_cols) * k;
// loop over columns in current output row
for (I j = 0; j < cols; ++j)
row[j] += a * B_row[j];

to const T a = A_data[npy_intp(A_L) * n + k];?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's avoid mixing pointer arithmetic and indexing in the same expression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@perimosocordiae
Copy link
Member

Thanks @MikhailRyazanov! Based on your benchmark results, I think it's reasonable to keep the current approach (max_nnz = self.nnz) for now. We can then test using the more precise count_nonzero() version as a followup (if you're interested).

Copy link
Contributor

@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!
Thanks @MikhailRyazanov !

@dschult dschult merged commit 47d0ffb into scipy:main Jul 3, 2025
38 checks passed
@j-bowhay j-bowhay added this to the 1.17.0 milestone Jul 3, 2025
@lucascolley lucascolley added the needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki label Nov 21, 2025
@nickodell nickodell removed the needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki label Nov 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C/C++ Items related to the internal C/C++ code base enhancement A new feature or improvement scipy.sparse

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: Implement all relevant methods for sparse.dia_array

6 participants