Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: sparse: Generalize coo_array to support 1d shapes #18530

Merged
merged 20 commits into from Dec 23, 2023

Conversation

perimosocordiae
Copy link
Member

@perimosocordiae perimosocordiae commented May 25, 2023

This PR rewrites scipy.sparse.coo_array in a way that theoretically could support arbitrary n-dimensional shapes. In practice, coo_array will only fully support 1d and 2d shapes. The coo_matrix subclass will preserve its existing 2d-only semantics.

The main transformation is replacing the individual .row and .col attribute arrays with a single .indices attribute that holds a tuple of index arrays. We maintain the invariant that len(x.shape) == len(x.indices) == x.ndim.

Support checklist:

  • constructors (shape, dense, sparse, tuple)
  • reshape
  • nnz
  • transpose (including axes=(...) keyword)
  • resize
  • tocsc / tocsr / todia
    • these probably only make sense for the 2d case, so we raise an exception
    • as of now todok also raises, but we could easily add n-d support there too
  • tocoo
  • diagonal and _setdiag
    • raises an error for non-2d cases
  • _with_data
  • sum_duplicates
  • eliminate_zeros
  • toarray (1d and 2d only)
  • _add_dense (1d and 2d only)
  • _mul_vector (1d and 2d only)
  • _mul_multivector (1d and 2d only)

Note: the last 4 items on the checklist are routing 1d array data through the 2d native code paths, which requires the creation of a dummy col array containing all zeros. If we want to squeeze the last bit of performance out of these code paths, we'll want to implement specialized C++ overloads. That's out of scope for this PR, though.

@perimosocordiae perimosocordiae added scipy.sparse enhancement A new feature or improvement labels May 25, 2023
@rgommers
Copy link
Member

Interesting. Are you discussion the scope of changes or some sort of roadmap/plan in Seattle @perimosocordiae? I'd expect general n-D support for scipy.sparse to be infeasible, and pure Python changes for COO only to not be competitive with PyData Sparse (which already has an n-D COO class with a better, numpy-compatible API than scipy.sparse has).

@perimosocordiae
Copy link
Member Author

We're planning to put together a report of the various scipy.sparse work we're doing here in Seattle (including a roadmap for future work) hopefully by the end of day tomorrow.

This PR is a bit exploratory, to see how feasible basic n-D support would be. My initial goal is to get 1-d COO arrays working reasonably well, and then take a look at other formats. With the ongoing change to use array semantics, users will start asking for sparse 1-d arrays (as the result of foo[:, 3] for example), and I don't want to tell them that they have to stop using scipy.sparse entirely.

@perimosocordiae
Copy link
Member Author

It's not exactly a roadmap, but the final report from Seattle is available here: https://hackmd.io/1Q2832LDR_2Uv_-cV-wnYg

@rgommers
Copy link
Member

Thanks for sharing - looks like that was a productive sprint!

My main points of feedback on the "what remains to be done" are:

  1. It seems to me like the key goal of the introduction of sparse array classes is to get rid of the need for np.matrix, so it can be removed from numpy and we get rid of * working like @. That is likely to involve deprecation and removal of *_matrix classes - but that goal and the plan for deprecation is left completely implicit. It'd be great to make that explicit.
  2. n-D support for scipy.sparse does not seem feasible (given the state of the C++ code) nor the right plan (accelerating PyData Sparse development seems much preferred).

@perimosocordiae
Copy link
Member Author

@rgommers I completely agree on (1). Would you like to see the roadmap updated here? https://docs.scipy.org/doc/scipy/dev/roadmap-detailed.html#sparse

For (2), I've been trying to keep this PR's implementation as general as possible rather than special-casing the 1d and 2d shapes, as I find this leads to simpler code. We'll probably decide to limit users to those shapes in the end, but it has been useful to test the boundaries of what we can naturally support.

@rgommers
Copy link
Member

@rgommers I completely agree on (1). Would you like to see the roadmap updated here? https://docs.scipy.org/doc/scipy/dev/roadmap-detailed.html#sparse

I'd suggest opening a new RFC issue with the plan. This may be the most impactful deprecation we've ever done, and it's unclear if we should go ahead with that and on what timescale. It requires discussion with scikit-learn and other major users. Scikit-learn is particularly conservative, and their minimum SciPy version right now is 1.5.0: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/_min_dependencies.py#L17. So it'd be great to work out if and when *_matrix classes can be deprecated, how long they should stick around after that, and then also what can be done with np.matrix.

@perimosocordiae
Copy link
Member Author

@dschult @stefanv @rossbar @ivirshup @jjerphan This should now be ready for review!

Copy link
Contributor

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤩

Just taking a look now. Do you have a solid idea of what is in scope what should be follow up? Before taking to close a look at the code some of the things I tried were: csr[1, :] and concatenation.

One possibly in scope improvement:

.toarray for nD

I think it should be easy to do .toarray for nD. Here's an implementation:

from scipy import sparse
import numpy as np

def to_dense(coo, out=None):
    if out is None:
        out = np.zeros(coo.shape, dtype=coo.dtype)
    out[coo.indices] = coo.data
    return out

This can be faster than the current implementation for 2d:

coo = sparse.coo_array(
    sparse.random(10_000, 10_000, density=.01, random_state=np.random.default_rng())
)

%timeit coo.toarray()
# 457 ms ± 4.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit to_dense(coo)
# 413 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And works for nD:

dense = np.random.poisson(size=(6, 6, 6))
coo = sparse.coo_array(dense)
assert np.array_equal(dense, to_dense(coo))

@ivirshup
Copy link
Contributor

As discussed in the meeting, we should come up with a list of things that need implementing and figure out which are being pushed to follow up PRs. Features include:

  • more testing
  • toarray() that doesn't go through the current native code path
  • indexing
    • For coo arrays
    • So other classes (e.g. CSC, CSR) return 1d coo
  • broadcasting for math ops

@dschult
Copy link
Contributor

dschult commented Jul 22, 2023

I've got a rudimentary version of TestCommon1d. Meaning it does things like "add two sparse arrays" or "multiply by scalars". The tests show that coo 1d arrays are not yet able to do some arithmetic operations, e.g. the _spbase class handles _add_sparse and uses .tocsr() which raises for 1D arrays.

I have thought about this occasionally over the past two weeks. Rather than overriding _add_sparse in the _coobase class, or special casing 1D in tocsr, I would prefer we merge this PR knowing that it doesn't provide a full suite of 1D functionality. The next steps toward more functionality might be to get _base.py and _csr.py working with 1d. Luckily, in 1D, the csr and coo formats are very similar. indptr in 1D becomes indptr == [0 self.nnz] and the indices have to be (un)wrapped in a tuple in the conversion. So, I think it should be fairly straightforward to move from the coo format to the csr format for 1D.

This PR provides a nice coo format with 1d support including matmul but not addition and a good foundation for an eventual nd. Following this we will have a separate PR that allows 1D in csr, csc classes (and base class support). Thus at each PR's merge, the main repo supports all existing 2d operations and incrementally adds 1D capabilities.

Thoughts?

Copy link
Contributor

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @perimosocordiae!

Here is a first review of this contribution.

scipy/sparse/_base.py Outdated Show resolved Hide resolved
scipy/sparse/tests/test_base.py Show resolved Hide resolved
scipy/sparse/_base.py Outdated Show resolved Hide resolved
scipy/sparse/_sputils.py Outdated Show resolved Hide resolved
scipy/sparse/tests/test_coo.py Show resolved Hide resolved
scipy/sparse/tests/test_coo.py Show resolved Hide resolved
scipy/sparse/tests/test_coo.py Show resolved Hide resolved
scipy/sparse/_coo.py Show resolved Hide resolved
scipy/sparse/_coo.py Show resolved Hide resolved
scipy/sparse/_coo.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thank you, @perimosocordiae.

scipy/sparse/_coo.py Outdated Show resolved Hide resolved
@@ -732,13 +735,18 @@ class coo_matrix(spmatrix, _coo_base):
ndim : int
Number of dimensions (this is always 2)
nnz
Number of stored values, including explicit zeros
size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size
size
Number of values, including implicit zeros.

has_canonical_format : bool
Whether the matrix has sorted indices and no duplicates
format
T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
T
T
The transpose of the matrix.

data
COO format data array of the matrix
row
COO format row index array of the matrix
col
COO format column index array of the matrix
has_canonical_format : bool
Whether the matrix has sorted indices and no duplicates
format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
format
format
A string for the format of the matrix, here "coo".

@@ -310,9 +310,54 @@ def format(self) -> str:
"""Format string for matrix."""
return self._format

@property
def A(self) -> np.ndarray:
"""DEPRECATED: Return a dense array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side-note: Should we remove the usage of those deprecated properties in SciPy then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is removing usage of deprecated properties from usage in SciPy something we should do in this PR or can/should it be a separate PR?

@dschult
Copy link
Contributor

dschult commented Oct 25, 2023

I've made a PR to this PR's branch with the following:

Some misc fixes of remaining errors for coo-1d.

  1. Mypy error on dtype not being defined in _spbase.
    We could set it to something expecting it to be overwritten by the subclass.
    But self.dtype.type appeared before this PR. Turns out adding -> str typing in the function signature line caused more strict checking by mypy. I removed the -> str thinking we can fix the mypy error on dtype in a different PR. If you prefer we can define it to be a vaccuous value that will be overwritten. I think with typing you can also set the type without assigning a value, but that choice is yours to make. :)

  2. Mypy error on max_size = np.prod(shape) because the result might not be an integer. I changed it to manually multiplying the values (which is 100 times faster anyway, but is very fast for both).

  3. den.resize(arg) was raising due to references of this array not allowing resize. So I added refcheck=False which was suggested by the test message. Also the reference is what we are creating: res and we know it isn't going to cause problems. [This occurs in 3 tests of resize within test_coo.py.]

  4. test error complaining about intc not equaling int32 when we used type(res) == type(exp). That idiom is used in a few other tests to check that they are both ndarrays. But in the 1d_mul_vector case, the result is a scalar. So the type becomes the dtype of the original arrays, and that can be intc for scipy and int32 for numpy. I don't think we need to test the type of the results. We should check it is a scalar and then check that the values are equal. Checking for scalar is suggested to be done (in the np.isscalar documentation) by testing that res.ndim == 0. So I replace the existing test with that test. The next line checks that res equals exp` so they both must be scalars.

  5. errors due to bounds exceeded within np.revel_multi_index inside _coo_base.reshape(). This had a TODO comment, that took me way too long to notice, pointing to a PR where the code to handle reshaping accounts for this limitation on revel_multi_index. The 1d case was actually easy to handle (index is already flat). So I added code to mimic what was there before (with indices instead of row and col). It is passing tests and hopefully OK.

  6. I removed white space on two empty lines in test_coo.py.

I think this could fix the failed tests in Linux Meson, Windows and Mypy. That should give us a green check if all goes well. Finger's crossed.

@stefanv stefanv modified the milestone: 1.12.0 Dec 5, 2023
@stefanv
Copy link
Member

stefanv commented Dec 5, 2023

We intend to postpone these changes until after the 1.12 branch.

Copy link
Contributor

@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the use of issparse in scipy/sparse and found only 3 places in the core package that might lead to a consuming exception when passed a 1d array. It looks like the csgraph stuff is protected already.

But the linalg stuff has not been protected against 1d (or 3d) pydata-sparse arrays. (It handles pydata.sparse and scipy.sparse in the same code-blocks -- and assumes they are 2d). That is evidence that making coo 1d is unlikely to disrupt linalg functions much.
But maybe a separate PR protecting linalg from non-2d sparse arrays would be appropriate. I can take a stab at a separate PR. Or I could put it into this PR if that's preferred.

I've made a PR to this PR with hopeful resolutions. The 3 places in scipy/sparse class code where 1d might cause trouble are as follows.:

  1. In _spbase._mul_dispatch we should avoid sending the 1d-array on to _mul_sparse_matrix. We could either raise an exception, or convert to dense for now to self._mul_vector(other.toarray()). In PR-PR I raised a ValueError.
  2. In _compressed, the multiply function doesn't handle 1d sparse well unless both self and other are the same shape (where a ValueError says 1d cannot convert to csr). So I put a TypeError about broadcasting not ready for 1d into the PR-PR.
  3. In _construct the blockdiag function doesn't handle a 1d block. We could either raise an exception, or reshape 1d as a 1xN 2d array. In PR-PR it reshapes to a 2d row array.

I also found a place to switch to self._shape_as_2d in _base.py in the PR.
And I had a question about a tuple(below). :}

else:
coo = arg1.tocoo()
self.row = coo.row
self.col = coo.col
self.indices = tuple(coo.indices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the tuple(...)? We don't make a copy of the data, so maybe we shouldn't be making a copy of the indices either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tuple() call here is doing a shallow copy: it makes a new tuple, but the arrays inside that tuple are not copied.

@dschult
Copy link
Contributor

dschult commented Dec 8, 2023

Further investigation of the _construct.py module shows that using 1d in e.g. kron needs a fairly easy fix. But block_array, hstack etc. are more complicated. In Numpy, numpy.hstack raises an exception if the blocks are not all the same dimension. So we should probably add some code to do that here too.

This seems to be getting away from the focus of this PR though. Rewriting _construct.py to handle both 1d and 2d should probably be a different PR. The exception obtained when trying to use a 1d-coo_array in kron and vstack (after merging the main version of _construct.py) is IndexError: tuple index out of range because A.shape[1] doesn't exist.

So, I think we will need a PR to get _construct.py working with 1d. We could do it in this PR, but it would delay things.

Note for future nd consideration:
As we move to nd we might want to reuse some code from numpy.block to figure out how to handle concatenation/block dimensions. There are a bunch of helper functions in there. It'd be great to copy the input processing and just replace the core concatenation routine. It would be a shame to have to reconstruct the handling of nd-list representations of blocks. I'm sure reusing that code would not be easy either... but probably better than rebuilding the logic from scratch.

@perimosocordiae perimosocordiae changed the title ENH: sparse: Generalize coo_array to n dimensions ENH: sparse: Generalize coo_array to support 1d shapes Dec 18, 2023
@perimosocordiae
Copy link
Member Author

Ok, all the nd wording is now replaced with 1d, and the code itself checks that shapes are either 1d or 2d.

Assuming that CI is green, I'd like to get this merged soon to maximize the time in which we can shake out any unforeseen issues before the 1.13 release.

Many many thanks to @dschult, @stefanv, @jjerphan, @ivirshup, and all the other contributors who have been so instrumental in making this PR happen!

Copy link
Member

@stefanv stefanv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @perimosocordiae! I left a few question, but none that require changes, so let's get this in.

return len(self._shape)

@property
def _shape_as_2d(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name is OK for now, but needs to be changed if we ever support any other dimensionality than 1 and 2.

self.has_canonical_format = True

if dtype is not None:
self.data = self.data.astype(dtype, copy=False)

self._check()

@property
def row(self):
return self.indices[-2] if self.ndim > 1 else np.zeros_like(self.col)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again special cased for 1/2; should we add comments where these special cases occur?

# When reducing the number of dimensions, we need to be careful about
# index overflow. This is why we can't simply call
# `np.ravel_multi_index()` followed by `np.unravel_index()` here.
flat_indices = _ravel_indices(self.indices, self.shape, order=order)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug/lack of functionality in ravel_multi_index? If so, happy to file a NumPy issue.


# Handle copy here rather than passing on to the constructor so that no
# copy will be made of new_row and new_col regardless
# copy will be made of `new_indices` regardless.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly, this comment doesn't make a lot of sense to me, but perhaps will if I dig deeper into the machinery.

self.row = np.asarray(self.row, dtype=idx_dtype)
self.col = np.asarray(self.col, dtype=idx_dtype)
for i, idx in enumerate(self.indices):
if idx.dtype.kind != 'i':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this meant to catch?

if len(set(axes)) != self.ndim:
raise ValueError("repeated axis in transpose")
elif axes != (1, 0):
raise ValueError("Sparse matrices do not support an 'axes' "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this again because we are limited to 2D?

@stefanv stefanv merged commit ccee92e into scipy:main Dec 23, 2023
28 checks passed
@jjerphan
Copy link
Contributor

jjerphan commented Jan 4, 2024

FYI, this new implementation of coo_array created a regression in scikit-learn.

See scikit-learn/scikit-learn#28047.

@lesteve
Copy link
Contributor

lesteve commented Jan 4, 2024

To give a bit more context, the change of behaviour happens when creating sparse arrays from a 1d array-like object. This was possible before this PR and would create a sparse array with a single row.

With this PR the following code

from scipy import sparse

sparse.csr_array([1., 2., 3.])

now yields an error:

ValueError: Cannot convert a 1d sparse array to csr format

We were able to adapt the scikit-learn tests without too much work, but this does feel like a breaking change for sparse arrays. I'll leave it to scipy maintainers to reflect about the trade-off between this breaking change and the functionality allowed by this PR.

Here is a summary the things that broke in scikit-learn tests:

  • sparse arrays created from a 1d array-like used to be fine and create a 2d sparse container with a single row, e.g. csr_array([1., 2., 3.]). After this PR you get an error. To be honest, there did not seem to be a particularly good reason why we were creating sparse containers from 1d array-like, other than it works and creates the data we want ...
  • on the same topic coo_array([1., 2., 3.]) is 1d after this PR and used to be 2d, this could affect some downstream code, although scikit-learn tests were not affected by this (we probably don't use coo format that much I guess)
  • the last change is a bit more edge-casy and is probably less of an issue. In our tests we create sparse arrays with 64 bit indices (mostly to check that they are supported in some scikit-learn code), we were doing things like:
    array = coo_array([1., 2.])
    array.row = array.row.astype('int64')
    array.col = array.col.astype('int64')
    After this PR the source of truth seems to be .indices and we now do:
    array.indices = tuple(each.astype('int64') for each in array.indices)

@dschult
Copy link
Contributor

dschult commented Jan 5, 2024

Creating a sparse array from a 1d array-like will now try to create a 1d array.
At the moment, coo is the only allowed format, though we should soon have dok and probably csr formats be possible. Before the next release this creation statement will construct a 1d-array instead of raising an error.

But... it is definitely going to mix up existing code that 1d array-like input creates 2d sparse matrix but a 1d sparse array. We will need to make that clear in any transition document because it is a desired change. I suppose we could raise an error for that code with sparse matrices (no creating 2d matrix with 1d array-like input). But that seems extreme.

The other important feature this issue brings to light is that A.row for a 1d array returns np.zeros_like(self.col) so it has the dtype of A.col. Thus code that used to convert the types of the index arrays by converting like this:

A.row.astype(new_dtype)
A.col.astype(new_dtype)

won't change the type of the rows -- because the A.row gets created as an array of zeros with dtype of A.col. By the time we change the type of A.col it is too late to change the type of the already created A.row. It would work to reverse the order:

A.col.astype(new_dtype)
A.row.astype(new_dtype)

This works!! because the col gets changed before row gets created.

I think it is probably better to reduce surprises by raising on A.row when A is 1d. There are no row indices -- and while we can construct some to make it "like" a 2d array with rows and columns, it really isn't. Returning an array of zeros constructed just for the row property does not allow interactive adaptation of the value of the row indices.

Proposal:

  • make A.row raise an exception when A is 1d.
  • make a lot of noise for the transition about making sure that array-like input is 2d when you want 2d output.

The second could be facilitated by changing spmatrix to raise on 1d array-like input. But I don't propose that.

@lesteve
Copy link
Contributor

lesteve commented Jan 5, 2024

OK thanks for the detailed answer, this helps a lot to understand the general direction.

When support to 1d csr/csc is added in scipy development version, we will probably want to test thoroughly what happens in scikit-learn. I would guess that some of our code assumes issparse implies 2d and that feeding a 1d sparse may create some not super user-friendly error.

I agree that making spmatrix raise an error when constructed from a 1d array, does not seem desirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.sparse
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants