New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: sparse: Generalize coo_array to support 1d shapes #18530
Conversation
Interesting. Are you discussion the scope of changes or some sort of roadmap/plan in Seattle @perimosocordiae? I'd expect general n-D support for |
We're planning to put together a report of the various This PR is a bit exploratory, to see how feasible basic n-D support would be. My initial goal is to get 1-d COO arrays working reasonably well, and then take a look at other formats. With the ongoing change to use array semantics, users will start asking for sparse 1-d arrays (as the result of |
It's not exactly a roadmap, but the final report from Seattle is available here: https://hackmd.io/1Q2832LDR_2Uv_-cV-wnYg |
Thanks for sharing - looks like that was a productive sprint! My main points of feedback on the "what remains to be done" are:
|
@rgommers I completely agree on (1). Would you like to see the roadmap updated here? https://docs.scipy.org/doc/scipy/dev/roadmap-detailed.html#sparse For (2), I've been trying to keep this PR's implementation as general as possible rather than special-casing the 1d and 2d shapes, as I find this leads to simpler code. We'll probably decide to limit users to those shapes in the end, but it has been useful to test the boundaries of what we can naturally support. |
I'd suggest opening a new RFC issue with the plan. This may be the most impactful deprecation we've ever done, and it's unclear if we should go ahead with that and on what timescale. It requires discussion with scikit-learn and other major users. Scikit-learn is particularly conservative, and their minimum SciPy version right now is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤩
Just taking a look now. Do you have a solid idea of what is in scope what should be follow up? Before taking to close a look at the code some of the things I tried were: csr[1, :]
and concatenation.
One possibly in scope improvement:
.toarray
for nD
I think it should be easy to do .toarray
for nD. Here's an implementation:
from scipy import sparse
import numpy as np
def to_dense(coo, out=None):
if out is None:
out = np.zeros(coo.shape, dtype=coo.dtype)
out[coo.indices] = coo.data
return out
This can be faster than the current implementation for 2d:
coo = sparse.coo_array(
sparse.random(10_000, 10_000, density=.01, random_state=np.random.default_rng())
)
%timeit coo.toarray()
# 457 ms ± 4.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_dense(coo)
# 413 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And works for nD:
dense = np.random.poisson(size=(6, 6, 6))
coo = sparse.coo_array(dense)
assert np.array_equal(dense, to_dense(coo))
As discussed in the meeting, we should come up with a list of things that need implementing and figure out which are being pushed to follow up PRs. Features include:
|
I've got a rudimentary version of I have thought about this occasionally over the past two weeks. Rather than overriding This PR provides a nice Thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @perimosocordiae!
Here is a first review of this contribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thank you, @perimosocordiae.
scipy/sparse/_coo.py
Outdated
@@ -732,13 +735,18 @@ class coo_matrix(spmatrix, _coo_base): | |||
ndim : int | |||
Number of dimensions (this is always 2) | |||
nnz | |||
Number of stored values, including explicit zeros | |||
size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size | |
size | |
Number of values, including implicit zeros. |
scipy/sparse/_coo.py
Outdated
has_canonical_format : bool | ||
Whether the matrix has sorted indices and no duplicates | ||
format | ||
T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T | |
T | |
The transpose of the matrix. |
scipy/sparse/_coo.py
Outdated
data | ||
COO format data array of the matrix | ||
row | ||
COO format row index array of the matrix | ||
col | ||
COO format column index array of the matrix | ||
has_canonical_format : bool | ||
Whether the matrix has sorted indices and no duplicates | ||
format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format | |
format | |
A string for the format of the matrix, here "coo". |
scipy/sparse/_base.py
Outdated
@@ -310,9 +310,54 @@ def format(self) -> str: | |||
"""Format string for matrix.""" | |||
return self._format | |||
|
|||
@property | |||
def A(self) -> np.ndarray: | |||
"""DEPRECATED: Return a dense array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side-note: Should we remove the usage of those deprecated properties in SciPy then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is removing usage of deprecated properties from usage in SciPy something we should do in this PR or can/should it be a separate PR?
fe1dd04
to
91f7b7c
Compare
91f7b7c
to
45aa3fa
Compare
I've made a PR to this PR's branch with the following: Some misc fixes of remaining errors for coo-1d.
I think this could fix the failed tests in Linux Meson, Windows and Mypy. That should give us a green check if all goes well. Finger's crossed. |
fix misc tests and mypy errors
manually unravel to avoid 32bit overflows
Switch to leftpadding self.shape in coo-1d arrays when using 2d tools
We don't want to support dimensions > 2 just yet.
We intend to postpone these changes until after the 1.12 branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the use of issparse
in scipy/sparse
and found only 3 places in the core package that might lead to a consuming exception when passed a 1d array. It looks like the csgraph
stuff is protected already.
But the linalg
stuff has not been protected against 1d (or 3d) pydata-sparse arrays. (It handles pydata.sparse and scipy.sparse in the same code-blocks -- and assumes they are 2d). That is evidence that making coo 1d is unlikely to disrupt linalg
functions much.
But maybe a separate PR protecting linalg from non-2d sparse arrays would be appropriate. I can take a stab at a separate PR. Or I could put it into this PR if that's preferred.
I've made a PR to this PR with hopeful resolutions. The 3 places in scipy/sparse
class code where 1d might cause trouble are as follows.:
- In
_spbase._mul_dispatch
we should avoid sending the 1d-array on to_mul_sparse_matrix
. We could either raise an exception, or convert to dense for now toself._mul_vector(other.toarray())
. In PR-PR I raised a ValueError. - In
_compressed
, themultiply
function doesn't handle 1d sparse well unless bothself
andother
are the same shape (where a ValueError says 1d cannot convert to csr). So I put a TypeError about broadcasting not ready for 1d into the PR-PR. - In
_construct
theblockdiag
function doesn't handle a 1d block. We could either raise an exception, or reshape 1d as a 1xN 2d array. In PR-PR it reshapes to a 2d row array.
I also found a place to switch to self._shape_as_2d
in _base.py in the PR.
And I had a question about a tuple
(below). :}
else: | ||
coo = arg1.tocoo() | ||
self.row = coo.row | ||
self.col = coo.col | ||
self.indices = tuple(coo.indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the tuple(...)
? We don't make a copy of the data, so maybe we shouldn't be making a copy of the indices either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tuple()
call here is doing a shallow copy: it makes a new tuple, but the arrays inside that tuple are not copied.
Further investigation of the This seems to be getting away from the focus of this PR though. Rewriting So, I think we will need a PR to get Note for future |
Handle 1d with care outside of _coo.py
Ok, all the nd wording is now replaced with 1d, and the code itself checks that shapes are either 1d or 2d. Assuming that CI is green, I'd like to get this merged soon to maximize the time in which we can shake out any unforeseen issues before the 1.13 release. Many many thanks to @dschult, @stefanv, @jjerphan, @ivirshup, and all the other contributors who have been so instrumental in making this PR happen! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @perimosocordiae! I left a few question, but none that require changes, so let's get this in.
return len(self._shape) | ||
|
||
@property | ||
def _shape_as_2d(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name is OK for now, but needs to be changed if we ever support any other dimensionality than 1 and 2.
self.has_canonical_format = True | ||
|
||
if dtype is not None: | ||
self.data = self.data.astype(dtype, copy=False) | ||
|
||
self._check() | ||
|
||
@property | ||
def row(self): | ||
return self.indices[-2] if self.ndim > 1 else np.zeros_like(self.col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again special cased for 1/2; should we add comments where these special cases occur?
# When reducing the number of dimensions, we need to be careful about | ||
# index overflow. This is why we can't simply call | ||
# `np.ravel_multi_index()` followed by `np.unravel_index()` here. | ||
flat_indices = _ravel_indices(self.indices, self.shape, order=order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bug/lack of functionality in ravel_multi_index
? If so, happy to file a NumPy issue.
|
||
# Handle copy here rather than passing on to the constructor so that no | ||
# copy will be made of new_row and new_col regardless | ||
# copy will be made of `new_indices` regardless. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly, this comment doesn't make a lot of sense to me, but perhaps will if I dig deeper into the machinery.
self.row = np.asarray(self.row, dtype=idx_dtype) | ||
self.col = np.asarray(self.col, dtype=idx_dtype) | ||
for i, idx in enumerate(self.indices): | ||
if idx.dtype.kind != 'i': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this meant to catch?
if len(set(axes)) != self.ndim: | ||
raise ValueError("repeated axis in transpose") | ||
elif axes != (1, 0): | ||
raise ValueError("Sparse matrices do not support an 'axes' " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this again because we are limited to 2D?
FYI, this new implementation of |
To give a bit more context, the change of behaviour happens when creating sparse arrays from a 1d array-like object. This was possible before this PR and would create a sparse array with a single row. With this PR the following code from scipy import sparse
sparse.csr_array([1., 2., 3.]) now yields an error:
We were able to adapt the scikit-learn tests without too much work, but this does feel like a breaking change for sparse arrays. I'll leave it to scipy maintainers to reflect about the trade-off between this breaking change and the functionality allowed by this PR. Here is a summary the things that broke in scikit-learn tests:
|
Creating a sparse array from a 1d array-like will now try to create a 1d array. But... it is definitely going to mix up existing code that 1d array-like input creates 2d sparse matrix but a 1d sparse array. We will need to make that clear in any transition document because it is a desired change. I suppose we could raise an error for that code with sparse matrices (no creating 2d matrix with 1d array-like input). But that seems extreme. The other important feature this issue brings to light is that A.row.astype(new_dtype)
A.col.astype(new_dtype) won't change the type of the A.col.astype(new_dtype)
A.row.astype(new_dtype) This works!! because the I think it is probably better to reduce surprises by raising on Proposal:
The second could be facilitated by changing |
OK thanks for the detailed answer, this helps a lot to understand the general direction. When support to 1d I agree that making |
This PR rewrites
scipy.sparse.coo_array
in a way that theoretically could support arbitrary n-dimensional shapes. In practice,coo_array
will only fully support 1d and 2d shapes. Thecoo_matrix
subclass will preserve its existing 2d-only semantics.The main transformation is replacing the individual
.row
and.col
attribute arrays with a single.indices
attribute that holds a tuple of index arrays. We maintain the invariant thatlen(x.shape) == len(x.indices) == x.ndim
.Support checklist:
reshape
nnz
transpose
(includingaxes=(...)
keyword)resize
tocsc
/tocsr
/todia
todok
also raises, but we could easily add n-d support there tootocoo
diagonal
and_setdiag
_with_data
sum_duplicates
eliminate_zeros
toarray
(1d and 2d only)_add_dense
(1d and 2d only)_mul_vector
(1d and 2d only)_mul_multivector
(1d and 2d only)Note: the last 4 items on the checklist are routing 1d array data through the 2d native code paths, which requires the creation of a dummy
col
array containing all zeros. If we want to squeeze the last bit of performance out of these code paths, we'll want to implement specialized C++ overloads. That's out of scope for this PR, though.