-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: CSR/CSC Division Yields COO #20169
Comments
cc: @ivirshup |
For example, could we do: dividend.data / divisor[dividend.nonzero()] To get the new |
I would add:
from scipy import sparse
import numpy as np
csr = sparse.random_array((100, 100), format="csr", density=0.01)
type(csr / 2)
# scipy.sparse._csr.csr_array
type(csr / np.full(100, 2))
# scipy.sparse._coo.coo_array
type(csr * 2)
# scipy.sparse._csr.csr_array
type(csr * np.full(100, 2))
# scipy.sparse._coo.coo_array |
The code path for the recip = np.true_divide(1., vector)
return csr.multiply(recip) Then, in the handling of coo = csr.tocoo()
coo.data = np.multiply(coo.data, recip[coo.row, coo.col])
return coo This strips away a bunch of branches for shape-checking and other stuff, but that's the core of the operation. |
I wouldn't necessarily call this a correctness bug, because we don't generally guarantee that sparse formats will be preserved, but I can see a case for it being a performance bug in some situations. Note: using |
I would also say more performance bug, and probably unexpected behaviours
Based on that, would changing this be an acceptable change in behavior? For reference, here is how scikit-learn implements the similar
So pure python, though with allocations we could skip in compiled code. But can look like: from scipy import sparse
import numpy as np
from operator import mul, truediv
def broadcast_csr_by_vec(X, vec, op, axis):
if axis == 0:
new_data = op(X.data, np.repeat(vec, np.diff(X.indptr)))
elif axis == 1:
new_data = op(X.data, vec.take(X.indices, mode="clip"))
return X._with_data(new_data) With some promise for performance: In [28]: X = sparse.random_array((10_000, 1_000), format="csr", density=0.1)
...: y = np.random.randn(10_000)
In [29]: broadcast_csr_by_vec(X, y, mul, axis=0) != (X * y[:, None])
Out[29]:
<10000x1000 sparse array of type '<class 'numpy.bool_'>'
with 0 stored elements in Compressed Sparse Row format>
In [30]: %timeit broadcast_csr_by_vec(X, y, mul, axis=0)
4.89 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [31]: %timeit (X * y[:, None])
18.9 ms ± 974 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
Added bonus of the approach above: since we're not "faking" division the results are actually equivalent to numpy (which they currently aren't): dense_op = X.toarray() / y[:, None]
sparse_op = (X / y[:, None]).toarray()
new_op = broadcast_csr_by_vec(X, y, truediv, axis=0).toarray()
np.testing.assert_array_equal(dense_op, new_op)
assert not np.array_equal(dense_op, sparse_op) |
I think a PR to fix this may be a little bigger than expected due to the number of cases that currently return COO, though don't have to. For example, currently |
Describe your issue.
I would expect dividing a CSR or CSC matrix by an
ndarray
of the same size would yield aCSR/CSC
matrix since the operation should just scaledata
but instead we getCOO
. I see there's an in place row scale that should be similar except it doesn't need to broadcast. But the problem then becomes one of just indexing the divisor, and then calling this?Reproducing Code Example
Error message
SciPy/NumPy/Python version and system information
The text was updated successfully, but these errors were encountered: