Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operations between numpy.array and scipy.sparse matrix return inconsistent array type #7510

Closed
zietzm opened this issue Jun 20, 2017 · 7 comments
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.sparse
Milestone

Comments

@zietzm
Copy link

zietzm commented Jun 20, 2017

It appears that adding or subtracting numpy.ndarrays with scipy.sparse matrices returns a numpy.matrix.

Is the inconsistency in returned array type (see code below) a bug or is it intentional?

It is confusing because different operations return either numpy arrays or numpy matrices, and since we are working with arrays that are often inter-converted between sparse and dense, it would be very helpful to have consistency in output array type (ie. operations between sparse and dense always return either a numpy array or always a numpy matrix.)

It makes sense that operations between like types return the same type as the input type. However, it would be nice if operations between numpy.arrays and scipy.sparse matrices would return always the same array type.

Reproducing code example:

import numpy as np
from scipy.sparse import csc_matrix

mat = np.array([[1,1],
                [1,0]])
sp_mat = csc_matrix([[1,1],
                     [1,0]])
mat + mat
>>> array([[2, 2],
           [2, 0]])
sp_mat + sp_mat
>>> <2x2 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Column format>
mat + sp_mat
>>> matrix([[2, 2],
            [2, 0]], dtype=int64)
sp_mat * mat
>>> array([[2, 1],
           [1, 1]], dtype=int64)
sp_mat == mat
>>> matrix([[ True,  True],
            [ True,  True]], dtype=bool)

Scipy/Numpy/Python version information:

print(scipy.__version__, numpy.__version__, sys.version_info)
>>> 0.19.0 1.12.1 sys.version_info(major=3, minor=6, micro=0, releaselevel='final', serial=0)
@dhimmel
Copy link

dhimmel commented Jun 20, 2017

For me, the issue here is that numpy.array + scipy.sparse returns a numpy.matrix. I think the ideal behavior is:

  • numpy.array + scipy.sparse returns a numpy.array
  • numpy.matrix + scipy.sparse returns a numpy.matrix

This issue also applies to subtraction. Actually the way we noticed it was from code like:

numpy_array -= scipy_sparse_matrix

This command changed the type of numpy_array to numpy.matrix which caused downstream problems.

@perimosocordiae
Copy link
Member

In general, the rule we try to follow is: if you replace the sparse matrix with a numpy matrix, the resulting type should not change (unless the result itself is sparse).

Following this rule:

  • array -= sparse converting to matrix is certainly a bug (and I can reproduce it locally).
  • array + sparse returning matrix is working as intended (even though it's a bit strange).
  • array * sparse returning array is a bug.

@perimosocordiae perimosocordiae added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.sparse labels Jun 21, 2017
@perimosocordiae perimosocordiae added this to the 1.0 milestone Jun 21, 2017
@dhimmel
Copy link

dhimmel commented Jun 21, 2017

In general, the rule we try to follow is: if you replace the sparse matrix with a numpy matrix, the resulting type should not change (unless the result itself is sparse).

@perimosocordiae so it sounds like goal is to imitate the type behavior of numpy.matrix. Let's work with the following definitions:

import numpy
from scipy.sparse import csc_matrix

data = [[1,1],
        [1,0]]

array = numpy.array(data)
matrix = numpy.matrix(data)
sparse = csc_matrix(data)

The following pure numpy operations all return a matrix: array + matrix, matrix + array, array - matrix, matrix - array. Hence according to @perimosocordiae's rule, array + sparse, sparse + array, array - sparse, and sparse - array should all return a matrix (which they do). However, since array += matrix and array -= matrix, keep array as an array, so should array += sparse and array -= sparse.

For the dot product case, I will use @ rather than * for clarity. array @ matrix and matrix @ array both return a matrix. Hence, array @ sparse and sparse @ array should return a matrix, but instead they return an array.

(unless the result itself is sparse).

@perimosocordiae how do you know whether the output should be sparse? For example, why doesn't matrix @ sparse return sparse?

From the scipy.linalg docs:

Despite its convenience, the use of the numpy.matrix class is discouraged, since it adds nothing that cannot be accomplished with 2D numpy.ndarray objects, and may lead to a confusion of which class is being used.

This makes a lot of sense to me. In fact, for our project, we've tried to avoid using matrix at all. Primarily because we sometimes have to do elementwise multiplication. The separate operators for dot product and elementwise multiplication is helpful. However, scipy.sparse seems pretty committed to the matrix rather than ndarray design.

So my question is, for a project that mixes 2d-arrays and scipy.sparse matrices, should we just migrate from ndarray to matrix entirely? While this forces us to use the "discouraged" class, it's an uphill battle to use ndarrays when everything is always getting converted to matrices?

@perimosocordiae
Copy link
Member

Your analysis looks correct to me.

We determine if the output is sparse without considering the contents of the operands, based on the worst-case behavior. Thus, sparse - dense is always dense, even if the particular dense operand happens to be mostly zeros. In the case of dot products, the worst-case is that the dense operand has no zeros and the sparse operand is full-rank, which will result in an entirely dense result.

In general, I agree that users should avoid numpy.matrix whenever possible. Unfortunately, the scipy.sparse API design decision to mimic numpy.matrix can't be changed now. There's a growing list of efforts to produce a sparse ndarray-like library that may someday make it into scipy, but that will take considerable time.

To answer your question, I suggest taking a close look at the operations producing your arrays/matrices, and deciding what the appropriate type is on a case-by-case basis. It's fairly common to have separate code paths for sparse vs dense inputs, as the time/space complexity concerns often change pretty significantly between these cases. If you find that you often have sparse matrices that end up densifying, it might be faster/cheaper to avoid the sparse representation altogether.

@evfro
Copy link

evfro commented Apr 14, 2020

@perimosocordiae
Hi, is there any work going on to fix the bug you outlined above?

array -= sparse converting to matrix is certainly a bug

@perimosocordiae
Copy link
Member

gh-7826 has more discussion of that particular problem, and why it's not trivial to fix. I like the suggestion in #7826 (comment) about raising a warning, though.

As far as I know there aren't any open PRs along those lines yet.

@evfro
Copy link

evfro commented Apr 18, 2020

@perimosocordiae thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.sparse
Projects
None yet
Development

No branches or pull requests

4 participants