ENH much more efficient sparse min/max with axis #3248

Merged
merged 10 commits into from Jan 30, 2014

Conversation

Projects
None yet
4 participants
Contributor

jnothman commented Jan 28, 2014

The current implementation of min, max for sparse matrices with a specified axis is very slow, and the use of np.where(M.indices == col) wastes both time and memory. I don't promise this to be the tightest possible alternative, but it certainly works much faster at scale!

Benchmarks in seconds:

side    density master   commit1  latest
100     0.001   0.0232   0.0011   0.0009
100     0.01    0.0238   0.0033   0.0009
100     0.1     0.0162   0.0050   0.0009
1000    0.001   0.1574   0.0275   0.0010
1000    0.01    0.1858   0.0281   0.0013 
1000    0.1     0.4279   0.0302   0.0032 
10000   0.001   4.2201   0.2791   0.0048
10000   0.01   27.7040   0.2979   0.0188
10000   0.1          -   0.5101   0.2276

Coverage Status

Coverage remained the same when pulling 151fdd9 on jnothman:fast_minmax into 8ad48cd on scipy:master.

@pv pv commented on an outdated diff Jan 28, 2014

scipy/sparse/data.py
- get_min_or_max = getattr(ith_row_data, min_or_max)
- out_mat[i, 0] = get_min_or_max()
- return self.__class__(out_mat)
+ min_or_max = getattr(np, min_or_max)
+ mat = self.tocsc() if axis == 0 else self.tocsr()
+ mat.sum_duplicates()
+ N = mat.shape[axis]
+
+ zero = self.dtype.type(0)
+ out_mat = lil_matrix((1, len(mat.indptr) - 1), dtype=self.dtype)
+ row = out_mat.rows[0]
+ data = out_mat.data[0]
+
+ for i, (start, stop) in enumerate(izip(mat.indptr, mat.indptr[1:])):
+ if start == stop:
+ continue
@pv

pv Jan 28, 2014

Owner

Coverage shows this line is never tested: https://coveralls.io/files/125056678
The code is correct, though.

@pv pv and 1 other commented on an outdated diff Jan 28, 2014

scipy/sparse/data.py
- zero = self.dtype.type(0)
- for i in range(self.shape[0]):
- ith_row_data_indices = np.argwhere(mat.indices == i)
- ith_row_data = mat.data[ith_row_data_indices]
-
- # Add a zero if needed
- if len(ith_row_data) < self.shape[1]:
- ith_row_data = np.append(zero, ith_row_data)
-
- get_min_or_max = getattr(ith_row_data, min_or_max)
- out_mat[i, 0] = get_min_or_max()
- return self.__class__(out_mat)
+ min_or_max = getattr(np, min_or_max)
+ mat = self.tocsc() if axis == 0 else self.tocsr()
+ mat.sum_duplicates()
+ N = mat.shape[axis]
@pv

pv Jan 28, 2014

Owner
if N == 0:
    raise ValueError("zero-size array to reduction operation")

should probably also be added, to match the behavior of np.max/np.min.

@jnothman

jnothman Jan 28, 2014

Contributor

That needs to be tested for either axis, and for the axis=None case...

@pv

pv Jan 28, 2014

Owner

Numpy seems to allow reductions as long as the axis reduced over is nonzero:
np.max(np.zeros((0, 5)), axis=1) vs. np.max(np.zeros((0, 5)), axis=0)

@jnothman

jnothman Jan 28, 2014

Contributor

Oh. I didn't see that last comment :s I'll need to change a few things!

On 28 January 2014 21:02, Pauli Virtanen notifications@github.com wrote:

In scipy/sparse/data.py:

  •        zero = self.dtype.type(0)
    
  •        for i in range(self.shape[0]):
    
  •            ith_row_data_indices = np.argwhere(mat.indices == i)
    

- ith_row_data = mat.data[ith_row_data_indices]

  •            # Add a zero if needed
    
  •            if len(ith_row_data) < self.shape[1]:
    

- ith_row_data = np.append(zero, ith_row_data)

  •            get_min_or_max = getattr(ith_row_data, min_or_max)
    
  •            out_mat[i, 0] = get_min_or_max()
    
  •        return self.**class**(out_mat)
    
  •    min_or_max = getattr(np, min_or_max)
    
  •    mat = self.tocsc() if axis == 0 else self.tocsr()
    
  •    mat.sum_duplicates()
    
  •    N = mat.shape[axis]
    

Numpy seems to allow reductions as long as the axis reduced over is
nonzero:
np.max(np.zeros((0, 5)), axis=1)

Reply to this email directly or view it on GitHubhttps://github.com/scipy/scipy/pull/3248/files#r9222101
.

Owner

pv commented Jan 28, 2014

Looks good to me!

Contributor

jnothman commented Jan 28, 2014

Those are now tested, with a refactor thrown in as a bonus.

Maybe add?

if axis < 0: 
    axis += 2

Coverage Status

Coverage remained the same when pulling bbed0c6 on jnothman:fast_minmax into 8ad48cd on scipy:master.

@jaimefrio jaimefrio and 2 others commented on an outdated diff Jan 28, 2014

scipy/sparse/data.py
- mat = self.tocsc()
- if not mat.has_sorted_indices:
- mat.sort_indices()
+ N = self.shape[axis]
+ if N == 0:
+ raise ValueError("zero-size array to reduction operation")
+
+ mat = self.tocsc() if axis == 0 else self.tocsr()
+ mat.sum_duplicates()
+
+ zero = self.dtype.type(0)
+ out_mat = lil_matrix((1, len(mat.indptr) - 1), dtype=self.dtype)
+ row = out_mat.rows[0]
+ data = out_mat.data[0]
+
+ for i, (start, stop) in enumerate(izip(mat.indptr, mat.indptr[1:])):
@jaimefrio

jaimefrio Jan 28, 2014

Member

Maybe you have already tested it and it isn't any faster, but you could vectorize, e.g. the min operation for all rows in a CSR matrix, by doing something along the lines of np.minimum.reduceat(mat.data, mat.indptr[::-1]), rather than looping over the slices.

@jnothman

jnothman Jan 28, 2014

Contributor

I am not familiar with these ufunc min/max forms, nor with reduceat (I need to brush up on my ufuncs). But I think you're right that it's possible, and I'd like to see iteration through slices avoided when possible. We could try:

x = np.minimum.reduceat(mat.data, mat.indptr[:-1])
d = np.diff(mat.indptr)
x = np.minimum(x, 0, where=d < N, out=x)
x[d == 0] = 0

which I think is logically equivalent, but produces a dense results, performs unnecessary mins on single elements, etc.

@pv

pv Jan 28, 2014

Owner

reduceat is somewhat quirky numpy/numpy#4238 (behavior inherited from the mists of time, it seems)

@jnothman

jnothman Jan 28, 2014

Contributor

Hmm. We don't just need to remove the last value of indptr, we need to slice off all values of indptr equal to the last (i.e. all empty rows at the end of the matrix). Is there a nice way to do that?

Is the code obfuscation going to be worthwhile?

@jnothman

jnothman Jan 28, 2014

Contributor

reduceat is somewhat quirky

maybe reduceat needs a where argument too, with which we could write:

out = np.zeros(len(mat.indptr) - 1)
d = np.diff(mat.indptr)
np.minimum.reduceat(mat.data, mat.indptr[:-1], where=d, out=out)
out = np.minimum(out, 0, where=d < N, out=out)
@pv

pv Jan 28, 2014

Owner

You can use np.searchsorted to find the index where to cut it off.
The worst case for the current code is probably 1 element per each row, where the Python overhead will dominate.

@jaimefrio

jaimefrio Jan 28, 2014

Member

See what you think of the following code, it computes the min or max over rows of a CSR matrix, using boolean indexing to minimize calculations. It may be possible to improve it by not considering rows with a single item, but I doubt there is much to improve there:

import scipy.sparse as sps

def row_minmax(csr_mat, minmax='min'):
    if minmax == 'min':
        ufunc = np.minimum
        compare = np.greater_equal
    elif minmax == 'max':
        ufunc = np.maximum
        compare = np.less_equal
    else:
        raise ValueError('not \'min\' or \'max\'')

    mask_indptr = np.diff(csr_mat.indptr) > 0 # rows not all zeros
    new_data = ufunc.reduceat(csr_mat.data,
                              csr_mat.indptr[mask_indptr]) # min (max) of non-zeros in rows not all zeros
    mask_data = compare(new_data, 0) # wrong entries in new_data, should be zero
    new_data = new_data[~mask_data] # remove them
    mask_indptr[mask_indptr] ^= mask_data # set to False those rows in mask_indptr
    return sps.csc_matrix((new_data,
                           np.nonzero(mask_indptr)[0],
                           [0, len(new_data)]), shape=(csr_mat.shape[0], 1))
@jnothman

jnothman Jan 28, 2014

Contributor

Yes, I'd considered something like that. Given that I don't think this is
code that needs to be run at high frequency, I've tried to minimise
obfuscation...

On 29 January 2014 07:38, Jaime notifications@github.com wrote:

In scipy/sparse/data.py:

  •        mat = self.tocsc()
    
  •        if not mat.has_sorted_indices:
    
  •            mat.sort_indices()
    
  •    N = self.shape[axis]
    
  •    if N == 0:
    
  •        raise ValueError("zero-size array to reduction operation")
    
  •    mat = self.tocsc() if axis == 0 else self.tocsr()
    
  •    mat.sum_duplicates()
    
  •    zero = self.dtype.type(0)
    
  •    out_mat = lil_matrix((1, len(mat.indptr) - 1), dtype=self.dtype)
    
  •    row = out_mat.rows[0]
    
  •    data = out_mat.data[0]
    
  •    for i, (start, stop) in enumerate(izip(mat.indptr, mat.indptr[1:])):
    

See what you think of the following code, it computes the min or max over
rows of a CSR matrix, using boolean indexing to minimize calculations. It
may be possible to improve it by not considering rows with a single item,
but I doubt there is much to improve there:

def row_minmax(csr_mat, minmax='min'):
if minmax == 'min':
ufunc = np.minimum
compare = np.greater_equal
elif minmax == 'max':
ufunc = np.maximum
compare = np.less_equal
else:
raise ValueError('not 'min' or 'max'')

mask_indptr = np.diff(csr_mat.indptr) > 0 # rows not all zeros
new_data = ufunc.reduceat(csr_mat.data,
                          csr_mat.indptr[mask_indptr]) # min (max) of non-zeros in rows not all zeros
mask_data = compare(new_data, 0) # wrong entries in new_data, should be zero
new_data = new_data[~mask_data] # remove them
mask_indptr[mask_indptr] ^= mask_data # set to False those rows in mask_indptr
return sps.csc_matrix((new_data,
                       np.nonzero(mask_indptr)[0],
                       [0, len(new_data)]), shape=(csr_mat.shape[0], 1))

Reply to this email directly or view it on GitHubhttps://github.com/scipy/scipy/pull/3248/files#r9243435
.

Contributor

jnothman commented Jan 28, 2014

Wow. Thanks @jaimefrio: that gave a time reduction in the order of density for a big matrix, as you can see in the benchmarks above.

The where keyword is new in Numpy 1.7, and we are still trying to stick with 1.5...
This can probably be written as

mask = (nnz < N)
out[mask] = min_or_max(out[mask], zero)

@jaimefrio jaimefrio and 1 other commented on an outdated diff Jan 28, 2014

scipy/sparse/data.py
- elif axis == 1:
- mat = self.tocsc()
- if not mat.has_sorted_indices:
- mat.sort_indices()
+ N = self.shape[axis]
+ if N == 0:
+ raise ValueError("zero-size array to reduction operation")
+
+ mat = self.tocsc() if axis == 0 else self.tocsr()
+ mat.sum_duplicates()
+
+ indptr = mat.indptr
+ zero = self.dtype.type(0)
+ out = np.zeros(len(indptr) - 1, dtype=self.dtype)
+ # can't use indices > data length with reduceat`
+ trunc = np.searchsorted(indptr, indptr[-1])
@jaimefrio

jaimefrio Jan 28, 2014

Member

Are you sure you don't want to use boolean indexing before doing .reduceat? Lines 118 through 125 can be rewritten more compactly, probably also faster, and I think more readable, as:

nnz = np.diff(indptr)
# look only at rows with non-zero entries
mask = nnz > 0
# compute min_or_max of non-zero entries in each row
min_or_max.reduceat(mat.data, indptr[mask], out=out[mask]) 
# the min_or_max of non-full rows may be zero
mask &= nnz < N
min_or_max(out[mask], zero, out=out[mask])
@jnothman

jnothman Jan 28, 2014

Contributor

Firstly, I don't think using out=out[mask] could possibly work, since the
masking is a copying operation.

Secondly, I am trying something very similar that, but I'd like to
encapsulate part of it in a function on csr and csc called
"reduce_minor(ufunc)". This can also be used in sum where axis is minor,
which gives gains over matrix multiplication for a large minor axis.

On 29 January 2014 09:56, Jaime notifications@github.com wrote:

In scipy/sparse/data.py:

  •    elif axis == 1:
    
  •        mat = self.tocsc()
    
  •        if not mat.has_sorted_indices:
    
  •            mat.sort_indices()
    
  •    N = self.shape[axis]
    
  •    if N == 0:
    
  •        raise ValueError("zero-size array to reduction operation")
    
  •    mat = self.tocsc() if axis == 0 else self.tocsr()
    
  •    mat.sum_duplicates()
    
  •    indptr = mat.indptr
    
  •    zero = self.dtype.type(0)
    
  •    out = np.zeros(len(indptr) - 1, dtype=self.dtype)
    
  •    # can't use indices > data length with reduceat`
    
  •    trunc = np.searchsorted(indptr, indptr[-1])
    

Are you sure you don't want to use boolean indexing before doing .reduceat?
Lines 118 through 125 can be rewritten more compactly, probably also
faster, and I think more readable, as:

nnz = np.diff(indptr)

look only at rows with non-zero entries

mask = nnz > 0

compute min_or_max of non-zero entries in each row

min_or_max.reduceat(mat.data, indptr[mask], 0, None, out[mask])

the min_or_max of non-full rows may be zero

mask &= nnz < N
out[mask] = min_or_max(out[mask], zero)

Reply to this email directly or view it on GitHubhttps://github.com/scipy/scipy/pull/3248/files#r9249448
.

@jnothman

jnothman Jan 28, 2014

Contributor

But half of the pain is putting it into a sparse result (which itself is
weird, because sum returns a dense matrix), since we can't access
sp.csr_matrix directly without importing them locally in the function, due
to dependency cycles.

On 29 January 2014 10:24, Joel Nothman jnothman@student.usyd.edu.au wrote:

Firstly, I don't think using out=out[mask] could possibly work, since
the masking is a copying operation.

Secondly, I am trying something very similar that, but I'd like to
encapsulate part of it in a function on csr and csc called
"reduce_minor(ufunc)". This can also be used in sum where axis is minor,
which gives gains over matrix multiplication for a large minor axis.

On 29 January 2014 09:56, Jaime notifications@github.com wrote:

In scipy/sparse/data.py:

  •    elif axis == 1:
    
  •        mat = self.tocsc()
    
  •        if not mat.has_sorted_indices:
    
  •            mat.sort_indices()
    
  •    N = self.shape[axis]
    
  •    if N == 0:
    
  •        raise ValueError("zero-size array to reduction operation")
    
  •    mat = self.tocsc() if axis == 0 else self.tocsr()
    
  •    mat.sum_duplicates()
    
  •    indptr = mat.indptr
    
  •    zero = self.dtype.type(0)
    
  •    out = np.zeros(len(indptr) - 1, dtype=self.dtype)
    
  •    # can't use indices > data length with reduceat`
    
  •    trunc = np.searchsorted(indptr, indptr[-1])
    

Are you sure you don't want to use boolean indexing before doing
.reduceat? Lines 118 through 125 can be rewritten more compactly,
probably also faster, and I think more readable, as:

nnz = np.diff(indptr)

look only at rows with non-zero entries

mask = nnz > 0

compute min_or_max of non-zero entries in each row

min_or_max.reduceat(mat.data, indptr[mask], 0, None, out[mask])

the min_or_max of non-full rows may be zero

mask &= nnz < N
out[mask] = min_or_max(out[mask], zero)

Reply to this email directly or view it on GitHubhttps://github.com/scipy/scipy/pull/3248/files#r9249448
.

@jaimefrio

jaimefrio Jan 28, 2014

Member

You are very right, indeed about the out=, although this does work:

out[mask] = min_or_max.reduceat(mat.data, indptr[mask])
...
out[mask] = min_or_max(out[mask], zero)

Anyway, it's your PR, so do as you wish: the speed-up is going to be great either way.

Contributor

jnothman commented Jan 28, 2014

WDYT?

sumis a little slower when the minor size is small, but I've seen it cut off half the time when large.

Contributor

jnothman commented Jan 28, 2014

Benchmark for sum over the minor axis (of size N) with the refactored _reduce_minor:

M       N       density reduce  mult    gain
100     100     0.001   0.0001  0.0001  1.3x
100     100     0.01    0.0001  0.0001  1.0x
100     100     0.1     0.0001  0.0001  0.9x
100     1000    0.001   0.0001  0.0001  1.0x
100     1000    0.01    0.0001  0.0001  1.0x
100     1000    0.1     0.0001  0.0001  1.1x
100     10000   0.001   0.0001  0.0001  1.2x
100     10000   0.01    0.0001  0.0001  1.4x
100     10000   0.1     0.0002  0.0004  1.9x
100     100000  0.001   0.0001  0.0003  2.6x
100     100000  0.01    0.0002  0.0008  3.5x
100     100000  0.1     0.0013  0.0033  2.5x
1000    100     0.001   0.0001  0.0001  0.8x
1000    100     0.01    0.0001  0.0001  0.7x
1000    100     0.1     0.0001  0.0001  0.6x
1000    1000    0.001   0.0001  0.0001  0.7x
1000    1000    0.01    0.0001  0.0001  0.7x
1000    1000    0.1     0.0002  0.0002  1.1x
1000    10000   0.001   0.0001  0.0001  0.8x
1000    10000   0.01    0.0002  0.0003  1.4x
1000    10000   0.1     0.0014  0.0024  1.7x
1000    100000  0.001   0.0002  0.0005  2.3x
1000    100000  0.01    0.0013  0.0034  2.5x
1000    100000  0.1     0.0118  0.0304  2.6x
10000   100     0.001   0.0002  0.0001  0.4x
10000   100     0.01    0.0005  0.0002  0.4x
10000   100     0.1     0.0008  0.0003  0.4x
10000   1000    0.001   0.0005  0.0002  0.4x
10000   1000    0.01    0.0008  0.0003  0.4x
10000   1000    0.1     0.0018  0.0018  1.0x
10000   10000   0.001   0.0008  0.0003  0.4x
10000   10000   0.01    0.0018  0.0023  1.3x
10000   10000   0.1     0.0125  0.0228  1.8x
10000   100000  0.001   0.0018  0.0038  2.1x
10000   100000  0.01    0.0120  0.0310  2.6x

@pv pv and 1 other commented on an outdated diff Jan 28, 2014

scipy/sparse/data.py
- # Add a zero if needed
- if len(ith_col_data) < self.shape[0]:
- ith_col_data = np.append(zero, ith_col_data)
+ major_index, value = mat._minor_reduce(min_or_max)
+ min_or_max(value, 0, out=value)
@pv

pv Jan 28, 2014

Owner

min_or_max(value, 0, value)
in case Numpy 1.5 doesn't like the kwarg

@jnothman

jnothman Jan 29, 2014

Contributor

Argh! Just realised this no longer handles the completely full case. Which means doing the indptr diff again :(

@pv

pv Jan 29, 2014

Owner

This doesn't seem to have been caught by the tests, so it would be useful to test it.

@jnothman

jnothman Jan 29, 2014

Contributor

You're commenting on an old diff, so I'm not sure what you mean...

On 29 January 2014 20:10, Pauli Virtanen notifications@github.com wrote:

In scipy/sparse/data.py:

  •            # Add a zero if needed
    
  •            if len(ith_col_data) < self.shape[0]:
    
  •                ith_col_data = np.append(zero, ith_col_data)
    
  •    major_index, value = mat._minor_reduce(min_or_max)
    
  •    min_or_max(value, 0, out=value)
    

This doesn't seem to have been caught by the tests, so it would be useful
to test it.

Reply to this email directly or view it on GitHubhttps://github.com/scipy/scipy/pull/3248/files#r9259751
.

@pv

pv Jan 29, 2014

Owner

I mean that the fact that the issue was not caught by the test suite, indicates that the test suite doesn't have a test case with a full matrix, and so it would be useful to add a test that tests that things work also with a full matrix.

@jnothman

jnothman Jan 29, 2014

Contributor

Yes, true. In ensuring the test worked with empty rows, I forgot to leave a
full one.

On 29 January 2014 20:48, Pauli Virtanen notifications@github.com wrote:

In scipy/sparse/data.py:

  •            # Add a zero if needed
    
  •            if len(ith_col_data) < self.shape[0]:
    
  •                ith_col_data = np.append(zero, ith_col_data)
    
  •    major_index, value = mat._minor_reduce(min_or_max)
    
  •    min_or_max(value, 0, out=value)
    

I mean that the fact that the issue was not caught by the test suite,
indicates that the test suite doesn't have a test case with a full matrix,
and so it would be useful to add a test that tests that things work also
with a full matrix.

Reply to this email directly or view it on GitHubhttps://github.com/scipy/scipy/pull/3248/files#r9260572
.

@pv pv and 1 other commented on an outdated diff Jan 28, 2014

scipy/sparse/data.py
- get_min_or_max = getattr(ith_col_data, min_or_max)
- out_mat[0, i] = get_min_or_max()
- return self.__class__(out_mat)
+ out = np.zeros(len(mat.indptr) - 1, dtype=self.dtype)
@pv

pv Jan 28, 2014

Owner

Can we construct the sparse matrix directly here?

@jnothman

jnothman Jan 29, 2014

Contributor

Yes, I'd like to. =)

@pv

pv Jan 29, 2014

Owner

Looks almost like COO, coo_matrix((major_index, np.ones_like(major_index), value))

@jnothman

jnothman Jan 29, 2014

Contributor

Those should be zeros, and will need a shape arg. It also almost looks like
CSR or CSC...

On 29 January 2014 11:26, Pauli Virtanen notifications@github.com wrote:

In scipy/sparse/data.py:

  •            get_min_or_max = getattr(ith_col_data, min_or_max)
    
  •            out_mat[0, i] = get_min_or_max()
    
  •        return self.**class**(out_mat)
    
  •    out = np.zeros(len(mat.indptr) - 1, dtype=self.dtype)
    

Looks almost like COO, coo_matrix((major_index,
np.ones(len(major_index),dtype=np.intc)), value))

Reply to this email directly or view it on GitHubhttps://github.com/scipy/scipy/pull/3248/files#r9252272
.

@jnothman

jnothman Jan 29, 2014

Contributor

Do you mind if I import coo_matrix locally? I can't do it at the module level.

Owner

pv commented Jan 28, 2014

This is starting to look like the right way to do it. I think we can now converge to this method and get it merged.

Coverage Status

Coverage remained the same when pulling 720ef74 on jnothman:fast_minmax into c683ff9 on scipy:master.

Contributor

jnothman commented Jan 29, 2014

I've got tests for those edge cases now.

pv merged commit 720ef74 into scipy:master Jan 30, 2014

1 check passed

default The Travis CI build passed
Details
Owner

pv commented Jan 30, 2014

Thanks, merged in 4844c63.
Some tests added, Numpy < 1.8 compatibility fixed, and one (unrelated) bugfix in BSR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment