ENH: adding `sum_duplicates` method to COO sparse matrix #3646

perimosocordiae · 2014-05-12T22:15:16Z

Fixes #3624 by adding a sum_duplicates method and calling it before the todok conversion.

Also adds some test cases for both sum_duplicates and todok with duplicates.

argriffing · 2014-05-12T22:33:01Z

The other sparse matrices that sum duplicates also have bookkeeping to remember that they are in a canonical format (indices ordered and duplicates summed). Should this be added to coo too?

coveralls · 2014-05-12T23:03:36Z

Coverage remained the same when pulling 168ef17 on perimosocordiae:patch-4 into 7ff4e90 on scipy:master.

perimosocordiae · 2014-05-13T18:18:49Z

Commit f198f74 adds a has_canonical_format attribute. I've made an attempt at setting it intelligently in the constructor, but it's not propagated across all operations (like transpose). The result is that the value is somewhat pessimistic.

There's no guarantee that a user won't change the row and/or col attributes directly, which might cause has_canonical_format to be true when duplicates exist, which might result in downstream bugs. This seems like a documentation issue, however. If the consensus is for this approach, I can add a note to the coo_matrix constructor.

coveralls · 2014-05-13T18:58:28Z

Coverage remained the same when pulling f198f74 on perimosocordiae:patch-4 into 7ff4e90 on scipy:master.

argriffing · 2014-05-13T19:14:45Z

I checked the reasons for the failing build -- one is that the pep8 robocop is requiring the number of blank lines that it likes, and the other is that older but supported numpy versions apparently do not allow the dtype keyword in ones_like.

coveralls · 2014-05-13T20:08:35Z

Coverage remained the same when pulling 0814fed on perimosocordiae:patch-4 into 7ff4e90 on scipy:master.

pv · 2014-05-13T20:13:11Z

scipy/sparse/coo.py

+        prev_idx = 0
+        prev_inds = (self.row[0], self.col[0])
+        mask = np.ones(len(self.row), dtype=bool)
+        for idx, inds in enumerate(izip(self.row[1:], self.col[1:]), 1):


This should be written in Cython or C++ (see _csparsetools.pyx or coo.h and generate_sparsetools.py), as it's going to be pretty slow. I'd recommend C++ (easier type templating).

I agree. Should we merge this PR first and then rewrite for speed, or roll the fast version into this PR?

I think you can get away without Cython and vectorize the whole thing with numpy doing something along the lines of:

idx = self.row * self.shape[0] + self.col order = np.argsort(idx) idx = idx[order] unq_idx, unq_inv = np.unique(idx, return_inverse=True) unq_cnt = np.bincount(unq_inv) if np.any(unq_cnt > 1): add_idx = np.concatenate(([0], np.cumsum(unq_cnt[:-1]))) self.data = np.add.reduceat(self.data[order], add_idx) self.row = unq_idx // self.shape[0] self.col = unq_idx % self.shape[0]

A C/Cython version will however probably be some factors faster still, as it requires fewer temporaries.
(Also, it is likely simpler to understand, as the operation is somewhat involved to vectorize.)

OTOH, this code path is probably usually not speed-critical, as it's a rare case, so a Numpy version will probably be mostly fine.

Using a flat index for sorting may overflow int32, ~~so care should here be taken in choosing the integer type. Use get_index_dtype(maxval=self.shape[0]*self.shape[1]) to upcast to int64 when necessary...~~ Or just use lexsort, which will be even safer.

The main reason for a flat index is allowing the use of np.unique. But it wouldn't be too hard to rewrite the functionality for two index arrays using np.unique's source code as a guide. It would also help performance, as calling np.unique sorts an already sorted array. And in the latest numpy master, the implementation can compute unq_cnt faster than calling np.bincount on unq_inv. It would look something like this:

order = np.lexsort((self.row,self.col)) self.row = self.row[order] self.col = self.col[order] self.data = self.data[order] flag = np.concatenate(([True], (self.row[1:] != self.row[:-1]) | (self.col[1:] != self.col[:-1]))) self.row = self.row[flag] self.col = self.col[flag] add_idx = np.nonzero(flag)[0] self.data = np.add.reduceat(self.data, add_idx)

This is probably even less readable, but I wouldn't be surprised if it was 2x faster.

perimosocordiae · 2014-05-14T15:18:41Z

@jaimefrio thanks! Your version does appear to be faster, and I don't think it sacrifices much readability.

coveralls · 2014-05-14T16:07:59Z

Coverage remained the same when pulling b9c4e62 on perimosocordiae:patch-4 into 7ff4e90 on scipy:master.

jaimefrio · 2014-05-14T16:20:04Z

scipy/sparse/coo.py

+        unique_mask = np.append(True, unique_mask)
+        self.row = self.row[unique_mask]
+        self.col = self.col[unique_mask]
+        unique_inds, = np.nonzero(unique_mask)


Nice way of unpacking a single item tuple! I hate having to add [0] at the end...

argriffing · 2014-05-16T18:29:01Z

I don't have an opinion about whether or not it's a good idea to cache the canonical status, I just thought I'd mention it because the other sparse matrix classes seem to do it.

perimosocordiae · 2014-05-16T18:38:23Z

I think I'm against it, actually:

sum_duplicates is only being called for the todok conversion, which is unlikely to happen more than once in real code.
There are cases where it could actually cause bugs, as mentioned above.
The lexsort, which is the most expensive step, is faster for pre-sorted inputs, so calling sum_duplicates twice will be faster the second time anyway.

Unless anyone has a good case for caching the status, I'll pull that part out and we can hopefully merge this in.

argriffing · 2014-05-16T18:44:58Z

Unless anyone has a good case for caching the status,
I'll pull that part out and we can hopefully merge this in.

Sounds good to me.

pv · 2014-05-16T19:04:08Z

There is a good argument for caching the sum_duplicates, as certain operations expect non-duplicated input, and it is currently not called appropriately in several of them. There are tests, but they are disabled for the moment.

I think it is reasonable to expect the user to invalidate the canonical flags if they muck with the data, or otherwise we have either to (i) accept the performance penalty in common operations, or (ii) accept incorrect results with duplicates.

Moreover, I suspect the speed of the operation is constrained more by copying the data several times over in memory, than just lexsort.

pv · 2014-05-16T19:13:54Z

Disabled tests here: https://github.com/scipy/scipy/blob/master/scipy/sparse/tests/test_base.py#L3753 They didn't seem to catch this todok() issue, so maybe there are more...

perimosocordiae · 2014-05-19T19:11:21Z

Fair points. I didn't realize that so much is broken in the presence of duplicates, and I agree that silent failure isn't the best behavior.

Currently, much of the broken functionality isn't COO-specific (max,min, various ufuncs, etc), so it's not clear how to insert a sum_duplicates call (or throw an error, maybe). Overriding the various mixin methods just to check that canonical format flag seems inelegant. In any case, I think those fixes are beyond the scope of this PR.

jnothman · 2014-05-25T12:31:50Z

They didn't seem to catch this todok() issue, so maybe there are more...

See https://github.com/scipy/scipy/blob/master/scipy/sparse/tests/test_base.py#L3764 ("format conversion broken with non-canonical matrix")

ENH: sparse: adding `sum_duplicates` method to COO sparse matrix

pv · 2014-08-23T17:31:36Z

LGTM, merged.

perimosocordiae added 2 commits May 12, 2014 18:12

TST: adding test cases for COO sum_duplicates

2637dbe

ENH: adding sum_duplicates method to COO matrix

168ef17

rgommers added PR labels May 12, 2014

adding has_canonical_format to COO matrix

f198f74

fixing build errors

0814fed

pv reviewed May 13, 2014
View reviewed changes

incorporating @jaimefrio's vectorized version

b9c4e62

jaimefrio reviewed May 14, 2014
View reviewed changes

pv removed the PR label Aug 13, 2014

pv added a commit that referenced this pull request Aug 23, 2014

Merge pull request #3646 from perimosocordiae/patch-4

24ab79f

ENH: sparse: adding `sum_duplicates` method to COO sparse matrix

pv merged commit 24ab79f into scipy:master Aug 23, 2014

pv mentioned this pull request Aug 23, 2014

get and set diagonal of coo_matrix, and related csgraph laplacian changes #3827

Merged

perimosocordiae deleted the patch-4 branch August 23, 2014 18:39

pv added this to the 0.15.0 milestone Aug 24, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: adding `sum_duplicates` method to COO sparse matrix #3646

ENH: adding `sum_duplicates` method to COO sparse matrix #3646

perimosocordiae commented May 12, 2014

argriffing commented May 12, 2014

coveralls commented May 12, 2014

perimosocordiae commented May 13, 2014

coveralls commented May 13, 2014

argriffing commented May 13, 2014

coveralls commented May 13, 2014

pv May 13, 2014

perimosocordiae May 13, 2014

jaimefrio May 13, 2014

pv May 14, 2014

pv May 14, 2014

jaimefrio May 14, 2014

perimosocordiae commented May 14, 2014

coveralls commented May 14, 2014

jaimefrio May 14, 2014

argriffing commented May 16, 2014

perimosocordiae commented May 16, 2014

argriffing commented May 16, 2014

pv commented May 16, 2014

pv commented May 16, 2014

perimosocordiae commented May 19, 2014

jnothman commented May 25, 2014

pv commented Aug 23, 2014

ENH: adding sum_duplicates method to COO sparse matrix #3646

ENH: adding sum_duplicates method to COO sparse matrix #3646

Conversation

perimosocordiae commented May 12, 2014

argriffing commented May 12, 2014

coveralls commented May 12, 2014

perimosocordiae commented May 13, 2014

coveralls commented May 13, 2014

argriffing commented May 13, 2014

coveralls commented May 13, 2014

pv May 13, 2014

Choose a reason for hiding this comment

perimosocordiae May 13, 2014

Choose a reason for hiding this comment

jaimefrio May 13, 2014

Choose a reason for hiding this comment

pv May 14, 2014

Choose a reason for hiding this comment

pv May 14, 2014

Choose a reason for hiding this comment

jaimefrio May 14, 2014

Choose a reason for hiding this comment

perimosocordiae commented May 14, 2014

coveralls commented May 14, 2014

jaimefrio May 14, 2014

Choose a reason for hiding this comment

argriffing commented May 16, 2014

perimosocordiae commented May 16, 2014

argriffing commented May 16, 2014

pv commented May 16, 2014

pv commented May 16, 2014

perimosocordiae commented May 19, 2014

jnothman commented May 25, 2014

pv commented Aug 23, 2014

ENH: adding `sum_duplicates` method to COO sparse matrix #3646

ENH: adding `sum_duplicates` method to COO sparse matrix #3646