Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: adding sum_duplicates method to COO sparse matrix #3646

Merged
merged 5 commits into from
Aug 23, 2014

Conversation

perimosocordiae
Copy link
Member

Fixes #3624 by adding a sum_duplicates method and calling it before the todok conversion.

Also adds some test cases for both sum_duplicates and todok with duplicates.

@argriffing
Copy link
Contributor

The other sparse matrices that sum duplicates also have bookkeeping to remember that they are in a canonical format (indices ordered and duplicates summed). Should this be added to coo too?

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 168ef17 on perimosocordiae:patch-4 into 7ff4e90 on scipy:master.

@perimosocordiae
Copy link
Member Author

Commit f198f74 adds a has_canonical_format attribute. I've made an attempt at setting it intelligently in the constructor, but it's not propagated across all operations (like transpose). The result is that the value is somewhat pessimistic.

There's no guarantee that a user won't change the row and/or col attributes directly, which might cause has_canonical_format to be true when duplicates exist, which might result in downstream bugs. This seems like a documentation issue, however. If the consensus is for this approach, I can add a note to the coo_matrix constructor.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling f198f74 on perimosocordiae:patch-4 into 7ff4e90 on scipy:master.

@argriffing
Copy link
Contributor

I checked the reasons for the failing build -- one is that the pep8 robocop is requiring the number of blank lines that it likes, and the other is that older but supported numpy versions apparently do not allow the dtype keyword in ones_like.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 0814fed on perimosocordiae:patch-4 into 7ff4e90 on scipy:master.

prev_idx = 0
prev_inds = (self.row[0], self.col[0])
mask = np.ones(len(self.row), dtype=bool)
for idx, inds in enumerate(izip(self.row[1:], self.col[1:]), 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be written in Cython or C++ (see _csparsetools.pyx or coo.h and generate_sparsetools.py), as it's going to be pretty slow. I'd recommend C++ (easier type templating).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Should we merge this PR first and then rewrite for speed, or roll the fast version into this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get away without Cython and vectorize the whole thing with numpy doing something along the lines of:

idx = self.row * self.shape[0] + self.col
order = np.argsort(idx)
idx = idx[order]
unq_idx, unq_inv = np.unique(idx, return_inverse=True)
unq_cnt = np.bincount(unq_inv)
if np.any(unq_cnt > 1):
    add_idx = np.concatenate(([0], np.cumsum(unq_cnt[:-1])))
    self.data = np.add.reduceat(self.data[order], add_idx)
    self.row = unq_idx // self.shape[0]
    self.col = unq_idx % self.shape[0]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A C/Cython version will however probably be some factors faster still, as it requires fewer temporaries.
(Also, it is likely simpler to understand, as the operation is somewhat involved to vectorize.)

OTOH, this code path is probably usually not speed-critical, as it's a rare case, so a Numpy version will probably be mostly fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a flat index for sorting may overflow int32, so care should here be taken in choosing the integer type. Use get_index_dtype(maxval=self.shape[0]*self.shape[1]) to upcast to int64 when necessary... Or just use lexsort, which will be even safer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason for a flat index is allowing the use of np.unique. But it wouldn't be too hard to rewrite the functionality for two index arrays using np.unique's source code as a guide. It would also help performance, as calling np.unique sorts an already sorted array. And in the latest numpy master, the implementation can compute unq_cnt faster than calling np.bincount on unq_inv. It would look something like this:

order = np.lexsort((self.row,self.col))
self.row = self.row[order]
self.col = self.col[order]
self.data = self.data[order]
flag = np.concatenate(([True], (self.row[1:] != self.row[:-1]) | (self.col[1:] != self.col[:-1])))
self.row = self.row[flag]
self.col = self.col[flag]
add_idx = np.nonzero(flag)[0]
self.data = np.add.reduceat(self.data, add_idx)

This is probably even less readable, but I wouldn't be surprised if it was 2x faster.

@perimosocordiae
Copy link
Member Author

@jaimefrio thanks! Your version does appear to be faster, and I don't think it sacrifices much readability.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling b9c4e62 on perimosocordiae:patch-4 into 7ff4e90 on scipy:master.

unique_mask = np.append(True, unique_mask)
self.row = self.row[unique_mask]
self.col = self.col[unique_mask]
unique_inds, = np.nonzero(unique_mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice way of unpacking a single item tuple! I hate having to add [0] at the end...

@argriffing
Copy link
Contributor

I don't have an opinion about whether or not it's a good idea to cache the canonical status, I just thought I'd mention it because the other sparse matrix classes seem to do it.

@perimosocordiae
Copy link
Member Author

I think I'm against it, actually:

  • sum_duplicates is only being called for the todok conversion, which is unlikely to happen more than once in real code.
  • There are cases where it could actually cause bugs, as mentioned above.
  • The lexsort, which is the most expensive step, is faster for pre-sorted inputs, so calling sum_duplicates twice will be faster the second time anyway.

Unless anyone has a good case for caching the status, I'll pull that part out and we can hopefully merge this in.

@argriffing
Copy link
Contributor

Unless anyone has a good case for caching the status,
I'll pull that part out and we can hopefully merge this in.

Sounds good to me.

@pv
Copy link
Member

pv commented May 16, 2014

There is a good argument for caching the sum_duplicates, as certain operations expect non-duplicated input, and it is currently not called appropriately in several of them. There are tests, but they are disabled for the moment.

I think it is reasonable to expect the user to invalidate the canonical flags if they muck with the data, or otherwise we have either to (i) accept the performance penalty in common operations, or (ii) accept incorrect results with duplicates.

Moreover, I suspect the speed of the operation is constrained more by copying the data several times over in memory, than just lexsort.

@pv
Copy link
Member

pv commented May 16, 2014

Disabled tests here: https://github.com/scipy/scipy/blob/master/scipy/sparse/tests/test_base.py#L3753 They didn't seem to catch this todok() issue, so maybe there are more...

@perimosocordiae
Copy link
Member Author

Fair points. I didn't realize that so much is broken in the presence of duplicates, and I agree that silent failure isn't the best behavior.

Currently, much of the broken functionality isn't COO-specific (max,min, various ufuncs, etc), so it's not clear how to insert a sum_duplicates call (or throw an error, maybe). Overriding the various mixin methods just to check that canonical format flag seems inelegant. In any case, I think those fixes are beyond the scope of this PR.

@jnothman
Copy link
Contributor

They didn't seem to catch this todok() issue, so maybe there are more...

See https://github.com/scipy/scipy/blob/master/scipy/sparse/tests/test_base.py#L3764 ("format conversion broken with non-canonical matrix")

@pv pv removed the PR label Aug 13, 2014
pv added a commit that referenced this pull request Aug 23, 2014
ENH: sparse: adding  `sum_duplicates` method to COO sparse matrix
@pv pv merged commit 24ab79f into scipy:master Aug 23, 2014
@pv
Copy link
Member

pv commented Aug 23, 2014

LGTM, merged.

@pv pv added this to the 0.15.0 milestone Aug 24, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

coo -> dok -> dense loses repeats somewhere
7 participants