-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: adding sum_duplicates
method to COO sparse matrix
#3646
Conversation
The other sparse matrices that sum duplicates also have bookkeeping to remember that they are in a canonical format (indices ordered and duplicates summed). Should this be added to |
Commit f198f74 adds a There's no guarantee that a user won't change the |
I checked the reasons for the failing build -- one is that the pep8 robocop is requiring the number of blank lines that it likes, and the other is that older but supported numpy versions apparently do not allow the dtype keyword in |
prev_idx = 0 | ||
prev_inds = (self.row[0], self.col[0]) | ||
mask = np.ones(len(self.row), dtype=bool) | ||
for idx, inds in enumerate(izip(self.row[1:], self.col[1:]), 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be written in Cython or C++ (see _csparsetools.pyx
or coo.h
and generate_sparsetools.py
), as it's going to be pretty slow. I'd recommend C++ (easier type templating).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Should we merge this PR first and then rewrite for speed, or roll the fast version into this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can get away without Cython and vectorize the whole thing with numpy doing something along the lines of:
idx = self.row * self.shape[0] + self.col
order = np.argsort(idx)
idx = idx[order]
unq_idx, unq_inv = np.unique(idx, return_inverse=True)
unq_cnt = np.bincount(unq_inv)
if np.any(unq_cnt > 1):
add_idx = np.concatenate(([0], np.cumsum(unq_cnt[:-1])))
self.data = np.add.reduceat(self.data[order], add_idx)
self.row = unq_idx // self.shape[0]
self.col = unq_idx % self.shape[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A C/Cython version will however probably be some factors faster still, as it requires fewer temporaries.
(Also, it is likely simpler to understand, as the operation is somewhat involved to vectorize.)
OTOH, this code path is probably usually not speed-critical, as it's a rare case, so a Numpy version will probably be mostly fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a flat index for sorting may overflow int32, so care should here be taken in choosing the integer type. Use Or just use get_index_dtype(maxval=self.shape[0]*self.shape[1])
to upcast to int64 when necessary...lexsort
, which will be even safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason for a flat index is allowing the use of np.unique
. But it wouldn't be too hard to rewrite the functionality for two index arrays using np.unique
's source code as a guide. It would also help performance, as calling np.unique
sorts an already sorted array. And in the latest numpy master, the implementation can compute unq_cnt
faster than calling np.bincount
on unq_inv
. It would look something like this:
order = np.lexsort((self.row,self.col))
self.row = self.row[order]
self.col = self.col[order]
self.data = self.data[order]
flag = np.concatenate(([True], (self.row[1:] != self.row[:-1]) | (self.col[1:] != self.col[:-1])))
self.row = self.row[flag]
self.col = self.col[flag]
add_idx = np.nonzero(flag)[0]
self.data = np.add.reduceat(self.data, add_idx)
This is probably even less readable, but I wouldn't be surprised if it was 2x faster.
@jaimefrio thanks! Your version does appear to be faster, and I don't think it sacrifices much readability. |
unique_mask = np.append(True, unique_mask) | ||
self.row = self.row[unique_mask] | ||
self.col = self.col[unique_mask] | ||
unique_inds, = np.nonzero(unique_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice way of unpacking a single item tuple! I hate having to add [0]
at the end...
I don't have an opinion about whether or not it's a good idea to cache the canonical status, I just thought I'd mention it because the other sparse matrix classes seem to do it. |
I think I'm against it, actually:
Unless anyone has a good case for caching the status, I'll pull that part out and we can hopefully merge this in. |
Sounds good to me. |
There is a good argument for caching the sum_duplicates, as certain operations expect non-duplicated input, and it is currently not called appropriately in several of them. There are tests, but they are disabled for the moment. I think it is reasonable to expect the user to invalidate the canonical flags if they muck with the data, or otherwise we have either to (i) accept the performance penalty in common operations, or (ii) accept incorrect results with duplicates. Moreover, I suspect the speed of the operation is constrained more by copying the data several times over in memory, than just lexsort. |
Disabled tests here: https://github.com/scipy/scipy/blob/master/scipy/sparse/tests/test_base.py#L3753 They didn't seem to catch this |
Fair points. I didn't realize that so much is broken in the presence of duplicates, and I agree that silent failure isn't the best behavior. Currently, much of the broken functionality isn't COO-specific ( |
See https://github.com/scipy/scipy/blob/master/scipy/sparse/tests/test_base.py#L3764 ("format conversion broken with non-canonical matrix") |
ENH: sparse: adding `sum_duplicates` method to COO sparse matrix
LGTM, merged. |
Fixes #3624 by adding a
sum_duplicates
method and calling it before thetodok
conversion.Also adds some test cases for both
sum_duplicates
andtodok
with duplicates.