-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dok_matrix updates #7673
dok_matrix updates #7673
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
It appears that dok_matrix
was already pickleable, but there are potentially some useful changes to make here.
I've left my comments in-line.
scipy/sparse/dok.py
Outdated
if aij != 0: | ||
new[i, j] = aij | ||
for i, j in itertools.product(xrange(M), xrange(N)): | ||
aij = dict.get(self, (i, j), 0) + other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't bypassing any bounds-checking. In fact, it's equivalent to what we had before due to subclassing.
The usage of itertools.product
does allow us to avoid packing/unpacking the index tuple, though:
for ij in itertools.product(xrange(M), xrange(N)):
aij = self.get(ij, 0) + other
if aij:
new[ij] = aij
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, I totally missed this one with itertool, on the other hand I don't think that you are correct on self.get
vs dict.get
can you run the script I provided and let me know the results you get? Script is not perfect but it should give some ideas
scipy/sparse/dok.py
Outdated
@@ -290,21 +291,19 @@ def __setitem__(self, index, x): | |||
del self[key] | |||
|
|||
def __add__(self, other): | |||
# First check if argument is a scalar | |||
if isscalarlike(other): | |||
if isscalarlike(other): # check if argument is a scalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just remove the comment altogether. (Same with the __radd__
comment below.)
scipy/sparse/dok.py
Outdated
# new.dtype.char = self.dtype.char | ||
elif isinstance(other, dok_matrix): | ||
elif isspmatrix_dok(other): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is any clearer than the original code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me its also not any clearer but there is a function isspmatrix_dok that, I guess, was supposed to be used for this otherwise it has no purpose?
scipy/sparse/dok.py
Outdated
if other.shape != self.shape: | ||
raise ValueError("matrix dimensions are not equal") | ||
raise ValueError("matrix dimensions are not equal.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize Matrix
scipy/sparse/dok.py
Outdated
@@ -414,6 +411,9 @@ def __itruediv__(self, other): | |||
else: | |||
return NotImplemented | |||
|
|||
def __reduce__(self): | |||
return dict.__reduce__(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_pickle
that I added fails on python3 for me on 2 ubuntu machines. I just tested and it works in python 2 thought. From debugging I found that it goes to spmatrix.__reduce__
for whatever reason, and that returns an empty dict. I am still not completely sure that this is the right fix though because of all the nuances of subclassing a dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I can reproduce the failure on python 3. I would prefer implementing __getstate__
and __setstate__
(https://docs.python.org/3.1/library/pickle.html#pickle-state), though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that but I didn't manage to do it, it was never entering __setstate__
. I think because dict
is subclassed and that forces the use of __reduce__
? The current solution is just calling dict.__reduce__
reduce instead of spmatrix.__reduce__
in py3 and I assume that's what was already happening in py2. If you have an idea about other implementation let me know so I can give it a shot.
scipy/sparse/dok.py
Outdated
for key, value in iteritems(self): | ||
new[key[1], key[0]] = value | ||
for (lhs, rhs), value in iteritems(self): | ||
new[rhs, lhs] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer i
and j
over lhs
and rhs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure about this either but since there is a swap between the members I figured better not to use i, j
since they look similar but now looking this rhs, lhs
does not look much better either.
scipy/sparse/tests/test_base.py
Outdated
@@ -3942,6 +3942,25 @@ def test_ticket1160(self): | |||
b[:,0] = 0 | |||
assert_(len(b.keys()) == 0, "Unexpected entries in keys") | |||
|
|||
def test_copy(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a test for this: https://github.com/scipy/scipy/blob/master/scipy/sparse/tests/test_base.py#L1784
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, totally missed this when I was searching it 😄
9eb553b
to
9a37448
Compare
191bf39
to
a52cdea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more nitpicky changes, but I think this should be ready to merge soon.
If possible, could you run the ASV benchmarks to compare your changes to master?
# After committing your changes to your local branch...
asv continuous -b sparse master
To see all changes, even the insignificant ones, you can pass the --factor 1
flag.
scipy/sparse/dok.py
Outdated
|
||
def _update(self, data): | ||
""" An update method for dict data defined for direct access to | ||
`dok_matrix` data. Main purpose is to be used for effient conversion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: effient -> efficient
scipy/sparse/dok.py
Outdated
def getnnz(self, axis=None): | ||
if axis is not None: | ||
raise NotImplementedError("getnnz over an axis is not implemented " | ||
"for DOK format") | ||
raise NotImplementedError("Getnnz over an axis is not implemented " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't capitalize the method name. Could reorder like: "DOK format doesn't support getnnz() over an axis"
scipy/sparse/dok.py
Outdated
@@ -140,7 +155,7 @@ def get(self, key, default=0.): | |||
return dict.get(self, key, default) | |||
|
|||
def __getitem__(self, index): | |||
"""If key=(i,j) is a pair of integers, return the corresponding | |||
""" If key=(i,j) is a pair of integers, return the corresponding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't add space at the start of docstrings (here and elsewhere).
scipy/sparse/dok.py
Outdated
for (i,j),v in iteritems(self): | ||
M, N = self.shape | ||
n_vecs = other.shape[1] | ||
result = np.zeros((M, n_vecs), dtype=upcast(self.dtype, other.dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can clean this a little further:
result_shape = (self.shape[0], other.shape[1])
result_dtype = upcast(self.dtype, other.dtype)
result = np.zeros(result_shape, dtype=result_dtype)
scipy/sparse/dok.py
Outdated
for j in range(self.shape[1]): | ||
out[0, j] = self[i, j] | ||
return out | ||
""" Returns a copy of row with index `i` of the matrix as a (1 x n) DOK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns the i-th row as a (1 x n) DOK matrix.
scipy/sparse/dok.py
Outdated
for i in range(self.shape[0]): | ||
out[i, 0] = self[i, j] | ||
return out | ||
""" Returns a copy of column with index j of the matrix as a (m x 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns the j-th column as a (m x 1) DOK matrix.
scipy/sparse/tests/test_base.py
Outdated
for protocol in range(pickle.HIGHEST_PROTOCOL): | ||
sploaded = pickle.loads(pickle.dumps(datsp, protocol=protocol)) | ||
assert_equal(datsp.shape, sploaded.shape) | ||
assert_array_equal(datsp.todense(), sploaded.todense()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use .toarray()
instead.
scipy/sparse/tests/test_base.py
Outdated
for key, val in datsp.__dict__.items(): | ||
if isinstance(val, np.ndarray): | ||
assert_array_equal(val, sploaded.__dict__[key]) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to drop the continue
and just use an else-block for the non-ndarray check.
return NotImplemented | ||
return NotImplemented | ||
|
||
def __reduce__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not entirely convinced that this is the correct approach, but we can clean it up later if needed. For now, there should be a comment explaining what this is and why's it's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one can get it to work with __get(set)state__
but also __new__
and __getnewargs__
would need to be implemented. Reason is that __setstate__
is not called because __setitem__
is called before it and it does not have shape
variable hence everything breaks. If you know some other way to achieve this let me know, but I think this one is reasonably robust since it just redirects work to dict
.
@perimosocordiae I've updated the PR to include all your last comments. I've also added one commit where there is a hacky constructor that constructs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave off the iterator-constructor (commit fa308ad) for now, and we can address that after this PR gets merged.
scipy/sparse/dok.py
Outdated
from other spmatrix classes. Has no checking if `data` is valid.""" | ||
return dict.update(self, data) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PEP8 checker is complaining about the extra newline here.
Your benchmark results look off: too many unrelated timings changed pretty drastically. Are you sure you're doing a fair comparison? |
Yes they look very weird, but it was late last night for me to figure out why. I ran them twice and both times got completely weird results. I'll remove iterator constructor and fix that pep8 issue. |
2d47a86
to
315851c
Compare
scipy/sparse/dok.py
Outdated
newM, newN = shape | ||
M, N = self.shape | ||
if newM < M or newN < N: | ||
# Remove all elements outside new dimensions | ||
for (i, j) in _list(self.keys()): | ||
for (i, j) in list(self.keys()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes an unnecessary copy on Python 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed, I also found another place which used dict.keys()
. How would you go about measuring if a copy is made? I used %memit
in ipython but did not find any memory difference for dictionary with 10^7 elements. But there is a measurable time difference, dict.keys()
takes twice as much time compared to dict.iterkeys()
.
ff40922
to
315851c
Compare
I ran the benchmarks locally (using
No significant regressions (COO -> CSC conversions aren't affected by this PR, so that's probably noise) and some very nice speedups on |
Merged. Thanks again, @akstrfn! |
@perimosocordiae I ran it in a debugger and |
Sure, send the next PR when you're ready. I'd recommend working from a new branch in your fork (that is, not |
Hi,
I had problem recently when using pickle on dok_matrix and I saw it was unfixed for quite some time (issue #1188) hence I decided to fix it.
But I also added one additional test (not sure if necessary) and made some changes to
dok_matrix
's__add__
and__radd__
methods. Namely there are some sort of bound checking that should probably be avoided in these methods. I've also wrote small script to benchmark some of these things and it looks like that usingiterkeys
orkeys
is comparably slower. Note I tested only py3 so difference on py2 should be even bigger when usingkeys
.Please let me know what you think of it and if this looks good I can try to adapt other methods to directly access the data i.e. without bounds checks when useful.
Summary of changes in this PL:
dok_matrix
in python3 (related issue cPickle protocol 2 fails for sparse.dok_matrix (Trac #661) #1188)itertools.product
dict
to avoid additional checksdok_matrix.update
since it directly sets the data fordok_matrix
with no checks if the input is valid.dict.update
whenever possibleCheers,
Aleks