Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dok_matrix updates #7673

Merged
merged 14 commits into from
Aug 13, 2017
Merged

dok_matrix updates #7673

merged 14 commits into from
Aug 13, 2017

Conversation

akstrfn
Copy link
Contributor

@akstrfn akstrfn commented Jul 27, 2017

Hi,

I had problem recently when using pickle on dok_matrix and I saw it was unfixed for quite some time (issue #1188) hence I decided to fix it.

But I also added one additional test (not sure if necessary) and made some changes to dok_matrix's __add__ and __radd__ methods. Namely there are some sort of bound checking that should probably be avoided in these methods. I've also wrote small script to benchmark some of these things and it looks like that using iterkeys or keys is comparably slower. Note I tested only py3 so difference on py2 should be even bigger when using keys.

Please let me know what you think of it and if this looks good I can try to adapt other methods to directly access the data i.e. without bounds checks when useful.

Summary of changes in this PL:

  • fix pickle for dok_matrix in python3 (related issue cPickle protocol 2 fails for sparse.dok_matrix (Trac #661) #1188)
  • double for loops changed to itertools.product
  • changed some calls to directly call dict to avoid additional checks
  • prevent unintentional usage of dok_matrix.update since it directly sets the data for dok_matrix with no checks if the input is valid.
  • using generator in combination with dict.update whenever possible

Cheers,
Aleks

Copy link
Member

@perimosocordiae perimosocordiae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

It appears that dok_matrix was already pickleable, but there are potentially some useful changes to make here.
I've left my comments in-line.

if aij != 0:
new[i, j] = aij
for i, j in itertools.product(xrange(M), xrange(N)):
aij = dict.get(self, (i, j), 0) + other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't bypassing any bounds-checking. In fact, it's equivalent to what we had before due to subclassing.

The usage of itertools.product does allow us to avoid packing/unpacking the index tuple, though:

for ij in itertools.product(xrange(M), xrange(N)):
    aij = self.get(ij, 0) + other
    if aij:
        new[ij] = aij

Copy link
Contributor Author

@akstrfn akstrfn Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, I totally missed this one with itertool, on the other hand I don't think that you are correct on self.get vs dict.get can you run the script I provided and let me know the results you get? Script is not perfect but it should give some ideas

@@ -290,21 +291,19 @@ def __setitem__(self, index, x):
del self[key]

def __add__(self, other):
# First check if argument is a scalar
if isscalarlike(other):
if isscalarlike(other): # check if argument is a scalar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just remove the comment altogether. (Same with the __radd__ comment below.)

# new.dtype.char = self.dtype.char
elif isinstance(other, dok_matrix):
elif isspmatrix_dok(other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is any clearer than the original code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me its also not any clearer but there is a function isspmatrix_dok that, I guess, was supposed to be used for this otherwise it has no purpose?

if other.shape != self.shape:
raise ValueError("matrix dimensions are not equal")
raise ValueError("matrix dimensions are not equal.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize Matrix

@@ -414,6 +411,9 @@ def __itruediv__(self, other):
else:
return NotImplemented

def __reduce__(self):
return dict.__reduce__(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_pickle that I added fails on python3 for me on 2 ubuntu machines. I just tested and it works in python 2 thought. From debugging I found that it goes to spmatrix.__reduce__ for whatever reason, and that returns an empty dict. I am still not completely sure that this is the right fix though because of all the nuances of subclassing a dict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can reproduce the failure on python 3. I would prefer implementing __getstate__ and __setstate__ (https://docs.python.org/3.1/library/pickle.html#pickle-state), though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that but I didn't manage to do it, it was never entering __setstate__. I think because dict is subclassed and that forces the use of __reduce__? The current solution is just calling dict.__reduce__ reduce instead of spmatrix.__reduce__ in py3 and I assume that's what was already happening in py2. If you have an idea about other implementation let me know so I can give it a shot.

for key, value in iteritems(self):
new[key[1], key[0]] = value
for (lhs, rhs), value in iteritems(self):
new[rhs, lhs] = value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer i and j over lhs and rhs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure about this either but since there is a swap between the members I figured better not to use i, j since they look similar but now looking this rhs, lhs does not look much better either.

@@ -3942,6 +3942,25 @@ def test_ticket1160(self):
b[:,0] = 0
assert_(len(b.keys()) == 0, "Unexpected entries in keys")

def test_copy(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, totally missed this when I was searching it 😄

@akstrfn akstrfn force-pushed the master branch 2 times, most recently from 9eb553b to 9a37448 Compare July 29, 2017 14:54
@akstrfn akstrfn changed the title WIP: dok_matrix updates dok_matrix updates Jul 29, 2017
@akstrfn akstrfn force-pushed the master branch 2 times, most recently from 191bf39 to a52cdea Compare July 31, 2017 21:02
Copy link
Member

@perimosocordiae perimosocordiae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more nitpicky changes, but I think this should be ready to merge soon.

If possible, could you run the ASV benchmarks to compare your changes to master?

# After committing your changes to your local branch...
asv continuous -b sparse master

To see all changes, even the insignificant ones, you can pass the --factor 1 flag.


def _update(self, data):
""" An update method for dict data defined for direct access to
`dok_matrix` data. Main purpose is to be used for effient conversion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: effient -> efficient

def getnnz(self, axis=None):
if axis is not None:
raise NotImplementedError("getnnz over an axis is not implemented "
"for DOK format")
raise NotImplementedError("Getnnz over an axis is not implemented "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't capitalize the method name. Could reorder like: "DOK format doesn't support getnnz() over an axis"

@@ -140,7 +155,7 @@ def get(self, key, default=0.):
return dict.get(self, key, default)

def __getitem__(self, index):
"""If key=(i,j) is a pair of integers, return the corresponding
""" If key=(i,j) is a pair of integers, return the corresponding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add space at the start of docstrings (here and elsewhere).

for (i,j),v in iteritems(self):
M, N = self.shape
n_vecs = other.shape[1]
result = np.zeros((M, n_vecs), dtype=upcast(self.dtype, other.dtype))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can clean this a little further:

result_shape = (self.shape[0], other.shape[1])
result_dtype = upcast(self.dtype, other.dtype)
result = np.zeros(result_shape, dtype=result_dtype)

for j in range(self.shape[1]):
out[0, j] = self[i, j]
return out
""" Returns a copy of row with index `i` of the matrix as a (1 x n) DOK
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns the i-th row as a (1 x n) DOK matrix.

for i in range(self.shape[0]):
out[i, 0] = self[i, j]
return out
""" Returns a copy of column with index j of the matrix as a (m x 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns the j-th column as a (m x 1) DOK matrix.

for protocol in range(pickle.HIGHEST_PROTOCOL):
sploaded = pickle.loads(pickle.dumps(datsp, protocol=protocol))
assert_equal(datsp.shape, sploaded.shape)
assert_array_equal(datsp.todense(), sploaded.todense())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use .toarray() instead.

for key, val in datsp.__dict__.items():
if isinstance(val, np.ndarray):
assert_array_equal(val, sploaded.__dict__[key])
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to drop the continue and just use an else-block for the non-ndarray check.

return NotImplemented
return NotImplemented

def __reduce__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not entirely convinced that this is the correct approach, but we can clean it up later if needed. For now, there should be a comment explaining what this is and why's it's needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one can get it to work with __get(set)state__ but also __new__ and __getnewargs__ would need to be implemented. Reason is that __setstate__ is not called because __setitem__ is called before it and it does not have shape variable hence everything breaks. If you know some other way to achieve this let me know, but I think this one is reasonably robust since it just redirects work to dict.

@akstrfn
Copy link
Contributor Author

akstrfn commented Aug 3, 2017

@perimosocordiae I've updated the PR to include all your last comments. I've also added one commit where there is a hacky constructor that constructs dok_matrix from iterator but that can probably be done better. I've ran the benchmarks and some are good and some not, here is the output: bench.txt.

Copy link
Member

@perimosocordiae perimosocordiae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave off the iterator-constructor (commit fa308ad) for now, and we can address that after this PR gets merged.

from other spmatrix classes. Has no checking if `data` is valid."""
return dict.update(self, data)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PEP8 checker is complaining about the extra newline here.

@perimosocordiae
Copy link
Member

Your benchmark results look off: too many unrelated timings changed pretty drastically. Are you sure you're doing a fair comparison?

@akstrfn
Copy link
Contributor Author

akstrfn commented Aug 3, 2017

Yes they look very weird, but it was late last night for me to figure out why. I ran them twice and both times got completely weird results. I'll remove iterator constructor and fix that pep8 issue.

@akstrfn akstrfn force-pushed the master branch 2 times, most recently from 2d47a86 to 315851c Compare August 3, 2017 15:08
newM, newN = shape
M, N = self.shape
if newM < M or newN < N:
# Remove all elements outside new dimensions
for (i, j) in _list(self.keys()):
for (i, j) in list(self.keys()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes an unnecessary copy on Python 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed, I also found another place which used dict.keys(). How would you go about measuring if a copy is made? I used %memit in ipython but did not find any memory difference for dictionary with 10^7 elements. But there is a measurable time difference, dict.keys() takes twice as much time compared to dict.iterkeys().

@akstrfn akstrfn force-pushed the master branch 2 times, most recently from ff40922 to 315851c Compare August 4, 2017 19:12
@perimosocordiae
Copy link
Member

I ran the benchmarks locally (using python runtests.py --bench-compare HEAD~1 sparse\.). The before column is current master (87385d6) and the after column is the current state of this PR:

    before     after       ratio
+  195.06μs   279.28μs      1.43  sparse.Conversion.time_conversion('coo', 'csc')
+  407.28μs   494.69μs      1.21  sparse.Sum.time_sum_axis1(0.5, 'coo')
+  192.39μs   231.07μs      1.20  sparse.Conversion.time_conversion('coo', 'csr')
+  145.90μs   167.86μs      1.15  sparse.Conversion.time_conversion('csc', 'csr')
+    8.36ms     9.21ms      1.10  sparse.Sum.time_sum_axis0(0.1, 'dia')
+    4.58μs     4.89μs      1.07  sparse.Getset.track_fancy_setitem(1, 'different', 'dok')
+    7.95μs     8.49μs      1.07  sparse.Diagonal.time_diagonal(0.5, 'dia')
+    9.88ms    10.51ms      1.06  sparse.Getset.track_fancy_setitem(10000, 'different', 'csc')
+  241.44μs   255.33μs      1.06  sparse.Densify.time_toarray('coo', 'C')
+   30.39μs    32.00μs      1.05  sparse.NullSlice.time_getrow(0.01, 'csr')
-   47.66μs    45.22μs      0.95  sparse.Getset.time_fancy_getitem(10, 'same', 'dok')
-  718.53μs   676.00μs      0.94  sparse.Conversion.time_conversion('dia', 'coo')
-   16.28ms    15.23ms      0.94  sparse.Conversion.time_conversion('lil', 'dia')
-   48.20μs    45.04μs      0.93  sparse.Getset.time_fancy_getitem(10, 'different', 'dok')
-  713.79μs   666.32μs      0.93  sparse_linalg_onenormest.BenchmarkOneNormEst.time_onenormest(5, 'exact')
-    5.68ms     5.30ms      0.93  sparse.Getset.track_fancy_setitem(1000, 'different', 'csr')
-   10.47ms     9.71ms      0.93  sparse.Matvecs.time_matvecs('dia')
-   12.80ms    11.77ms      0.92  sparse.NullSlice.time_3_rows(0.05, 'csc')
-   30.09ms    25.06ms      0.83  sparse_linalg_lobpcg.Bench.time_sakurai(50, 'lobpcg')
-  103.40μs    76.41μs      0.74  sparse.Getset.time_fancy_getitem(100, 'different', 'dok')
-  106.44μs    75.97μs      0.71  sparse.Getset.time_fancy_getitem(100, 'same', 'dok')
-     1.43s   101.08ms      0.07  sparse.Sum.time_sum_axis0(0.1, 'dok')
-  135.88ms     7.16ms      0.05  sparse.Sum.time_sum_axis0(0.01, 'dok')

No significant regressions (COO -> CSC conversions aren't affected by this PR, so that's probably noise) and some very nice speedups on time_sum_axis0, which somewhat unintuitively exercises the _mul_vector method.

@perimosocordiae perimosocordiae merged commit 5ce91f1 into scipy:master Aug 13, 2017
@perimosocordiae
Copy link
Member

Merged. Thanks again, @akstrfn!

@akstrfn
Copy link
Contributor Author

akstrfn commented Aug 13, 2017

@perimosocordiae I ran it in a debugger and __rmul__ is called, hence its the transpose method from dok_matrix that makes this difference which makes sense since transpose should be much faster now by using dict builtins and operates directly on data.
I had to check this because these benchmarks would not make sense otherwise since not much is changed in _mul_vector except say faster slicing but the difference was too large be explained only by better slicing. Should I submit the PR with the constructor for dok_matrix that would take iterators?
Thanks for the feedback and benchmarks!

@perimosocordiae
Copy link
Member

Sure, send the next PR when you're ready. I'd recommend working from a new branch in your fork (that is, not master), to avoid git headaches down the road.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants