dok_matrix updates #7673

akstrfn · 2017-07-27T22:32:57Z

Hi,

I had problem recently when using pickle on dok_matrix and I saw it was unfixed for quite some time (issue #1188) hence I decided to fix it.

But I also added one additional test (not sure if necessary) and made some changes to dok_matrix's __add__ and __radd__ methods. Namely there are some sort of bound checking that should probably be avoided in these methods. I've also wrote small script to benchmark some of these things and it looks like that using iterkeys or keys is comparably slower. Note I tested only py3 so difference on py2 should be even bigger when using keys.

Please let me know what you think of it and if this looks good I can try to adapt other methods to directly access the data i.e. without bounds checks when useful.

Summary of changes in this PL:

fix pickle for dok_matrix in python3 (related issue cPickle protocol 2 fails for sparse.dok_matrix (Trac #661) #1188)
double for loops changed to itertools.product
changed some calls to directly call dict to avoid additional checks
prevent unintentional usage of dok_matrix.update since it directly sets the data for dok_matrix with no checks if the input is valid.
using generator in combination with dict.update whenever possible

Cheers,
Aleks

perimosocordiae

Thanks for the contribution!

It appears that dok_matrix was already pickleable, but there are potentially some useful changes to make here.
I've left my comments in-line.

perimosocordiae · 2017-07-27T23:55:13Z

scipy/sparse/dok.py

-                    if aij != 0:
-                        new[i, j] = aij
+            for i, j in itertools.product(xrange(M), xrange(N)):
+                aij = dict.get(self, (i, j), 0) + other


This isn't bypassing any bounds-checking. In fact, it's equivalent to what we had before due to subclassing.

The usage of itertools.product does allow us to avoid packing/unpacking the index tuple, though:

for ij in itertools.product(xrange(M), xrange(N)): aij = self.get(ij, 0) + other if aij: new[ij] = aij

you are right, I totally missed this one with itertool, on the other hand I don't think that you are correct on self.get vs dict.get can you run the script I provided and let me know the results you get? Script is not perfect but it should give some ideas

perimosocordiae · 2017-07-27T23:56:01Z

scipy/sparse/dok.py

@@ -290,21 +291,19 @@ def __setitem__(self, index, x):
                    del self[key]

    def __add__(self, other):
-        # First check if argument is a scalar
-        if isscalarlike(other):
+        if isscalarlike(other):  # check if argument is a scalar


I'd just remove the comment altogether. (Same with the __radd__ comment below.)

perimosocordiae · 2017-07-27T23:56:24Z

scipy/sparse/dok.py

            # new.dtype.char = self.dtype.char
-        elif isinstance(other, dok_matrix):
+        elif isspmatrix_dok(other):


I don't think this is any clearer than the original code.

To me its also not any clearer but there is a function isspmatrix_dok that, I guess, was supposed to be used for this otherwise it has no purpose?

perimosocordiae · 2017-07-27T23:57:18Z

scipy/sparse/dok.py

            if other.shape != self.shape:
-                raise ValueError("matrix dimensions are not equal")
+                raise ValueError("matrix dimensions are not equal.")


Capitalize Matrix

perimosocordiae · 2017-07-27T23:57:44Z

scipy/sparse/dok.py

@@ -414,6 +411,9 @@ def __itruediv__(self, other):
        else:
            return NotImplemented

+    def __reduce__(self):
+        return dict.__reduce__(self)


This isn't necessary.

test_pickle that I added fails on python3 for me on 2 ubuntu machines. I just tested and it works in python 2 thought. From debugging I found that it goes to spmatrix.__reduce__ for whatever reason, and that returns an empty dict. I am still not completely sure that this is the right fix though because of all the nuances of subclassing a dict.

Ok, I can reproduce the failure on python 3. I would prefer implementing __getstate__ and __setstate__ (https://docs.python.org/3.1/library/pickle.html#pickle-state), though.

I tried that but I didn't manage to do it, it was never entering __setstate__. I think because dict is subclassed and that forces the use of __reduce__? The current solution is just calling dict.__reduce__ reduce instead of spmatrix.__reduce__ in py3 and I assume that's what was already happening in py2. If you have an idea about other implementation let me know so I can give it a shot.

perimosocordiae · 2017-07-28T00:00:00Z

scipy/sparse/dok.py

-        for key, value in iteritems(self):
-            new[key[1], key[0]] = value
+        for (lhs, rhs), value in iteritems(self):
+            new[rhs, lhs] = value


I'd prefer i and j over lhs and rhs.

I wasn't sure about this either but since there is a swap between the members I figured better not to use i, j since they look similar but now looking this rhs, lhs does not look much better either.

perimosocordiae · 2017-07-28T00:02:26Z

scipy/sparse/tests/test_base.py

@@ -3942,6 +3942,25 @@ def test_ticket1160(self):
        b[:,0] = 0
        assert_(len(b.keys()) == 0, "Unexpected entries in keys")

+    def test_copy(self):


We already have a test for this: https://github.com/scipy/scipy/blob/master/scipy/sparse/tests/test_base.py#L1784

sorry, totally missed this when I was searching it 😄

perimosocordiae

A few more nitpicky changes, but I think this should be ready to merge soon.

If possible, could you run the ASV benchmarks to compare your changes to master?

# After committing your changes to your local branch...
asv continuous -b sparse master

To see all changes, even the insignificant ones, you can pass the --factor 1 flag.

perimosocordiae · 2017-08-02T14:12:54Z

scipy/sparse/dok.py

+
+    def _update(self, data):
+        """ An update method for dict data defined for direct access to
+        `dok_matrix` data. Main purpose is to be used for effient conversion


typo: effient -> efficient

perimosocordiae · 2017-08-02T14:14:19Z

scipy/sparse/dok.py

    def getnnz(self, axis=None):
        if axis is not None:
-            raise NotImplementedError("getnnz over an axis is not implemented "
-                                      "for DOK format")
+            raise NotImplementedError("Getnnz over an axis is not implemented "


Don't capitalize the method name. Could reorder like: "DOK format doesn't support getnnz() over an axis"

perimosocordiae · 2017-08-02T14:14:48Z

scipy/sparse/dok.py

@@ -140,7 +155,7 @@ def get(self, key, default=0.):
        return dict.get(self, key, default)

    def __getitem__(self, index):
-        """If key=(i,j) is a pair of integers, return the corresponding
+        """ If key=(i,j) is a pair of integers, return the corresponding


Don't add space at the start of docstrings (here and elsewhere).

perimosocordiae · 2017-08-02T14:18:18Z

scipy/sparse/dok.py

-        for (i,j),v in iteritems(self):
+        M, N = self.shape
+        n_vecs = other.shape[1]
+        result = np.zeros((M, n_vecs), dtype=upcast(self.dtype, other.dtype))


Can clean this a little further:

result_shape = (self.shape[0], other.shape[1]) result_dtype = upcast(self.dtype, other.dtype) result = np.zeros(result_shape, dtype=result_dtype)

perimosocordiae · 2017-08-02T14:20:54Z

scipy/sparse/dok.py

-        for j in range(self.shape[1]):
-            out[0, j] = self[i, j]
-        return out
+        """ Returns a copy of row with index `i` of the matrix as a (1 x n) DOK


Returns the i-th row as a (1 x n) DOK matrix.

perimosocordiae · 2017-08-02T14:21:16Z

scipy/sparse/dok.py

-        for i in range(self.shape[0]):
-            out[i, 0] = self[i, j]
-        return out
+        """ Returns a copy of column with index j of the matrix as a (m x 1)


Returns the j-th column as a (m x 1) DOK matrix.

perimosocordiae · 2017-08-02T14:22:30Z

scipy/sparse/tests/test_base.py

+            for protocol in range(pickle.HIGHEST_PROTOCOL):
+                sploaded = pickle.loads(pickle.dumps(datsp, protocol=protocol))
+                assert_equal(datsp.shape, sploaded.shape)
+                assert_array_equal(datsp.todense(), sploaded.todense())


Use .toarray() instead.

perimosocordiae · 2017-08-02T14:23:19Z

scipy/sparse/tests/test_base.py

+                for key, val in datsp.__dict__.items():
+                    if isinstance(val, np.ndarray):
+                        assert_array_equal(val, sploaded.__dict__[key])
+                        continue


I'd prefer to drop the continue and just use an else-block for the non-ndarray check.

perimosocordiae · 2017-08-02T14:30:54Z

scipy/sparse/dok.py

-            return NotImplemented
+        return NotImplemented
+
+    def __reduce__(self):


I'm still not entirely convinced that this is the correct approach, but we can clean it up later if needed. For now, there should be a comment explaining what this is and why's it's needed.

I think one can get it to work with __get(set)state__ but also __new__ and __getnewargs__ would need to be implemented. Reason is that __setstate__ is not called because __setitem__ is called before it and it does not have shape variable hence everything breaks. If you know some other way to achieve this let me know, but I think this one is reasonably robust since it just redirects work to dict.

akstrfn · 2017-08-03T01:10:08Z

@perimosocordiae I've updated the PR to include all your last comments. I've also added one commit where there is a hacky constructor that constructs dok_matrix from iterator but that can probably be done better. I've ran the benchmarks and some are good and some not, here is the output: bench.txt.

perimosocordiae

Let's leave off the iterator-constructor (commit fa308ad) for now, and we can address that after this PR gets merged.

perimosocordiae · 2017-08-03T14:38:45Z

scipy/sparse/dok.py

+        from other spmatrix classes. Has no checking if `data` is valid."""
+        return dict.update(self, data)
+
+


The PEP8 checker is complaining about the extra newline here.

perimosocordiae · 2017-08-03T14:46:15Z

Your benchmark results look off: too many unrelated timings changed pretty drastically. Are you sure you're doing a fair comparison?

akstrfn · 2017-08-03T14:54:08Z

Yes they look very weird, but it was late last night for me to figure out why. I ran them twice and both times got completely weird results. I'll remove iterator constructor and fix that pep8 issue.

pv · 2017-08-04T18:31:42Z

scipy/sparse/dok.py

        newM, newN = shape
        M, N = self.shape
        if newM < M or newN < N:
            # Remove all elements outside new dimensions
-            for (i, j) in _list(self.keys()):
+            for (i, j) in list(self.keys()):


Makes an unnecessary copy on Python 2.

Should be fixed, I also found another place which used dict.keys(). How would you go about measuring if a copy is made? I used %memit in ipython but did not find any memory difference for dictionary with 10^7 elements. But there is a measurable time difference, dict.keys() takes twice as much time compared to dict.iterkeys().

… and __radd__

…licing

perimosocordiae · 2017-08-13T00:35:19Z

I ran the benchmarks locally (using python runtests.py --bench-compare HEAD~1 sparse\.). The before column is current master (87385d6) and the after column is the current state of this PR:

    before     after       ratio
+  195.06μs   279.28μs      1.43  sparse.Conversion.time_conversion('coo', 'csc')
+  407.28μs   494.69μs      1.21  sparse.Sum.time_sum_axis1(0.5, 'coo')
+  192.39μs   231.07μs      1.20  sparse.Conversion.time_conversion('coo', 'csr')
+  145.90μs   167.86μs      1.15  sparse.Conversion.time_conversion('csc', 'csr')
+    8.36ms     9.21ms      1.10  sparse.Sum.time_sum_axis0(0.1, 'dia')
+    4.58μs     4.89μs      1.07  sparse.Getset.track_fancy_setitem(1, 'different', 'dok')
+    7.95μs     8.49μs      1.07  sparse.Diagonal.time_diagonal(0.5, 'dia')
+    9.88ms    10.51ms      1.06  sparse.Getset.track_fancy_setitem(10000, 'different', 'csc')
+  241.44μs   255.33μs      1.06  sparse.Densify.time_toarray('coo', 'C')
+   30.39μs    32.00μs      1.05  sparse.NullSlice.time_getrow(0.01, 'csr')
-   47.66μs    45.22μs      0.95  sparse.Getset.time_fancy_getitem(10, 'same', 'dok')
-  718.53μs   676.00μs      0.94  sparse.Conversion.time_conversion('dia', 'coo')
-   16.28ms    15.23ms      0.94  sparse.Conversion.time_conversion('lil', 'dia')
-   48.20μs    45.04μs      0.93  sparse.Getset.time_fancy_getitem(10, 'different', 'dok')
-  713.79μs   666.32μs      0.93  sparse_linalg_onenormest.BenchmarkOneNormEst.time_onenormest(5, 'exact')
-    5.68ms     5.30ms      0.93  sparse.Getset.track_fancy_setitem(1000, 'different', 'csr')
-   10.47ms     9.71ms      0.93  sparse.Matvecs.time_matvecs('dia')
-   12.80ms    11.77ms      0.92  sparse.NullSlice.time_3_rows(0.05, 'csc')
-   30.09ms    25.06ms      0.83  sparse_linalg_lobpcg.Bench.time_sakurai(50, 'lobpcg')
-  103.40μs    76.41μs      0.74  sparse.Getset.time_fancy_getitem(100, 'different', 'dok')
-  106.44μs    75.97μs      0.71  sparse.Getset.time_fancy_getitem(100, 'same', 'dok')
-     1.43s   101.08ms      0.07  sparse.Sum.time_sum_axis0(0.1, 'dok')
-  135.88ms     7.16ms      0.05  sparse.Sum.time_sum_axis0(0.01, 'dok')

No significant regressions (COO -> CSC conversions aren't affected by this PR, so that's probably noise) and some very nice speedups on time_sum_axis0, which somewhat unintuitively exercises the _mul_vector method.

perimosocordiae · 2017-08-13T00:36:44Z

Merged. Thanks again, @akstrfn!

akstrfn · 2017-08-13T11:10:33Z

@perimosocordiae I ran it in a debugger and __rmul__ is called, hence its the transpose method from dok_matrix that makes this difference which makes sense since transpose should be much faster now by using dict builtins and operates directly on data.
I had to check this because these benchmarks would not make sense otherwise since not much is changed in _mul_vector except say faster slicing but the difference was too large be explained only by better slicing. Should I submit the PR with the constructor for dok_matrix that would take iterators?
Thanks for the feedback and benchmarks!

perimosocordiae · 2017-08-13T20:52:51Z

Sure, send the next PR when you're ready. I'd recommend working from a new branch in your fork (that is, not master), to avoid git headaches down the road.

perimosocordiae requested changes Jul 28, 2017

View reviewed changes

akstrfn force-pushed the master branch 2 times, most recently from 9eb553b to 9a37448 Compare July 29, 2017 14:54

akstrfn changed the title ~~WIP: dok_matrix updates~~ dok_matrix updates Jul 29, 2017

akstrfn force-pushed the master branch 2 times, most recently from 191bf39 to a52cdea Compare July 31, 2017 21:02

This was referenced Aug 2, 2017

Error while passing dok matrix into joblib.Parallel #7699

Closed

cPickle protocol 2 fails for sparse.dok_matrix (Trac #661) #1188

Closed

perimosocordiae requested changes Aug 2, 2017

View reviewed changes

perimosocordiae reviewed Aug 2, 2017

View reviewed changes

akstrfn force-pushed the master branch from 8159673 to fa308ad Compare August 3, 2017 00:58

perimosocordiae reviewed Aug 3, 2017

View reviewed changes

akstrfn force-pushed the master branch 2 times, most recently from 2d47a86 to 315851c Compare August 3, 2017 15:08

perimosocordiae approved these changes Aug 3, 2017

View reviewed changes

pv reviewed Aug 4, 2017

View reviewed changes

akstrfn force-pushed the master branch 2 times, most recently from ff40922 to 315851c Compare August 4, 2017 19:12

akstrfn added 9 commits August 4, 2017 21:15

fix 1188: pickle dok_matrix properly

8f30848

directly access elements dictionary elements of dok_matrix in __add__…

3683c47

… and __radd__

use provided function to check if other instance is dok_matrix

44a86ce

in transpose use unpacked values direcntly without additional tuple s…

d59bba4

…licing

make pickle test for all sparse matrices

5f3e838

remove _list() it serves no purpose?; style changes and comments removal

1191c4a

prevent unintential usage of update on dok_matrix

a5ac1f3

use itertool.product instead of double loop

0135988

use comprehension whenever possible

feac2d5

akstrfn added 5 commits August 4, 2017 21:15

add comment in __reduce__ to motivate its use

1bddd59

pickle test update per comments

e410152

formatting of comments and docs

670566d

better code expression in mat_multiector

80031b0

add missing dict iterators for python2 compatibility

2c12847

akstrfn force-pushed the master branch from 315851c to 2c12847 Compare August 4, 2017 19:15

perimosocordiae merged commit 5ce91f1 into scipy:master Aug 13, 2017

WarrenWeckesser mentioned this pull request Jan 28, 2018

dok_matrix no longer allows update #8338

Open

		from other spmatrix classes. Has no checking if `data` is valid."""
		return dict.update(self, data)

dok_matrix updates #7673

dok_matrix updates #7673

Conversation

akstrfn commented Jul 27, 2017 • edited

perimosocordiae left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akstrfn Jul 28, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

perimosocordiae left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akstrfn commented Aug 3, 2017

perimosocordiae left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

perimosocordiae commented Aug 3, 2017

akstrfn commented Aug 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

perimosocordiae commented Aug 13, 2017

perimosocordiae commented Aug 13, 2017

akstrfn commented Aug 13, 2017

perimosocordiae commented Aug 13, 2017

akstrfn commented Jul 27, 2017 •

edited

akstrfn Jul 28, 2017 •

edited