Partition #2

mrocklin · 2014-04-06T16:04:40Z

OK, here is a "getting the feet wet" pull request.

At first I just relied on zip and izip_longest, as we do in toolz. This yielded a 30% speedup.

Then I tried making an iterator class and ran into weird issues. I get a fair number of stackoverflows. In particular I get some strange integer wraparound at 256. Does Cython use one byte integers somehow? Anyway, if you have any thoughts on debugging this I'd be interested.

Fail case on my machine

import cytoolz
list(cytoolz.partition(5, range(1000)))

oddly this works

list(cytoolz.partition(5, range(200)))

but this doesn't

timeit list(cytoolz.partition(5, range(200)))

Conflicts: cytoolz/itertoolz/core.pyx cytoolz/itertoolz/tests/test_core.py

mrocklin · 2014-04-06T16:05:02Z

@eriknw do you get notifications on this repository?

eriknw · 2014-04-06T18:18:27Z

Yeah, I get notifications.

PyTuple_SET_ITEM "steals" a reference (it takes responsibility for garbage collecting the element), so you need to do Py_INCREF(val) before calling PyTuple_SET_ITEM. Unless you handle PyObject* values, you almost never need to manually increment or decrements reference counts. Cython is actually pretty nice in this regard. Manually setting tuples and lists are the only exceptions I know about.

I do this for get. It was one of the lessons I learned and wrote down. I should probably share these in a wiki!

mrocklin · 2014-04-06T18:24:42Z

Thanks for the tip. That worked wonderfully.

Unfortunately my custom code is slower than the zip solution.

In [1]: import cytoolz

In [2]: import toolz

In [3]: data = range(1000)

In [4]: timeit list(toolz.partition(5, data))
100000 loops, best of 3: 12.9 µs per loop

In [5]: timeit list(cytoolz.partition(5, data))
100000 loops, best of 3: 16.2 µs per loop

I'll revert or (if I get adventurous), look into line profiling (I remember being impressed with cython tools for this).

eriknw · 2014-04-06T19:22:32Z

Yeah, I wasn't sure how to best go about partition. Since your (perfectly reasonable) approach didn't beat the original, I bet the original implementation is pretty good! I think there is room for improvement using advanced (read: dirty-ish) techniques, but can you benchmark the following for ideal partitions to see if it's worth pursuing?

    def __next__(self):
        cdef tuple result
        cdef int i 
        cdef object val
        result = PyTuple_New(self.n)
        for i in range(self.n):  # efficient loop if `i` is cdef integer type
            val = next(self.seq)
            Py_INCREF(val)
            PyTuple_SET_ITEM(result, i, val)
        return result

I actually haven't done much with profile yet. I should probably try it out. Annotations--cython -a some_file.pxd, which creates an html file--are also useful.

mrocklin · 2014-04-06T19:30:00Z

No dice. I'll rebase out the recent commits and stick with the solution in mrocklin@4c29e14

In [1]: import toolz
In [2]: import cytoolz

In [3]: data = range(1000)

In [4]: timeit list(toolz.partition(5, data))
100000 loops, best of 3: 12.5 µs per loop

In [5]: timeit list(cytoolz.partition(5, data))
100000 loops, best of 3: 14.5 µs per loop

eriknw · 2014-04-06T19:48:27Z

Awesome (yes, I think it's awesome when toolz wins)!

Does map return an iterator or a list? I actually haven't used map in cytoolz yet, and you may have noticed that cytoolz.compatibility doesn't yet exist.

Another pointer: enumerate also makes efficient loops if i is an integer type. It's not more efficient than if you do it yourself, but it is as efficient and can be more readable.

Nice job getting your feet wet by the way.

I can't wait to flatten the directory structure in cytoolz, which will reduce the number of things to do when adding a new function.

eriknw · 2014-04-08T15:17:56Z

I played around with this some more, and, wow, the toolz implementation is insane! We should definitely stick with it.

We'll need to add cytoolz/compatibility.py for things like map, zip, zip_longest, etc.

I'm going to go ahead and flatten cytoolz like you did for toolz. It will then be easy to add the stock implementation of partition, and--unless you want to--I can go ahead and do it if you don't mind. It's awesome that toolz.partition performs so well; thanks for this discovery!

mrocklin · 2014-04-08T15:19:27Z

Yeah, I was confused the most recent time that I did timings, toolz was beating cytoolz. Perhaps this is always the way that it was and when I reported the 30% improvement I had it flipped around.

eriknw · 2014-04-08T15:23:19Z

Perhaps this is always the way that it was and when I reported the 30% improvement I had it flipped around.

I wouldn't be surprised. I don't see where there is 30% to be gained. Heh, you chose a heck of a function with which to get your feet wet.

…fast!

eriknw · 2014-04-09T02:37:59Z

Closing. I just added the stock version after flattening the directory structure.

mrocklin added 3 commits April 4, 2014 16:22

add partition

4c29e14

Merge branch 'master' into partition

c2f0658

Conflicts: cytoolz/itertoolz/core.pyx cytoolz/itertoolz/tests/test_core.py

class solution for partition

aaabf7a

eriknw added a commit that referenced this pull request Apr 9, 2014

Add partition. As @mrocklin discovered in #2, the toolz version is …

4f564e3

…fast!

eriknw closed this Apr 9, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition #2

Partition #2

mrocklin commented Apr 6, 2014

mrocklin commented Apr 6, 2014

eriknw commented Apr 6, 2014

mrocklin commented Apr 6, 2014

eriknw commented Apr 6, 2014

mrocklin commented Apr 6, 2014

eriknw commented Apr 6, 2014

eriknw commented Apr 8, 2014

mrocklin commented Apr 8, 2014

eriknw commented Apr 8, 2014

eriknw commented Apr 9, 2014

Partition #2

Partition #2

Conversation

mrocklin commented Apr 6, 2014

mrocklin commented Apr 6, 2014

eriknw commented Apr 6, 2014

mrocklin commented Apr 6, 2014

eriknw commented Apr 6, 2014

mrocklin commented Apr 6, 2014

eriknw commented Apr 6, 2014

eriknw commented Apr 8, 2014

mrocklin commented Apr 8, 2014

eriknw commented Apr 8, 2014

eriknw commented Apr 9, 2014