Skip to content

Conversation

@groutr
Copy link
Contributor

@groutr groutr commented Jun 5, 2018

This PR fixes #387.

There is a slight performance hit, however for large sequences it is relatively small:

In [31]: seq = list(range(10000000))
In [32]: %timeit list(partition_all_old(3, seq))
411 ms ± 6.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [33]: %timeit list(partition_all(3, seq))
409 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The overhead is a little more noticeable for small sequences:

In [40]: seq = list(range(11))
In [41]: %timeit list(partition_all_old(2, seq))
1.87 µs ± 5.79 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [42]: %timeit list(partition_all(2, seq))
2.17 µs ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@eriknw
Copy link
Member

eriknw commented Jun 5, 2018

Cool, thanks @groutr!

TravisCI isn't happy about pep8 check. Need to add a space between - operator.

Also, any chance of adding a regression test?

@groutr
Copy link
Contributor Author

groutr commented Jun 5, 2018

@eriknw I can add some more tests. I think I might be able to be smarter about calculating the index of no_pad and avoid looping altogether.
Also, my performance tests are a little flawed because I misunderstood the role of n. In the worst case, performance can actually drop quite a bit.

# seq = list(range(10000001))
In [23]: %timeit list(partition_all_old(len(seq)//2, seq))
381 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [24]: %timeit list(partition_all(len(seq)//2, seq))
684 ms ± 19.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I'm going to think about this a bit more tonight.

@eriknw
Copy link
Member

eriknw commented Jun 5, 2018

Thanks, I appreciate your attention to detail.

@groutr
Copy link
Contributor Author

groutr commented Jun 5, 2018

In [75]: seq = list(range(1000001))
In [76]: %timeit list(partition_all(len(seq)//2, seq))
34.6 ms ± 75.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [77]: %timeit list(partition_all_old(len(seq)//2, seq))
34.7 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [78]: %timeit list(partition_all_old(len(seq)//2, iter(seq)))
34.7 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [79]: %timeit list(partition_all(len(seq)//2, iter(seq)))
35.3 ms ± 655 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Awww, better. (These times are not comparable to previous times, different machine).

@groutr
Copy link
Contributor Author

groutr commented Jun 5, 2018

@eriknw I'll add regression test and pep8 stuff later tonight.

@groutr
Copy link
Contributor Author

groutr commented Jun 6, 2018

@eriknw, one nice benefit of this PR is that we no longer do any actual equality tests here. Only identity testing is done.
One pathological case for the old version of partition_all would be a list of these objects

class SlowCompare(object):
    def __eq__(self, other):
        time.sleep(1)
        return self.__class__ == other.__class__

And the numbers

In [5]: %timeit list(partition_all_old(11, [SlowCompare()]*21)) # <--- *very* slow
10 s ± 753 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Use iter(...) to trigger slow path.
In [6]: %timeit list(partition_all(11, iter([SlowCompare()]*21)))
4.8 µs ± 55.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@groutr
Copy link
Contributor Author

groutr commented Jun 12, 2018

@eriknw does this look good to you?

@eriknw
Copy link
Member

eriknw commented Jun 13, 2018

LGTM, thanks again @groutr! Merging.

@eriknw eriknw merged commit 2bd9139 into pytoolz:master Jun 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

partition_all fails for sequence of pd.DataFrames

2 participants