PERF: speed up multi-key groupby #8128

behzadnouri · 2014-08-28T02:47:09Z

Improves multi-key groupby speed; On master:

In [3]: pd.__version__
Out[3]: '0.14.1-276-g995f91c'

In [4]: np.random.seed(2718281)

In [5]: n = 20000

In [6]: df = pd.DataFrame(np.random.randint(1, n, (n, 3)),
   ...:         columns=['jim', 'joe', 'jolie'])

In [7]: %timeit df.groupby(['jim', 'joe'])['jolie'].transform('max')
1 loops, best of 3: 1.09 s per loop

In [8]: df['joe'] = df['jim']

In [9]: %timeit df.groupby(['jim', 'joe'])['jolie'].transform('max')
1 loops, best of 3: 1.02 s per loop

note that it is not responsive to the reduction in number of groups. With this patch:

In [9]: %timeit df.groupby(['jim', 'joe'])['jolie'].transform('max')
10 loops, best of 3: 122 ms per loop

In [10]: df['joe'] = df['jim']

In [11]: %timeit df.groupby(['jim', 'joe'])['jolie'].transform('max')
10 loops, best of 3: 82.3 ms per loop

benching:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_transform_multi_key2                 |  48.2820 | 810.2150 |   0.0596 |
groupby_transform_multi_key4                 | 137.5050 | 1986.8030 |   0.0692 |
groupby_transform_multi_key1                 |  70.1749 | 841.8140 |   0.0834 |
groupby_transform_multi_key3                 | 713.8393 | 2613.6247 |   0.2731 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [11cc057] : Merge branch 'groupby-speed-up' of https://github.com/behzadnouri/pandas into behzadnouri-groupby-speed-up
Base   [3bb0803] : Merge pull request #8103 from sinhrks/pivot_dt

BUG: pivot_table raises KeyError with nameless index and columns

jreback · 2014-08-28T14:56:49Z

need a vbench (or 2) (and then show the results)
release note

behzadnouri · 2014-08-28T23:09:45Z

I had some dependency issue, so ran the benchmarks manually;
On master:

>>> groupby_transform_multi_key1.run()
{'loops': 1, 'timing': 1880.3930282592773, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key2.run()
{'loops': 1, 'timing': 1886.4881992340088, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key3.run()
{'loops': 1, 'timing': 5588.175058364868, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key4.run()
{'loops': 1, 'timing': 4688.298940658569, 'repeat': 3, 'succeeded': True, 'units': 'ms'}

on branch:

>>> groupby_transform_multi_key1.run()
{'loops': 1, 'timing': 106.5061092376709, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key2.run()
{'loops': 10, 'timing': 73.46320152282715, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key3.run()
{'loops': 1, 'timing': 1123.687982559204, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key4.run()
{'loops': 1, 'timing': 200.47903060913086, 'repeat': 3, 'succeeded': True, 'units': 'ms'}

groupby_transform_multi_key[3|4] needs 6.5G memory to run on master.

jreback · 2014-08-29T01:17:50Z

why is the memory usage so high? the Cartesian product of the groups is not represented here (it's only the compressed space)

behzadnouri · 2014-08-29T01:59:43Z

The master branch calls into groupsort_indexer with ngroups = np.prod(shape). np.prod(shape) is the size of Cartesian product space of unique values across each key and groupsort_indexer allocates an array counts with this size.

Elsewhere also, the code falls back on argsort to avoid memory error.

jreback · 2014-08-29T19:07:43Z

argsort says is O(n**2) (as the default is quicksort)....but only using it on the smaller ones anyhow.
thanks..this is great! (and the mem issue is avoided!)

jreback · 2014-08-29T19:17:01Z

thanks! this was great!

behzadnouri · 2014-08-29T22:28:11Z

@jreback On further tests, it seems to me that we need a stable sorter for group_index. see for example

tests/test_groupby.py:test_lexsort_indexer
tests/test_frame.py:test_sort_index_multicolumn
It compares versus np.lexsort which is a stable sort.
tests/test_groupby:test__cython_agg_general:
first fails without stable sort.

I need to change the code to .argsort(kind='mergesort'). Should I make a new pull-request or it can work within this pull-request?

I did some tests with merge-sort and benchmarks still look good. It is also inline with the fact that Wes uses merge sort in here.

jreback · 2014-08-29T22:39:06Z

ok make a new pr

hmm no tests break
can u show a test case?

jreback added Groupby labels Aug 28, 2014

jreback added this to the 0.15.0 milestone Aug 28, 2014

behzadnouri changed the title ~~ENH: speed up multi-key groupby~~ PERF: speed up multi-key groupby Aug 28, 2014

PERF: speed up multi-key groupby

c5a3514

behzadnouri force-pushed the groupby-speed-up branch from b05c7c4 to c5a3514 Compare August 29, 2014 00:07

jreback merged commit c5a3514 into pandas-dev:master Aug 29, 2014

behzadnouri deleted the groupby-speed-up branch August 30, 2014 01:51

behzadnouri mentioned this pull request Aug 30, 2014

BUG: use stable sort for group_index in groupby #8148

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: speed up multi-key groupby #8128

PERF: speed up multi-key groupby #8128

behzadnouri commented Aug 28, 2014

jreback commented Aug 28, 2014

behzadnouri commented Aug 28, 2014

jreback commented Aug 29, 2014

behzadnouri commented Aug 29, 2014

jreback commented Aug 29, 2014

jreback commented Aug 29, 2014

behzadnouri commented Aug 29, 2014

jreback commented Aug 29, 2014

PERF: speed up multi-key groupby #8128

PERF: speed up multi-key groupby #8128

Conversation

behzadnouri commented Aug 28, 2014

jreback commented Aug 28, 2014

behzadnouri commented Aug 28, 2014

jreback commented Aug 29, 2014

behzadnouri commented Aug 29, 2014

jreback commented Aug 29, 2014

jreback commented Aug 29, 2014

behzadnouri commented Aug 29, 2014

jreback commented Aug 29, 2014