Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: speed up multi-key groupby #8128

Merged
merged 1 commit into from Aug 29, 2014

Conversation

behzadnouri
Copy link
Contributor

Improves multi-key groupby speed; On master:

In [3]: pd.__version__
Out[3]: '0.14.1-276-g995f91c'

In [4]: np.random.seed(2718281)

In [5]: n = 20000

In [6]: df = pd.DataFrame(np.random.randint(1, n, (n, 3)),
   ...:         columns=['jim', 'joe', 'jolie'])

In [7]: %timeit df.groupby(['jim', 'joe'])['jolie'].transform('max')
1 loops, best of 3: 1.09 s per loop

In [8]: df['joe'] = df['jim']

In [9]: %timeit df.groupby(['jim', 'joe'])['jolie'].transform('max')
1 loops, best of 3: 1.02 s per loop

note that it is not responsive to the reduction in number of groups. With this patch:

In [9]: %timeit df.groupby(['jim', 'joe'])['jolie'].transform('max')
10 loops, best of 3: 122 ms per loop

In [10]: df['joe'] = df['jim']

In [11]: %timeit df.groupby(['jim', 'joe'])['jolie'].transform('max')
10 loops, best of 3: 82.3 ms per loop

benching:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_transform_multi_key2                 |  48.2820 | 810.2150 |   0.0596 |
groupby_transform_multi_key4                 | 137.5050 | 1986.8030 |   0.0692 |
groupby_transform_multi_key1                 |  70.1749 | 841.8140 |   0.0834 |
groupby_transform_multi_key3                 | 713.8393 | 2613.6247 |   0.2731 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [11cc057] : Merge branch 'groupby-speed-up' of https://github.com/behzadnouri/pandas into behzadnouri-groupby-speed-up
Base   [3bb0803] : Merge pull request #8103 from sinhrks/pivot_dt

BUG: pivot_table raises KeyError with nameless index and columns

@jreback
Copy link
Contributor

jreback commented Aug 28, 2014

  • need a vbench (or 2) (and then show the results)
  • release note

@jreback jreback added this to the 0.15.0 milestone Aug 28, 2014
@behzadnouri behzadnouri changed the title ENH: speed up multi-key groupby PERF: speed up multi-key groupby Aug 28, 2014
@behzadnouri
Copy link
Contributor Author

I had some dependency issue, so ran the benchmarks manually;
On master:

>>> groupby_transform_multi_key1.run()
{'loops': 1, 'timing': 1880.3930282592773, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key2.run()
{'loops': 1, 'timing': 1886.4881992340088, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key3.run()
{'loops': 1, 'timing': 5588.175058364868, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key4.run()
{'loops': 1, 'timing': 4688.298940658569, 'repeat': 3, 'succeeded': True, 'units': 'ms'}

on branch:

>>> groupby_transform_multi_key1.run()
{'loops': 1, 'timing': 106.5061092376709, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key2.run()
{'loops': 10, 'timing': 73.46320152282715, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key3.run()
{'loops': 1, 'timing': 1123.687982559204, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key4.run()
{'loops': 1, 'timing': 200.47903060913086, 'repeat': 3, 'succeeded': True, 'units': 'ms'}

groupby_transform_multi_key[3|4] needs 6.5G memory to run on master.

@jreback
Copy link
Contributor

jreback commented Aug 29, 2014

why is the memory usage so high? the Cartesian product of the groups is not represented here (it's only the compressed space)

@behzadnouri
Copy link
Contributor Author

The master branch calls into groupsort_indexer with ngroups = np.prod(shape). np.prod(shape) is the size of Cartesian product space of unique values across each key and groupsort_indexer allocates an array counts with this size.

Elsewhere also, the code falls back on argsort to avoid memory error.

@jreback
Copy link
Contributor

jreback commented Aug 29, 2014

argsort says is O(n**2) (as the default is quicksort)....but only using it on the smaller ones anyhow.
thanks..this is great! (and the mem issue is avoided!)

@jreback jreback merged commit c5a3514 into pandas-dev:master Aug 29, 2014
@jreback
Copy link
Contributor

jreback commented Aug 29, 2014

thanks! this was great!

@behzadnouri
Copy link
Contributor Author

@jreback On further tests, it seems to me that we need a stable sorter for group_index. see for example

I need to change the code to .argsort(kind='mergesort'). Should I make a new pull-request or it can work within this pull-request?

I did some tests with merge-sort and benchmarks still look good. It is also inline with the fact that Wes uses merge sort in here.

@jreback
Copy link
Contributor

jreback commented Aug 29, 2014

ok make a new pr

hmm no tests break
can u show a test case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants