Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up NdMapping.groupby #349

Merged
merged 8 commits into from Dec 12, 2015

Conversation

Projects
None yet
3 participants
@philippjfr
Copy link
Contributor

philippjfr commented Dec 11, 2015

This PR refactors the NdMapping.groupby operation into a separate function and provides an alternative implementation based on Pandas, which is significantly faster for large datasets. You can see the linear performance scaling by the old implementation and might just make out the sublinear performance of pandas, which becomes very significant for large datasets >10000 items. This is a temporary workaround until we come up with a general solution data API for NdMapping types that's being discussed in #347.

image

@philippjfr philippjfr added this to the v1.4.1 milestone Dec 11, 2015

import pandas
ndmapping_groupby = ndmapping_groupby_pandas
except:
ndmapping_groupby = ndmapping_groupby_python

This comment has been minimized.

@jlstevens

jlstevens Dec 11, 2015

Contributor

I would make this a parameterized function (i.e a class) that uses one of two possible bothmethods in __call__. My only other comment is that we need some docstrings here...

yield [((), (v.ix[0],))]


def ndmapping_groupby_pandas(ndmapping, dimensions, container_type, group_type, **kwargs):

This comment has been minimized.

@jlstevens

jlstevens Dec 11, 2015

Contributor

From what I can see, this is actually used at the MultiDimensionalMapping level. Just a minor comment, as this name is certainly less unwieldy than multidimensional_groupby_pandas and I doubt this would cause much confusion.

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Dec 11, 2015

Definitely a very valuable PR: it cleans up groupby on MultiDimensionalMapping, moves useful functionality into util and most importantly offers a major performance improvement (for people with pandas installed) with very little code.

I'm happy to implement my suggestion of turning ndmapping_groupby into a parameterized function once the tests are passing...

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Dec 12, 2015

I'm now done with this, I'd be happy if you refactored it into ParameterizedFunction, then we can merge.

Just plotted the improvement factor as a function of samples, huge difference for large N:

image

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Dec 12, 2015

Great! The key thing is that all the tests are passing now...

My only comment now is whether you are happy for me to make this into a parameterized function? Or do you object to having a single parameterized function for groupby?

Shouldn't take long to do and I am happy to change it if you are busy.

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Dec 12, 2015

My only comment now is whether you are happy for me to make this into a parameterized function? Or do you object to having a single parameterized function for groupby?

Must have missed my comment somehow:

I'm now done with this, I'd be happy if you refactored it into [a] ParameterizedFunction, then we can merge.

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Dec 12, 2015

Sorry yes - I skimmed your reply too quickly. I'll do the final refactor now.

Philipp Rudiger and others added some commits Dec 12, 2015

Philipp Rudiger Philipp Rudiger
Small fix to ndmapping_groupby_pandas function
Avoids hardcoding the 'Index' dimension used for NdElement types
@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Dec 12, 2015

If the tests pass now, I'll go ahead and merge.

jlstevens added a commit that referenced this pull request Dec 12, 2015

Merge pull request #349 from ioam/ndmapping_groupby
Significant speed up of NdMapping.groupby

@jlstevens jlstevens merged commit 6b9fe08 into master Dec 12, 2015

3 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 71.238%
Details

@jlstevens jlstevens deleted the ndmapping_groupby branch Dec 12, 2015

@jbednar

This comment has been minimized.

Copy link
Contributor

jbednar commented Dec 12, 2015

Excellent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.