Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up NdMapping.groupby #349

Merged
merged 8 commits into from
Dec 12, 2015
Merged

Speed up NdMapping.groupby #349

merged 8 commits into from
Dec 12, 2015

Conversation

philippjfr
Copy link
Member

This PR refactors the NdMapping.groupby operation into a separate function and provides an alternative implementation based on Pandas, which is significantly faster for large datasets. You can see the linear performance scaling by the old implementation and might just make out the sublinear performance of pandas, which becomes very significant for large datasets >10000 items. This is a temporary workaround until we come up with a general solution data API for NdMapping types that's being discussed in #347.

image

@philippjfr philippjfr added this to the v1.4.1 milestone Dec 11, 2015
import pandas
ndmapping_groupby = ndmapping_groupby_pandas
except:
ndmapping_groupby = ndmapping_groupby_python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this a parameterized function (i.e a class) that uses one of two possible bothmethods in __call__. My only other comment is that we need some docstrings here...

@jlstevens
Copy link
Contributor

Definitely a very valuable PR: it cleans up groupby on MultiDimensionalMapping, moves useful functionality into util and most importantly offers a major performance improvement (for people with pandas installed) with very little code.

I'm happy to implement my suggestion of turning ndmapping_groupby into a parameterized function once the tests are passing...

@philippjfr
Copy link
Member Author

I'm now done with this, I'd be happy if you refactored it into ParameterizedFunction, then we can merge.

Just plotted the improvement factor as a function of samples, huge difference for large N:

image

@jlstevens
Copy link
Contributor

Great! The key thing is that all the tests are passing now...

My only comment now is whether you are happy for me to make this into a parameterized function? Or do you object to having a single parameterized function for groupby?

Shouldn't take long to do and I am happy to change it if you are busy.

@philippjfr
Copy link
Member Author

My only comment now is whether you are happy for me to make this into a parameterized function? Or do you object to having a single parameterized function for groupby?

Must have missed my comment somehow:

I'm now done with this, I'd be happy if you refactored it into [a] ParameterizedFunction, then we can merge.

@jlstevens
Copy link
Contributor

Sorry yes - I skimmed your reply too quickly. I'll do the final refactor now.

@jlstevens
Copy link
Contributor

If the tests pass now, I'll go ahead and merge.

jlstevens added a commit that referenced this pull request Dec 12, 2015
Significant speed up of NdMapping.groupby
@jlstevens jlstevens merged commit 6b9fe08 into master Dec 12, 2015
@jlstevens jlstevens deleted the ndmapping_groupby branch December 12, 2015 15:17
@jbednar
Copy link
Member

jbednar commented Dec 12, 2015

Excellent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants