Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Superhash: binby, groupby, unique, value_counts and xarray support #197

Merged
merged 43 commits into from Apr 24, 2019

Conversation

maartenbreddels
Copy link
Member

@maartenbreddels maartenbreddels commented Apr 2, 2019

This is the start of using hashmaps in vaex for

  • groupby support
  • binby support
  • unique
  • value_counts

Value count is 235x faster than the old implementation, 4.7x faster compared to pandas (benchmarked on 1e8 random integers between 0 and 99).

@maartenbreddels maartenbreddels force-pushed the superhash branch 2 times, most recently from dcf3eee to fef2fe7 Compare April 3, 2019 07:58
@maartenbreddels maartenbreddels force-pushed the superhash branch 2 times, most recently from 9d38553 to 60241a4 Compare April 12, 2019 07:52
@maartenbreddels maartenbreddels changed the title Superhash: groupby, unique, value_counts Superhash: binby, groupby, unique, value_counts and xarray support Apr 15, 2019
@maartenbreddels
Copy link
Member Author

cc @rabernat

I've chosen to split 'groupby' into two:

  • A 'traditional' groupby, that simply returns a new DataFrame,
  • A new binby method, that returns an xarray DataArray, which is a much more friendly data structure to work with if you do higher dimensional 'groupbys'. Also, it seems like a perfect match for vaex, since it produces regular N-d arrays. It also makes visualization very easy to do:

image

@JovanVeljanoski
Copy link
Member

I find this potential issue:

import vaex
import numpy as np

x = np.array(['a', 'a', 'a', 'b', 'b', 'c'], dtype=np.object)
ds = vaex.from_arrays(x=x)

ds.x.value_counts()

that returns

Series([], dtype: float64)

For big datasets this also sometimes kills the kernel, although I always reproduce it.

@maartenbreddels
Copy link
Member Author

Due to fundamental issues with the aggregation code, this led to a complete refactor, which also gives a 20-100% speedup in certain cases for 1 or 2 binning for non-float (and even float in some cases).
This leaves a lot of legacy code in, but there are a few aggregations still using the old code path, once we fully migrated, we can cut out a lot of legacy code.
We'll do this in a separate PR.

@maartenbreddels
Copy link
Member Author

I've squashed some commits (mainly CI debugging), but since it's a big change, I'm gonna merge this as is to have a fine grained history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants