Superhash: binby, groupby, unique, value_counts and xarray support #197

maartenbreddels · 2019-04-02T12:55:04Z

This is the start of using hashmaps in vaex for

groupby support
binby support
unique
value_counts

Value count is 235x faster than the old implementation, 4.7x faster compared to pandas (benchmarked on 1e8 random integers between 0 and 99).

maartenbreddels · 2019-04-15T13:30:57Z

cc @rabernat

I've chosen to split 'groupby' into two:

A 'traditional' groupby, that simply returns a new DataFrame,
A new binby method, that returns an xarray DataArray, which is a much more friendly data structure to work with if you do higher dimensional 'groupbys'. Also, it seems like a perfect match for vaex, since it produces regular N-d arrays. It also makes visualization very easy to do:

JovanVeljanoski · 2019-04-19T14:03:06Z

I find this potential issue:

import vaex
import numpy as np

x = np.array(['a', 'a', 'a', 'b', 'b', 'c'], dtype=np.object)
ds = vaex.from_arrays(x=x)

ds.x.value_counts()

that returns

Series([], dtype: float64)

For big datasets this also sometimes kills the kernel, although I always reproduce it.

maartenbreddels · 2019-04-23T14:01:29Z

Due to fundamental issues with the aggregation code, this led to a complete refactor, which also gives a 20-100% speedup in certain cases for 1 or 2 binning for non-float (and even float in some cases).
This leaves a lot of legacy code in, but there are a few aggregations still using the old code path, once we fully migrated, we can cut out a lot of legacy code.
We'll do this in a separate PR.

…ter performance

…er the existing ones according to a corresponding dict or a mapping function.

…Maarten.

…port discrete bins

…ss over the data

…also for groupby

fix: compiler warnings cleanip+small tweaks remove: unique now replaced by superhash

fix: vaex-hdf5: support timedelta64

…s.txt ci: skip numba for py35

maartenbreddels · 2019-04-24T09:38:40Z

I've squashed some commits (mainly CI debugging), but since it's a big change, I'm gonna merge this as is to have a fine grained history.

maartenbreddels force-pushed the superhash branch 2 times, most recently from dcf3eee to fef2fe7 Compare April 3, 2019 07:58

maartenbreddels force-pushed the superhash branch 2 times, most recently from 9d38553 to 60241a4 Compare April 12, 2019 07:52

maartenbreddels changed the title ~~Superhash: groupby, unique, value_counts~~ Superhash: binby, groupby, unique, value_counts and xarray support Apr 15, 2019

maartenbreddels force-pushed the superhash branch from 04feec8 to 4eae0e3 Compare April 19, 2019 12:56

maartenbreddels mentioned this pull request Apr 23, 2019

Virtual Column coordinates __xxx_matrix not found / Stride != 8 #40

Closed

maartenbreddels force-pushed the superhash branch from cb394b9 to f6d1966 Compare April 23, 2019 19:20

maartenbreddels mentioned this pull request Apr 23, 2019

Pandas vs. Vaex quartile differences #227

Closed

maartenbreddels added 18 commits April 24, 2019 10:17

improved: value_counts using hashmap

9e54373

vendor hopscotch map

043ce92

initial commit

46817cf

vendor flat_hash_map

e3cea84

hash: ´make it easier to switch hashmap implementation

69d80b2

hash: proper default template args

a517d44

initial commit

dc786ad

hash: using tessil instead of skarupke hashmap (gcc >= 5.0), also bet…

fc71bd2

…ter performance

hash: fix: reference count missed (+tests)

7e5dde0

make test order independant

7b89891

add support for bool

c91e9a7

new: add ordered set to improve unique performance

fa2f9ac

small bugs

efc3c62

initial commit

6938ac7

new: proper groupby start

2f3ab29

Relase the gil

4c55b9c

small groupby fixes

bdfd43d

new: binby and better groupby

e6ed02a

maartenbreddels and others added 17 commits April 24, 2019 10:19

fix: proper datatime handling for minmax

4a955bd

Adding a map method to the vaex expressions, which maps new values ov…

afa2fee

…er the existing ones according to a corresponding dict or a mapping function.

Improving the map method for expressions: Applying the comments from …

1bbb727

…Maarten.

Small fix for python2.7.

2708725

fix: py27 support

97d8e3a

missed a set

b762efc

optimize map

ca2791a

make .keys() give an ordered list so no sorting is needed

471b623

performance: use double instead of ints, saves a casting until we sup…

40c5819

…port discrete bins

performance: use delayed execution so all operations are done in 1 pa…

ffccb5d

…ss over the data

new+improvement: binning/aggregation is redone, much more efficient, …

ccc24d8

…also for groupby

stringview->string, added first, mean/min/max use new system, cleanups

b257ef6

task fix

7ce9fd5

fix: msvc doesnt like it

44de132

add AggSum_int32, which seems default on py27-win

b1a6745

fix: bug when dimension is 0

09f7672

travis debug

62b40c4

maartenbreddels force-pushed the superhash branch from f6d1966 to 3565ea3 Compare April 24, 2019 08:37

maartenbreddels and others added 7 commits April 24, 2019 11:32

fix: stride was incorrect (assumed 8 bits per element)

8a89989

datetime64 support and more Aggregator types

eeec86b

fix: valgrind found array oob issue with strings

03f0331

fix: mean for datetime

bb00963

fix: compiler warnings cleanip+small tweaks remove: unique now replaced by superhash

tests: move to agg_tests.py

1a68c46

fix: Adding datetime, timedelta and bool support and testing

0987f00

fix: vaex-hdf5: support timedelta64

ci: requirements for numpy due to np.isnat and pyarrow in requirement…

526a75b

…s.txt ci: skip numba for py35

maartenbreddels force-pushed the superhash branch from 3565ea3 to 526a75b Compare April 24, 2019 09:37

maartenbreddels merged commit 7864b50 into master Apr 24, 2019

maartenbreddels mentioned this pull request May 8, 2019

Adding a map method to the vaex expressions, which maps new values ov… #180

Closed

maartenbreddels deleted the superhash branch May 11, 2020 09:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Superhash: binby, groupby, unique, value_counts and xarray support #197

Superhash: binby, groupby, unique, value_counts and xarray support #197

maartenbreddels commented Apr 2, 2019 •

edited

maartenbreddels commented Apr 15, 2019

JovanVeljanoski commented Apr 19, 2019

maartenbreddels commented Apr 23, 2019

maartenbreddels commented Apr 24, 2019

Superhash: binby, groupby, unique, value_counts and xarray support #197

Superhash: binby, groupby, unique, value_counts and xarray support #197

Conversation

maartenbreddels commented Apr 2, 2019 • edited

maartenbreddels commented Apr 15, 2019

JovanVeljanoski commented Apr 19, 2019

maartenbreddels commented Apr 23, 2019

maartenbreddels commented Apr 24, 2019

maartenbreddels commented Apr 2, 2019 •

edited