In [1]:
%run ~/ipynb/startup.py groupby-lib

Adding local packages to sys.path: ['groupby-lib']


Welcome to groupby-lib, a Python library for accelerating groupby operations on large in-memory datasets, aimed at users of Pandas and Polars. It's fast, flexible and convenient, in many cases, speeding up the equivalent Pandas operations by an order of magnitude or more and can even improve significantly on the performance of Polars in some case, with much less verbosity. 
It's particularly great for quickly generating useful cuts while exploring and summarizing large tabular data, something I did a lot for over a decade in quant finance. 

In this short video, I'll demo some of the core features and do some comparisons with Pandas and Polars

In [2]:
import numpy as np
import polars
import pandas as pd
from groupby_lib import GroupBy
from groupby_lib import install_groupby_fast
install_groupby_fast()

✅ groupby-lib patches installed methods installed!
   Use df.groupby_fast() and series.groupby_fast() for optimized performance


In [3]:
N = 20_000_000
df = pd.DataFrame(dict(
    floats=np.random.rand(N), 
    ints=np.random.randint(0, 1000, N),
))
df["categorical"] = pd.Categorical.from_codes(df.ints % 10, list("qwertyuiop"))
df_pl = polars.DataFrame(df)

### The GroupBy Class & groupby_fast monkey patch

#### Two ways of utilizing the functionality 

In [4]:
from groupby_lib import GroupBy
from groupby_lib import install_groupby_fast
install_groupby_fast()

✅ groupby-lib patches installed methods installed!
   Use df.groupby_fast() and series.groupby_fast() for optimized performance


#### ***~15x*** faster when grouping by a categorical 4-5x vs. Polars)

In [5]:
%timeit -n 1 df.groupby("categorical", observed=True).sum()   # verbosity to avoid warnings
%timeit -n 1 df_pl.group_by("categorical").sum()
%timeit -n 1 GroupBy(df.categorical).sum(df); 
%timeit -n 1 GroupBy.sum(df.categorical, df); 
%timeit -n 1 df.groupby_fast("categorical").sum()

140 ms ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
43.3 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The slowest run took 120.53 times longer than the fastest. This could mean that an intermediate result is being cached.
170 ms ± 374 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
9.11 ms ± 960 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.3 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### ***~3x*** faster when grouping by integers column (randomly distributed, 1000 uniques)

In [29]:
%timeit -n 1 df.groupby("ints", observed=True).mean(numeric_only=True);
%timeit -n 1 df_pl.group_by("ints").mean()
%timeit -n 1 df.groupby_fast("ints").mean();

131 ms ± 50.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
70.8 ms ± 8.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
43.2 ms ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### ***~3-4x*** faster aggregating already grouped data

##### 3-8x Faster on 2 columns

In [7]:
for key in ["ints", "categorical"]:
    print(key)
    for gb in [
        df.groupby(key, observed=True),
        df.groupby_fast(key)
    ]:
        gb[["floats", "ints"]].mean(); # this is to pay the one-off setup for Pandas groupby
        %timeit -n 1 gb[["floats", "ints"]].mean();
    print()

ints
81.3 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
28.1 ms ± 7.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

categorical
91.3 ms ± 4.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.92 ms ± 264 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)



##### 4x faster on 10 columns

In [8]:
wide_df = pd.DataFrame({i: np.random.rand(N // 1) for i in range(10)})
for key in ["ints", "categorical"]:
    print(key)
    key = df[key][:len(wide_df)]
    for gb in [
        wide_df.groupby(key, observed=True),
        wide_df.groupby_fast(key),
    ]:
        gb.mean(); # this is to pay the one-off setup for Pandas groupby
        %timeit -n 1 gb.mean();
    print()

ints
261 ms ± 6.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
101 ms ± 937 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

categorical
201 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
54.1 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



#### In-line filtering/masking

#### 5-10x faster with a row mask (with 80% positive rate)

In [9]:
mask = df.floats.between( *df.floats.quantile([.1, .9]))
for key in ["ints", "categorical"]:
    print(f"{key} grouper")
    %timeit -n 1 df.loc[mask].groupby(key, observed=True).mean(numeric_only=True)
    %timeit -n 1 df.groupby_fast(key).mean(mask=mask)
    print()

ints grouper
232 ms ± 67.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
32.5 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

categorical grouper
248 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The slowest run took 8.25 times longer than the fastest. This could mean that an intermediate result is being cached.
58.1 ms ± 70.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



####  Multi-Key and Margins

In [10]:
multi_key = ["categorical", df.ints % 3]

In [11]:
%timeit -n 1 df.groupby(multi_key, observed=True).sum(numeric_only=True)
%timeit -n 1 df.groupby_fast(multi_key).sum()

420 ms ± 83.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The slowest run took 5.58 times longer than the fastest. This could mean that an intermediate result is being cached.
206 ms ± 189 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
df.groupby_fast(multi_key).sum(
    margins=True, mask=df.categorical.isin(["q", "p"])
).style.format(precision=2)

Unnamed: 0_level_0,Unnamed: 1_level_0,floats,ints
categorical,ints,Unnamed: 2_level_1,Unnamed: 3_level_1
p,0,340341.49,343084527
p,1,329939.75,329328144
p,2,330163.44,336294574
p,All,1000444.68,1008707245
q,0,340376.15,336242220
q,1,330119.29,323301180
q,2,329774.95,330036310
q,All,1000270.4,989579710
All,0,680717.64,679326747
All,1,660059.04,652629324


####  Rolling / EMA functions ~10 / 40x faster*

In [13]:
df_small = df.iloc[:N // 4]

In [15]:
# 10x faster with same behaviour
%timeit -n 1 df_small.groupby("categorical", observed=True).rolling(5, min_periods=1).sum()
%timeit -n 1 df_small.groupby_fast("categorical").rolling(5, min_periods=1).sum(index_by_groups=True)
a = df_small.groupby("categorical", observed=True).rolling(5, min_periods=1).sum()
b = df_small.groupby_fast("categorical").rolling(5, min_periods=1).sum(index_by_groups=True)
assert np.isclose(a, b).all()

899 ms ± 66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
91.2 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [17]:
# 40x faster with output aligned to input
%timeit -n 1 df_small.groupby_fast("categorical").rolling(5, min_periods=1).sum();

29.8 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### EMAs - 10x / 70x faster

In [25]:
ema = df.groupby("categorical", observed=True).ewm(alpha=.5).mean()
ema_fast = df.groupby_fast("categorical").ema(alpha=.5, index_by_groups=True)
assert np.isclose(ema, ema_fast).all()

In [19]:
# 10x faster with same behaviour
%timeit -n 1 df.groupby("categorical", observed=True).ewm(alpha=.5).mean()
%timeit -n 1 df.groupby_fast("categorical").ema(alpha=.5, index_by_groups=True)

3.78 s ± 95.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
314 ms ± 26.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Result aligned to input (like Polars) - 6-8x faster (70x vs Pandas)

In [28]:
%timeit -n 1 df.groupby_fast("categorical").ema(alpha=.5,)

49.3 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [27]:
%%timeit -n 1   
# YUCK
df_pl.select([
    c.ewm_mean(alpha=.5).over("categorical") for c in [polars.col.floats, polars.col.ints]
])

363 ms ± 37.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
ema_gbl = df.groupby_fast("categorical").ema(alpha=.5,)
ema_pl = df_pl.select([
    col.ewm_mean(alpha=.5).over("categorical") for col in [polars.col.floats, polars.col.ints]
])
assert np.isclose(ema_gbl, ema_pl.to_pandas()).all()

#### quantiles

In [54]:
%timeit -n 1 df.groupby("ints", observed=True).median(numeric_only=True);
%timeit -n 1  df.groupby_fast("ints").median();

332 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
228 ms ± 4.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [61]:
%timeit -n 1 df.groupby("categorical", observed=True).median(numeric_only=True);
%timeit -n 1  df.groupby_fast("categorical").median();

524 ms ± 8.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
153 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Quantiles 6-10x Faster

In [59]:
q = [.25, .5, .75]
%timeit -n 1 df.groupby("ints", observed=True).quantile(q=q, numeric_only=True);
%timeit -n 1 df.groupby_fast("ints").quantile(q=q);
%timeit -n 1 df.groupby_fast("ints").quantile(q=q);

1.47 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
258 ms ± 4.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [60]:
%timeit -n 1 df.groupby("categorical", observed=True).quantile(q=q, numeric_only=True);
%timeit -n 1 df.groupby_fast("categorical").quantile(q=q);

2.32 s ± 88.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
178 ms ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Faster crosstabs (4x/8x faster without/with margins)

In [80]:
# Without margins: 4x faster
%timeit -n 1 pd.crosstab(df.ints, df.categorical, df.floats, aggfunc="mean",)
%timeit -n 1 df.pivot_table("floats", "ints", "categorical", aggfunc="sum", observed=True)
%timeit -n 1 crosstab(df.ints, df.categorical, df.floats, aggfunc="mean", )

389 ms ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
311 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
108 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [79]:
# With margins: 8x faster
%timeit -n 1 pd.crosstab(df.ints, df.categorical, df.floats, aggfunc="sum", margins=True)
%timeit -n 1 crosstab(df.ints, df.categorical, df.floats, aggfunc="sum", margins=True)

796 ms ± 66.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
110 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [76]:
# With margins & mask: 8x faster
mask = df.floats > .2
%timeit -n 1 df.loc[mask].pivot_table("floats", "ints", "categorical", aggfunc="sum", margins=True, observed=True)
%timeit -n 1 crosstab(df.ints, df.categorical, df.floats, aggfunc="sum", margins=True, mask=mask)

616 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
127 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Masking Variants

In [None]:
mask = df.floats < df.floats.quantile(.2)
%time df.loc[mask].groupby("categorical", observed=True).mean();
%time df.groupby_fast("categorical").mean(mask=mask);