Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

GroupBy shifting performance issue #2162

Closed
wesm opened this Issue · 2 comments

2 participants

@wesm
Owner
ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))

xref http://stackoverflow.com/questions/13180499/most-efficient-way-to-shift-multiindex-time-series

@jreback
Owner

This obviously affected by issues fixed in #3145,
still prob should add to the vbenchs, and see what we can do

In [11]: %time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))
CPU times: user 10.13 s, sys: 0.18 s, total: 10.31 s
Wall time: 10.35 s

In [12]: pd.__version__
Out[12]: '0.11.0.dev-e6140e9'
In [11]: %time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))
CPU times: user 51.07 s, sys: 0.25 s, total: 51.32 s
Wall time: 51.48 s

In [13]: pd.__version__
Out[13]: '0.10.1'
  Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    96005    1.092    0.000    2.360    0.000 common.py:456(take_nd)
    48000    0.560    0.000    3.790    0.000 index.py:1536(values)
    96006    0.486    0.000    0.895    0.000 internals.py:762(make_block)
   144007    0.474    0.000    0.764    0.000 common.py:690(_maybe_promote)
    96006    0.450    0.000    1.504    0.000 internals.py:818(__init__)
    48000    0.445    0.000    4.000    0.000 frame.py:3975(shift)
    48000    0.426    0.000    0.958    0.000 index.py:1817(__getitem__)

prob should have a shift-like operator built into the apply/transform, rather than a generic apply,
would obviously be much faster

@jreback
Owner

fixed as indicated above

@jreback jreback closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.