Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick dataframe shift #5609

Closed
halleygithub opened this issue Nov 28, 2013 · 17 comments · Fixed by #6672
Closed

Quick dataframe shift #5609

halleygithub opened this issue Nov 28, 2013 · 17 comments · Fixed by #6672
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@halleygithub
Copy link

Quick implementation of dataframe shift

def quick_shift(df, N=1):
    '''Quick implementation of dataframe shift'''

    shift_value = np.roll(df.values,N,axis=0)
    shift_value[0:N] = np.NaN
    return DataFrame(shift_value, index=df.index, columns=df.columns)
DataFrame.quick_shift = quick_shift

Perf benchmark:

df = DataFrame(np.random.rand(10000,500))

%timeit df1 = df.quick_shift(1)
10 loops, best of 3: 55.3 ms per loop

%timeit df2 = df.shift(1)
1 loops, best of 3: 238 ms per loop

Related to #4095

@jreback
Copy link
Contributor

jreback commented Nov 28, 2013

this would need to be done in core/internals.py

needs to handle non trivial shifts (eg something that is not so straightforward like this) and multi dtypes

@jreback
Copy link
Contributor

jreback commented Jan 25, 2014

@halleygithub putting on the plate for 0.14. Care to do a PR for this?

@halleygithub
Copy link
Author

@jreback , as I am not familiar with github, so I am not sure what can I do . But you are free to handle the code as you want. :)

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

it's a nice excuse to get familiar

see here: http://pandas.pydata.org/developers.html

@gouthambs
Copy link
Contributor

@jreback I was wondering if I can take a stab at this?

@jreback
Copy link
Contributor

jreback commented Feb 15, 2014

of course

lmk if u need help

@gouthambs
Copy link
Contributor

@jreback I see shift function in core/internals.py in the Block class and in core/generic.py in the NDFrame class. Can you point me to a place where I can understand some basic code flow? Or can you give me some quick pointers?

@jreback
Copy link
Contributor

jreback commented Feb 15, 2014

core/generic /shift sets up what to do eg how many periods to shift and such and translates it to move the data up 3 or whatever; this then calls the internals shift

core/internals/ Block/shift is where to make this change

simply swap out the implantation for the new
make sure tests pass
write a vbench for this (pandas/vb_suite) - find a file with similar methods and stick it their
u can even use the above benchmarks

@gouthambs
Copy link
Contributor

@jreback Thanks. I will do that.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

FYI, I just fixed a bug in shift, see here: #6373

The soln proposed above is still valid in any case....make a change in core/internals/Block/shift

@gouthambs
Copy link
Contributor

Sure. I will take your changes and merge with mine.

@gouthambs
Copy link
Contributor

@jreback I swapped the current shift with a roll, in order to see if I get a speed up. I tried swapping in core/internals.py. But this did not get any performance improvements. On the other hand when I use a roll in core/generic.py (just as a test), I see good speedup. The two implementations are shown below:

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index d607be6..753b146 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -3169,9 +3169,10 @@ class NDFrame(PandasObject):
             return self

         if freq is None and not len(kwds):
-            block_axis = self._get_block_manager_axis(axis)
-            indexer = com._shift_indexer(len(self._get_axis(axis)), periods)
-            new_data = self._data.shift(indexer=indexer, periods=periods, axis=block_axis)
+            #block_axis = self._get_block_manager_axis(axis)
+            #indexer = com._shift_indexer(len(self._get_axis(axis)), periods)
+            #new_data = self._data.shift(indexer=indexer, periods=periods, axis=block_axis)
+            new_data    = np.roll(self.values,periods,axis=axis)
         else:
             return self.tshift(periods, freq, **kwds)

diff --git a/pandas/core/internals.py b/pandas/core/internals.py
index c89aac0..e56697b 100644
--- a/pandas/core/internals.py
+++ b/pandas/core/internals.py
@@ -931,7 +931,8 @@ class Block(PandasObject):
     def shift(self, indexer, periods, axis=0):
         """ shift the block by periods, possibly upcast """

-        new_values = self.values.take(indexer, axis=axis)
+        #new_values = self.values.take(indexer, axis=axis)
+        new_values = np.roll(self.values,periods, axis=axis)
         # convert integer to float if necessary. need to do a lot more than
         # that, handle boolean etc also
         new_values, fill_value = com._maybe_upcast(new_values)

Do you have any ideas why np.roll speeds up the NDFrame but not the Block class?

The timings for the above mentioned example are as follows:

  • Original Implementation: 242ms
  • Quick Shift : 27 ms
  • Roll in core/internals : 239 ms
  • Roll in core/generic:27 ms

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

you can't do it in generic.py because that only handles a single dtype. The orientation needs to be changed as the blocks store values 'flipped' from what you think they are. You need to adjust the roll (which IIRC is just an axis swap) to account for this.

In internals you are effectively using axis 1 for the shift (on a row shift), that's what the block_axis is.
transpose the values before roll, then transpose again after the setting

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

[goat-jreback-~/pandas] git diff

diff --git a/pandas/core/internals.py b/pandas/core/internals.py
index e68fc8d..d33386e 100644
--- a/pandas/core/internals.py
+++ b/pandas/core/internals.py
@@ -946,11 +946,10 @@ class Block(PandasObject):
     def shift(self, indexer, periods, axis=0):
         """ shift the block by periods, possibly upcast """

-        new_values = self.values.take(indexer, axis=axis)
+        new_values, fill_value = com._maybe_upcast(self.values)
         # convert integer to float if necessary. need to do a lot more than
         # that, handle boolean etc also
-        new_values, fill_value = com._maybe_upcast(new_values)
-
+        new_values = np.roll(self.values.T,periods, axis=axis)
         axis_indexer = [ slice(None) ] * self.ndim
         if periods > 0:
             axis_indexer[axis] = slice(None,periods)
@@ -959,7 +958,7 @@ class Block(PandasObject):
             axis_indexer[axis] = slice(periods,None)
         new_values[tuple(axis_indexer)] = fill_value

-        return [make_block(new_values, self.items, self.ref_items,
+        return [make_block(new_values.T, self.items, self.ref_items,
                            ndim=self.ndim, fastpath=True)]

     def eval(self, func, other, raise_on_error=True, try_cast=False):

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

you can also remove _shift_indexer from core/common.py as I think that's the only thing that uses it

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

pls add a vbench for this as well (vb_suite/frame_methods.py)

@gouthambs
Copy link
Contributor

Thanks. I will try this and add one to vbench.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants