Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Possible performance regression for indexing from 0.12 to 0.13.1 #6882

Closed
pmorissette opened this issue Apr 14, 2014 · 11 comments
Closed
Labels
Performance Memory or execution speed performance

Comments

@pmorissette
Copy link

Hey all,

Just upgraded my pandas version from 0.12 to 0.13.1 and noticed a significant performance regression for indexing operations (get, set, and windowing).

Here is my test setup code:

import pandas as pd

ts1 = pd.TimeSeries(data=100.0, index=pd.date_range('2000-01-01', periods=1000))
ts2 = pd.TimeSeries(data=200.0, index=pd.date_range('2000-01-01', periods=1000))
ts3 = pd.TimeSeries(data=300.0, index=pd.date_range('2000-01-01', periods=1000))
df = pd.DataFrame({'ts1': ts1, 'ts2': ts2, 'ts3': ts3})

dt = ts1.index[500]

Here is a table showing the results of IPython's %timeit function.

test 0.12 0.13.1
ts1[dt] 3.78 8.5
ts1.ix[dt] 11.8 30.7
ts1.loc[dt] 12.7 37.7
ts1[dt]=1 1.86 4.32
ts1.ix[dt]=1 12.5 65.9
ts1.loc[df]=1 36.2 65.7
ts1[:dt] 78.2 101
ts1.ix [:dt] 53.1 106
ts1.loc [:dt] 59.5 101
df.ix[dt] 45.3 77.9
df.ix [:dt] 63.3 85.9

I did not see up-to-date data on http://pandas.pydata.org/pandas-docs/vbench/vb_indexing.html - am I looking at the right benchmark data? Most charts end in June 2012.

Can someone confirm this slowdown?

I am using numpy 1.8.1 by the way - let me know if you need any other version numbers.

Thanks in advance!

@immerrr
Copy link
Contributor

immerrr commented Apr 15, 2014

Are those numbers microseconds? There was some microsecond-level overhead added in 0.13.1, which I've seen and tried to address, but it was agreed that shaving off several dozen (hundred?) of additional function calls might not be worth it, because that overhead didn't scale with container size. For example, on my 3.3GHz i3 one microsecond is about 6 function calls:

In [1]: def foo(x): return x

In [2]: timeit foo(1)
10000000 loops, best of 3: 149 ns per loop

FTR, there's a pull request with a lot of big container indexing benchmarks, most likely including ones shown here. I remember it showing some unexpected slowdowns for datetime indices, but I haven't yet looked at them.

@pmorissette
Copy link
Author

Hey @immerrr thanks for the quick reply!

Yes these numbers are in microseconds. The reason I noticed is that I have a program that updates a large number of pre-allocated time series and data frames date after date so the increase in time was noticeable. I understand this is not a huge issue but I thought I'd bring it up since I saw no mention of it elsewhere. For my application it did lead to a significant time increase (~1.7 times slower).

@jreback
Copy link
Contributor

jreback commented Apr 15, 2014

@pmorissette you need to make sure that you are vectorizing
accessing/setting single elements via most indexers is not that fast as it handles lots of cases
u can try using iat/at in those cases but it behooves u to vectorize as much as possible

@pmorissette
Copy link
Author

Hey @jreback yeah vectorizing would indeed be the way to go but for my application this is not possible. The values I am updating are only known at time t and I must loop through all the dates one at a time. It is convenient to have the data in a pandas TimeSeries for my application, but perhaps a quicker data storage solution could work and I could create a TimeSeries on demand when necessary. Some testing will be in order.

Also, I will look at iat/at to see if I can get a speed improvement. Thanks for the help!

@pmorissette
Copy link
Author

@jreback

I just ran my benchmark using .iat and .at and they too are slower in 0.13.1 vs 0.12. These two methods are also slower than the basic bracket indexing. Again, these are microseconds. Not a big deal individually but it adds up in my use case.

test 0.12 0.13.1
ts1[dt] 3.78 8.5
ts1.iat[500] 15.5 26.8
ts1.at[dt] 7 15

@jreback
Copy link
Contributor

jreback commented Apr 15, 2014

you realize that substantial changes took place in 0.13
see the whatsnew

if these microseconds matter to you
then u need to do the indexing some other way

@pmorissette
Copy link
Author

@jreback sounds good - just wanted to bring it up since I didn't see this issue mentioned elsewhere. Pandas is great and I appreciate all the hard work that goes into this library. Thanks again.

@immerrr
Copy link
Contributor

immerrr commented Apr 15, 2014

@pmorissette , sometimes i = ts.index.get_loc(dt); ts1.iloc[i] was faster for me than simply ts1.loc[dt], maybe that could help...

@pmorissette
Copy link
Author

@immerrr ok cool I'll take a look!

@jreback
Copy link
Contributor

jreback commented Apr 15, 2014

my point before is that iat/at are faster than iloc/loc

they are all prob slower than 0.12 a bit

we normally don't optimize to microseconds as if that actually matter you are generally going about the problem in the wrong way

@pmorissette
Copy link
Author

@jreback ok understood. Thanks for the heads up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants