Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: non-info axes slicing on Panels is slow #6484

Closed
aldanor opened this issue Feb 26, 2014 · 11 comments · Fixed by #6486
Closed

PERF: non-info axes slicing on Panels is slow #6484

aldanor opened this issue Feb 26, 2014 · 11 comments · Fixed by #6486
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Milestone

Comments

@aldanor
Copy link
Contributor

aldanor commented Feb 26, 2014

Assume we have two panels:

>>> pn1.shape
(13, 5412, 162)
>>> pn2.shape
(12, 5412, 162)

12 fields are float64 while one field in pn1 is datetime64[ns]. This slows pretty much all operations (slicing, querying, anything) down by a huge factor:

>>> %timeit pn2.values
100000 loops, best of 3: 4.06 µs per loop

>>> %timeit pn1.values
1 loops, best of 3: 5.31 s per loop

Is there an unofficial rule of not using datetime64 in the first place, is it some weird coercion bug (it seems to try and coerce everything down to floats), does it have anything to do with panels?

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

.values coerces to a common dtype, object in this case. This is in general not a good idea to use values in any event. What are you trying to do?

Slicing etc should be just fine as long as you use pandas methods.

@aldanor
Copy link
Contributor Author

aldanor commented Feb 26, 2014

I actually ran into the whole datetime64 thing when I was pondering on where the 100x speed drop comes from in the case below (the example is for a pure float64 panel) and whether it can be related to datetime64 field (as in the previous example):

>>> pn = pd.Panel(np.random.random((12, 5412, 162)))

>>> ix = np.ones(pn.shape[1], dtype=bool)

>>> ix[np.random.random(ix.size) > 0.5] = 0

>>> %timeit pn.loc[0, ix]  # one field, as fast as dataframe
1000 loops, best of 3: 690 µs per loop

>>> %timeit pn.loc[[0, 1], ix]  # two fields, 233x slower
10 loops, best of 3: 161 ms per loop

>>> %timeit pn.loc[:, ix]
1 loops, best of 3: 666 ms per loop

>>> pn_t = pn.swapaxes(0, 1).copy()  # try .loc on the first axis

>>> %timeit pn_t.loc[ix]
1 loops, best of 3: 1.27 s per loop

>>> def pn_slice_major(pn, ix):  # a really stupid way of doing this
   ..:     slices = dict((item, pn.loc[item, ix]) for item in pn.items)
   ..:     pn_s = pd.Panel(major_axis=slices[0].index, minor_axis=slices[0].columns)
   ..:     for item in pn.items:
   ..:         pn_s[item] = slices[item]
   ..:     return pn_s
   ..:

>>> %timeit pn_slice_major(pn, ix)  # 13.5x faster than .loc[:, ix]
10 loops, best of 3: 49.5 ms per loop

Still have no idea what's the exact reason for this ^.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

#6440 can prob help with this

the indexing code is very tricky

if you can implement this in a generic way go for it, pls look into contributing a fix for this.

you have to be really careful with this, because if for example their are a lot of items, this would be way slower. indexing is optimized for 0th access, (otherwise you are doing a lot of cross-section indexing).

@jreback jreback added this to the 0.15.0 milestone Feb 26, 2014
@aldanor
Copy link
Contributor Author

aldanor commented Feb 26, 2014

I can sure write hacks and workarounds that work faster for specific use cases, but solving it generally is quite a bit more complicated in pandas, especially when you're not as familiar with the entire internal api :/ I'll try looking into it a bit later, maybe it's something more or less obvious.

I find Panels generally very useful (esp for financial market data, where you often have date/symbol/field/etc) and tried using them to avoid indexing similarly indexed data multiple times.. but ironically that's exactly what I have to do now because it's faster.

As for a lot of items: see above where I index the transposed panel.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

@aldanor great...I use them for the same reason!

I DO think their is a perf degredation from 0.12 to 0.13.1...but as you noted the slicing is pretty tricky.

should be pretty straightfoward in this case to at least see where its coming from then can address it.

as an aside, it is sometimes more efficient to transpose, then slice and transpose back (but a bit tricky to make this work correctly). as the 0th axis vs the -1th axis have different slicing characteristics because of how numpy aligns memory.

I generally line the panels up like I use them (and somethimes this is different from how I store then in HDF5), e.g. generally do something like: items x dates x symbols

@aldanor
Copy link
Contributor Author

aldanor commented Feb 26, 2014

Good point, I'll try to bench all the above from 0.12 through to 0.13.1 -- I bet it wasn't that slow before.

Btw! More weirdness (data from the very first example):

>>> %timeit pn2.ix[['volumes', 'discounts']]  # panel w/o timestamps; both fields are floats
100 loops, best of 3: 16.9 ms per loop

>>> %timeit pn1.ix[['volumes', 'discounts']]  # panel w/ timestamps; both fields are floats
1 loops, best of 3: 1.42 s per loop  # WTF?

Note that I'm not calling .values or anything that would (or should, at least) trigger coercion. Just pulling 2 items out of 12 should be sort of instant.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

Panels are very tricky with multi-dtypes. Look at pn2._data and see if the blocks are correct. and then do this with the sliced.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

Your data is prob lined up cross sectionally which causes conversion to object

In [12]: pn = pd.Panel(np.random.random((12, 5412, 162)))

In [13]: pn._data
Out[13]: 
BlockManager
Items: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')
Axis 1: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
Axis 2: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
FloatBlock: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 12 x 5412 x 162, dtype: float64

In [14]: %timeit pn.ix[[0,1]]
100 loops, best of 3: 14.6 ms per loop

In [15]: pn['foo'] = DataFrame({ 0 : { 0 : Timestamp('20130101') }})

In [16]: pn._data
Out[16]: 
BlockManager
Items: Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, u'foo'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
Axis 2: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
FloatBlock: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 12 x 5412 x 162, dtype: float64
DatetimeBlock: [foo], 1 x 5412 x 162, dtype: datetime64[ns]

In [17]: %timeit pn.ix[[0,1]]
10 loops, best of 3: 30.7 ms per loop

@jreback
Copy link
Contributor

jreback commented Feb 26, 2014

@aldanor see #6486 fixes this case. can you add some vbenches for these cases (and then see what the effect of this on other things you are testing?) thanks

@jreback jreback modified the milestones: 0.14.0, 0.15.0 Feb 26, 2014
@aldanor
Copy link
Contributor Author

aldanor commented Feb 27, 2014

@jreback Hey, sorry I wouldn't have time to look into this until weekend. Mind if I add couple edge cases to the vbench regarding panels?

@jreback
Copy link
Contributor

jreback commented Feb 27, 2014

np submit a pr at your leisure
(keep in mind #6440) is systematically putting in vbenches though may not cover these cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants