Multiindex slicing df.loc[idx[dim1,dim2,dim3],:] not working right in some cases #12896

tntdynamight · 2016-04-14T01:53:33Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
np.random.seed(0) 
idx = pd.IndexSlice
midx = pd.MultiIndex.from_product([['CS'], range(20), range(-1100, 6000)]) 
df = pd.DataFrame(np.random.randn(7100*20, 3), columns=['dat1', 'dat2', 'dat3'], index=midx)

Output

> df.loc[idx['CS', :, -1000:-950], :].head()
                dat1      dat2      dat3
CS 0 -1000 -1.306527  1.658131 -0.118164
     -999  -0.680178  0.666383 -0.460720
     -998  -1.334258 -1.346718  0.693773
     -997  -0.159573 -0.133702  1.077744
     -996  -1.126826 -0.730678 -0.384880

> df.loc[idx['CS', :, -1000:-50], :].head()
                dat1      dat2      dat3
CS 0 -1100  1.764052  0.400157  0.978738  # <<< Index Level 2 should start at -1000
     -1099  2.240893  1.867558 -0.977278
     -1098  0.950088 -0.151357 -0.103219
     -1097  0.410599  0.144044  1.454274
     -1096  0.761038  0.121675  0.443863

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.2.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0

The text was updated successfully, but these errors were encountered:

colinalexander · 2016-04-14T02:05:29Z

I believe idx = pd.IndexSlice

jreback · 2016-04-14T02:52:25Z

hmm, something funny going on.

This gives something odd as well.

In [33]: df.loc[idx[:, :, -1000:-950], :].head()
Out[33]: 
                dat1      dat2      dat3
CS 0 -1000 -1.306527  1.658131 -0.118164
     -999  -0.680178  0.666383 -0.460720
     -998  -1.334258 -1.346718  0.693773
     -997  -0.159573 -0.133702  1.077744
     -996  -1.126826 -0.730678 -0.384880

In [34]: df.loc[idx[:, :, -1000:-50], :].head()
Out[34]: 
                dat1      dat2      dat3
CS 0 -1100  1.764052  0.400157  0.978738
     -1099  2.240893  1.867558 -0.977278
     -1098  0.950088 -0.151357 -0.103219
     -1097  0.410599  0.144044  1.454274
     -1096  0.761038  0.121675  0.443863

max-sixty · 2016-05-08T01:51:40Z

Any idea what this could be? It seems to be related to the size of the index - how could that be?

Correct with an index of 1500:

In [240]: l=1500

In [241]:  ints = (pd.np.random.rand(l)*1e6).round().astype('int')

In [242]: index=pd.MultiIndex.from_arrays([list('abc')*(l//3), ints])


In [243]:  series=pd.Series(np.random.rand(l), index=index)

In [244]:  series.sort_index().loc[(slice(None), slice(1e5))]
Out[244]: 
a  2360     0.501724
   5253     0.892526
   10122    0.158961
   15737    0.927828
...
   94452    0.460249
   96376    0.248980
   97572    0.514986
   99746    0.719964
dtype: float64

Incorrect with an index of 15000:

In [245]: l=15000

In [246]:  ints = (pd.np.random.rand(l)*1e6).round().astype('int')

In [247]: index=pd.MultiIndex.from_arrays([list('abc')*(l//3), ints])

In [248]: series=pd.Series(np.random.rand(l), index=index)


In [249]: series.sort_index().loc[(slice(None), slice(1e5))]
Out[249]: 
a  409       0.317578
   582       0.526421
   584       0.620082
   838       0.139467
...
   859804    0.510514
   947555    0.951258
dtype: float64

max-sixty · 2016-05-08T02:15:07Z

Here? https://github.com/pydata/pandas/blob/master/pandas/index.pyx#L305-L305

jreback · 2016-05-08T14:08:06Z

more like here: https://github.com/pydata/pandas/blob/master/pandas/indexes/multi.py#L1734

max-sixty · 2016-05-08T16:17:41Z

Yup. But there mus be somewhere where the behavior is different depending on the size of the index?

kawochen · 2016-05-08T16:29:44Z

sounds interesting. I'll give this a go

tntdynamight · 2016-05-10T03:09:05Z

Looks solved, thanks for the effort everyone! I guess you will close this upon merge, or I will once it's through. Cheers!

jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Difficulty Advanced labels Apr 14, 2016

jreback added this to the 0.18.1 milestone Apr 14, 2016

jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016

jreback mentioned this issue May 8, 2016

BUG: Large MultiIndex-ed series fails on slicing #13113

Closed

kawochen mentioned this issue May 8, 2016

BUG: GH12896 where extra elements are returned in MultiIndex slicing #13117

Closed

4 tasks

jreback closed this as completed in 2de2884 May 14, 2016

jreback mentioned this issue Jul 1, 2016

partial slicing FAILS with a datetimeindex #13539

Closed

jreback mentioned this issue Aug 20, 2016

Partial datetime indexing of Multiindex by year only #14049

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiindex slicing df.loc[idx[dim1,dim2,dim3],:] not working right in some cases #12896

Multiindex slicing df.loc[idx[dim1,dim2,dim3],:] not working right in some cases #12896

tntdynamight commented Apr 14, 2016

colinalexander commented Apr 14, 2016

jreback commented Apr 14, 2016

max-sixty commented May 8, 2016

max-sixty commented May 8, 2016

jreback commented May 8, 2016

max-sixty commented May 8, 2016

kawochen commented May 8, 2016

tntdynamight commented May 10, 2016

Multiindex slicing df.loc[idx[dim1,dim2,dim3],:] not working right in some cases #12896

Multiindex slicing df.loc[idx[dim1,dim2,dim3],:] not working right in some cases #12896

Comments

tntdynamight commented Apr 14, 2016

Code Sample, a copy-pastable example if possible

Output

output of pd.show_versions()

INSTALLED VERSIONS

colinalexander commented Apr 14, 2016

jreback commented Apr 14, 2016

max-sixty commented May 8, 2016

max-sixty commented May 8, 2016

jreback commented May 8, 2016

max-sixty commented May 8, 2016

kawochen commented May 8, 2016

tntdynamight commented May 10, 2016

output of `pd.show_versions()`