Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiindex slicing df.loc[idx[dim1,dim2,dim3],:] not working right in some cases #12896

Closed
tntdynamight opened this issue Apr 14, 2016 · 8 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@tntdynamight
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
np.random.seed(0) 
idx = pd.IndexSlice
midx = pd.MultiIndex.from_product([['CS'], range(20), range(-1100, 6000)]) 
df = pd.DataFrame(np.random.randn(7100*20, 3), columns=['dat1', 'dat2', 'dat3'], index=midx)

Output

> df.loc[idx['CS', :, -1000:-950], :].head()
                dat1      dat2      dat3
CS 0 -1000 -1.306527  1.658131 -0.118164
     -999  -0.680178  0.666383 -0.460720
     -998  -1.334258 -1.346718  0.693773
     -997  -0.159573 -0.133702  1.077744
     -996  -1.126826 -0.730678 -0.384880
> df.loc[idx['CS', :, -1000:-50], :].head()
                dat1      dat2      dat3
CS 0 -1100  1.764052  0.400157  0.978738  # <<< Index Level 2 should start at -1000
     -1099  2.240893  1.867558 -0.977278
     -1098  0.950088 -0.151357 -0.103219
     -1097  0.410599  0.144044  1.454274
     -1096  0.761038  0.121675  0.443863

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.2.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0

@colinalexander
Copy link

I believe idx = pd.IndexSlice

@jreback
Copy link
Contributor

jreback commented Apr 14, 2016

hmm, something funny going on.

This gives something odd as well.

In [33]: df.loc[idx[:, :, -1000:-950], :].head()
Out[33]: 
                dat1      dat2      dat3
CS 0 -1000 -1.306527  1.658131 -0.118164
     -999  -0.680178  0.666383 -0.460720
     -998  -1.334258 -1.346718  0.693773
     -997  -0.159573 -0.133702  1.077744
     -996  -1.126826 -0.730678 -0.384880

In [34]: df.loc[idx[:, :, -1000:-50], :].head()
Out[34]: 
                dat1      dat2      dat3
CS 0 -1100  1.764052  0.400157  0.978738
     -1099  2.240893  1.867558 -0.977278
     -1098  0.950088 -0.151357 -0.103219
     -1097  0.410599  0.144044  1.454274
     -1096  0.761038  0.121675  0.443863

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Difficulty Advanced labels Apr 14, 2016
@jreback jreback added this to the 0.18.1 milestone Apr 14, 2016
@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016
@max-sixty
Copy link
Contributor

Any idea what this could be? It seems to be related to the size of the index - how could that be?

Correct with an index of 1500:

In [240]: l=1500

In [241]:  ints = (pd.np.random.rand(l)*1e6).round().astype('int')

In [242]: index=pd.MultiIndex.from_arrays([list('abc')*(l//3), ints])


In [243]:  series=pd.Series(np.random.rand(l), index=index)

In [244]:  series.sort_index().loc[(slice(None), slice(1e5))]
Out[244]: 
a  2360     0.501724
   5253     0.892526
   10122    0.158961
   15737    0.927828
...
   94452    0.460249
   96376    0.248980
   97572    0.514986
   99746    0.719964
dtype: float64

Incorrect with an index of 15000:

In [245]: l=15000

In [246]:  ints = (pd.np.random.rand(l)*1e6).round().astype('int')

In [247]: index=pd.MultiIndex.from_arrays([list('abc')*(l//3), ints])

In [248]: series=pd.Series(np.random.rand(l), index=index)


In [249]: series.sort_index().loc[(slice(None), slice(1e5))]
Out[249]: 
a  409       0.317578
   582       0.526421
   584       0.620082
   838       0.139467
...
   859804    0.510514
   947555    0.951258
dtype: float64

@max-sixty
Copy link
Contributor

@jreback
Copy link
Contributor

jreback commented May 8, 2016

@max-sixty
Copy link
Contributor

Yup. But there mus be somewhere where the behavior is different depending on the size of the index?

@kawochen
Copy link
Contributor

kawochen commented May 8, 2016

sounds interesting. I'll give this a go

@tntdynamight
Copy link
Author

Looks solved, thanks for the effort everyone! I guess you will close this upon merge, or I will once it's through. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants