BUG: iloc fails with non lex-sorted MultiIndex #13797

ygriku · 2016-07-26T01:55:40Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
ind = [
        ['AA','AA','AA','BB','BB'],
        ['A' ,'B' ,'B' ,'a' ,'b']
] 
ind_nonLex = [
        ['CC','CC','CC','BB','BB'],
        ['A' ,'B' ,'B' ,'a' ,'b']
] 

strCol=pd.DataFrame([['fooA'],['fooB'],['fooC'],['fooD'],['fooE']])

dat=np.arange(1,26).reshape(5,5)
df=pd.concat([strCol, pd.DataFrame(dat)], axis=1)
df1=pd.DataFrame(df.values, index=ind_nonLex)
df2=pd.DataFrame(df.values, index=ind)

df1 (whose index is not lex sorted) fails with iloc access:

>>> df1.iloc[0,0]
C:\Users\rikuhiro\Anaconda3\envs\pd-check\lib\site-packages\ipykernel\__main__.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  if __name__ == '__main__':
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-37-884ae2904642> in <module>()
----> 1 df1.iloc[0,0]

C:\Users\rikuhiro\Anaconda3\envs\pd-check\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1292 
   1293         if type(key) is tuple:
-> 1294             return self._getitem_tuple(key)
   1295         else:
   1296             return self._getitem_axis(key, axis=0)

C:\Users\rikuhiro\Anaconda3\envs\pd-check\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
   1561 
   1562             # if the dim was reduced, then pass a lower-dim the next time
-> 1563             if retval.ndim < self.ndim:
   1564                 axis -= 1
   1565 

AttributeError: 'str' object has no attribute 'ndim'

Expected Output

df2 (whose index is lex sorted) works as expected:

>>> df2.iloc[0,0]
'fooA'

This attributeError does not occur when the DataFrame.values consist of numpy objects (e.g. numpy.int32) because they have the ndim attribute. (Although the performance warning remains, it may be another issue).

I found that the addition of an if statement can remedy this in pandas/core/indexing.py. This just makes the _getitem_tuple(self, tup) be aware of objects without the ndim attribute, as _getitem_nested_tuple(self, tup) is (I will prepare a pull request if it is helpful.)

 @@ -1569,6 +1569,10 @@ def _getitem_tuple(self, tup):
              retval = getattr(retval, self.name)._getitem_axis(key, axis=axis)

 +            # if we have a scalar, we are done
 +            if lib.isscalar(retval) or not hasattr(retval, 'ndim'):
 +                break
 +
              # if the dim was reduced, then pass a lower-dim the next time
              if retval.ndim < self.ndim:
                  axis -= 1

output of `pd.show_versions()`

>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.11.1
scipy: None
statsmodels: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-07-26T10:41:25Z

@YG-Riku actually the fix should start here

.iloc should not hit the MultiIndex branch testing at all. If you want to give this a try would be great.

ygriku · 2016-07-26T11:03:09Z

@jreback
Thank you very much for your response.

That's the point where I stumbled on too. Just commenting out the lines:

#        ax0 = self.obj._get_axis(0)
#        if isinstance(ax0, MultiIndex):
#            result = self._handle_lowerdim_multi_index_axis0(tup)
#            if result is not None:
#                return result

diminishes the error (and the warning too). But I am not sure about the consequence of this removal of the codes (any side-effect?).

jreback · 2016-07-26T11:11:37Z

almost certainly needs to be more selective
eg dispatch on self.name == 'iloc'

lots and lots of indexing tests must still pass

ygriku · 2016-07-27T00:24:03Z

I think I can do some trials-and-errors with the existing test cases of pandas/tests. But if there is a need to prepare a decent set of new test codes, that will be too much burden for me. (Anyway, I will start looking into the test cases ... )

jreback · 2016-07-27T00:26:36Z

no as I said there are hundreds of tests already
you can just add this case

ygriku · 2016-07-27T11:00:41Z

Thank you for the comment. I will try. It is exciting to have a chance to contribute!. (But, please don't wait for me. Probably the time I can spare for this would be mostly on weekends.)

ygriku · 2016-08-15T11:07:34Z

I found that the following simple amendment works as jreback suggested.
This passed nosetests pandas/tests/test_multilevel.py and no significant performance change was detected (at least in some trial).

@@ -903,7 +903,7 @@ class _NDFrameIndexer(object):

         # we maybe be using a tuple to represent multiple dimensions here
         ax0 = self.obj._get_axis(0)
-        if isinstance(ax0, MultiIndex):
+        if isinstance(ax0, MultiIndex) and self.name != 'iloc':
             result = self._handle_lowerdim_multi_index_axis0(tup)
             if result is not None:
                 return result

What is the next step?

jreback · 2016-08-15T11:56:01Z

you submit a PR with tests and the fix

see here: http://pandas.pydata.org/pandas-docs/stable/contributing.html

* commit 'v0.19.0rc1-25-ga7469cf': (471 commits) ENH: Add divmod to series and index. (pandas-dev#14208) Fix generator tests to run (pandas-dev#14245) BUG: GH13629 Binned groupby median function with empty bins (pandas-dev#14225) TST/TEMP: fix pyqt to 4.x for plotting tests (pandas-dev#14240) DOC: added example to Series.map showing use of na_action parameter (GH14231) DOC: split docstring into multiple lines in excel.py (pandas-dev#14073) MAINT: Use __module__ in _DeprecatedModule. (pandas-dev#14181) ENH: Allow true_values and false_values options in read_excel (pandas-dev#14002) DOC: fix incorrect example in unstack docstring (GH14206) (pandas-dev#14211) BUG: iloc fails with non lex-sorted MultiIndex pandas-dev#13797 BUG: add check for infinity in __call__ of EngFormatter In gbq.to_gbq allow the DataFrame column order to differ from schema BLD: require cython if tempita is needed DOC: add source links to api docs (pandas-dev#14200) BUG: compat with Stata ver 111 Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198) BLD: Test for Python 3.5 with C locale BUG: DatetimeTZBlock can't assign values near dst boundary BUG: union_categorical with Series and cat idx BUG: fix str.contains for series containing only nan values ...

sinhrks added Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jul 26, 2016

jorisvandenbossche added this to the Next Major Release milestone Jul 26, 2016

ygriku mentioned this issue Aug 19, 2016

BUG: iloc fails with non lex-sorted MultiIndex #13797 #14038

Closed

4 tasks

jreback modified the milestones: 0.19.0, Next Major Release Sep 13, 2016

jreback closed this as completed in fb25cca Sep 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: iloc fails with non lex-sorted MultiIndex #13797

BUG: iloc fails with non lex-sorted MultiIndex #13797

ygriku commented Jul 26, 2016

jreback commented Jul 26, 2016

ygriku commented Jul 26, 2016

jreback commented Jul 26, 2016

ygriku commented Jul 27, 2016

jreback commented Jul 27, 2016

ygriku commented Jul 27, 2016

ygriku commented Aug 15, 2016

jreback commented Aug 15, 2016

BUG: iloc fails with non lex-sorted MultiIndex #13797

BUG: iloc fails with non lex-sorted MultiIndex #13797

Comments

ygriku commented Jul 26, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

jreback commented Jul 26, 2016

ygriku commented Jul 26, 2016

jreback commented Jul 26, 2016

ygriku commented Jul 27, 2016

jreback commented Jul 27, 2016

ygriku commented Jul 27, 2016

ygriku commented Aug 15, 2016

jreback commented Aug 15, 2016

output of `pd.show_versions()`