Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: iloc fails with non lex-sorted MultiIndex #13797

Closed
ygriku opened this issue Jul 26, 2016 · 8 comments
Closed

BUG: iloc fails with non lex-sorted MultiIndex #13797

ygriku opened this issue Jul 26, 2016 · 8 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@ygriku
Copy link
Contributor

ygriku commented Jul 26, 2016

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
ind = [
        ['AA','AA','AA','BB','BB'],
        ['A' ,'B' ,'B' ,'a' ,'b']
] 
ind_nonLex = [
        ['CC','CC','CC','BB','BB'],
        ['A' ,'B' ,'B' ,'a' ,'b']
] 

strCol=pd.DataFrame([['fooA'],['fooB'],['fooC'],['fooD'],['fooE']])

dat=np.arange(1,26).reshape(5,5)
df=pd.concat([strCol, pd.DataFrame(dat)], axis=1)
df1=pd.DataFrame(df.values, index=ind_nonLex)
df2=pd.DataFrame(df.values, index=ind)

df1 (whose index is not lex sorted) fails with iloc access:

>>> df1.iloc[0,0]
C:\Users\rikuhiro\Anaconda3\envs\pd-check\lib\site-packages\ipykernel\__main__.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  if __name__ == '__main__':
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-37-884ae2904642> in <module>()
----> 1 df1.iloc[0,0]

C:\Users\rikuhiro\Anaconda3\envs\pd-check\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1292 
   1293         if type(key) is tuple:
-> 1294             return self._getitem_tuple(key)
   1295         else:
   1296             return self._getitem_axis(key, axis=0)

C:\Users\rikuhiro\Anaconda3\envs\pd-check\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
   1561 
   1562             # if the dim was reduced, then pass a lower-dim the next time
-> 1563             if retval.ndim < self.ndim:
   1564                 axis -= 1
   1565 

AttributeError: 'str' object has no attribute 'ndim'

Expected Output

df2 (whose index is lex sorted) works as expected:

>>> df2.iloc[0,0]
'fooA'

This attributeError does not occur when the DataFrame.values consist of numpy objects (e.g. numpy.int32) because they have the ndim attribute. (Although the performance warning remains, it may be another issue).

I found that the addition of an if statement can remedy this in pandas/core/indexing.py. This just makes the _getitem_tuple(self, tup) be aware of objects without the ndim attribute, as _getitem_nested_tuple(self, tup) is (I will prepare a pull request if it is helpful.)

 @@ -1569,6 +1569,10 @@ def _getitem_tuple(self, tup):
              retval = getattr(retval, self.name)._getitem_axis(key, axis=axis)

 +            # if we have a scalar, we are done
 +            if lib.isscalar(retval) or not hasattr(retval, 'ndim'):
 +                break
 +
              # if the dim was reduced, then pass a lower-dim the next time
              if retval.ndim < self.ndim:
                  axis -= 1

output of pd.show_versions()

>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.11.1
scipy: None
statsmodels: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
@sinhrks sinhrks added Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jul 26, 2016
@jorisvandenbossche jorisvandenbossche added this to the Next Major Release milestone Jul 26, 2016
@jreback
Copy link
Contributor

jreback commented Jul 26, 2016

@YG-Riku actually the fix should start here

.iloc should not hit the MultiIndex branch testing at all. If you want to give this a try would be great.

@ygriku
Copy link
Contributor Author

ygriku commented Jul 26, 2016

@jreback
Thank you very much for your response.

That's the point where I stumbled on too. Just commenting out the lines:

#        ax0 = self.obj._get_axis(0)
#        if isinstance(ax0, MultiIndex):
#            result = self._handle_lowerdim_multi_index_axis0(tup)
#            if result is not None:
#                return result

diminishes the error (and the warning too). But I am not sure about the consequence of this removal of the codes (any side-effect?).

@jreback
Copy link
Contributor

jreback commented Jul 26, 2016

almost certainly needs to be more selective
eg dispatch on self.name == 'iloc'

lots and lots of indexing tests must still pass

@ygriku
Copy link
Contributor Author

ygriku commented Jul 27, 2016

I think I can do some trials-and-errors with the existing test cases of pandas/tests. But if there is a need to prepare a decent set of new test codes, that will be too much burden for me. (Anyway, I will start looking into the test cases ... )

@jreback
Copy link
Contributor

jreback commented Jul 27, 2016

no as I said there are hundreds of tests already
you can just add this case

@ygriku
Copy link
Contributor Author

ygriku commented Jul 27, 2016

Thank you for the comment. I will try. It is exciting to have a chance to contribute!. (But, please don't wait for me. Probably the time I can spare for this would be mostly on weekends.)

@ygriku
Copy link
Contributor Author

ygriku commented Aug 15, 2016

I found that the following simple amendment works as jreback suggested.
This passed nosetests pandas/tests/test_multilevel.py and no significant performance change was detected (at least in some trial).

@@ -903,7 +903,7 @@ class _NDFrameIndexer(object):

         # we maybe be using a tuple to represent multiple dimensions here
         ax0 = self.obj._get_axis(0)
-        if isinstance(ax0, MultiIndex):
+        if isinstance(ax0, MultiIndex) and self.name != 'iloc':
             result = self._handle_lowerdim_multi_index_axis0(tup)
             if result is not None:
                 return result

What is the next step?

@jreback
Copy link
Contributor

jreback commented Aug 15, 2016

you submit a PR with tests and the fix

see here: http://pandas.pydata.org/pandas-docs/stable/contributing.html

@jreback jreback modified the milestones: 0.19.0, Next Major Release Sep 13, 2016
yarikoptic added a commit to neurodebian/pandas that referenced this issue Oct 13, 2016
* commit 'v0.19.0rc1-25-ga7469cf': (471 commits)
  ENH: Add divmod to series and index. (pandas-dev#14208)
  Fix generator tests to run (pandas-dev#14245)
  BUG: GH13629 Binned groupby median function with empty bins (pandas-dev#14225)
  TST/TEMP: fix pyqt to 4.x for plotting tests (pandas-dev#14240)
  DOC: added example to Series.map showing use of na_action parameter (GH14231)
  DOC: split docstring into multiple lines in excel.py (pandas-dev#14073)
  MAINT: Use __module__ in _DeprecatedModule. (pandas-dev#14181)
  ENH: Allow true_values and false_values options in read_excel (pandas-dev#14002)
  DOC: fix incorrect example in unstack docstring (GH14206) (pandas-dev#14211)
  BUG: iloc fails with non lex-sorted MultiIndex pandas-dev#13797
  BUG: add check for infinity in __call__ of EngFormatter
  In gbq.to_gbq allow the DataFrame column order to differ from schema
  BLD: require cython if tempita is needed
  DOC: add source links to api docs (pandas-dev#14200)
  BUG: compat with Stata ver 111
  Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198)
  BLD: Test for Python 3.5 with C locale
  BUG: DatetimeTZBlock can't assign values near dst boundary
  BUG: union_categorical with Series and cat idx
  BUG: fix str.contains for series containing only nan values
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants