Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN label in MultiIndex is assigned a non NaN value when writing to excel file #13511

Closed
mpuels opened this issue Jun 25, 2016 · 4 comments · Fixed by #13551
Closed

NaN label in MultiIndex is assigned a non NaN value when writing to excel file #13511

mpuels opened this issue Jun 25, 2016 · 4 comments · Fixed by #13551
Labels
Bug IO Excel read_excel, to_excel MultiIndex
Milestone

Comments

@mpuels
Copy link
Contributor

mpuels commented Jun 25, 2016

Given a DataFrame which has a MultiIndex. When a label of the MultiIndex has the value NaN and the DataFrame is written to an excel file, the label will have a value which is not NaN in the excel file.

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'c1': [1,1,2,2],
                   'c2': [None] + "b a b".split()})
df

returns a DataFrame where the first element of column 'c2' is NaN:

    c1  c2
0   1   
1   1   b
2   2   a
3   2   b

Set both columns as the index:

df_midx = df.set_index(['c1', 'c2'])
df_midx

returns

c1  c2
1   
1   b
2   a
2   b

Write DataFrame to excel file and read it back in:

df_midx.to_excel('df_midx.xlsx')
df_midx_from_xlsx = pd.read_excel('df_midx.xlsx')
df_midx_from_xlsx

returns

    c1  c2
0   1.0 b
1       b
2   2.0 a
3       b

The first element of column 'c2' is now set to 'b' instead of NaN.

Expected Output

    c1  c2
0   1.0
1       b
2   2.0 a
3       b

output of pd.show_versions()

INSTALLED VERSIONS

commit: ac174349b0e1525475c2354e1c0b8ee1ed1cabad
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-88-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: None
pip: 1.5.4
setuptools: 2.2
Cython: None
numpy: 1.11.0
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.0.1
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: 2.5.2
matplotlib: 1.5.0
openpyxl: 2.3.5
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

This indeed seems like a bug (your example is a bit strange, as you end up with a empty dataframe with a mutli-index (as you did set all columns as the index), but it occurs as well for a non-empty dataframe).

Interested in trying to look for a fix?

@jreback
Copy link
Contributor

jreback commented Jun 25, 2016

this is prob a duplicate of #6322 or #5286

@mpuels
Copy link
Contributor Author

mpuels commented Jun 27, 2016

I think the bug is in the method pandas.formats.format.ExcelFormatter._format_hierarchical_rows() which is called by the public method pandas.formats.format.ExcelFormatter.get_formatted_cells(). More precisely the problem is the statement

values = levels.take(labels)

(here) where each of levels and labels correspond to the same level in a MultiIndex. levels contains the possible labels for that level (e.g. ['bar', 'foo']) and labels contains indices (e.g. [0,0,0,1,1,1,1]) which point to elements in levels. If a label is null, it is represented by -1 in labels (e.g. [0,0,-1,1,1,1], if the label of the third row is null). The problem in the above statement is that it doesn't treat the index -1 as a special value.

To fix the bug, the above statement could be replaced by

if levels._can_hold_na:
    values = levels.take(labels, fill_value=True)
else:
    values = levels.take(labels)

What follows is an example:

df = (pd.DataFrame({'c1': [1,1,2,2],
                    'c2': [None] + "b c d".split(),
                    'v' : [6,7,8,9]})
        .set_index(['c1', 'c2']))

df

yields

c1  c2  v
1       6
1   b   7
2   c   8
2   d   9

df.index

yields

MultiIndex(levels=[[1, 2], [u'b', u'c', u'd']],
           labels=[[0, 0, 1, 1], [-1, 0, 1, 2]],
           names=[u'c1', u'c2'])

for levels, labels in zip(df.index.levels, df.index.labels):
    print levels.take(labels)
    print levels._can_hold_na
    if levels._can_hold_na:
        print levels.take(labels, fill_value=True)
    print levels
    print labels
    print "------"

yields

Int64Index([1, 1, 2, 2], dtype='int64', name=u'c1')
False
Int64Index([1, 2], dtype='int64', name=u'c1')
FrozenNDArray([0, 0, 1, 1], dtype='int8')
------
Index([u'd', u'b', u'c', u'd'], dtype='object', name=u'c2')
True
Index([nan, u'b', u'c', u'd'], dtype='object', name=u'c2')
Index([u'b', u'c', u'd'], dtype='object', name=u'c2')
FrozenNDArray([-1, 0, 1, 2], dtype='int8')
------

I'd like to fix that bug, but I haven't found any unit tests for pandas.formats.format.ExcelFormatter. Shall I create one in pandas.tests.formats.test_format.py? And do you have any suggestions on how to assert that the correct ExcelCells are returned by get_formatted_cells()?

@jorisvandenbossche
Copy link
Member

That looks like a very reasonable explanation! PR very welcome.

For the tests, I don't think we have tests for ExcelFormatter directly, but tests for read_excel/to_excel are in https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_excel.py. The basic tests are eg here: https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_excel.py#L1311. Those typically read the written file back in to check it's correctness. That approach should be possible here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel MultiIndex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants