read_csv clobbers values of columns with duplicate names #9424

stevenmanton · 2015-02-05T22:20:07Z

xref #10577 (has test for duplicates with empty data)

I don't expect this is the correct behavior, although it's always possible I'm doing something wrong. Importing data using the names keyword will clobber the values of columns where the name is duplicated. For example:

from StringIO import StringIO
import pandas as pd

data = """a,1
b,2
c,3"""
names = ['field', 'field']

print pd.read_csv(StringIO(data), names=names, mangle_dupe_cols=True)
print pd.read_csv(StringIO(data), names=names, mangle_dupe_cols=False)

returns

   field  field
0      1      1
1      2      2
2      3      3
   field  field
0      1      1
1      2      2
2      3      3

However, this produces the correct result:

df = pd.read_csv(StringIO(data), header=None)
df.columns = names
print df

   field  field
0      a      1
1      b      2
2      c      3

Interestingly, it works if the field names are in the header:

data_with_header = "field,field\n" + data
print pd.read_csv(StringIO(data_with_header))

  field  field.1
0     a        1
1     b        2
2     c        3

Is this a bug or am I doing something wrong?

The text was updated successfully, but these errors were encountered:

yoavram · 2015-04-26T08:27:21Z

I've came across something similar. When using mangle_dupe_cols=False I get duplicate column names but the column data is all the same, although it's not so in the data I read. So it seems that the last column of the same name overrides all other columns.

import pandas as pd
from StringIO import StringIO

data = """A,A,B,B,B
    1,2,3,4,5
    6,7,8,9,10
    11,12,13,14,15
    """

df1 = pd.read_table(StringIO(data), sep=',', mangle_dupe_cols=True)
df2 = pd.read_table(StringIO(data), sep=',', mangle_dupe_cols=False)

Now df1 is:

	A	A.1	B	B.1	B.2
0	1	2	3	4	5
1	6	7	8	9	10
2	11	12	13	14	15

which has the original data but non-duplicate column names;

and df2 is:

	A	A	B	B	B
0	2	2	5	5	5
1	7	7	10	10	10
2	12	12	15	15	15

which has duplicate column names but their respecrive data has been overriden.

Reproducible bug in IPython notebook: http://nbviewer.ipython.org/github/yoavram/ipython-notebooks/blob/master/pandas%20duplicate%20column%20bug.ipynb

Pandas version 0.16.0. Python 2.7.

shoyer · 2015-04-27T08:11:04Z

This seems very strange to me -- I don't think there's any good reason for this behavior. I'm going to label it as a bug.

gfyoung · 2016-04-22T22:41:56Z

xref #7160

Deduplicates the 'names' parameter by default if there are duplicate names. Also raises when 'mangle_ dupe_cols' is False to prevent data overwrite. Closes pandas-devgh-7160. Closes pandas-devgh-9424.

yoavram pushed a commit to yoavram/ipython-notebooks that referenced this issue Apr 26, 2015

for issue pandas-dev/pandas#9424

5576c0f

shoyer added Bug IO CSV read_csv, to_csv labels Apr 27, 2015

shoyer added this to the Next Major Release milestone Apr 27, 2015

This was referenced Jul 17, 2015

read_table / read_csv with duplicate names leads to column duplication #10496

Closed

Fixed bug where read_csv ignores dtype arg if input is empty. #10577

Merged

gfyoung mentioned this issue Apr 22, 2016

BUG, ENH: Add support for parsing duplicate columns #12935

Closed

jreback closed this as completed in 9a6ce07 May 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv clobbers values of columns with duplicate names #9424

read_csv clobbers values of columns with duplicate names #9424

stevenmanton commented Feb 5, 2015

yoavram commented Apr 26, 2015

shoyer commented Apr 27, 2015

gfyoung commented Apr 22, 2016

read_csv clobbers values of columns with duplicate names #9424

read_csv clobbers values of columns with duplicate names #9424

Comments

stevenmanton commented Feb 5, 2015

yoavram commented Apr 26, 2015

shoyer commented Apr 27, 2015

gfyoung commented Apr 22, 2016