Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv clobbers values of columns with duplicate names #9424

Closed
stevenmanton opened this issue Feb 5, 2015 · 3 comments
Closed

read_csv clobbers values of columns with duplicate names #9424

stevenmanton opened this issue Feb 5, 2015 · 3 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@stevenmanton
Copy link

xref #10577 (has test for duplicates with empty data)

I don't expect this is the correct behavior, although it's always possible I'm doing something wrong. Importing data using the names keyword will clobber the values of columns where the name is duplicated. For example:

from StringIO import StringIO
import pandas as pd

data = """a,1
b,2
c,3"""
names = ['field', 'field']

print pd.read_csv(StringIO(data), names=names, mangle_dupe_cols=True)
print pd.read_csv(StringIO(data), names=names, mangle_dupe_cols=False)

returns

   field  field
0      1      1
1      2      2
2      3      3
   field  field
0      1      1
1      2      2
2      3      3

However, this produces the correct result:

df = pd.read_csv(StringIO(data), header=None)
df.columns = names
print df
   field  field
0      a      1
1      b      2
2      c      3

Interestingly, it works if the field names are in the header:

data_with_header = "field,field\n" + data
print pd.read_csv(StringIO(data_with_header))
  field  field.1
0     a        1
1     b        2
2     c        3

Is this a bug or am I doing something wrong?

@yoavram
Copy link

yoavram commented Apr 26, 2015

I've came across something similar. When using mangle_dupe_cols=False I get duplicate column names but the column data is all the same, although it's not so in the data I read. So it seems that the last column of the same name overrides all other columns.

import pandas as pd
from StringIO import StringIO

data = """A,A,B,B,B
    1,2,3,4,5
    6,7,8,9,10
    11,12,13,14,15
    """

df1 = pd.read_table(StringIO(data), sep=',', mangle_dupe_cols=True)
df2 = pd.read_table(StringIO(data), sep=',', mangle_dupe_cols=False)

Now df1 is:

A A.1 B B.1 B.2
0 1 2 3 4 5
1 6 7 8 9 10
2 11 12 13 14 15

which has the original data but non-duplicate column names;

and df2 is:

A A B B B
0 2 2 5 5 5
1 7 7 10 10 10
2 12 12 15 15 15

which has duplicate column names but their respecrive data has been overriden.

Reproducible bug in IPython notebook: http://nbviewer.ipython.org/github/yoavram/ipython-notebooks/blob/master/pandas%20duplicate%20column%20bug.ipynb

Pandas version 0.16.0. Python 2.7.

yoavram pushed a commit to yoavram/ipython-notebooks that referenced this issue Apr 26, 2015
@shoyer shoyer added Bug IO CSV read_csv, to_csv labels Apr 27, 2015
@shoyer shoyer added this to the Next Major Release milestone Apr 27, 2015
@shoyer
Copy link
Member

shoyer commented Apr 27, 2015

This seems very strange to me -- I don't think there's any good reason for this behavior. I'm going to label it as a bug.

@gfyoung
Copy link
Member

gfyoung commented Apr 22, 2016

xref #7160

gfyoung added a commit to forking-repos/pandas that referenced this issue May 23, 2016
Deduplicates the 'names' parameter by default if
there are duplicate names. Also raises when 'mangle_
dupe_cols' is False to prevent data overwrite.

Closes pandas-devgh-7160.
Closes pandas-devgh-9424.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

5 participants