Pandas inconsistenly handles identically named columns in csv export and merging #3468

darindillon · 2013-04-26T21:25:39Z

Using pandas 0.10.1

Pandas allows creating a dataframe with two columns with the same name. (I disagree that it should be allowed, but it is allowed, so OK). However, it doesn't handle that correctly in several cases.
Pandas ought to either completely disallow duplicate named columns or handle them everywhere. But it shouldn't handle them in some cases but not others.

Problem 1: Round-trip to a CSV
Dump the dataframe to a CSV and then read it back. Even though duplicate columns are supposed to be legal, Pandas won't allow that in the CSV import/export.

df = pandas.DataFrame([[1,2]], columns=['a','a'])
#Note df has two identically named columns now
df.to_csv('foo.csv')
df2 = pandas.DataFrame.from_csv('foo.csv')
#Note df2 does NOT have the same columns anymore. the "a" was renamed "a.1"
#At the least, you ought to get some warning that dumping to a CSV will 
#lose column names.

Problem 2: Merging to a dataframe with dup columns does not work

df = pandas.DataFrame([[1,2,3]], columns=['a','a','b'])
df.merge(df, how='left', on='b')
#Result: Very misleading error message
# edit: reindexing error

I'd be ok with almost any solution (disallowing duplicate named columns, giving you a warning when you do it, handling it correctly in the merge and csv read, etc). But it should be consistently one way or the other.

jreback · 2013-04-26T21:34:48Z

is completely fixed in master (that is round trip to csv works; you will get a renamed column on readback, that is a feature)

ghost · 2013-04-27T16:03:15Z

@jreback actually brought this up a few days ago, and I talked him down because
I didn't think the mangling mattered to anyone.

Name mangling

to_csv was broken for dupe columns, and it was only fixed in 0.11+ master, though
the name mangling is still there right now (done on read)

My guess is that the column name mangling was put there before dupe columns
support to allow people to more easily import 3rd party data that had them. I imaging
editing the column names of a 30MB csv file manually created some issues for
more casual users. Under those assumptions, I feel that was a good compromise.

As it turns out It's very easy to undo this behavior and it only breaks one test, the one
which specifically tested for it. I think that can go away soon, unless
other issues are raised. (breaking existing code for one). maybe we can set a
mode.read_mangle_dupe_cols or some such to handle that, defaulting
to old behavior.

Merge behavior

The merge behavior is naturally more subtle. Though the error message is unclear
from a user's pov, it's actually very poignant if you realize why it's happening. reindexing is crucial
to pandas's good perf re auto-alignment, but it's semantics break down when there are dupe columns,
at least currently. I'm not sure it's worth the effort to fix generally or even if it's possible to do
cleanly. Note that there is a significant subset of pandas that works fine for dupe cols,
so that can be useful.

What I will concede, is that:

In [1]: df = pandas.DataFrame([[1,2]], columns=['a','a'])

In [2]: df
Out[2]: 
   a  a
0  1  2

In [3]: df.columns=['a','a.1']
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-3-d9fa6a61a781> in <module>()
----> 1 df.columns=['a','a.1']

Exception: Reindexing only valid with uniquely valued Index objects

Is really terrible usability, because once a frame with dupe cols is created the user can't
manually override to put the frame into working order for the rest of pandas.

If we can fix that somehow, I think that should be a reasonable compromise.

@jreback , what do you think?

jreback · 2013-04-27T19:41:55Z

I think adding an option in read_csv to pass thru dup columns (rather than mangle) is doable and prob should be the default (API change); maybe make a separate issue for this
assigning a new index to a dup index should work as well; lets make an issue about this as well
(not too hard to fix either)

ghost · 2013-04-27T20:03:54Z

The mangling code sits in a place that's reused by several io paths,
not sure if read_csv option is better fit then config option. but either is fine.

darindillon · 2013-04-29T16:19:38Z

Do we really need an option to make read_csv name-mangle the columns? Yeah, we want to be backwards compatible and all, but really, is there even one person in the world who has a CSV that has duplicate columns and does NOT want pandas to create a dataframe with the exact same column names as in the file?
If pandas can handle the duplicate columns, and if the CSV has duplicate columns, then why would anyone NOT want to just load the column names as-is without mangling? I don't see the use case for who would want the name-mangling (and if you did really need it, it's easy to just rename the column yourself).

ghost · 2013-04-29T16:38:09Z

Some users will have existing code that depends on this behavior, when users
upgrade pandas and their scripts no longer work, they are reasonably upset.
If it's possible to address legacy issues without breaking existing code, that's what we do.

What the default behavior should be (opt-in to new, or activate "legacy" mode) is a bike shed topic
we all can argue from both sides.

ghost · 2013-05-05T09:14:49Z

#3511 merged, unmangled will become the default in 0.12.

jreback · 2013-07-10T13:45:29Z

@y-p I think this an be closed? (or wait to revers mangle default in 0.13?)

ghost · 2013-07-10T13:53:19Z

definitely, you nailed this one long ago.

jreback mentioned this issue Apr 29, 2013

BUG: GH3468 Fix assigning a new index to a duplicate index in a DataFrame would fail #3483

Closed

darindillon mentioned this issue Apr 29, 2013

Can't handle duplicate column names in sql read #3487

Closed

This was referenced Apr 30, 2013

ENH: create BlockManager positional indexer (for easier dupe cols support) #3092

Closed

BUG/CLN: Allow the BlockManager to have a non-unique items (axis 0) #3509

Merged

ghost mentioned this issue May 2, 2013

ENH: add mode.mangle_dupe_cols option GH3468 #3511

Merged

jreback mentioned this issue Jun 28, 2013

Assign column values to non-unique DataFrame #4067

Closed

ghost closed this as completed Jul 10, 2013

atheyjohnc mentioned this issue Jul 7, 2015

Reading from Excel mangles columns #10523

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas inconsistenly handles identically named columns in csv export and merging #3468

Pandas inconsistenly handles identically named columns in csv export and merging #3468

darindillon commented Apr 26, 2013

jreback commented Apr 26, 2013

ghost commented Apr 27, 2013

jreback commented Apr 27, 2013

ghost commented Apr 27, 2013

darindillon commented Apr 29, 2013

ghost commented Apr 29, 2013

ghost commented May 5, 2013

jreback commented Jul 10, 2013

ghost commented Jul 10, 2013

Pandas inconsistenly handles identically named columns in csv export and merging #3468

Pandas inconsistenly handles identically named columns in csv export and merging #3468

Comments

darindillon commented Apr 26, 2013

jreback commented Apr 26, 2013

ghost commented Apr 27, 2013

Name mangling

Merge behavior

jreback commented Apr 27, 2013

ghost commented Apr 27, 2013

darindillon commented Apr 29, 2013

ghost commented Apr 29, 2013

ghost commented May 5, 2013

jreback commented Jul 10, 2013

ghost commented Jul 10, 2013