Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas inconsistenly handles identically named columns in csv export and merging #3468

Closed
darindillon opened this issue Apr 26, 2013 · 9 comments · Fixed by #3509
Closed
Labels
Enhancement IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@darindillon
Copy link

Using pandas 0.10.1

Pandas allows creating a dataframe with two columns with the same name. (I disagree that it should be allowed, but it is allowed, so OK). However, it doesn't handle that correctly in several cases.
Pandas ought to either completely disallow duplicate named columns or handle them everywhere. But it shouldn't handle them in some cases but not others.

Problem 1: Round-trip to a CSV
Dump the dataframe to a CSV and then read it back. Even though duplicate columns are supposed to be legal, Pandas won't allow that in the CSV import/export.

df = pandas.DataFrame([[1,2]], columns=['a','a'])
#Note df has two identically named columns now
df.to_csv('foo.csv')
df2 = pandas.DataFrame.from_csv('foo.csv')
#Note df2 does NOT have the same columns anymore. the "a" was renamed "a.1"
#At the least, you ought to get some warning that dumping to a CSV will 
#lose column names.

Problem 2: Merging to a dataframe with dup columns does not work

df = pandas.DataFrame([[1,2,3]], columns=['a','a','b'])
df.merge(df, how='left', on='b')
#Result: Very misleading error message
# edit: reindexing error

I'd be ok with almost any solution (disallowing duplicate named columns, giving you a warning when you do it, handling it correctly in the merge and csv read, etc). But it should be consistently one way or the other.

@jreback
Copy link
Contributor

jreback commented Apr 26, 2013

  1. is completely fixed in master (that is round trip to csv works; you will get a renamed column on readback, that is a feature)

@ghost
Copy link

ghost commented Apr 27, 2013

@jreback actually brought this up a few days ago, and I talked him down because
I didn't think the mangling mattered to anyone.

Name mangling

to_csv was broken for dupe columns, and it was only fixed in 0.11+ master, though
the name mangling is still there right now (done on read)

My guess is that the column name mangling was put there before dupe columns
support to allow people to more easily import 3rd party data that had them. I imaging
editing the column names of a 30MB csv file manually created some issues for
more casual users. Under those assumptions, I feel that was a good compromise.

As it turns out It's very easy to undo this behavior and it only breaks one test, the one
which specifically tested for it. I think that can go away soon, unless
other issues are raised. (breaking existing code for one). maybe we can set a
mode.read_mangle_dupe_cols or some such to handle that, defaulting
to old behavior.

Merge behavior

The merge behavior is naturally more subtle. Though the error message is unclear
from a user's pov, it's actually very poignant if you realize why it's happening. reindexing is crucial
to pandas's good perf re auto-alignment, but it's semantics break down when there are dupe columns,
at least currently. I'm not sure it's worth the effort to fix generally or even if it's possible to do
cleanly. Note that there is a significant subset of pandas that works fine for dupe cols,
so that can be useful.

What I will concede, is that:

In [1]: df = pandas.DataFrame([[1,2]], columns=['a','a'])

In [2]: df
Out[2]: 
   a  a
0  1  2

In [3]: df.columns=['a','a.1']
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-3-d9fa6a61a781> in <module>()
----> 1 df.columns=['a','a.1']

Exception: Reindexing only valid with uniquely valued Index objects

Is really terrible usability, because once a frame with dupe cols is created the user can't
manually override to put the frame into working order for the rest of pandas.

If we can fix that somehow, I think that should be a reasonable compromise.

@jreback , what do you think?

@jreback
Copy link
Contributor

jreback commented Apr 27, 2013

  1. I think adding an option in read_csv to pass thru dup columns (rather than mangle) is doable and prob should be the default (API change); maybe make a separate issue for this
  2. assigning a new index to a dup index should work as well; lets make an issue about this as well
    (not too hard to fix either)

@ghost
Copy link

ghost commented Apr 27, 2013

The mangling code sits in a place that's reused by several io paths,
not sure if read_csv option is better fit then config option. but either is fine.

@darindillon
Copy link
Author

Do we really need an option to make read_csv name-mangle the columns? Yeah, we want to be backwards compatible and all, but really, is there even one person in the world who has a CSV that has duplicate columns and does NOT want pandas to create a dataframe with the exact same column names as in the file?
If pandas can handle the duplicate columns, and if the CSV has duplicate columns, then why would anyone NOT want to just load the column names as-is without mangling? I don't see the use case for who would want the name-mangling (and if you did really need it, it's easy to just rename the column yourself).

@ghost
Copy link

ghost commented Apr 29, 2013

Some users will have existing code that depends on this behavior, when users
upgrade pandas and their scripts no longer work, they are reasonably upset.
If it's possible to address legacy issues without breaking existing code, that's what we do.

What the default behavior should be (opt-in to new, or activate "legacy" mode) is a bike shed topic
we all can argue from both sides.

@ghost
Copy link

ghost commented May 5, 2013

#3511 merged, unmangled will become the default in 0.12.

@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

@y-p I think this an be closed? (or wait to revers mangle default in 0.13?)

@ghost
Copy link

ghost commented Jul 10, 2013

definitely, you nailed this one long ago.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
2 participants