Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavior of pandas.DataFrame.duplicated() #8505

Closed
wikiped opened this issue Oct 8, 2014 · 2 comments · Fixed by #10236
Closed

Behavior of pandas.DataFrame.duplicated() #8505

wikiped opened this issue Oct 8, 2014 · 2 comments · Fixed by #10236
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@wikiped
Copy link

wikiped commented Oct 8, 2014

While trying to get handle on duplicated records I stumble upon this which lead to conclusion that .duplicated(take_last=True) seem to be taking the first of duplicate rows and .duplicate(take_last=False) takes the last rows.
Here is an illustration:

import pandas as pd
data = { 'key1':[1,2,3,1,2,3,2,2,2],
      'key2':[2,2,1,2,2,2,2,2,2],
      'dup':['d1_1','d2_1', 'n_d','d1_2','d2_2', 'n_d','d2_3','d2_4','d2_5']}
df = pd.DataFrame(data,columns=['key1','key2','dup'])
print df
   key1  key2  dup
0     1     2  d1_1
1     2     2  d2_1
2     3     1   n_d
3     1     2  d1_2
4     2     2  d2_2
5     3     2   n_d
6     2     2  d2_3
7     2     2  d2_4
8     2     2  d2_5

Now with take_last=False it would be fair to expect dn_1s to be in output, but this is not the case:

c1 = df.duplicated(['key1', 'key2'], take_last=False)
df[c1]
   key1  key2   dup
3     1     2  d1_2
4     2     2  d2_2
6     2     2  d2_3
7     2     2  d2_4
8     2     2  d2_5

And take_last=True outputs the first rows:

c2 = df.duplicated(['key1', 'key2'], take_last=True)
df[c2]
   key1  key2   dup
0     1     2  d1_1
1     2     2  d2_1
4     2     2  d2_2
6     2     2  d2_3
7     2     2  d2_4

Unless I am misunderstanding the doc:

take_last : boolean, default False
    Take the last observed row in a row. Defaults to the first row

it does feel that .duplicated() could be improved by fixing this behavior.
And additionally it would have one more parameter:

take_all : boolean, default False
    Take all observed rows. Overrides take_last

or alternatively a keyword parameter:

take : 'last', 'first', 'all', default 'last'
    Sets which observed duplicated rows to take. Default: take last observed rows.

Right now trying to get all observed duplicates requires applying two above conditions c1 | c2.
This was done with pandas 0.14.1.
Thank you.

@shoyer
Copy link
Member

shoyer commented Oct 8, 2014

The arguments to duplicated make more sense if you consider drop_duplicates, and realize the df.drop_duplicates(take_last) is equivalent to df[~df.duplicated(take_last)].

I agree that this is rather confusing. There is some symmetry here with the dropped duplicates always being the complement of the non-dropped duplicates, but I have no idea what "Take the last observed row in a row. Defaults to the first row" is supposed to mean. At the very least, it would be good to fix this docstring.

I don't think it's a good idea to entirely reverse the meaning of take_last here, because that would silently break existing code relying on the existing behavior. But is is an option to introduce a new parameter like you suggest and deprecate take_last.

@wikiped
Copy link
Author

wikiped commented Oct 8, 2014

It does look like existing underlying logic is correct, meaning that:

full_df = df_without_duplicates + df_with_dropped_duplicates

It is just the parameters sound a bit confusing to a new comer.
If parameter would be linked to action "coded" in the name of the function it would make it easier to understand:
duplicated() suggests (to me at least :) that it can SHOW duplicated item therefore parameter take as in "show me" makes sense. And then take last would mean "show me the last <row(s)> of the duplicated records".

But in case of drop_duplicates the action DROP can suggest either what to KEEP or what to DROP.
And here take is confusing as it is not clear what do you take it for (to keep or to drop).
I guess ideally drop_duplicates needs a parameter like drop or keep (the first seems more obvious) taking the same values as duplicated but with clearer meaning.

df.duplicated(take=first) == df.drop_duplicates(drop=last)
True

And in no way I thought of replacing existing parameters but rather depreciating them like you suggested.

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 8, 2015
@jreback jreback added this to the 0.17.0 milestone Aug 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants