Behavior of pandas.DataFrame.duplicated() #8505

wikiped · 2014-10-08T06:34:28Z

While trying to get handle on duplicated records I stumble upon this which lead to conclusion that .duplicated(take_last=True) seem to be taking the first of duplicate rows and .duplicate(take_last=False) takes the last rows.
Here is an illustration:

import pandas as pd
data = { 'key1':[1,2,3,1,2,3,2,2,2],
      'key2':[2,2,1,2,2,2,2,2,2],
      'dup':['d1_1','d2_1', 'n_d','d1_2','d2_2', 'n_d','d2_3','d2_4','d2_5']}
df = pd.DataFrame(data,columns=['key1','key2','dup'])
print df
   key1  key2  dup
0     1     2  d1_1
1     2     2  d2_1
2     3     1   n_d
3     1     2  d1_2
4     2     2  d2_2
5     3     2   n_d
6     2     2  d2_3
7     2     2  d2_4
8     2     2  d2_5

Now with take_last=False it would be fair to expect dn_1s to be in output, but this is not the case:

c1 = df.duplicated(['key1', 'key2'], take_last=False)
df[c1]
   key1  key2   dup
3     1     2  d1_2
4     2     2  d2_2
6     2     2  d2_3
7     2     2  d2_4
8     2     2  d2_5

And take_last=True outputs the first rows:

c2 = df.duplicated(['key1', 'key2'], take_last=True)
df[c2]
   key1  key2   dup
0     1     2  d1_1
1     2     2  d2_1
4     2     2  d2_2
6     2     2  d2_3
7     2     2  d2_4

Unless I am misunderstanding the doc:

take_last : boolean, default False
    Take the last observed row in a row. Defaults to the first row

it does feel that .duplicated() could be improved by fixing this behavior.
And additionally it would have one more parameter:

take_all : boolean, default False
    Take all observed rows. Overrides take_last

or alternatively a keyword parameter:

take : 'last', 'first', 'all', default 'last'
    Sets which observed duplicated rows to take. Default: take last observed rows.

Right now trying to get all observed duplicates requires applying two above conditions c1 | c2.
This was done with pandas 0.14.1.
Thank you.

The text was updated successfully, but these errors were encountered:

shoyer · 2014-10-08T08:00:12Z

The arguments to duplicated make more sense if you consider drop_duplicates, and realize the df.drop_duplicates(take_last) is equivalent to df[~df.duplicated(take_last)].

I agree that this is rather confusing. There is some symmetry here with the dropped duplicates always being the complement of the non-dropped duplicates, but I have no idea what "Take the last observed row in a row. Defaults to the first row" is supposed to mean. At the very least, it would be good to fix this docstring.

I don't think it's a good idea to entirely reverse the meaning of take_last here, because that would silently break existing code relying on the existing behavior. But is is an option to introduce a new parameter like you suggest and deprecate take_last.

wikiped · 2014-10-08T10:40:21Z

It does look like existing underlying logic is correct, meaning that:

full_df = df_without_duplicates + df_with_dropped_duplicates

It is just the parameters sound a bit confusing to a new comer.
If parameter would be linked to action "coded" in the name of the function it would make it easier to understand:
duplicated() suggests (to me at least :) that it can SHOW duplicated item therefore parameter take as in "show me" makes sense. And then take last would mean "show me the last <row(s)> of the duplicated records".

But in case of drop_duplicates the action DROP can suggest either what to KEEP or what to DROP.
And here take is confusing as it is not clear what do you take it for (to keep or to drop).
I guess ideally drop_duplicates needs a parameter like drop or keep (the first seems more obvious) taking the same values as duplicated but with clearer meaning.

df.duplicated(take=first) == df.drop_duplicates(drop=last)
True

And in no way I thought of replacing existing parameters but rather depreciating them like you suggested.

shoyer mentioned this issue Nov 23, 2014

pandas.DataFrame.duplicated to allow take_all #6511

Closed

shoyer mentioned this issue Jan 7, 2015

DOC: Edited doc string of pandas/core/frame.duplicated(). Redefined take... #9203

Closed

sinhrks mentioned this issue May 30, 2015

ENH: duplicated and drop_duplicates now accept keep kw #10236

Merged

jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 8, 2015

jreback added this to the 0.17.0 milestone Aug 8, 2015

sinhrks closed this as completed in #10236 Aug 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior of pandas.DataFrame.duplicated() #8505

Behavior of pandas.DataFrame.duplicated() #8505

wikiped commented Oct 8, 2014

shoyer commented Oct 8, 2014

wikiped commented Oct 8, 2014

Behavior of pandas.DataFrame.duplicated() #8505

Behavior of pandas.DataFrame.duplicated() #8505

Comments

wikiped commented Oct 8, 2014

shoyer commented Oct 8, 2014

wikiped commented Oct 8, 2014