You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying to get handle on duplicated records I stumble upon this which lead to conclusion that .duplicated(take_last=True) seem to be taking the first of duplicate rows and .duplicate(take_last=False) takes the last rows.
Here is an illustration:
The arguments to duplicated make more sense if you consider drop_duplicates, and realize the df.drop_duplicates(take_last) is equivalent to df[~df.duplicated(take_last)].
I agree that this is rather confusing. There is some symmetry here with the dropped duplicates always being the complement of the non-dropped duplicates, but I have no idea what "Take the last observed row in a row. Defaults to the first row" is supposed to mean. At the very least, it would be good to fix this docstring.
I don't think it's a good idea to entirely reverse the meaning of take_last here, because that would silently break existing code relying on the existing behavior. But is is an option to introduce a new parameter like you suggest and deprecate take_last.
It is just the parameters sound a bit confusing to a new comer.
If parameter would be linked to action "coded" in the name of the function it would make it easier to understand: duplicated() suggests (to me at least :) that it can SHOW duplicated item therefore parameter take as in "show me" makes sense. And then take last would mean "show me the last <row(s)> of the duplicated records".
But in case of drop_duplicates the action DROP can suggest either what to KEEP or what to DROP.
And here take is confusing as it is not clear what do you take it for (to keep or to drop).
I guess ideally drop_duplicates needs a parameter like drop or keep (the first seems more obvious) taking the same values as duplicated but with clearer meaning.
While trying to get handle on duplicated records I stumble upon this which lead to conclusion that
.duplicated(take_last=True)
seem to be taking the first of duplicate rows and.duplicate(take_last=False)
takes the last rows.Here is an illustration:
Now with
take_last=False
it would be fair to expect dn_1s to be in output, but this is not the case:And
take_last=True
outputs the first rows:Unless I am misunderstanding the doc:
it does feel that
.duplicated()
could be improved by fixing this behavior.And additionally it would have one more parameter:
or alternatively a keyword parameter:
Right now trying to get all observed duplicates requires applying two above conditions
c1 | c2
.This was done with pandas 0.14.1.
Thank you.
The text was updated successfully, but these errors were encountered: