Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop_duplicates destroys non-duplicated data under 0.17 #11376

Closed
RPGillespie6 opened this issue Oct 19, 2015 · 12 comments
Closed

drop_duplicates destroys non-duplicated data under 0.17 #11376

RPGillespie6 opened this issue Oct 19, 2015 · 12 comments
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@RPGillespie6
Copy link

The drop_duplicates() function in Python 3 is broken. Take the following example snippet:

import pandas as pd

raw_data = {'x': [7,6,3,3,4,8,0],'y': [0,6,5,5,9,1,2]}
df = pd.DataFrame(raw_data, columns = ['x', 'y'])

print("Before:", df)
df = df.drop_duplicates()
print("After:", df)

When run under python 2, the results are correct, but when running under python 3, pandas removes 6,6 from the frame, which is a completely unique row. When using this function with large CSV files, it causes thousands of lines of unique data loss.

See:
http://stackoverflow.com/questions/33224356/why-is-pandas-dropping-unique-rows

@jreback
Copy link
Contributor

jreback commented Oct 19, 2015

pls pd.show_versions()

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 19, 2015
@jreback jreback added this to the 0.17.1 milestone Oct 19, 2015
@jreback
Copy link
Contributor

jreback commented Oct 19, 2015

cc @sinhrks

@oscar6echo
Copy link

In the same field - and maybe connected (?) - I ran into a Python 2 case where the .duplicated() method applied to a dataframe returned lines that were NOT duplicates. Too large to paste the example here. But wanted to mention it in case it almost the same bug under the hood.

@kawochen
Copy link
Contributor

able to reproduce in master (PY3)

@dsm054
Copy link
Contributor

dsm054 commented Oct 20, 2015

I see the loss of 6,6 under both 2.7.6 and 3.5.0 with 0.17.0-- that is, I see no 2 vs. 3 difference in this example. (My notebook has a somewhat rare 32-bit Python linux environment, and so problems sometimes manifest differently.)

>>> df = pd.DataFrame([[1,0],[0,2]])
>>> df
   0  1
0  1  0
1  0  2
>>> df.duplicated()
0    False
1     True
dtype: bool
>>> df.astype(str).duplicated()
0    False
1    False
dtype: bool
>>> df.astype(np.float64).duplicated()
0    False
1    False
dtype: bool

@dsm054
Copy link
Contributor

dsm054 commented Oct 20, 2015

I'm not convinced that the comment # if we have integers we can directly index with these before the int shortcut in f is correct. Not sure that get_group_index plays nicely with the integers if they're not ranked by column. For example:

In [202]: get_group_index([np.array([1,0]), np.array([0,2])], np.array([2,2]), False, False)
Out[202]: array([2, 2], dtype=int64)

Anyway, one route to a quick fix would be to push everything along the factorize branch, IIUC.

@sinhrks
Copy link
Member

sinhrks commented Oct 20, 2015

I can't track enough yet, but below part looks to break factorized labels.

@behzadnouri Any idea?

@pag
Copy link

pag commented Oct 21, 2015

Just a 'me too'. I'm running into this problem under python 2.7.10 now that I've upgraded to pandas 1.7. I can also reproduce using the example given in the stackoverflow post:

Before
   x  y
0  7  0
1  6  6
2  3  5
3  3  5
4  4  9
5  8  1
6  0  2

After:
   x  y
0  7  0
2  3  5
4  4  9
5  8  1
6  0  2

i.e. it's dropped the 3,5 duplicated (correct) but also the 6,6 row (incorrect).

@evanpw
Copy link
Contributor

evanpw commented Oct 21, 2015

If all of the integers are non-negative, we can index directly with integers, but the shape should be the largest integer present + 1, rather than the number of unique integers.

@behzadnouri
Copy link
Contributor

@jreback
@sinhrks

this is broken by #10917
core/frame.py#L2997-L3002.

factorize cannot be replaced by unique1d.

>>> a
array([ 5,  8, 11])
>>> unique1d(a)
array([ 5,  8, 11])
>>> factorize(a)[0]
array([0, 1, 2])

@RPGillespie6 RPGillespie6 changed the title drop_duplicates destroys non-duplicated data under Python 3 drop_duplicates destroys non-duplicated data under 0.17 Oct 21, 2015
@fridiculous
Copy link

👍 on duplicated issues with 0.17.0

i'm seeing a similar problems, as the following should return an empty data frame:

df[df.duplicated(['charstring','number'], keep=False)]

index   charstring                              number
7989    E1FDF0E7-DFBD-428F-A6E7-5D48EAD3A559    435
7990    B3308C59-B9CF-42CB-A2C6-A406CA36EF2B    0

@jreback
Copy link
Contributor

jreback commented Oct 25, 2015

closed by #11403

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

10 participants