DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.1 #11864

capelastegui · 2015-12-18T19:30:22Z

Dataframe.duplicated() is flagging rows as duplicates when they are in fact distinct. This happens when using large dataframes, and duplicated(keep=False):

import pandas as pd, numpy as np
df = pd.DataFrame({'a': pd.Series(range(1,100000)),
                   'b': pd.Series(range(10,1000000)),
                   'c': pd.Series(3*range(2,200000,2))})
df.head()

np.sum(df.duplicated())

Out[]: 0

np.sum(df.duplicated(keep=False))

Out[]:110

Changing column order results in different (but still incorrect) behavior.

np.sum(df[['c','b','a']].duplicated(keep=False))

Out[]:2138

Tested on 0.17.1. Environment details are provided below:
>> pd.util.print_versions.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.17.1
nose: None
pip: 7.1.2
setuptools: 18.5
Cython: None
numpy: 1.10.1
scipy: 0.16.0
statsmodels: None
IPython: 4.0.0
sphinx: 1.3.3
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
Jinja2: None

This looks like the same kind of problem described in #11668, though the specific examples provided in that issue work properly in 0.17.1

The text was updated successfully, but these errors were encountered:

jreback · 2015-12-18T19:35:37Z

cc @behzadnouri
cc @evanpw

evanpw · 2015-12-23T18:10:32Z

64-bit integers are getting sliced to 32-bit in duplicated_int64. This is a simpler DataFrame that exhibits the same problem:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([i] * 9 for i in range(16))

In [3]: df = df.append([[1] + [0] * 8], ignore_index=True)

In [4]: df
Out[4]:
     0   1   2   3   4   5   6   7   8
0    0   0   0   0   0   0   0   0   0
1    1   1   1   1   1   1   1   1   1
2    2   2   2   2   2   2   2   2   2
3    3   3   3   3   3   3   3   3   3
4    4   4   4   4   4   4   4   4   4
5    5   5   5   5   5   5   5   5   5
6    6   6   6   6   6   6   6   6   6
7    7   7   7   7   7   7   7   7   7
8    8   8   8   8   8   8   8   8   8
9    9   9   9   9   9   9   9   9   9
10  10  10  10  10  10  10  10  10  10
11  11  11  11  11  11  11  11  11  11
12  12  12  12  12  12  12  12  12  12
13  13  13  13  13  13  13  13  13  13
14  14  14  14  14  14  14  14  14  14
15  15  15  15  15  15  15  15  15  15
16   1   0   0   0   0   0   0   0   0

I've got an easy fix (just change a type from int to int64_t). I'll do a PR tonight.

jreback · 2016-01-07T15:15:15Z

closed by #11894

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Intermediate labels Dec 18, 2015

jreback added this to the 0.18.0 milestone Dec 18, 2015

evanpw mentioned this issue Dec 24, 2015

BUG: Spurious matches in DataFrame.duplicated when keep=False #11894

Closed

jreback pushed a commit that referenced this issue Jan 7, 2016

BUG: Spurious matches in DataFrame.duplicated when keep=False, #11864

b431f85

jreback closed this as completed Jan 7, 2016

drewhouston mentioned this issue Jun 10, 2019

duplicated() (and drop_duplicates()) incorrectly identifying unique/different rows as duplicates #26762

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.1 #11864

DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.1 #11864

capelastegui commented Dec 18, 2015

jreback commented Dec 18, 2015

evanpw commented Dec 23, 2015

jreback commented Jan 7, 2016

DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.1 #11864

DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.1 #11864

Comments

capelastegui commented Dec 18, 2015

jreback commented Dec 18, 2015

evanpw commented Dec 23, 2015

jreback commented Jan 7, 2016