# How do I find and remove duplicate rows in pandas?

In [15]:
import pandas as pd

In [16]:
user_cols = ["user_id","age","gender","occupation","ZIP_Code"] #creating a list with column names.
users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols,index_col="user_id")
#users.set_index('user_id',inplace = True)
users.head()

Unnamed: 0_level_0,age,gender,occupation,ZIP_Code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


In [18]:
users.shape

(943, 4)

###### checking duplicate Zip codes in the data

In [19]:
users.ZIP_Code.duplicated().head()

user_id
1    False
2    False
3    False
4    False
5    False
Name: ZIP_Code, dtype: bool

In [20]:
type(users.ZIP_Code.duplicated())

pandas.core.series.Series

Method duplicated used here returns a series of boolean values.Logic used here is checking if the zip code in current position is also present previously.

###### Counting Duplicates

In [21]:
users.ZIP_Code.duplicated().sum()

148

In the boolean series above Trues are converted to 1 and False to 0 . Using Sum() method on it provides count of duplicates.

Till now duplicated() method was used over a series , This can alo be applied over DataFrame and follows the same logic. i.e if current row is identical to previous row.

In [22]:
users.duplicated().sum()

7

##### checking duplicated rows 

In [23]:
users.loc[users.duplicated(),:]

Unnamed: 0_level_0,age,gender,occupation,ZIP_Code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
496,21,F,student,55414
572,51,M,educator,20003
621,17,M,student,60402
684,28,M,student,55414
733,44,F,other,60630
805,27,F,other,20009
890,32,M,student,97301


Here are the duplicated 7 duplicated rows , where loc function is used to filter the data for duplicted . 1st part tells to get only slected rows where condition is true and 2nd part tells to give all columns

##### How duplicated() works ?

As mentioned earlier duplicated compare the current value with the previous one if the current value is already present earlier then it marks the current value as duplicate or True. However  this can be altered to state to mark 1st one as duplicate and also two mark both as duplicates. Below is the example.

In [24]:
users.loc[users.duplicated(keep='first'),:]

Unnamed: 0_level_0,age,gender,occupation,ZIP_Code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
496,21,F,student,55414
572,51,M,educator,20003
621,17,M,student,60402
684,28,M,student,55414
733,44,F,other,60630
805,27,F,other,20009
890,32,M,student,97301


In [25]:
users.loc[users.duplicated(keep='first'),:].count()

age           7
gender        7
occupation    7
ZIP_Code      7
dtype: int64

By default value for keep is marked as first which signifies that mark duplicates as true except for the first occurence

In [26]:
users.loc[users.duplicated(keep='last'),:]

Unnamed: 0_level_0,age,gender,occupation,ZIP_Code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
67,17,M,student,60402
85,51,M,educator,20003
198,21,F,student,55414
350,32,M,student,97301
428,28,M,student,55414
437,27,F,other,20009
460,44,F,other,60630


In [27]:
users.loc[users.duplicated(keep='last'),:].count()

age           7
gender        7
occupation    7
ZIP_Code      7
dtype: int64

while using last it marks the first ones as duplicates and keeps the later one.We can see change in index numbers

In [28]:
users.loc[users.duplicated(keep=False),:]

Unnamed: 0_level_0,age,gender,occupation,ZIP_Code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
67,17,M,student,60402
85,51,M,educator,20003
198,21,F,student,55414
350,32,M,student,97301
428,28,M,student,55414
437,27,F,other,20009
460,44,F,other,60630
496,21,F,student,55414
572,51,M,educator,20003
621,17,M,student,60402


In [29]:
users.loc[users.duplicated(keep=False),:].count()

age           14
gender        14
occupation    14
ZIP_Code      14
dtype: int64

Using False marks both the rows as duplicates.

##### Drop Duplicates 

In [30]:
users.drop_duplicates(keep='first').shape

(936, 4)

In [31]:
users.drop_duplicates(keep='last').shape

(936, 4)

In [32]:
users.drop_duplicates(keep=False).shape

(929, 4)

###### Considering specfic columns when considering duplicates

In [34]:
users.duplicated(subset=['age','ZIP_Code']).sum()

16

In [35]:
users.drop_duplicates(subset=['age','ZIP_Code']).shape

(927, 4)