# Remove Duplicates in Pandas

Pandas makes dealing with duplicates very easy. You can drop duplicates in a number of different ways.

In [1]:
import pandas as pd

### Create some duplicate data

In [13]:
data = {'first': ['James', 'Jane', 'Adam', 'Tom', 'Tom', 'Tom'], 
        'last': ['Smith', 'Watson', 'Miller', 'Thompson', 'Piper', 'Piper'], 
        'age': [18, 18, 19, 19, 20, 20], 
        'height_cm': [75.7, 163, 176.5, 163.3, 168.5, 168.5],
        'weight_kg': [66.9, 56.7, 68.9, 58, 57.5, 57.5],
        'income':['1,000', 800, 350, 980, '2,500', '2,500']}

# create a DataFrame
df = pd.DataFrame(data, columns = ['first', 'last', 'age', 'height_cm', 'weight_kg', 'income'])

df

Unnamed: 0,first,last,age,height_cm,weight_kg,income
0,James,Smith,18,75.7,66.9,1000
1,Jane,Watson,18,163.0,56.7,800
2,Adam,Miller,19,176.5,68.9,350
3,Tom,Thompson,19,163.3,58.0,980
4,Tom,Piper,20,168.5,57.5,2500
5,Tom,Piper,20,168.5,57.5,2500


### Check for duplicates

In [17]:
df.duplicated().any()

# or if the frame is small and you want to see all duplicate rows
# use df.duplicated()

True

### Count number of duplicate rows

In [24]:
df.duplicated().sum()

1

### Count duplicates in a particular column

In [25]:
df['first'].duplicated().sum()

2

### Show duplicate rows

In [9]:
df[df.duplicated()]

Unnamed: 0,first,last,age,height_cm,weight_kg,income
5,Tom,Piper,20,168.5,57.5,2500


### Drop duplicate rows

In [11]:
df.drop_duplicates(keep='first')

Unnamed: 0,first,last,age,height_cm,weight_kg,income
0,James,Smith,18,75.7,66.9,1000
1,Jane,Watson,18,163.0,56.7,800
2,Adam,Miller,19,176.5,68.9,350
3,Tom,Thompson,19,163.3,58.0,980
4,Tom,Piper,20,168.5,57.5,2500


### Drop duplicates based on column

In [14]:
df.drop_duplicates('first', keep='first')

Unnamed: 0,first,last,age,height_cm,weight_kg,income
0,James,Smith,18,75.7,66.9,1000
1,Jane,Watson,18,163.0,56.7,800
2,Adam,Miller,19,176.5,68.9,350
3,Tom,Thompson,19,163.3,58.0,980


**Above**, we have dropped both **Tom Piper** rows because we only told Pandas to drop duplicates in a specific column, we can supply **multiple columns** to ensure this doesn't happen.

In [16]:
df.drop_duplicates(['first', 'last'], keep='first')

Unnamed: 0,first,last,age,height_cm,weight_kg,income
0,James,Smith,18,75.7,66.9,1000
1,Jane,Watson,18,163.0,56.7,800
2,Adam,Miller,19,176.5,68.9,350
3,Tom,Thompson,19,163.3,58.0,980
4,Tom,Piper,20,168.5,57.5,2500
