# Pandas Duplicate Data

This notebook explains how to identify and handle duplicate rows with `pandas`.

### Packages

This tutorial uses:
* [pandas](https://pandas.pydata.org/docs/)

In [1]:
import pandas as pd

## Creating the data

We will create a dataframe that contains multiple occurances of duplication for this example.

In [2]:
df = pd.DataFrame({'A': ['A']*2 + ['A', 'A', 'B', 'A', 'B']*3 + ['A', 'A', 'B'],
                   'B': ['A']*2 + ['A', 'a', 'B', 'A', 'b']*3 + ['A', 'a', 'B'],
                   'C': ['A']*2 + ['A', 'B', 'C']*5 + ['A', 'A', 'B'],
                   'D': ['A']*2 + ['A', 'a', 'B']*5 + ['A', 'A', 'B']
                  })
df

Unnamed: 0,A,B,C,D
0,A,A,A,A
1,A,A,A,A
2,A,A,A,A
3,A,a,B,a
4,B,B,C,B
5,A,A,A,A
6,B,b,B,a
7,A,A,C,B
8,A,a,A,A
9,B,B,B,a


## Identify duplicates

### Duplicate in all columns

The function `duplicated` will return a Boolean series indicating if that row is a duplicate.  The parameter `keep` can take on the values **'first'** (default) to label the first duplicate **False** and the rest **True**, **'last'** to mark the last duplicate **False** and the rest **True**, or **False** to mark all duplicates **True**.

In [3]:
dups = df.duplicated()
dups

0     False
1      True
2      True
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10     True
11    False
12    False
13    False
14    False
15     True
16    False
17     True
18     True
19    False
dtype: bool

To see the duplicate rows, use the Boolean series **dups** to select rows from the original dataframe.

In [4]:
df[dups]

Unnamed: 0,A,B,C,D
1,A,A,A,A
2,A,A,A,A
5,A,A,A,A
10,A,A,C,B
15,A,A,B,a
17,A,A,A,A
18,A,a,A,A


### Duplicate in selected columns

The function `duplicated` will return a Boolean series indicating if that row is a duplicate based on just the specified columns when the parameter `subset` is passed a list of the columns to use (in this case, **A** and **B**).

In [5]:
dups = df.duplicated(subset=['A', 'B'])
dups

0     False
1      True
2      True
3     False
4     False
5      True
6     False
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
19     True
dtype: bool

In [6]:
df[dups]

Unnamed: 0,A,B,C,D
1,A,A,A,A
2,A,A,A,A
5,A,A,A,A
7,A,A,C,B
8,A,a,A,A
9,B,B,B,a
10,A,A,C,B
11,B,b,A,A
12,A,A,B,a
13,A,a,C,B


## Delete duplicates

### Delete only if all columns are duplicated

The function `drop_duplicates` will return a dataframe after dropping duplicates.  The parameter `keep` can take on the values **'first'** (default) to keep the first duplicate and drop the rest, **'last'** to keep the last duplicate and drop the rest, or **False** to drop all duplicates.

In [7]:
dedup_df = df.drop_duplicates()
dedup_df

Unnamed: 0,A,B,C,D
0,A,A,A,A
3,A,a,B,a
4,B,B,C,B
6,B,b,B,a
7,A,A,C,B
8,A,a,A,A
9,B,B,B,a
11,B,b,A,A
12,A,A,B,a
13,A,a,C,B


### Delete only if specified columns are duplicated

The function `drop_duplicates` will return a dataframe after dropping all duplicates based on just the specified columns when the parameter `subset` is passed a list of the columns to use (in this case, **A** and **B**).

In [8]:
dedup_df = df.drop_duplicates(subset=['A', 'B'])
dedup_df

Unnamed: 0,A,B,C,D
0,A,A,A,A
3,A,a,B,a
4,B,B,C,B
6,B,b,B,a
