A Data Scientist's task is 80% data cleaning and 20% modelling. In this post, I show how you can deduplicate records quicker utilizing the dedupe library. The dedupe library, from the company Dedupe.io, essentially makes the task of identifying duplicate records easy. You train a model and it clusters duplicates. Thankfully the company released an open source library that can be used by anyone with knowledge of coding. However, if you are not inclined to write code, I suggest that you check out their GUI software at <a href="http://dedupe.io" target="_blank">dedupe.io</a>.

This post will focus on a library, pandas dedupe, that I have contributed to. It brings the power of dedupe to the pandas library making it interactive within a Jupyter notebook. The pandas dedupe library is found at:

<a href="https://github.com/Lyonk71/pandas-dedupe" target="_blank">https://github.com/Lyonk71/pandas-dedupe</a>

# Install Pandas Dedupe Library

In [1]:
!pip install git+git://github.com/Lyonk71/pandas-dedupe.git

Collecting git+git://github.com/Lyonk71/pandas-dedupe.git
  Cloning git://github.com/Lyonk71/pandas-dedupe.git to c:\users\tyler\appdata\local\temp\pip-req-build-xytctzaq
Building wheels for collected packages: pandas-dedupe
  Building wheel for pandas-dedupe (setup.py): started
  Building wheel for pandas-dedupe (setup.py): finished with status 'done'
  Stored in directory: C:\Users\tyler\AppData\Local\Temp\pip-ephem-wheel-cache-yfm93kjy\wheels\13\cd\fe\56faa6c628f81a5fac9c9f4245ab7b1f57dc2df48159f0a9b5
Successfully built pandas-dedupe


You are using pip version 19.0.3, however version 19.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


# Example of Deduplication

In [2]:
from pandas_dedupe import dedupe_dataframe

In [3]:
import pandas as pd

## Generate Fake Data

In this section I generate some fake data and duplicate some records.

In [4]:
import faker

In [5]:
fake = faker.Faker()

In [6]:
data = {
    'Name': [],
    'Address': [],
}

for i in range(100):
    data['Name'].append(fake.name())
    data['Address'].append(fake.address())
df = pd.DataFrame(data)

## Duplicate Records

Here I duplicate some records so that we can demonstrate dedupe. When you have already trained the model pandas_dedupe reads that training file and uses that for clustering.

In [7]:
df = pd.concat([df, df.sample(frac=0.2)])

In [8]:
len(df)

120

In [9]:
len(df.drop_duplicates())

100

In [10]:
dedupe_df = dedupe_dataframe(df, ['Name', 'Address'])



importing data ...
reading from dedupe_dataframe_learned_settings
clustering...
# duplicate sets 6


## Illustrate Dedupe

Dedupe will prompt with many records that it thinks are similar. You tell it what is and isn't similar so that the model can give better results.

dedupe_df = dedupe_dataframe(df, ['Name', 'Address'])

## Dedupe Output

Once the training and clustering process is complete, you are presented with a dataframe that provides a cluster id and confidence. Records with similar cluster ids are considered as duplicates. The confidence score provides you with a certaintity score from 0 to 1.

In [11]:
dedupe_df.sort_values(['confidence', 'cluster id'], ascending=False)

Unnamed: 0,Address,Name,cluster id,confidence
52,"028 michael orchard suite 654 carterside, in 6...",michael pugh,3.0,0.434289
98,"881 adrian centers apt. 030 lake timothymouth,...",brett bowman,3.0,0.434289
9,unit 1117 box 7761 dpo ap 36039,tami sanders,1.0,0.406689
62,unit 5686 box 6589 dpo ap 37867,holly zuniga,1.0,0.406689
62,unit 5686 box 6589 dpo ap 37867,holly zuniga,1.0,0.406689
57,"0871 ross court apt. 021 north josephstad, dc ...",edward patrick,4.0,0.399691
57,"0871 ross court apt. 021 north josephstad, dc ...",edward patrick,4.0,0.399691
79,"873 leblanc rapid apt. 613 new virginia, nd 66835",veronica barnes,4.0,0.359868
13,"psc 7201, box 7089 apo ap 08072",donna rodriguez,0.0,0.358922
59,"873 becker lodge haroldfort, wa 06364",carolyn cruz,4.0,0.355432


Notice that I suggested dedupe should consider invalid records as similar. This can affect the end result of your clustered, however for demonstration purposes this suffices.