# DedupliPy

## Simple deduplication

Load your data. In this example we take a sample dataset that comes with DedupliPy:

In [1]:
from deduplipy.datasets import load_data

In [2]:
df = load_data(kind="voters")

Column names: 'name', 'suburb', 'postcode'


In [3]:
df.head(2)

Unnamed: 0,name,suburb,postcode
0,khimerc thomas,charlotte,2826g
1,lucille richardst,kannapolis,28o81


Create a `Deduplicator` instance and provide the column names to be used for deduplication:

In [4]:
from deduplipy.deduplicator import Deduplicator

In [5]:
myDedupliPy = Deduplicator(["name", "suburb", "postcode"])

Fit the `Deduplicator` by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering 'f'.

In [None]:
myDedupliPy.fit(df)

Apply the trained `Deduplicator` on (new) data. The column `deduplication_id` is the identifier for a cluster. Rows with the same `deduplication_id` are found to be the same real world entity.

In [7]:
res = myDedupliPy.predict(df)
res.sort_values("deduplication_id").head(10)

Unnamed: 0,name,suburb,postcode,deduplication_id
252,kiera matthews,charlotte,28216,1
1380,kiea matthews,charlotte,28218,1
0,khimerc thomas,charlotte,2826g,2
1302,chimerc thmas,chaflotte,28269,2
1190,chimerc thomas,charlotte,28269,2
15,kimbefly craddock,charlotte,28264,6
1313,kimberly craddoclc,charlotte,282|4,6
1255,kimberly craddock,charlotte,28214,6
1139,l douglas loudin,charlotte,28205,9
39,l douglas loujdin,charlotte,28225,9


The `Deduplicator` instance can be saved as a pickle file and be applied on new data after training:

In [8]:
import pickle

In [9]:
with open("mypickle.pkl", "wb") as f:
    pickle.dump(myDedupliPy, f)

In [10]:
with open("mypickle.pkl", "rb") as f:
    loaded_obj = pickle.load(f)

In [11]:
res = loaded_obj.predict(df)
res.sort_values("deduplication_id").head(10)

Unnamed: 0,name,suburb,postcode,deduplication_id
252,kiera matthews,charlotte,28216,1
1380,kiea matthews,charlotte,28218,1
0,khimerc thomas,charlotte,2826g,2
1302,chimerc thmas,chaflotte,28269,2
1190,chimerc thomas,charlotte,28269,2
15,kimbefly craddock,charlotte,28264,6
1313,kimberly craddoclc,charlotte,282|4,6
1255,kimberly craddock,charlotte,28214,6
1139,l douglas loudin,charlotte,28205,9
39,l douglas loujdin,charlotte,28225,9
