# DedupliPy

## Advanced deduplication

Load your data. In this example we take a sample dataset that comes with DedupliPy:

In [1]:
from deduplipy.datasets import load_data

In [2]:
df = load_data(kind="voters")

Column names: 'name', 'suburb', 'postcode'


In [3]:
df.head(2)

Unnamed: 0,name,suburb,postcode
0,khimerc thomas,charlotte,2826g
1,lucille richardst,kannapolis,28o81


Create a `Deduplicator` instance and provide advanced settings

- The similarity metrics per field are entered in a dict. Similarity metric can be any function that takes two strings and output a number.

In [4]:
from fuzzywuzzy.fuzz import partial_ratio, ratio, token_set_ratio, token_sort_ratio

from deduplipy.deduplicator import Deduplicator

In [5]:
field_info = {
    "name": [ratio, partial_ratio],
    "suburb": [token_set_ratio, token_sort_ratio],
    "postcode": [ratio],
}

- We choose our own set of rules for blocking which we define ourselves. We only apply this rule to the 'name' column

In [6]:
def first_two_characters(x):
    return x[:2]

- `interaction=True` makes the classifier include interaction features, e.g. `ratio('name') * token_set_ratio('suburb')`. When interaction features are included, the logistic regression classifier applies a L1 regularisation to prevent overfitting.
- We set `verbose=1` to get information on the progress and a distribution of scores

In [7]:
myDedupliPy = Deduplicator(
    field_info=field_info,
    interaction=True,
    rules={"name": [first_two_characters]},
    verbose=1,
)

Fit the `Deduplicator` by active learning; enter whether a pair is a match (y) or not (n). When the training is converged, you will be notified and you can finish training by entering 'f'.

In [None]:
myDedupliPy.fit(df)

Based on the histogram of scores, we decide to ignore all pairs with a similarity probability lower than 0.1 when predicting:

Apply the trained `Deduplicator` on (new) data. The column `deduplication_id` is the identifier for a cluster. Rows with the same `deduplication_id` are found to be the same real world entity.

In [9]:
res = myDedupliPy.predict(df, score_threshold=0.1)
res.sort_values("deduplication_id").head(10)

blocking started
blocking finished
Nr of pairs: 27350
scoring started
scoring finished
Nr of filtered pairs: 892
Clustering started
Clustering finished


Unnamed: 0,name,suburb,postcode,deduplication_id
1,lucille richardst,kannapolis,28o81,1
1194,lucille richards,kannapolis,28081,1
604,lutta baldwin,whiteville,28472,3
995,lutta baldwin,whitevill,28475,3
2,reb3cca bauerboand,raleigh,27615,5
1134,rebecca bauerband,raleigh,27615,5
1456,rebecca harrell,winton,27986,7
1024,rebecca harrell,witnon,27926,7
92,repecca harrell,winton,27q86,7
675,rebeccah shelton,whittier,28789,10
