# Train a ML Classifier to Link FEBRL People Data

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rachhouse/intro-to-data-linking/)

In this tutorial, we'll train a machine learning classifier to score candidate pairs for linking, using supervised learning. We will use the same training dataset as the SimSum classification tutorial, as well as the same augmentation, blocking, and comparing functions. The functions have been included in a separate `.py` file for re-use and convenience, so we can focus on code unique to this tutorial.

The SimSum classification tutorial included a more detailed walkthrough of augmentation, blocking, and comparing, and since we're using the same functions within this tutorial, details will be light for those steps. Please see the SimSum tutorial if you need a refresher.

In [1]:
import itertools

import altair as alt
import pandas as pd

from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split

In [2]:
# Grab the linking functions file from github and save locally for Colab.
# We'll import our previously used linking functions from this file.
import linking_tutorial_functions as tutorial

## Load Training Data and Ground Truth Labels

In [3]:
df_A, df_B, df_ground_truth = tutorial.load_febrl_training_data()

## Data Augmentation

In [4]:
for df in [df_A, df_B]:
    df = tutorial.augment_data(df)

## Blocking

In [5]:
candidate_links = tutorial.block(df_A, df_B)

## Comparing

In [6]:
%%time

features = tutorial.compare(candidate_links, df_A, df_B)

CPU times: user 42.9 s, sys: 970 ms, total: 43.8 s
Wall time: 43.4 s


In [7]:
features.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,061b9c3f-afbe-41e5-923b-3de29a4e5b82,0,1,0,1,0.0,1.0,0.263158,0.225806,0.384615,0.2,0.25,0.0,0.333333,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,081ec178-99a1-4895-b96e-7c03cf8bbfdc,0,1,0,1,0.577778,1.0,0.210526,0.032258,0.230769,0.4,1.0,0.0,0.333333,0.142857
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,17b274b5-aa3f-43cc-96ae-21283b7d1ca5,0,1,0,1,0.588889,1.0,0.176471,0.032258,0.384615,0.2,0.25,0.0,0.333333,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,1f70d4cd-3106-4d15-af9f-1617a43ca83f,0,1,0,1,0.455556,1.0,0.235294,0.193548,0.307692,0.2,0.25,0.0,0.083333,0.142857
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,201c4dba-825a-42f2-b7a8-832b792af90b,0,0,0,0,0.577778,0.611111,0.083333,0.129032,0.153846,0.6,0.25,0.0,0.416667,0.571429


## Add Labels to Feature Vectors

We've augmented, blocked, and compared, so now we're ready to train a classification model which can score candidate record pairs on how likely it is that they are a link. As we did when classifying links via SimSum, we'll append our ground truth values to the features DataFrame.

In [8]:
df_ground_truth["ground_truth"] = df_ground_truth["ground_truth"].apply(lambda x: 1.0 if x else 0.0)

df_labeled_features = pd.merge(
    features,
    df_ground_truth,
    on=["person_id_A", "person_id_B"],
    how="left"
)

df_labeled_features["ground_truth"].fillna(0, inplace=True)
df_labeled_features.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,061b9c3f-afbe-41e5-923b-3de29a4e5b82,0,1,0,1,0.0,1.0,0.263158,0.225806,0.384615,0.2,0.25,0.0,0.333333,0.0,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,081ec178-99a1-4895-b96e-7c03cf8bbfdc,0,1,0,1,0.577778,1.0,0.210526,0.032258,0.230769,0.4,1.0,0.0,0.333333,0.142857,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,17b274b5-aa3f-43cc-96ae-21283b7d1ca5,0,1,0,1,0.588889,1.0,0.176471,0.032258,0.384615,0.2,0.25,0.0,0.333333,0.0,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,1f70d4cd-3106-4d15-af9f-1617a43ca83f,0,1,0,1,0.455556,1.0,0.235294,0.193548,0.307692,0.2,0.25,0.0,0.083333,0.142857,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,201c4dba-825a-42f2-b7a8-832b792af90b,0,0,0,0,0.577778,0.611111,0.083333,0.129032,0.153846,0.6,0.25,0.0,0.416667,0.571429,0.0


## Separate Candidate Links into Train/Test

Next, we'll separate our features DataFrame into a train and test set.

In [9]:
X = df_labeled_features.drop("ground_truth", axis=1)
y = df_labeled_features["ground_truth"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2
)

## Train ML Classifier

Though we're using a very simple machine learning model here, the important takeaway is to think of the classification step as a black box that produces a score indicating how likely the model thinks a given candidate record pair is a link. There must be an output score, but *how* that score is generated provides a lot of flexibility. Perhaps you just want to use SimSum, which could be considered an extremely simple "model". Maybe you want to build a neural net to ingest the comparison vectors and produce a score. Generally, in linking, the classification model is the simplest piece, and much more work will go into your blockers and comparators.

In [34]:
classifier = AdaBoostClassifier(n_estimators=64, learning_rate=0.5)

In [35]:
classifier.fit(X_train, y_train)

AdaBoostClassifier(learning_rate=0.5, n_estimators=64)

## Predict Using ML Classifier

Here, we'll generate scores for our test set, and format those predictions in a form useful for evaluation.

In [36]:
y_pred = classifier.predict_proba(X_test)[:,1]

In [37]:
df_predictions = X_test.copy()
df_predictions["model_score"] = y_pred
df_predictions["ground_truth"] = y_test

## Choosing a Linking Model Score Threshold

As with SimSum, we're able to examine the resulting score distribution and precision/recall vs. model score threshold plot to determine where the cutoff should be set.

### Model Score Distribution

In [38]:
tutorial.plot_model_score_distribution(df_predictions)

### Precision and Recall vs. Model Score

In [39]:
blocking_eval, df_eval, df_top_links, df_bottom_links = tutorial.evaluate_linking(
    df=df_predictions,
    df_true_links=df_ground_truth,
    df_left=df_A,
    df_right=df_B,
)

In [40]:
tutorial.plot_precision_recall_vs_threshold(df_eval)

### Top Scoring `k` Links

In [41]:
display_cols = [
    "first_name", "surname",
    "street_number", "address_1", "address_2", "suburb", "postcode", "state",
    "date_of_birth", "age", "phone_number", "soc_sec_id",
    "soundex_surname", "soundex_firstname",
    "nysiis_surname", "nysiis_firstname",
]

display_cols = [[f"{col}_x", f"{col}_y"] for col in display_cols]
display_cols = list(itertools.chain.from_iterable(display_cols))

In [42]:
with pd.option_context('display.max_columns', None):
    display(df_top_links[["person_id_A", "person_id_B", "model_score", "ground_truth"] + display_cols])

Unnamed: 0,person_id_A,person_id_B,model_score,ground_truth,first_name_x,first_name_y,surname_x,surname_y,street_number_x,street_number_y,address_1_x,address_1_y,address_2_x,address_2_y,suburb_x,suburb_y,postcode_x,postcode_y,state_x,state_y,date_of_birth_x,date_of_birth_y,age_x,age_y,phone_number_x,phone_number_y,soc_sec_id_x,soc_sec_id_y,soundex_surname_x,soundex_surname_y,soundex_firstname_x,soundex_firstname_y,nysiis_surname_x,nysiis_surname_y,nysiis_firstname_x,nysiis_firstname_y
0,93dc17f3-a02d-412e-8ccb-4cd2d24082fe,1c34fae0-84f6-4c66-9ccf-36e24d05b854,0.874489,1.0,isabelle,isabelle,reid,reid,37,,o'connor circuit,o'connor curcuit,,,hawthorn,hawthorn,2210,2210,nsw,nsw,19201126,19201126,,,08 33822741,08 38323741,4187751,4187751,R300,R300,I214,I214,RAD,RAD,ISABAL,ISABAL
1,84ac7aac-0504-4a76-818e-bab8b8b43e07,1111ebfb-b5e4-41d1-8e5c-80546f16ad40,0.874489,1.0,georgia,georgia,clarke,clarke,18,28.0,burrinjuck crescent,burrinjuckscrescent,walnut grove estate,walnut grove estate,cronulla,cronullla,4220,4220,nsw,nsw,19150906,19150906,31.0,,03 99665759,03 99665759,6785464,6785464,C462,C462,G620,G620,CLARC,CLARC,GARG,GARG
2,a1c7e20f-f340-442a-a8ce-9a676b3902ed,6e8bd289-5aa8-412f-b983-71e58108b691,0.874489,1.0,zac,zac,reid,reid,34,34.0,brebner street,brebner street,riverside professional centre (cnr banks,riverside professional centre (cnr banks,rosebud,rosenud,4075,4075,vic,vic,19390613,19390613,30.0,30.0,07 07659113,07 07695113,5176004,5176004,R300,R300,Z200,Z200,RAD,RAD,ZAC,ZAC
3,7652739c-10c1-421b-a96b-e890d3667e6e,c25fd504-a2c0-4d0b-abdf-9af4c4f4538c,0.874489,1.0,oliver,oliver,hendricks,hendrcks,15,11.0,ainsworth street,ainsworth street,,,ballina,ballina,4076,4076,tas,tas,19020213,19020213,26.0,26.0,02 30152278,02 30152278,3659732,3659732,H536,H536,O416,O416,HANDRAC,HANDRC,OLAVAR,OLAVAR
4,67b496c1-9370-413a-956f-ccf9d88f8b86,0f1321a8-9eaf-4f2d-a4e5-a36cebf21e71,0.874489,1.0,riley,riley,verschoor,verschoor,76,76.0,kallaroo road,kallaroo troad,,,warrandyte,warrandyte,2228,2228,,,19231028,19231028,,,03 27434832,03 27434832,4500486,4500486,V626,V626,R400,R400,VARSSAR,VARSSAR,RALY,RALY
5,1023c0bf-9fa9-4d6d-9819-0201b9a50c32,2a1a6625-36c5-4286-bb6d-a884b972bb1f,0.874489,1.0,harrison,harrison,purdon,purdon,6,6.0,oldham court,oldhamzcourt,,,mount samson,mount samson,6168,6168,sa,sa,19080226,19080226,32.0,32.0,03 23206117,03 23291617,1755521,1755521,P635,P635,H625,H625,PARDAN,PARDAN,HARASAN,HARASAN
6,a228612f-f95a-456e-b755-49346d9c4ecd,c8c22c1c-edb2-4988-948a-c592f5026208,0.874489,1.0,breeanne,breeanne,byers,byers,29,29.0,henning place,henning place,,,rosebery,roseery,6010,6010,act,act,19200606,19200606,32.0,32.0,08 27010960,08 27102960,5985372,5985372,B620,B620,B650,B650,BYAR,BYAR,BRAN,BRAN
7,ef793ffa-1c97-471b-849b-242bf5c7e5ef,ce36bce3-e5c6-4b40-82d1-c2981cd983f9,0.874489,1.0,tara,tara,millar,milla r,30,49.0,mcgill street,mcgill street,,,west lakes,west lakes,2722,2722,vic,vic,19030424,19030424,28.0,27.0,03 19754147,03 19754147,5852983,5852983,M460,M460,T600,T600,MALAR,MAL,TAR,TAR
8,76ee2b5a-d705-414a-a481-3bf49e190397,30546807-be3b-4087-ba8f-221476c0dd55,0.874489,1.0,chase,chase,fitzpatrick,fitzpafrick,62,64.0,shumack street,shumackistreet,,,camden,camden,2251,2251,,,19900906,19900906,28.0,28.0,02 65419222,02 65419222,5364907,5364907,F321,F321,C200,C200,FATSPATRAC,FATSPAFRAC,CAS,CAS
9,606e16cf-6fd2-4d71-ba15-f3db9d25fd8d,ee0c1beb-70b3-4f1e-87b0-c9afced34524,0.874489,1.0,reece,reece,mclaren-gates,mclaren-zates,7,7.0,totterdell street,totterdell street,,,canley heights,canley heights,6104,6104,vic,vic,19600507,19600507,12.0,12.0,03 25095374,03 25095374,6710205,6710205,M246,M246,R200,R200,MCLARAN-GAT,MCLARAN-SAT,RAC,RAC


### Bottom Scoring `k` Links

In [43]:
with pd.option_context('display.max_columns', None):
    display(df_bottom_links[["person_id_A", "person_id_B", "model_score", "ground_truth"] + display_cols])

Unnamed: 0,person_id_A,person_id_B,model_score,ground_truth,first_name_x,first_name_y,surname_x,surname_y,street_number_x,street_number_y,address_1_x,address_1_y,address_2_x,address_2_y,suburb_x,suburb_y,postcode_x,postcode_y,state_x,state_y,date_of_birth_x,date_of_birth_y,age_x,age_y,phone_number_x,phone_number_y,soc_sec_id_x,soc_sec_id_y,soundex_surname_x,soundex_surname_y,soundex_firstname_x,soundex_firstname_y,nysiis_surname_x,nysiis_surname_y,nysiis_firstname_x,nysiis_firstname_y
0,c740e993-b82a-4a5d-829c-5ec78f794365,ec5cb3ca-f6f5-47b6-b72b-e32b203e8cff,0.148199,0.0,luke,,clarke,dsvid,54,10.0,wilkins street,la perouse street,kilburnie,,moree,moonahcwest,3564,3201,,qld,,,,30.0,,03 54067859,7944844,9029332,C462,D213,L200,,CLARC,DSVAD,LAC,
1,8028bbd7-2e04-405e-b4e7-156b3e296574,62780386-2b5a-4fd4-a6eb-8a24a94ec9f8,0.148199,0.0,ryan,kaite,lambropoulos,white,90,285.0,pollock street,bindaga soreet,moline village,k tobru hse,highett,laverton,2600,5371,nsw,sa,,,39.0,,07 67858345,02 96573662,5669846,3215665,L516,W300,R500,K300,LANBRAPAL,WAT,RYAN,CAT
2,433ef249-be41-4466-80ba-e54fcad9f736,eb23d4ca-5d15-4493-99a9-116953a29d9e,0.148199,0.0,zara,dahiel,mayer,webb,105,78.0,musgrave street,coplan drove,mindaree,,bunbury,moffat beach,6022,3141,nsw,tas,,,34.0,23.0,,03 90147946,1517433,2094089,M600,W100,Z600,D400,MAYAR,WAB,ZAR,DAHAL
3,b9ee9210-d521-4082-9c6c-aebc662d3477,ee16f179-b750-416e-9e9b-7d405c3d2e0a,0.148199,0.0,georgia,ruyb,berry,clarke,45,17.0,merfield place,pickles street,sheep station,hartford,lindfield,surfers piradise,2082,7009,,sa,,,19.0,32.0,04 07568426,02 44900986,5218074,7470726,B600,C462,G620,R100,BARY,CLARC,GARG,RAYB
4,fa098171-37cd-48a1-920a-f2ba4ec677e9,75fc14a3-f516-4beb-ada7-9c1e9fb17877,0.148199,0.0,alessandra,harrison,stancombe,wigth,69,20.0,vincent place,trusselle place,,virginia estate,east hills,yass,2564,6032,vic,nsw,,,,37.0,07 38542744,08 41152090,2423598,3292243,S352,W230,A425,H625,STANCANB,WAGT,ALASANDR,HARASAN
5,1d1091e0-b447-4e38-90e7-74f020fdc86e,de95c3e5-1e0f-47e3-b7d4-2ea927625394,0.148199,0.0,shaun,henry,caruana,wooley,46,57.0,lark place,barada crescent,,warrina lakes,north maclean,mount nelson,3052,3141,qld,,,,26.0,10.0,03 12654759,04 53734835,1520280,4731590,C650,W400,S500,H560,CARAN,WALY,SAN,HANRY
6,cbf3a4f0-f51d-49b0-8121-902bcd4b92e8,0a4d9968-68bd-4b47-84ee-3e175ff7b6b8,0.148199,0.0,demie,demue,odfeldt,bhall,11,2.0,fitchett street,mckail crescent,chippendale village,,canley heights,lilydale,3216,5272,,nsw,19330226.0,19070803.0,31.0,29.0,03 57324830,08 15918275,2774327,5991902,O314,B400,D500,D500,ODFALD,BAL,DANY,DAN
7,96dfe9e6-d61a-4b63-82ef-6ab6ac2ad86e,734c0f47-1742-4523-a8c1-95a660eb84aa,0.148199,0.0,micah,rhiannni,gillis,,102,41.0,bamford street,blackwood terrace,hylows,busselton holiday village,picnic point,deakin,2145,2464,,vic,,,26.0,23.0,,04 43881157,5041960,9632442,G420,,M200,R500,GAL,,MAC,RAN
8,13fedda5-eda0-44fb-8003-83166c31d89d,7394560f-ce7d-428b-8ae9-73c3f54d6995,0.148199,0.0,cooper,ashleigh,dosmyk,mason,14,,gaunson crescent,albermarle place,sunnyview,wingara,goodwood,kedron,4872,2112,nsw,,,,,23.0,07 29344576,03 01730091,8766084,4087812,D252,M250,C160,A242,DASNYC,MASAN,CAPAR,ASLAG
9,88bb1235-d416-4403-a1a7-505290b44943,ad61643b-fce0-49c1-b137-483e536a6865,0.148199,0.0,blake,courtney,d'apollonio,harringgon,21,8.0,barraclough crescent,esperancdqstreet,,poitrel,north ryde,clifton springs,3585,3930,vic,wa,,,,26.0,04 20209991,,6358617,3749445,D145,H652,B420,C635,D'APALAN,HARANGAN,BLAC,CARTNY
