# Train a ML Classifier to Link FEBRL People Data

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rachhouse/intro-to-data-linking/blob/main/tutorial_notebooks/02_Link_FEBRL_Data_with_ML_Classifier.ipynb)

In this tutorial, we'll train a machine learning classifier to score candidate pairs for linking, using supervised learning. We will use the same training dataset as the SimSum classification tutorial, as well as the same augmentation, blocking, and comparing functions. The functions have been included in a separate `.py` file for re-use and convenience, so we can focus on code unique to this tutorial.

The SimSum classification tutorial included a more detailed walkthrough of augmentation, blocking, and comparing, and since we're using the same functions within this tutorial, details will be light for those steps. Please see the SimSum tutorial if you need a refresher.

In [1]:
import itertools

import altair as alt
import pandas as pd

from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split

In [2]:
# Grab the linking functions file from github and save locally for Colab.
# We'll import our previously used linking functions from this file.
import linking_tutorial_functions as tutorial

## Load Training Data and Ground Truth Labels

In [3]:
df_A, df_B, df_ground_truth = tutorial.load_febrl_training_data()

## Data Augmentation

In [4]:
for df in [df_A, df_B]:
    df = tutorial.augment_data(df)

## Blocking

In [5]:
candidate_links = tutorial.block(df_A, df_B)

## Comparing

In [6]:
%%time

features = tutorial.compare(candidate_links, df_A, df_B)

CPU times: user 43.4 s, sys: 1.33 s, total: 44.7 s
Wall time: 43.9 s


In [7]:
features.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,061b9c3f-afbe-41e5-923b-3de29a4e5b82,0,1,0,1,0.0,1.0,0.263158,0.225806,0.384615,0.2,0.25,0.0,0.333333,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,081ec178-99a1-4895-b96e-7c03cf8bbfdc,0,1,0,1,0.577778,1.0,0.210526,0.032258,0.230769,0.4,1.0,0.0,0.333333,0.142857
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,17b274b5-aa3f-43cc-96ae-21283b7d1ca5,0,1,0,1,0.588889,1.0,0.176471,0.032258,0.384615,0.2,0.25,0.0,0.333333,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,1f70d4cd-3106-4d15-af9f-1617a43ca83f,0,1,0,1,0.455556,1.0,0.235294,0.193548,0.307692,0.2,0.25,0.0,0.083333,0.142857
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,201c4dba-825a-42f2-b7a8-832b792af90b,0,0,0,0,0.577778,0.611111,0.083333,0.129032,0.153846,0.6,0.25,0.0,0.416667,0.571429


## Add Labels to Feature Vectors

We've augmented, blocked, and compared, so now we're ready to train a classification model which can score candidate record pairs on how likely it is that they are a link. As we did when classifying links via SimSum, we'll append our ground truth values to the features DataFrame.

In [8]:
df_ground_truth["ground_truth"] = df_ground_truth["ground_truth"].apply(lambda x: 1.0 if x else 0.0)

df_labeled_features = pd.merge(
    features,
    df_ground_truth,
    on=["person_id_A", "person_id_B"],
    how="left"
)

df_labeled_features["ground_truth"].fillna(0, inplace=True)
df_labeled_features.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,061b9c3f-afbe-41e5-923b-3de29a4e5b82,0,1,0,1,0.0,1.0,0.263158,0.225806,0.384615,0.2,0.25,0.0,0.333333,0.0,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,081ec178-99a1-4895-b96e-7c03cf8bbfdc,0,1,0,1,0.577778,1.0,0.210526,0.032258,0.230769,0.4,1.0,0.0,0.333333,0.142857,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,17b274b5-aa3f-43cc-96ae-21283b7d1ca5,0,1,0,1,0.588889,1.0,0.176471,0.032258,0.384615,0.2,0.25,0.0,0.333333,0.0,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,1f70d4cd-3106-4d15-af9f-1617a43ca83f,0,1,0,1,0.455556,1.0,0.235294,0.193548,0.307692,0.2,0.25,0.0,0.083333,0.142857,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,201c4dba-825a-42f2-b7a8-832b792af90b,0,0,0,0,0.577778,0.611111,0.083333,0.129032,0.153846,0.6,0.25,0.0,0.416667,0.571429,0.0


## Separate Candidate Links into Train/Test

Next, we'll separate our features DataFrame into a train and test set.

In [9]:
X = df_labeled_features.drop("ground_truth", axis=1)
y = df_labeled_features["ground_truth"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2
)

## Train ML Classifier

Though we're using a very simple machine learning model here, the important takeaway is to think of the classification step as a black box that produces a score indicating how likely the model thinks a given candidate record pair is a link. There must be an output score, but *how* that score is generated provides a lot of flexibility. Perhaps you just want to use SimSum, which could be considered an extremely simple "model". Maybe you want to build a neural net to ingest the comparison vectors and produce a score. Generally, in linking, the classification model is the simplest piece, and much more work will go into your blockers and comparators.

In [10]:
classifier = AdaBoostClassifier(n_estimators=64, learning_rate=0.5)

In [11]:
classifier.fit(X_train, y_train)

AdaBoostClassifier(learning_rate=0.5, n_estimators=64)

## Predict Using ML Classifier

Here, we'll generate scores for our test set, and format those predictions in a form useful for evaluation.

In [12]:
y_pred = classifier.predict_proba(X_test)[:,1]

In [13]:
df_predictions = X_test.copy()
df_predictions["model_score"] = y_pred
df_predictions["ground_truth"] = y_test

## Choosing a Linking Model Score Threshold

As with SimSum, we're able to examine the resulting score distribution and precision/recall vs. model score threshold plot to determine where the cutoff should be set.

### Model Score Distribution

In [14]:
tutorial.plot_model_score_distribution(df_predictions)

### Precision and Recall vs. Model Score

In [15]:
df_eval, df_top_links, df_bottom_links = tutorial.evaluate_linking(
    df=df_predictions,
    df_true_links=df_ground_truth,
    df_left=df_A,
    df_right=df_B,
)

In [16]:
tutorial.plot_precision_recall_vs_threshold(df_eval)

### Top Scoring `k` Links

In [17]:
display_cols = [
    "first_name", "surname",
    "street_number", "address_1", "address_2", "suburb", "postcode", "state",
    "date_of_birth", "age", "phone_number", "soc_sec_id",
    "soundex_surname", "soundex_firstname",
    "nysiis_surname", "nysiis_firstname",
]

display_cols = [[f"{col}_x", f"{col}_y"] for col in display_cols]
display_cols = list(itertools.chain.from_iterable(display_cols))

In [18]:
with pd.option_context('display.max_columns', None):
    display(df_top_links[["person_id_A", "person_id_B", "model_score", "ground_truth"] + display_cols])

Unnamed: 0,person_id_A,person_id_B,model_score,ground_truth,first_name_x,first_name_y,surname_x,surname_y,street_number_x,street_number_y,address_1_x,address_1_y,address_2_x,address_2_y,suburb_x,suburb_y,postcode_x,postcode_y,state_x,state_y,date_of_birth_x,date_of_birth_y,age_x,age_y,phone_number_x,phone_number_y,soc_sec_id_x,soc_sec_id_y,soundex_surname_x,soundex_surname_y,soundex_firstname_x,soundex_firstname_y,nysiis_surname_x,nysiis_surname_y,nysiis_firstname_x,nysiis_firstname_y
0,d410e7a9-f529-4bf9-93f1-870b620b154b,89a4f194-7136-4525-9124-eaecf315532e,0.884431,1.0,matteus,matteus,brayton,brayton,888,888.0,hall street,hall street,,,st ives,st ives,3163,3163,vic,vic,19950306,19950306,,,04 44743841,04 44743841,7386748,7386748,B635,B635,M320,M320,BRAYTAN,BRAYTAN,MAT,MAT
1,a228612f-f95a-456e-b755-49346d9c4ecd,c8c22c1c-edb2-4988-948a-c592f5026208,0.884431,1.0,breeanne,breeanne,byers,byers,29,29.0,henning place,henning place,,,rosebery,roseery,6010,6010,act,act,19200606,19200606,32.0,32.0,08 27010960,08 27102960,5985372,5985372,B620,B620,B650,B650,BYAR,BYAR,BRAN,BRAN
2,4588c4dd-0245-44ed-b21b-fda73d13d76d,acd3c67d-3364-45b8-a6b4-26c522b3f1c7,0.884431,1.0,georgia,georgia,vincent,vincent,25,25.0,kyabra place,kyabra place,,,terrigal,terrigal,4152,4152,nsw,nsw,19651104,19651104,,,07 42166622,07 42166622,4498407,4498407,V525,V525,G620,G620,VANCAD,VANCAD,GARG,GARG
3,a4d6f6b5-673c-4a2a-9370-a038a23ee2b2,f81bbdba-2ebf-4bb2-ac7d-058597615521,0.884431,1.0,kirrah,kirrah,campbell,campbell,109,1097.0,mockridge crescent,mockridge grescent,,,grenfell,grenfelyl,2902,2902,nsw,nsw,19901120,19901120,35.0,35.0,03 43918504,03 43918504,8101391,8101391,C514,C514,K600,K600,CANPBAL,CANPBAL,CAR,CAR
4,70fa191d-27c4-4d63-b226-e5c77b0db09c,076632cd-daeb-49f8-806e-0595f0000954,0.884431,1.0,jack,jack,clarke,clarke,39,,pennefather street,pennefather street,surfrider carvn park,surfrider carvn park,orange,orange,3186,3186,nsw,nsw,19901012,19901012,,,07 86614625,07 86625625,2257410,2257410,C462,C462,J200,J200,CLARC,CLARC,JAC,JAC
5,bf461373-7b2f-4253-a40e-8f7e8132f604,0a27a29f-1d39-4b30-9bee-a6bee00124b5,0.884431,1.0,riley,riley,clarke,clarke,3,2.0,wilkinson street,wilkinsoh street,,,willetton,willetton,2452,2452,nsw,nsw,19480428,19480428,30.0,,07 31713294,07 31713294,6921362,6921362,C462,C462,R400,R400,CLARC,CLARC,RALY,RALY
6,7b07437d-d090-468a-b8e8-93b561e926ee,edbc54ed-8673-4ec6-adf2-945ae1460495,0.884431,1.0,harrison,harrison,satterley,satterley,11,12.0,connibere crescent,connibere crescent,,,nambour,nambiur,6108,6108,vic,vic,19320711,19320711,,,08 55063090,08 55063090,2920537,2920537,S364,S364,H625,H625,SATARLY,SATARLY,HARASAN,HARASAN
7,151e0fe3-ca19-4448-bdd8-19302d974ffb,a9b3a7ca-f316-4d11-8191-63571dd791ae,0.884431,1.0,shane,shane,thorpe,thorpe,212,215.0,findlay street,findlay sireet,rowethorpe,rowethorpe,altona meadows,altona meadows,4214,4214,sa,sa,19640706,19640706,23.0,23.0,07 43138893,07 43138893,7423946,7423946,T610,T610,S500,S500,TARP,TARP,SAN,SAN
8,67b496c1-9370-413a-956f-ccf9d88f8b86,0f1321a8-9eaf-4f2d-a4e5-a36cebf21e71,0.884431,1.0,riley,riley,verschoor,verschoor,76,76.0,kallaroo road,kallaroo troad,,,warrandyte,warrandyte,2228,2228,,,19231028,19231028,,,03 27434832,03 27434832,4500486,4500486,V626,V626,R400,R400,VARSSAR,VARSSAR,RALY,RALY
9,c9554bae-effe-4817-99cf-86a774139882,4ac75e27-d9a7-45aa-aba0-27ef26d2e510,0.884431,1.0,shona,shona,zimmermann,zimmermann,8,9.0,coree place,coree place,nittymarra,nittymarra,croydon north,croydon nirth,3071,3071,nsw,nsw,19951214,19951214,,,07 25079244,07 25079244,4616045,4616045,Z565,Z565,S500,S500,ZANARNAN,ZANARNAN,SAN,SAN


### Bottom Scoring `k` Links

In [19]:
with pd.option_context('display.max_columns', None):
    display(df_bottom_links[["person_id_A", "person_id_B", "model_score", "ground_truth"] + display_cols])

Unnamed: 0,person_id_A,person_id_B,model_score,ground_truth,first_name_x,first_name_y,surname_x,surname_y,street_number_x,street_number_y,address_1_x,address_1_y,address_2_x,address_2_y,suburb_x,suburb_y,postcode_x,postcode_y,state_x,state_y,date_of_birth_x,date_of_birth_y,age_x,age_y,phone_number_x,phone_number_y,soc_sec_id_x,soc_sec_id_y,soundex_surname_x,soundex_surname_y,soundex_firstname_x,soundex_firstname_y,nysiis_surname_x,nysiis_surname_y,nysiis_firstname_x,nysiis_firstname_y
0,b122b00d-d302-4aef-9c74-c11e6b2cc2e4,1d675f68-1304-4bb3-8c34-8443112da07c,0.143183,0.0,kayla,kyle,purser,,123,4.0,alexander mackie circuit,sec 597,tasman nursing home,pickles street,auburn,hayborough,3170,3315,nsw,wa,19321111.0,19760280.0,52.0,33.0,08 03949807,07 48733932,2273139,3075588,P626,,K400,K400,PARSAR,,CAYL,CYL
1,13fedda5-eda0-44fb-8003-83166c31d89d,be12466a-9206-48e5-b94e-6e555700b820,0.143183,0.0,cooper,yasmin,dosmyk,fysh,14,73.0,gaunson crescent,skinner stdreet,sunnyview,,goodwood,hamilton,4872,4128,nsw,,,,,22.0,07 29344576,03 55583544,8766084,3940540,D252,F200,C160,Y255,DASNYC,FY,CAPAR,YASNAN
2,9a4ba8d9-7ddf-4e81-bf26-bc54250ba458,820f0562-ec1c-4b31-8001-e20c27a6e073,0.143183,0.0,alice,aleci,blake,allanby,111,,eddy crescent,barnett close,timaru heights,,withcott,east melbourne,2106,7235,qld,nsw,19610331.0,,29.0,30.0,03 87603447,02 72232796,8926230,5534198,B420,A451,A420,A420,BLAC,ALANBY,ALAC,ALAC
3,1d4c6b0e-7364-4bd0-807c-dd5b8db0fd4e,3981cc8a-6a30-45cc-a0bf-e781627c3591,0.143183,0.0,heather,alexander,kusuma,bartel,36,32.0,o'shanassy street,owen dixon drive,kilvinton village,redhill,northmead,toomwooba,2539,3644,vic,qld,,,,30.0,07 29169503,03 85823530,4987129,6003762,K250,B634,H360,A425,CASAN,BARTAL,HATAR,ALAXANDAR
4,4b8d5ac6-0af3-43fa-b5b5-c016855bfe94,dac45922-550a-48c7-90c7-21a8fa38c29a,0.143183,0.0,alexander,josyua,,,42,8.0,cherry street,james smkth circuit,villa 5,bakers creek,eastwood,rose bay,2028,4510,tas,viy,,19070511.0,23.0,,07 48262653,03 06398499,2961703,9862336,,,A425,J200,,,ALAXANDAR,JASY
5,1356140d-f406-49c0-a900-e478acb0430f,fd59acb6-2999-4213-8e1c-aadc4b11ce63,0.143183,0.0,riley,zara,eglinton,crouch,24,0.0,govett place,florentine circuit,villa 2,,hoppers crossing,yennora,3051,2508,vic,saf,,,28.0,,04 01453310,03 92382226,4048177,9635100,E245,C620,R400,Z600,EGLANTAN,CRAC,RALY,ZAR
6,8eb65eff-12a2-462e-bd5f-f780b515542e,098ebe8a-1125-4752-a525-60e662624684,0.143183,0.0,sascha,patrick,curry,boma,48,27.0,clisby close,enid lorimer circuit,carinya ski ranch,,yarraville,denililquin,2765,4020,nsw,,,,21.0,30.0,02 99388974,02 13063777,4847714,9980243,C600,B500,S200,P362,CARY,BAN,SASS,PATRAC
7,47fac021-d314-4dce-bdfb-4ffb1e5f2f78,9337f1fa-d19b-4256-87e1-f0e6ae3723f1,0.143183,0.0,alexander,alessandria,monz,zellmer,20,611.0,albermarle place,grimmett close,,ocean hunter,canley vale,robertson,2011,3170,qld,vic,19511018.0,19567107.0,35.0,33.0,04 70580988,08 99877508,4796528,2563427,M520,Z456,A425,A425,MAN,ZALNAR,ALAXANDAR,ALASANDR
8,b5acaf6d-d0e8-47a7-9e8a-807d0e17c451,0cd60f41-d51b-4952-b1c0-d190585ee67b,0.143183,0.0,jack,joshua,renfrey,matthews,22,32.0,gurney place,watts street,erina gardns,,dover heights,southbank,3133,3340,,nsw,19520114.0,19521122.0,27.0,,03 36075988,07 89548454,5650110,3331727,R516,M320,J200,J200,RANFRY,MATAE,JAC,JAS
9,595eb080-9857-4534-bff0-70d606a98af1,2078a609-8ed2-43a6-8c39-7dd9bb3621dc,0.143183,0.0,shona,sean,lowe,george,5,237.0,eungella street,colebatch place,tralee cottage,,lindfield,isle of capri,3672,4807,nsw,tla,19680223.0,19095208.0,28.0,,02 46089981,04 43707917,6039938,7930217,L000,G620,S500,S500,LAO,GARG,SAN,SAN
