# Train a ML Classifier to Link FEBRL People Data

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rachhouse/intro-to-data-linking/blob/main/tutorial_notebooks/02_Link_FEBRL_Data_with_ML_Classifier.ipynb)

In this tutorial, we'll train a machine learning classifier to score candidate pairs for linking, using supervised learning. We will use the same training dataset as the SimSum classification tutorial, as well as the same augmentation, blocking, and comparing functions. The functions have been included in a separate `.py` file for re-use and convenience, so we can focus on code unique to this tutorial.

The SimSum classification tutorial included a more detailed walkthrough of augmentation, blocking, and comparing, and since we're using the same functions within this tutorial, details will be light for those steps. Please see the SimSum tutorial if you need a refresher.

## Google Colab Setup

In [1]:
# Check if we're running locally, or in Google Colab.
try:
    import google.colab
    COLAB = True
except ModuleNotFoundError:
    COLAB = False
    
# If we're running in Colab, download the tutorial functions file 
# to the Colab session local directory, and install required libraries.
if COLAB:
    import requests
    
    tutorial_functions_url = "https://raw.githubusercontent.com/rachhouse/intro-to-data-linking/main/tutorial_notebooks/linking_tutorial_functions.py"
    r = requests.get(tutorial_functions_url)
    
    with open("linking_tutorial_functions.py", "w") as fh:
        fh.write(r.text)
    
    !pip install -q recordlinkage jellyfish altair

## Imports

In [2]:
import itertools

import altair as alt
import pandas as pd

from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split

In [3]:
# Grab the linking functions file from github and save locally for Colab.
# We'll import our previously used linking functions from this file.
import linking_tutorial_functions as tutorial

## Load Training Data and Ground Truth Labels

In [4]:
df_A, df_B, df_ground_truth = tutorial.load_febrl_training_data(COLAB)

## Data Augmentation

In [5]:
for df in [df_A, df_B]:
    df = tutorial.augment_data(df)

## Blocking

In [6]:
candidate_links = tutorial.block(df_A, df_B)

## Comparing

In [7]:
%%time

features = tutorial.compare(candidate_links, df_A, df_B)

CPU times: user 43.8 s, sys: 1.33 s, total: 45.1 s
Wall time: 44.4 s


In [8]:
features.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,061b9c3f-afbe-41e5-923b-3de29a4e5b82,0,1,0,1,0.0,1.0,0.263158,0.225806,0.384615,0.2,0.25,0.0,0.333333,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,081ec178-99a1-4895-b96e-7c03cf8bbfdc,0,1,0,1,0.577778,1.0,0.210526,0.032258,0.230769,0.4,1.0,0.0,0.333333,0.142857
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,17b274b5-aa3f-43cc-96ae-21283b7d1ca5,0,1,0,1,0.588889,1.0,0.176471,0.032258,0.384615,0.2,0.25,0.0,0.333333,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,1f70d4cd-3106-4d15-af9f-1617a43ca83f,0,1,0,1,0.455556,1.0,0.235294,0.193548,0.307692,0.2,0.25,0.0,0.083333,0.142857
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,201c4dba-825a-42f2-b7a8-832b792af90b,0,0,0,0,0.577778,0.611111,0.083333,0.129032,0.153846,0.6,0.25,0.0,0.416667,0.571429


## Add Labels to Feature Vectors

We've augmented, blocked, and compared, so now we're ready to train a classification model which can score candidate record pairs on how likely it is that they are a link. As we did when classifying links via SimSum, we'll append our ground truth values to the features DataFrame.

In [9]:
df_ground_truth["ground_truth"] = df_ground_truth["ground_truth"].apply(lambda x: 1.0 if x else 0.0)

df_labeled_features = pd.merge(
    features,
    df_ground_truth,
    on=["person_id_A", "person_id_B"],
    how="left"
)

df_labeled_features["ground_truth"].fillna(0, inplace=True)
df_labeled_features.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,061b9c3f-afbe-41e5-923b-3de29a4e5b82,0,1,0,1,0.0,1.0,0.263158,0.225806,0.384615,0.2,0.25,0.0,0.333333,0.0,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,081ec178-99a1-4895-b96e-7c03cf8bbfdc,0,1,0,1,0.577778,1.0,0.210526,0.032258,0.230769,0.4,1.0,0.0,0.333333,0.142857,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,17b274b5-aa3f-43cc-96ae-21283b7d1ca5,0,1,0,1,0.588889,1.0,0.176471,0.032258,0.384615,0.2,0.25,0.0,0.333333,0.0,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,1f70d4cd-3106-4d15-af9f-1617a43ca83f,0,1,0,1,0.455556,1.0,0.235294,0.193548,0.307692,0.2,0.25,0.0,0.083333,0.142857,0.0
002cf4ec-57d0-4ebf-a31b-88db4441ff2e,201c4dba-825a-42f2-b7a8-832b792af90b,0,0,0,0,0.577778,0.611111,0.083333,0.129032,0.153846,0.6,0.25,0.0,0.416667,0.571429,0.0


## Separate Candidate Links into Train/Test

Next, we'll separate our features DataFrame into a train and test set.

In [10]:
X = df_labeled_features.drop("ground_truth", axis=1)
y = df_labeled_features["ground_truth"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2
)

## Train ML Classifier

Though we're using a very simple machine learning model here, the important takeaway is to think of the classification step as a black box that produces a score indicating how likely the model thinks a given candidate record pair is a link. There must be an output score, but *how* that score is generated provides a lot of flexibility. Perhaps you just want to use SimSum, which could be considered an extremely simple "model". Maybe you want to build a neural net to ingest the comparison vectors and produce a score. Generally, in linking, the classification model is the simplest piece, and much more work will go into your blockers and comparators.

In [11]:
classifier = AdaBoostClassifier(n_estimators=64, learning_rate=0.5)

In [12]:
classifier.fit(X_train, y_train)

AdaBoostClassifier(learning_rate=0.5, n_estimators=64)

## Predict Using ML Classifier

Here, we'll generate scores for our test set, and format those predictions in a form useful for evaluation.

In [13]:
y_pred = classifier.predict_proba(X_test)[:,1]

In [14]:
df_predictions = X_test.copy()
df_predictions["model_score"] = y_pred
df_predictions["ground_truth"] = y_test

## Choosing a Linking Model Score Threshold

As with SimSum, we're able to examine the resulting score distribution and precision/recall vs. model score threshold plot to determine where the cutoff should be set.

### Model Score Distribution

In [15]:
tutorial.plot_model_score_distribution(df_predictions)

### Precision and Recall vs. Model Score

In [16]:
df_eval, df_top_links, df_bottom_links = tutorial.evaluate_linking(
    df=df_predictions,
    df_true_links=df_ground_truth,
    df_left=df_A,
    df_right=df_B,
)

In [17]:
tutorial.plot_precision_recall_vs_threshold(df_eval)

### Top Scoring `k` Links

In [18]:
display_cols = [
    "first_name", "surname",
    "street_number", "address_1", "address_2", "suburb", "postcode", "state",
    "date_of_birth", "age", "phone_number", "soc_sec_id",
    "soundex_surname", "soundex_firstname",
    "nysiis_surname", "nysiis_firstname",
]

display_cols = [[f"{col}_x", f"{col}_y"] for col in display_cols]
display_cols = list(itertools.chain.from_iterable(display_cols))

In [19]:
with pd.option_context('display.max_columns', None):
    display(df_top_links[["person_id_A", "person_id_B", "model_score", "ground_truth"] + display_cols])

Unnamed: 0,person_id_A,person_id_B,model_score,ground_truth,first_name_x,first_name_y,surname_x,surname_y,street_number_x,street_number_y,address_1_x,address_1_y,address_2_x,address_2_y,suburb_x,suburb_y,postcode_x,postcode_y,state_x,state_y,date_of_birth_x,date_of_birth_y,age_x,age_y,phone_number_x,phone_number_y,soc_sec_id_x,soc_sec_id_y,soundex_surname_x,soundex_surname_y,soundex_firstname_x,soundex_firstname_y,nysiis_surname_x,nysiis_surname_y,nysiis_firstname_x,nysiis_firstname_y
0,abef26d3-1003-45f1-a2dd-5780817f6c30,0b03d953-9e99-4e55-bfde-8ed3f96f7b65,0.880828,1.0,bella,bella,stanley,stanley,113,100.0,beasley street,beasley street,,,cudal,cudal,4807,4807,act,act,19640112,19640112,34.0,,07 11189696,07 11189696,9117455,9117455,S354,S354,B400,B400,STANLY,STANLY,BAL,BAL
1,80593dc0-864c-4de6-88c3-8acaba0e888d,4871a8f6-e219-4cb0-ade7-634eeba085be,0.880828,1.0,freya,freya,binns,binns,38,0.0,amagula avenue,amagula avenue,,,mount isa,mountm isa,4355,4355,,,19041205,19041205,32.0,32.0,07 92241480,07 92241480,7793455,7793455,B520,B520,F600,F600,BAN,BAN,FRAY,FRAY
2,49d22fd4-764f-4082-b6df-b279450a0c8c,58ea2735-398d-40a4-b48f-51a1733b65db,0.880828,1.0,james,james,morrison,morrison,11,15.0,ingram street,ingram street,villa 2,villa 2,noble park,noble park,2148,2148,nsw,nsw,19810409,19810409,30.0,,04 07562543,04 07562543,3227052,3227052,M625,M625,J520,J520,MARASAN,MARASAN,JAN,JAN
3,bf461373-7b2f-4253-a40e-8f7e8132f604,0a27a29f-1d39-4b30-9bee-a6bee00124b5,0.880828,1.0,riley,riley,clarke,clarke,3,2.0,wilkinson street,wilkinsoh street,,,willetton,willetton,2452,2452,nsw,nsw,19480428,19480428,30.0,,07 31713294,07 31713294,6921362,6921362,C462,C462,R400,R400,CLARC,CLARC,RALY,RALY
4,c5fc7631-afc5-4107-8457-0d960219b324,a933aeae-ec54-48cb-b8e4-7d5d629368fc,0.880828,1.0,dylan,dylan,kelley,kelley,5,7.0,sid barnes crescent,sid barnes crescent,,,millbank,millbank,3108,3108,vic,vic,19020621,19020621,23.0,37.0,04 43032020,04 43032020,7831682,7831682,K400,K400,D450,D450,CALY,CALY,DYLAN,DYLAN
5,1d5b6895-eca8-4628-a3ec-c68838fdddab,fa404b4e-b2f5-45bf-8aea-7051978e3f2d,0.880828,1.0,genevieve,genevieve,campbell,campbell,6,6.0,spica street,spica street,westmead accom,westmeaev accom,salisbury east,salisbury east,5086,5086,vic,vic,19720614,19720614,33.0,33.0,07 51302450,07 51302450,9459186,9459186,C514,C514,G511,G511,CANPBAL,CANPBAL,GANAFAAF,GANAFAAF
6,1c7576d8-19dc-4db3-8759-cab289fd0989,e44d9d4c-060f-405c-adc0-949a266d02cf,0.880828,1.0,lachlan,lachlan,turale,turale,313,313.0,marconi crescent,marconi cerscent,,,forrestfield,forrestfield,3669,3669,nsw,nsw,19280322,19280322,24.0,,02 52599111,02 52599111,7721100,7721100,T640,T640,L245,L245,TARAL,TARAL,LACLAN,LACLAN
7,19a54fc1-5086-4659-bb96-ff2b246e9587,00a71293-0d63-4ac3-a772-65b8c61ff31e,0.880828,1.0,riley,riley,green,green,57,4.0,searle place,searle place,ethelton,ethelton,strathalbyn,strathakbyn,5159,5159,vic,vic,19960104,19960104,,,03 04721186,03 04721186,4245711,4245711,G650,G650,R400,R400,GRAN,GRAN,RALY,RALY
8,bf501d9c-b1d2-4fd4-a0c4-2ef401d65888,6f386c16-8b08-43be-9b7d-eb33b057c66a,0.880828,1.0,madison,madison,green,green,116,111.0,white crescent,white crescent,,,swan hill,swan hill,4306,4306,nsw,nsw,19910712,19910712,,,04 42108404,04 42104904,7691535,7691535,G650,G650,M325,M325,GRAN,GRAN,MADASAN,MADASAN
9,589304c5-04dc-47b7-979d-3e76ad1dc8ba,b618b865-9705-4254-9846-4a2936d85516,0.880828,1.0,liam,liam,dixon,dixon,100,,balfour crescent,balfour crxscent,,,avondale heights,avondale biights,2560,2560,vic,vic,19640504,19640504,32.0,32.0,04 52257300,04 52257300,2141541,2141541,D250,D250,L500,L500,DAXAN,DAXAN,LAN,LAN


### Bottom Scoring `k` Links

In [20]:
with pd.option_context('display.max_columns', None):
    display(df_bottom_links[["person_id_A", "person_id_B", "model_score", "ground_truth"] + display_cols])

Unnamed: 0,person_id_A,person_id_B,model_score,ground_truth,first_name_x,first_name_y,surname_x,surname_y,street_number_x,street_number_y,address_1_x,address_1_y,address_2_x,address_2_y,suburb_x,suburb_y,postcode_x,postcode_y,state_x,state_y,date_of_birth_x,date_of_birth_y,age_x,age_y,phone_number_x,phone_number_y,soc_sec_id_x,soc_sec_id_y,soundex_surname_x,soundex_surname_y,soundex_firstname_x,soundex_firstname_y,nysiis_surname_x,nysiis_surname_y,nysiis_firstname_x,nysiis_firstname_y
0,0244e89d-eb8b-480e-a749-71ea4d80c330,89c88233-e4c8-464f-865d-78b4346d8ece,0.148794,0.0,oliver,tarp,oddy,sedorkw,177,418,sproule circuit,roughley place,,villa 444 the village glen,deer park,eden,2227,3930,,vic,,,,,02 47762768,03 96637595,2049594,4730440,O300,S362,O416,T610,ODY,SADARCW,OLAVAR,TARP
1,471e0b2b-2c35-4efd-adc2-fe61ccc934cd,f857c553-b7c5-40e4-b913-108db4bbcb38,0.148794,0.0,henry,chelsse,stapley,perin,17,11,calder crescent,rowe plkace,posetto,,peakhurst,whalan,2083,4183,nsw,vix,,,28.0,22.0,08 38682853,03 48813481,7894319,4034576,S314,P650,H560,C420,STAPLY,PARAN,HANRY,CALS
2,f6a52f3d-d667-44c4-86fc-f5ba45011d51,e5969f04-70c8-4a31-95a1-9e7bf2461506,0.148794,0.0,anthony,joshua,tregloan,,6,53,legge street,hardie close,rivulett cottage,,queenscliff,parkdfle,6054,2616,,ws,,,24.0,,02 24213411,03 31298292,2272039,7768091,T624,,A535,J200,TRAGLAN,,ANTANY,JAS
3,ed5b595e-5ba7-4a0e-9ac9-ac1ea289735d,6c188e34-7ef1-4c51-ae45-4ebe82f7498a,0.148794,0.0,jordan,georgiz,blake,daisp,143,0,bennett street,currong street,barkool,,hamilton,cleveland,3942,2161,vic,nsw,,,23.0,34.0,02 55033822,,6068538,8087116,B420,D210,J635,G622,BLAC,DASP,JARDAN,GARG
4,4607801e-4448-44be-9043-6948ddee41ca,c2cdd4bf-8b33-45e3-90cc-6521879c8488,0.148794,0.0,,,roberts-yates,hiltovn,32,4,kambalda crescent,rowntree rescent,brentwood vlge,cheviot hills,ruse,parramatta,2615,3056,vic,wa,19380618.0,19371103.0,10.0,27.0,08 91734054,02 51850443,6303115,4759937,R163,H431,,,RABARTS-YAT,HALTAVN,,
5,359e07cb-9997-48c7-a1b6-6f95c7ee229f,62014cbf-f083-41d5-9f7d-69729314b177,0.148794,0.0,sarah,talia,green,whitw,76,36,dwyer street,longerenong street,,brentwood vlge,port augusta,oakleigh,2380,2312,,sa,,,29.0,33.0,02 05411606,03 38212360,4786823,9690114,G650,W300,S600,T400,GRAN,WATW,SAR,TAL
6,31b6cad9-19ad-45de-89ed-5ddfc7897329,81296b2a-4c5d-45e0-a5e7-e5637ed78f1c,0.148794,0.0,charlie,jonah,cullinane,mcgregwor,13,35,corinna street,pinschof place,,glenview,parkville,emerald,2464,2650,nsw,,,,,28.0,,04 36186071,4342822,3588806,C455,M262,C640,J500,CALANAN,MCGRAGWAR,CARLY,JAN
7,fc846804-61b8-479b-a9fd-31eec5c9b76b,3a1ee3d8-f5db-4210-a77a-cfa8c967b562,0.148794,0.0,kira,fervus,blake,camp,37,16,terry close,,tudor,pine hill,ingham,phillip,4151,6011,nsw,sa,,,34.0,22.0,02 34426186,02 50817810,1930220,9443281,B420,C510,K600,F612,BLAC,CANP,CAR,FARV
8,1fcabdce-1474-40c0-a63d-a1eda9e571a6,5e01b1b1-edbf-4044-8939-91350bb3c15c,0.148794,0.0,hannah,chloe,spagnoletti,beams,83,159,kauper street,henslowe place,the manor garden,alexander's folly,marsden,bankgor,2611,5052,,nsz,,,24.0,28.0,07 45650733,02 82303969,5866492,9120603,S125,B520,H500,C400,SPAGNALAT,BAN,HAN,CL
9,d11bedaf-4bd1-4f38-8024-ad69a25f3531,e25465a3-0fd8-4528-a6f2-61d8aa6e425b,0.148794,0.0,holly,jaden,ryan,beckmith,10,6,fremantle drive,smeaton circuit,,bulliac,st agnes,woollahra,2830,3173,nsw,qld,,,36.0,3.0,02 04081167,08 59980211,6562489,7454955,R500,B253,H400,J350,RYAN,BACNAT,HALY,JADAN
