# Train a ML Classifier to Link FEBRL People Data

<a href="https://colab.research.google.com/github/rachhouse/intro-to-data-linking/blob/main/tutorial_notebooks/02_Link_FEBRL_Data_with_ML_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>

In this tutorial, we'll train a machine learning classifier to score candidate pairs for linking, using supervised learning. We will use the same training dataset as the SimSum classification tutorial, as well as the same augmentation, blocking, and comparing functions. The functions have been included in a separate `.py` file for re-use and convenience, so we can focus on code unique to this tutorial.

The SimSum classification tutorial included a more detailed walkthrough of augmentation, blocking, and comparing, and since we're using the same functions within this tutorial, details will be light for those steps. Please see the SimSum tutorial if you need a refresher.

## Google Colab Setup

In [1]:
# Check if we're running locally, or in Google Colab.
try:
    import google.colab
    COLAB = True
except ModuleNotFoundError:
    COLAB = False
    
# If we're running in Colab, download the tutorial functions file 
# to the Colab session local directory, and install required libraries.
if COLAB:
    import requests
    
    tutorial_functions_url = "https://raw.githubusercontent.com/rachhouse/intro-to-data-linking/main/tutorial_notebooks/linking_tutorial_functions.py"
    r = requests.get(tutorial_functions_url)
    
    with open("linking_tutorial_functions.py", "w") as fh:
        fh.write(r.text)
    
    !pip install -q recordlinkage jellyfish altair

## Imports

In [2]:
import itertools

import altair as alt
import pandas as pd

from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split

In [3]:
# Grab the linking functions file from github and save locally for Colab.
# We'll import our previously used linking functions from this file.
import linking_tutorial_functions as tutorial

## Load Training Data and Ground Truth Labels

In [4]:
df_A, df_B, df_ground_truth = tutorial.load_febrl_training_data(COLAB)

## Data Augmentation

In [5]:
for df in [df_A, df_B]:
    df = tutorial.augment_data(df)

## Blocking

In [6]:
candidate_links = tutorial.block(df_A, df_B)

## Comparing

In [7]:
%%time

features = tutorial.compare(candidate_links, df_A, df_B)

CPU times: user 1min 18s, sys: 2.25 s, total: 1min 20s
Wall time: 1min 19s


In [8]:
features.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
00062cca-a85a-43fe-a309-2d1a47a58323,02759653-c8db-4587-8821-d154a4c32498,0,1,0,0,0.6,0.933333,0.352941,0.076923,0.222222,0.4,0.333333,0.0,0.333333,0.0
00062cca-a85a-43fe-a309-2d1a47a58323,02ce7446-e904-4c51-ab30-c69e6a0f8ff0,0,0,0,0,0.466667,0.577778,0.133333,0.0625,0.357143,0.2,0.25,0.0,0.416667,0.428571
00062cca-a85a-43fe-a309-2d1a47a58323,033c561a-5a00-4a50-a576-28481298630c,1,0,1,0,1.0,0.577778,0.230769,1.0,0.1,0.2,0.25,0.0,0.083333,0.0
00062cca-a85a-43fe-a309-2d1a47a58323,04562435-59aa-4740-b84f-af3ba0f1463a,1,0,1,0,1.0,0.0,0.25,0.090909,0.1875,0.4,0.25,0.0,0.333333,0.0
00062cca-a85a-43fe-a309-2d1a47a58323,07cbfeea-7430-467d-98fc-dca36acc9853,1,0,0,0,0.88,0.425926,0.230769,1.0,0.444444,0.6,0.25,0.0,0.416667,0.0


## Add Labels to Feature Vectors

We've augmented, blocked, and compared, so now we're ready to train a classification model which can score candidate record pairs on how likely it is that they are a link. As we did when classifying links via SimSum, we'll append our ground truth values to the features DataFrame.

In [9]:
df_ground_truth["ground_truth"] = df_ground_truth["ground_truth"].apply(lambda x: 1.0 if x else 0.0)

df_labeled_features = pd.merge(
    features,
    df_ground_truth,
    on=["person_id_A", "person_id_B"],
    how="left"
)

df_labeled_features["ground_truth"].fillna(0, inplace=True)
df_labeled_features.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
00062cca-a85a-43fe-a309-2d1a47a58323,02759653-c8db-4587-8821-d154a4c32498,0,1,0,0,0.6,0.933333,0.352941,0.076923,0.222222,0.4,0.333333,0.0,0.333333,0.0,0.0
00062cca-a85a-43fe-a309-2d1a47a58323,02ce7446-e904-4c51-ab30-c69e6a0f8ff0,0,0,0,0,0.466667,0.577778,0.133333,0.0625,0.357143,0.2,0.25,0.0,0.416667,0.428571,0.0
00062cca-a85a-43fe-a309-2d1a47a58323,033c561a-5a00-4a50-a576-28481298630c,1,0,1,0,1.0,0.577778,0.230769,1.0,0.1,0.2,0.25,0.0,0.083333,0.0,0.0
00062cca-a85a-43fe-a309-2d1a47a58323,04562435-59aa-4740-b84f-af3ba0f1463a,1,0,1,0,1.0,0.0,0.25,0.090909,0.1875,0.4,0.25,0.0,0.333333,0.0,0.0
00062cca-a85a-43fe-a309-2d1a47a58323,07cbfeea-7430-467d-98fc-dca36acc9853,1,0,0,0,0.88,0.425926,0.230769,1.0,0.444444,0.6,0.25,0.0,0.416667,0.0,0.0


## Separate Candidate Links into Train/Test

Next, we'll separate our features DataFrame into a train and test set.

In [10]:
X = df_labeled_features.drop("ground_truth", axis=1)
y = df_labeled_features["ground_truth"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2
)

## Train ML Classifier

Though we're using a very simple machine learning model here, the important takeaway is to think of the classification step as a black box that produces a score indicating how likely the model thinks a given candidate record pair is a link. There must be an output score, but *how* that score is generated provides a lot of flexibility. Perhaps you just want to use SimSum, which could be considered an extremely simple "model". Maybe you want to build a neural net to ingest the comparison vectors and produce a score. Generally, in linking, the classification model is the simplest piece, and much more work will go into your blockers and comparators.

In [11]:
classifier = AdaBoostClassifier(n_estimators=64, learning_rate=0.5)

In [12]:
classifier.fit(X_train, y_train)

AdaBoostClassifier(learning_rate=0.5, n_estimators=64)

## Predict Using ML Classifier

Here, we'll generate scores for our test set, and format those predictions in a form useful for evaluation.

In [13]:
y_pred = classifier.predict_proba(X_test)[:,1]

In [14]:
df_predictions = X_test.copy()
df_predictions["model_score"] = y_pred
df_predictions["ground_truth"] = y_test

## Choosing a Linking Model Score Threshold

As with SimSum, we're able to examine the resulting score distribution and precision/recall vs. model score threshold plot to determine where the cutoff should be set.

### Model Score Distribution

In [15]:
tutorial.plot_model_score_distribution(df_predictions)

### Precision and Recall vs. Model Score

In [16]:
df_eval = tutorial.evaluate_linking(df_predictions)

In [17]:
tutorial.plot_precision_recall_vs_threshold(df_eval)

In [18]:
tutorial.plot_f1_score_vs_threshold(df_eval)

### Top Scoring Non-Links

In [19]:
display_cols = [
    "first_name", "surname",
    "street_number", "address_1", "address_2", "suburb", "postcode", "state",
    "date_of_birth", "age", "phone_number", "soc_sec_id",
    "soundex_surname", "soundex_firstname",
    "nysiis_surname", "nysiis_firstname",
]

display_cols = [[f"{col}_A", f"{col}_B"] for col in display_cols]
display_cols = list(itertools.chain.from_iterable(display_cols))

In [20]:
df_top_scoring_negatives = df_predictions[
    df_predictions["ground_truth"] == False
][["model_score", "ground_truth"]].sort_values("model_score", ascending=False).head(n=10)

df_top_scoring_negatives = tutorial.augment_scored_pairs(df_top_scoring_negatives, df_A, df_B)

with pd.option_context('display.max_columns', None):
    display(df_top_scoring_negatives[["model_score", "ground_truth"] + display_cols])

Unnamed: 0_level_0,Unnamed: 1_level_0,model_score,ground_truth,first_name_A,first_name_B,surname_A,surname_B,street_number_A,street_number_B,address_1_A,address_1_B,address_2_A,address_2_B,suburb_A,suburb_B,postcode_A,postcode_B,state_A,state_B,date_of_birth_A,date_of_birth_B,age_A,age_B,phone_number_A,phone_number_B,soc_sec_id_A,soc_sec_id_B,soundex_surname_A,soundex_surname_B,soundex_firstname_A,soundex_firstname_B,nysiis_surname_A,nysiis_surname_B,nysiis_firstname_A,nysiis_firstname_B
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
65b5ca85-b18e-489b-9ea7-60ffe2b4a53f,eff1fcd5-be94-4b4c-a161-7930c1aafbb7,0.418382,0.0,jade,jamee,rees,godfrey,11,261.0,liverpool street,conner close,,,scarborough,orange,3677,5127,qld,vic,19150910.0,19800820.0,34.0,82,08 33192380,08 66058612,4665894,4665894,R200,G316,J300,J500,R,GADFRY,JAD,JANY
ccd36b1b-1cf7-43fc-971c-81028966e9a9,1a9a1f66-cd8d-4d28-9792-ab09885b9f0c,0.407103,0.0,sophie,oliver,rivers,hage,2,5.0,balonne street,balonne jtreet,,low glen dairy,,,3107,2500,nsw,nsw,,,37.0,23,03 78305011,04 36540820,5311737,3449996,R162,H200,S100,O416,RAVAR,HAG,SAFY,OLAVAR
ad1dfbd6-0132-4133-9c7f-2a76059c1714,ed99f955-5a59-4777-97ea-f6595dacf369,0.386337,0.0,mikhayla,mikayla,feeney,gronow,25,122.0,clambe place,blamey crescent,,,auburn,albury,2223,2207,vic,qld,19190923.0,19260727.0,19.0,36,,,9413778,9729461,F500,G650,M240,M240,FANY,GRANAO,MACKAYL,MACAYL
3a9895dd-3222-4024-abaa-d86720ef6f02,323616bb-8b35-4c5e-8015-eefc007fe1e0,0.384094,0.0,kaysey,,morrison,mccargy,36,356.0,faunce crescent,fawkner street,,killarney,plumpton,kingston,2795,2795,nsw,nss,19560907.0,19560907.0,,22,07 65223520,03 81104473,8559208,5558602,M625,M262,K200,,MARASAN,MCARGY,CAYSY,
ab12aa85-0819-4f77-a9c9-e7d2db42c53f,048ce9df-79cc-4004-bd2a-d23f417fc477,0.38263,0.0,alexandra,abbey,kerkham,smeaton,15,6.0,le hunte street,howitt street,,,toowong,kempsey,2326,2759,qld,vic,19010530.0,19351203.0,10.0,27,04 93152910,02 51603324,2745287,2745287,K625,S535,A425,A100,CARCKAN,SNATAN,ALAXANDR,ABY
9a8e3fff-be44-4079-af4c-6714e3db9a31,b159c049-7bca-486c-80ac-8c67f702d0fc,0.38263,0.0,zoe,cheld,lodge,,8,0.0,kirchaufs street,noala street,,,maryborough,jingellic,5052,3216,sa,qmy,19051024.0,,,27,,03 85377136,8649581,8649581,L320,,Z000,C430,LADG,,Z,CALD
5af6f4b9-2c24-4f0d-ac57-f1e1cf718fc5,628099ae-d20a-41bd-9ab2-efe977046682,0.375734,0.0,reece,nathan,cirelli,,78,39.0,,,mianga,,lockington,corrimal,2540,2594,nsw,nsw,,,31.0,32,,,8567221,7417565,C640,,R200,N350,CARAL,,RAC,NATAN
d5f0268b-0441-4e18-9dc3-e41166ec30a0,c6b8aa08-d17c-486f-8640-f1d3026d0363,0.372096,0.0,lewis,emiily,hinks,blackwell,10,,,,ben nevis estate,,thornleigh,gorokan,6167,3723,sa,vc,19250323.0,19012579.0,,38,,,5766355,5573355,H520,B424,L200,E540,HANC,BLACWAL,LAE,ENALY
525cbb20-c8f2-4911-82b4-538c7b42d39a,8a75ef59-18f0-4f7f-87a1-eba4ac703538,0.365435,0.0,andrew,oliver,webb,webb,117,73.0,grylls crescent,keira street,,parraweena,ormond,hurlstone park,2650,2650,tas,qld,19201017.0,19981012.0,28.0,31,,,9658403,8798378,W100,W100,A536,O416,WAB,WAB,ANDRAE,OLAVAR
880b9c04-b04c-40c0-97bb-ad424c16f37d,00711705-fb8a-4367-be5d-bf216893abfe,0.365363,0.0,cooper,cooper,birkill,teagme,1410,15.0,gurrang avenue,emblinggsreet,,ardlui,camperdown,clarence park,2029,2302,vic,vic,19571026.0,19880623.0,26.0,21,,,6299032,6200903,B624,T250,C160,C160,BARCAL,TAGN,CAPAR,CAPAR


### Lowest Scoring True Links

In [21]:
df_lowest_scoring_positives = df_predictions[
    df_predictions["ground_truth"] == True
][["model_score", "ground_truth"]].sort_values("model_score").head(n=10)

df_lowest_scoring_positives = tutorial.augment_scored_pairs(df_lowest_scoring_positives, df_A, df_B)

with pd.option_context('display.max_columns', None):
    display(df_lowest_scoring_positives[["model_score", "ground_truth"] + display_cols])

Unnamed: 0_level_0,Unnamed: 1_level_0,model_score,ground_truth,first_name_A,first_name_B,surname_A,surname_B,street_number_A,street_number_B,address_1_A,address_1_B,address_2_A,address_2_B,suburb_A,suburb_B,postcode_A,postcode_B,state_A,state_B,date_of_birth_A,date_of_birth_B,age_A,age_B,phone_number_A,phone_number_B,soc_sec_id_A,soc_sec_id_B,soundex_surname_A,soundex_surname_B,soundex_firstname_A,soundex_firstname_B,nysiis_surname_A,nysiis_surname_B,nysiis_firstname_A,nysiis_firstname_B
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
3f24f870-eb86-42de-92e4-7ff778d3d4c4,fb5a880c-e16b-4f7d-a2bb-9e01770c9079,0.505958,1.0,julia,julia,ryan,ryaf,198.0,198.0,fawkner street,fawkner street,,,buff point,point uff,4000.0,3280,qld,qld,19960408.0,19960048.0,,,08 37106392,08 08861539,7699546,7699456,R500,R100,J400,J400,RYAN,RYAF,JAL,JAL
7d521342-0c6a-4089-95a0-f3ec2a55617c,00a46b4d-a9db-41c5-af07-8b5d26d93bd5,0.545607,1.0,jessica,,indelicato,indelicato,19.0,19.0,darke street,darke srreet,bonnie vale,bonnie vale,berriedale,,4214.0,4214,nsw,nsw,19811024.0,,39.0,39.0,08 16991308,08 16991308,2337844,6524248,I534,I534,J220,,INDALACAT,INDALACAT,JASAC,
58967f10-8805-4ad8-8b28-6745b0c370a2,4fc5a4a6-55c3-430b-b88a-e37edcfac745,0.598963,1.0,hayden,hayen,fullwood,fullwood,30.0,30.0,hindmarsh drive,pyramid carwipark,pyramid cara park,hindmarsh drive,broome,broome,3550.0,3550,qld,qld,19850822.0,19580822.0,29.0,29.0,02 28552602,02 28552602,4233824,8455011,F430,F430,H350,H500,FALWAD,FALWAD,HAYDAN,HAYAN
d1fad992-2445-4736-8f79-f329736d72fb,a4d1b3d6-ad6f-4ed5-b3ae-b2cf42e8c04c,0.629916,1.0,emma,egan,egan,emma,8.0,8.0,lucy gullett circuit,lucy gullett circuit,yuvam,,iluka,2611,2611.0,iluka,vic,vic,19210708.0,19210708.0,28.0,28.0,08 10150546,08 10105546,2426178,2426177,E250,E500,E500,E250,EGAN,EN,EN,EGAN
14ed19d6-fc48-4bbf-9f39-31df46ed56dd,70c88c9b-7587-4632-b087-ac76722dce1e,0.634237,1.0,kylee,kylee,crossman,croswmn,6.0,6.0,oakey place,oakeypece,,,ulladulla,ulladulla,2032.0,2032,nsw,nsw,,,26.0,26.0,08 58868220,08 58868220,8412700,5550977,C625,C625,K400,K400,CRASNAN,CRASWN,CYLY,CYLY
cf245053-4dc4-46ec-b909-49378b4b3931,05aa786a-4adb-421f-8685-e45b283493eb,0.650695,1.0,ned,ned,cochrane,cochrane,16.0,16.0,goldfinch circuit,goldfinch circuit,,,forest,for ef,2077.0,,nsw,nsw,19861112.0,19861112.0,28.0,28.0,07 13112055,03 26947618,5278133,5278313,C265,C265,N300,N300,CACRAN,CACRAN,NAD,NAD
aa3c726b-b4f8-481f-a9a4-4c2ff98b29d6,04279dcf-6065-4fc0-bf27-dcedc3c64b59,0.655554,1.0,monique,westbeook,westbrook,monique,77.0,77.0,burnie street,,,,south nanango,southxnnango,2450.0,2450,nsw,nsw,19511220.0,19511220.0,,,02 51542062,08 98696270,4102167,4101167,W231,M520,M520,W231,WASTBRAC,MANAG,MANAG,WASTBAC
de27ad63-e666-4fd7-8a89-f44027faf965,92b2cb22-256b-4194-9a0e-1351462698f1,0.662368,1.0,alexander,alexander,green,green,4.0,4.0,were street,were street,,,kaimkillenbun,kaimkillenbun,6011.0,6012,,,19550820.0,19550280.0,23.0,23.0,03 08556723,03 08556723,5665398,9280078,G650,G650,A425,A425,GRAN,GRAN,ALAXANDAR,ALAXANDAR
bb11fe61-3e9a-47eb-ab1f-ae99004df009,878dabf6-2d21-4a34-a703-5637ab913909,0.663146,1.0,mikhayla,mikhayla,philps,philps,,,roebuck street,roebucastfeet,robgill farm,robgill farm,raymond terrace,raymond terrace,,,nsw,nsw,19740830.0,19620401.0,35.0,35.0,03 46824534,03 46824534,3007137,8437095,P412,P412,M240,M240,FALP,FALP,MACKAYL,MACKAYL
7b9da434-5e96-43e4-9ac0-389b882ea29a,9fea4ddb-b4a7-4a1b-9be9-4d83d058951e,0.673982,1.0,jack,jack,mason,mason,15.0,15.0,malcolm place,malcolhalace,,,wahroonga,wahroonga,3549.0,3594,qld,qld,19331105.0,,32.0,32.0,07 66241824,07 66241824,4528779,4529797,M250,M250,J200,J200,MASAN,MASAN,JAC,JAC
