# Use SimSum Classification to Link FEBRL People Data

<a href="https://colab.research.google.com/github/rachhouse/intro-to-data-linking/blob/main/tutorial_notebooks/01_Link_FEBRL_Data_with_SimSum_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>

In this tutorial, we'll link synthesized people datasets generated by the [Freely Extensible Biomedical Record Linkage (FEBRL)](https://sourceforge.net/projects/febrl/) project. The FEBRL-generated datasets represent cleaned datasets, so in this notebook, we will step through:
* data augmentation,
* blocking,
* comparing, and
* classification using the SimSum methodology.

## Google Colab Setup

In [1]:
# Check if we're running locally, or in Google Colab.
try:
    import google.colab
    COLAB = True
except ModuleNotFoundError:
    COLAB = False
    
# If we're running in Colab, download the tutorial functions file 
# to the Colab session local directory, and install required libraries.
if COLAB:
    import requests
    
    tutorial_functions_url = "https://raw.githubusercontent.com/rachhouse/intro-to-data-linking/main/tutorial_notebooks/linking_tutorial_functions.py"
    r = requests.get(tutorial_functions_url)
    
    with open("linking_tutorial_functions.py", "w") as fh:
        fh.write(r.text)
    
    !pip install -q recordlinkage jellyfish altair

## Imports

In [2]:
import itertools
import re

from typing import Dict, Tuple, Optional

import altair as alt
import jellyfish
import numpy as np
import pandas as pd
import recordlinkage as rl

# We have a couple helper functions from this file that we'll use for evaluation.
import linking_tutorial_functions as tutorial

## Define Filepaths

First, let's set up access to a few data resources that we'll need for the tutorial.

In [3]:
TRAINING_DATASET_A, TRAINING_DATASET_B, TRAINING_LABELS = tutorial.get_training_data_paths(COLAB)

## Load (Cleaned) Training Datasets

We'll load our training datasets into pandas DataFrames. We want to be able to take advantage of pandas indexing as we link our data (plus, the `recordlinkage` package that we'll be using later needs input DataFrames to be indexed by record id), so we'll set an index on each training DataFrame.

As mentioned above, we can consider the cleaning step of linking to be already done - the data generated by FEBRL is in a consistent format, and equivalent attributes have been encoded in the same manner for the two synthesized people datasets.

In [4]:
df_A = pd.read_csv(TRAINING_DATASET_A)
df_A = df_A.set_index("person_id_A")
df_A.head()

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id
person_id_A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
c538959d-35b6-4b4f-aa9d-12e2195e57bd,marcus,butt,98,kirkwood crescent,euroka,terrigal,2409,nsw,19420616,30,02 40555328,7758524
17f19297-13ab-457b-ac0e-bdda526a8c51,jessica,white,15,sabine close,springdale,yungaburra,2046,,19100318,27,03 84921725,7406466
ecc89e8a-847a-4fd5-bf00-3e1e65a94e90,jay,voarino,108,howitt street,,childers,2147,,19700411,26,02 95550035,7232789
defd07dd-a969-44e2-aefe-0ceb046d5ad3,jackson,miles,6,clive steele avenue,,castella,3078,vic,19391016,27,08 95639180,2079318
caf3bb89-6892-4059-99bf-93c744597e2f,sienna,beattie,4,hooley place,,elsternwick,6164,sa,19120225,37,02 48925933,2667388


In [5]:
df_B = pd.read_csv(TRAINING_DATASET_B)
df_B = df_B.set_index("person_id_B")
df_B.head()

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id
person_id_B,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
43ff859e-7109-421a-8787-065ad587b1a2,neneh,,82,eldridge street,corcoran,allambie heights,6170,nsw,19960225,30,03 41232173,7887265
3550a60a-c53b-4480-b94a-a864b971a7c1,brodee,whktd,58,goulburn street,,st albans,4575,vic,19370707,24,07 99650106,9475664
3c2832c0-64d8-4dff-a538-4ae6862eb2fa,montana,reu,3,kingsford drhqdrive,mountain view village,braddon,4114,,19631111,31,02 54174642,5390716
9b674709-d9cc-40c5-b6c7-76651d6a30a2,rorty,patejrson,2,cabena court,living springs,alice springs,2018,nsw,19111013,34,08 45475549,8580401
2b072378-aa64-452a-ade5-844119f30040,madison,petitph,8,chewings street,,sheidow park,3029,nsha,19900305,39,07 75149511,4001101


## Load Training Ground Truth Labels

One of the advantages of synthesized data, especially for tutorials and learning, is that we have ground truth labels for data. (This is rarely the case when you encounter linking problems in the wild). We'll load our known true links into a pandas DataFrame below.

In [6]:
df_ground_truth = pd.read_csv(TRAINING_LABELS)
df_ground_truth = df_ground_truth.set_index(["person_id_A", "person_id_B"])
df_ground_truth["ground_truth"] = df_ground_truth["ground_truth"].apply(lambda x: True if x == 1 else False)
df_ground_truth.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1
a213c59c-5135-4b94-a458-96fd4f3b8cd2,518b4b80-a2b5-4192-9859-ba2a8035e311,True
765f15a9-c5a8-4019-89c6-b770ffb5073b,4f15c33d-9c55-4f8d-a3d2-eab63694a0e2,True
8d59f7e6-75c1-4c35-9f7b-00e071c1f5a7,236ee781-37f0-4cf4-a59e-e91fe8d0f5e3,True
4fbaf334-8dd5-4fe3-b441-16b5601a2cae,836cca68-84b7-496d-927e-a27710d41f4b,True
790ac78b-e649-4a9a-a70f-f1f23962d0b7,90ec4239-6489-4cd2-9842-adf5a146f4ff,True


## Data Augmentation

Let's take a look at our data, and consider what we have currently available for blocking and comparing.

In [7]:
df_A.head(n=2)

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id
person_id_A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
c538959d-35b6-4b4f-aa9d-12e2195e57bd,marcus,butt,98,kirkwood crescent,euroka,terrigal,2409,nsw,19420616,30,02 40555328,7758524
17f19297-13ab-457b-ac0e-bdda526a8c51,jessica,white,15,sabine close,springdale,yungaburra,2046,,19100318,27,03 84921725,7406466


It would probably make sense to block on people's first and last name, but, as we've noted, the realities of data entry typos, nicknames, aliases, OCR mishaps, and speech-to-text blips mean that using an exact blocker isn't going to work well. These fields are prime candidates for phonetic encoding!

We'll use the python [jellyfish library](https://pypi.org/project/jellyfish/) to encode our `first_name` and `surname` fields via two phonetic encoding algorithms, [**Soundex**](https://en.wikipedia.org/wiki/Soundex) and [**NYSIIS**](https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System).

We could also use a truncated exact blocking approach with the `soc_sec_id` field. For this, we'll create a new attribute containing the last three digits of the SSid.

And lastly, we'll cast the `date_of_birth` field to a pandas Timestamp field so that we can compare it more easily down the road.

In [8]:
def dob_to_date(dob: str) -> Optional[pd.Timestamp]:
    """ Transform string date in YYYYMMDD format to a pd.Timestamp.
        Return None if transformation is not successful.
    """
    date_pattern = r"(\d{4})(\d{2})(\d{2})"
    dob_timestamp = None
    
    try:
        m = re.match(date_pattern, dob.strip())
        if m:
            dob_timestamp = pd.Timestamp(int(m.group(1)), int(m.group(2)), int(m.group(3)))
    except:
        pass

    return dob_timestamp

In [9]:
%%time

for df in [df_A, df_B]:
    
    # Update NaNs to empty strings or jellyfish will choke.
    df["surname"] = df["surname"].fillna("")
    df["first_name"] = df["first_name"].fillna("")

    # Soundex phonetic encodings.
    df["soundex_surname"] = df["surname"].apply(lambda x: jellyfish.soundex(x))
    df["soundex_firstname"] = df["first_name"].apply(lambda x: jellyfish.soundex(x))
    
    # NYSIIS phonetic encodings.    
    df["nysiis_surname"] = df["surname"].apply(lambda x: jellyfish.nysiis(x))
    df["nysiis_firstname"] = df["first_name"].apply(lambda x: jellyfish.nysiis(x))
    
    # Last 3 of SSID.
    df["ssid_last3"] = df["soc_sec_id"].apply(lambda x: str(x)[-3:].zfill(3) if x else None)
    df["soc_sec_id"] = df["soc_sec_id"].astype(str)
    
    # DOB to date object.
    df["dob"] = df["date_of_birth"].apply(lambda x: dob_to_date(x))

CPU times: user 108 ms, sys: 3.67 ms, total: 112 ms
Wall time: 110 ms


Let's take a look at a sample of our new columns:

In [10]:
df_A.head(n=2)

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,ssid_last3,dob
person_id_A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
c538959d-35b6-4b4f-aa9d-12e2195e57bd,marcus,butt,98,kirkwood crescent,euroka,terrigal,2409,nsw,19420616,30,02 40555328,7758524,B300,M622,BAT,MARC,524,1942-06-16
17f19297-13ab-457b-ac0e-bdda526a8c51,jessica,white,15,sabine close,springdale,yungaburra,2046,,19100318,27,03 84921725,7406466,W300,J220,WAT,JASAC,466,1910-03-18


## Blocking

Now that we've augmented our datasets, let's try some blocking! We'll use the python [`recordlinkage` library](https://github.com/J535D165/recordlinkage) for blocking. 

First, let's see how many candidate record pairs we would generate with a full blocker - meaning if we compared every record in dataset A to every record in dataset B. This produces the [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) of the two datasets.

In [11]:
indexer = rl.Index()
indexer.add(rl.index.Full())

full_blocker_pairs = indexer.index(df_A, df_B)
max_candidate_record_pairs = full_blocker_pairs.shape[0]

print("\ndataset A size * dataset B size = maximum candidate record pairs")
print(f"{df_A.shape[0]:,} * {df_B.shape[0]:,} = {df_A.shape[0]*df_B.shape[0]:,}")

print(f"\n{max_candidate_record_pairs:,} total pairs.")


dataset A size * dataset B size = maximum candidate record pairs
6,500 * 6,500 = 42,250,000

42,250,000 total pairs.


`indexer.index` returns a pandas MultiIndex of the candidate record pairs: 

In [12]:
full_blocker_pairs

MultiIndex([('c538959d-35b6-4b4f-aa9d-12e2195e57bd', ...),
            ('c538959d-35b6-4b4f-aa9d-12e2195e57bd', ...),
            ('c538959d-35b6-4b4f-aa9d-12e2195e57bd', ...),
            ('c538959d-35b6-4b4f-aa9d-12e2195e57bd', ...),
            ('c538959d-35b6-4b4f-aa9d-12e2195e57bd', ...),
            ('c538959d-35b6-4b4f-aa9d-12e2195e57bd', ...),
            ('c538959d-35b6-4b4f-aa9d-12e2195e57bd', ...),
            ('c538959d-35b6-4b4f-aa9d-12e2195e57bd', ...),
            ('c538959d-35b6-4b4f-aa9d-12e2195e57bd', ...),
            ('c538959d-35b6-4b4f-aa9d-12e2195e57bd', ...),
            ...
            ('be3fbec8-58b0-47b4-b2e5-969c0e01ea04', ...),
            ('be3fbec8-58b0-47b4-b2e5-969c0e01ea04', ...),
            ('be3fbec8-58b0-47b4-b2e5-969c0e01ea04', ...),
            ('be3fbec8-58b0-47b4-b2e5-969c0e01ea04', ...),
            ('be3fbec8-58b0-47b4-b2e5-969c0e01ea04', ...),
            ('be3fbec8-58b0-47b4-b2e5-969c0e01ea04', ...),
            ('be3fbec8-58b0-47b4-b2e5-96

Even for very small datasets, like our training data, we're looking a huge amount of candidate record pairs to compare, unless we employ more selective blocking.

Recall that successful and efficient blocking minimizes:
* the quantity of generated candidate record pairs
* missed true links

So, first let's define a method which measures the percentage of true links captured by blocking, as well as the search space reduction.

In [13]:
def evaluate_blocking(
    candidate_pairs: pd.MultiIndex,
    df_left: pd.DataFrame,
    df_right: pd.DataFrame,
    df_true_links: pd.DataFrame
) -> Tuple[float, float]:
    """ Function to calculate blocking search space reduction and retained true links.
        Reports and returns search space reduction percentage and retained true links percentage.
    """
    
    # Calculate search space reduction.
    search_space_reduction = round(rl.reduction_ratio(candidate_pairs.shape[0], df_left, df_right), 3)
    
    # Calculate retained true links percentage.
    total_true_links = df_true_links.shape[0]
    true_links_after_blocking = pd.merge(
        df_true_links,
        candidate_pairs.to_frame(),
        left_index=True,
        right_index=True,
        how="inner"
    ).shape[0]
    
    retained_true_link_percent = round((true_links_after_blocking/total_true_links) * 100, 2)
    
    print(f"{candidate_pairs.shape[0]:,} pairs after full blocking: {search_space_reduction}% search space reduction.")
    print(f"{retained_true_link_percent}% true links retained after full blocking.")
    
    return search_space_reduction, retained_true_link_percent 

We can evaluate the full blocker as such:

In [14]:
%%time
_, _ = evaluate_blocking(full_blocker_pairs, df_A, df_B, df_ground_truth)

42,250,000 pairs after full blocking: 0.0% search space reduction.
100.0% true links retained after full blocking.
CPU times: user 37.9 s, sys: 5.74 s, total: 43.7 s
Wall time: 45 s


This makes sense. If we use a full blocker, we won't have reduced our search space at all. And, since we consider every possible candidate pair, this will include all true links.

However, let's see if we can do better. Let's experiment with a few sets of different blockers.

In [15]:
indexer = rl.Index()

indexer.add(rl.index.Block("surname"))

candidate_pairs = indexer.index(df_A, df_B)

_, _ = evaluate_blocking(candidate_pairs, df_A, df_B, df_ground_truth)

142,875 pairs after full blocking: 0.997% search space reduction.
54.37% true links retained after full blocking.


In [16]:
indexer = rl.Index()

indexer.add(rl.index.Block("surname"))
indexer.add(rl.index.Block("first_name"))

candidate_pairs = indexer.index(df_A, df_B)

_, _ = evaluate_blocking(candidate_pairs, df_A, df_B, df_ground_truth)

290,640 pairs after full blocking: 0.993% search space reduction.
82.38% true links retained after full blocking.


In [17]:
indexer = rl.Index()

indexer.add(rl.index.Block("soundex_surname"))
indexer.add(rl.index.Block("soundex_firstname"))
indexer.add(rl.index.Block("nysiis_surname"))
indexer.add(rl.index.Block("nysiis_firstname"))

candidate_pairs = indexer.index(df_A, df_B)

_, _ = evaluate_blocking(candidate_pairs, df_A, df_B, df_ground_truth)

499,837 pairs after full blocking: 0.988% search space reduction.
89.75% true links retained after full blocking.


In [18]:
indexer = rl.Index()

indexer.add(rl.index.Block("soundex_surname"))
indexer.add(rl.index.Block("soundex_firstname"))
indexer.add(rl.index.Block("nysiis_surname"))
indexer.add(rl.index.Block("nysiis_firstname"))
indexer.add(rl.index.Block("ssid_last3"))
indexer.add(rl.index.Block("date_of_birth"))

candidate_pairs = indexer.index(df_A, df_B)

_, _ = evaluate_blocking(candidate_pairs, df_A, df_B, df_ground_truth)

1,219,176 pairs after full blocking: 0.971% search space reduction.
99.72% true links retained after full blocking.


## Comparing

After we're reasonably satisifed with our blockers, we can move on to comparing our candidate record pairs. Recall that in the comparison step, for each candidate record pair, we compare their attributes to generate a comparison vector. Once again, we'll use [`recordlinkage`](https://github.com/J535D165/recordlinkage) to define our comparators. `recordlinkage` offers a variety of built-in comparators to use for string, numeric, and datetime fields.

* We can use exact comparators for our phonetic encoding fields.
* We'll use Jaro-Winkler comparison for the name fields, as this comparison approach is specifically designed for comparison of names.
* For the other string fields, we'll opt for Damerau-Levenshtein, which does a nice job in accomodating data entry typos.
* For the DOB, we'll use a date comparison.

In [19]:
%%time

comparer = rl.Compare()

# Phonetic encodings.
comparer.add(rl.compare.Exact("soundex_surname", "soundex_surname", label="soundex_surname"))
comparer.add(rl.compare.Exact("soundex_firstname", "soundex_firstname", label="soundex_firstname"))
comparer.add(rl.compare.Exact("nysiis_surname", "nysiis_surname", label="nysiis_surname"))
comparer.add(rl.compare.Exact("nysiis_firstname", "nysiis_firstname", label="nysiis_firstname"))

# First & last name.
comparer.add(rl.compare.String("surname", "surname", method="jarowinkler", label="last_name"))
comparer.add(rl.compare.String("first_name", "first_name", method="jarowinkler", label="first_name"))

# Address.
comparer.add(rl.compare.String("address_1", "address_1", method="damerau_levenshtein", label="address_1"))
comparer.add(rl.compare.String("address_2", "address_2", method="damerau_levenshtein", label="address_2"))
comparer.add(rl.compare.String("suburb", "suburb", method="damerau_levenshtein", label="suburb"))
comparer.add(rl.compare.String("postcode", "postcode", method="damerau_levenshtein", label="postcode"))
comparer.add(rl.compare.String("state", "state", method="damerau_levenshtein", label="state"))

# Other fields.
comparer.add(rl.compare.Date("dob", "dob", label="date_of_birth"))
comparer.add(rl.compare.String("phone_number", "phone_number", method="damerau_levenshtein", label="phone_number"))
comparer.add(rl.compare.String("soc_sec_id", "soc_sec_id", method="damerau_levenshtein", label="ssn"))

features = comparer.compute(candidate_pairs, df_A, df_B)

CPU times: user 1min 19s, sys: 1.94 s, total: 1min 21s
Wall time: 1min 20s


You can see that the output of the compare step is a collection of comparison/feature vectors, one for each candidate record pair. `recordlinkage` returns these vectors as a pandas Dataframe, indexed on the record pair ids.

In [20]:
features

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
00062cca-a85a-43fe-a309-2d1a47a58323,02759653-c8db-4587-8821-d154a4c32498,0,1,0,0,0.600000,0.933333,0.352941,0.076923,0.222222,0.4,0.333333,0.0,0.333333,0.000000
00062cca-a85a-43fe-a309-2d1a47a58323,02ce7446-e904-4c51-ab30-c69e6a0f8ff0,0,0,0,0,0.466667,0.577778,0.133333,0.062500,0.357143,0.2,0.250000,0.0,0.416667,0.428571
00062cca-a85a-43fe-a309-2d1a47a58323,033c561a-5a00-4a50-a576-28481298630c,1,0,1,0,1.000000,0.577778,0.230769,1.000000,0.100000,0.2,0.250000,0.0,0.083333,0.000000
00062cca-a85a-43fe-a309-2d1a47a58323,04562435-59aa-4740-b84f-af3ba0f1463a,1,0,1,0,1.000000,0.000000,0.250000,0.090909,0.187500,0.4,0.250000,0.0,0.333333,0.000000
00062cca-a85a-43fe-a309-2d1a47a58323,07cbfeea-7430-467d-98fc-dca36acc9853,1,0,0,0,0.880000,0.425926,0.230769,1.000000,0.444444,0.6,0.250000,0.0,0.416667,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fffae586-12e1-4e28-ab49-a90de2adeb20,effb2ec1-6eab-4de6-a401-36991072d168,0,0,0,0,0.490741,0.416667,0.200000,0.083333,0.200000,0.4,1.000000,0.0,0.333333,0.428571
fffae586-12e1-4e28-ab49-a90de2adeb20,f0718a9e-2dc5-406c-82ba-94c901f67b90,0,1,0,1,0.540741,1.000000,0.250000,0.166667,0.142857,0.2,0.250000,0.0,0.250000,0.142857
fffae586-12e1-4e28-ab49-a90de2adeb20,f07cb4eb-858d-479e-9062-0aa47fddf3ff,0,0,0,0,0.458333,0.000000,0.277778,0.083333,0.111111,0.4,1.000000,0.0,0.416667,0.428571
fffae586-12e1-4e28-ab49-a90de2adeb20,f3904939-7613-4510-a9c2-9397ce2ff6c7,0,1,0,1,0.502646,1.000000,0.307692,0.166667,0.333333,0.2,0.250000,0.0,0.083333,0.142857


Here's a look at an individual comparison vector:

In [21]:
display(features.iloc[0].name)
display(features.iloc[0])

('00062cca-a85a-43fe-a309-2d1a47a58323',
 '02759653-c8db-4587-8821-d154a4c32498')

soundex_surname      0.000000
soundex_firstname    1.000000
nysiis_surname       0.000000
nysiis_firstname     0.000000
last_name            0.600000
first_name           0.933333
address_1            0.352941
address_2            0.076923
suburb               0.222222
postcode             0.400000
state                0.333333
date_of_birth        0.000000
phone_number         0.333333
ssn                  0.000000
Name: (00062cca-a85a-43fe-a309-2d1a47a58323, 02759653-c8db-4587-8821-d154a4c32498), dtype: float64

## Add labels to feature vectors

We've generated our comparison/feature vectors, now we're ready to classify! To begin, we'll add our ground truth labels to the features DataFrame. Note that `df_ground_truth` just contains the true links, so we'll use a left join and then `fillna` with `False` for any records that are not true links.

In [22]:
df_labeled_features = pd.merge(
    features,
    df_ground_truth,
    on=["person_id_A", "person_id_B"],
    how="left"
)

df_labeled_features["ground_truth"].fillna(False, inplace=True)
df_labeled_features.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
00062cca-a85a-43fe-a309-2d1a47a58323,02759653-c8db-4587-8821-d154a4c32498,0,1,0,0,0.6,0.933333,0.352941,0.076923,0.222222,0.4,0.333333,0.0,0.333333,0.0,False
00062cca-a85a-43fe-a309-2d1a47a58323,02ce7446-e904-4c51-ab30-c69e6a0f8ff0,0,0,0,0,0.466667,0.577778,0.133333,0.0625,0.357143,0.2,0.25,0.0,0.416667,0.428571,False
00062cca-a85a-43fe-a309-2d1a47a58323,033c561a-5a00-4a50-a576-28481298630c,1,0,1,0,1.0,0.577778,0.230769,1.0,0.1,0.2,0.25,0.0,0.083333,0.0,False
00062cca-a85a-43fe-a309-2d1a47a58323,04562435-59aa-4740-b84f-af3ba0f1463a,1,0,1,0,1.0,0.0,0.25,0.090909,0.1875,0.4,0.25,0.0,0.333333,0.0,False
00062cca-a85a-43fe-a309-2d1a47a58323,07cbfeea-7430-467d-98fc-dca36acc9853,1,0,0,0,0.88,0.425926,0.230769,1.0,0.444444,0.6,0.25,0.0,0.416667,0.0,False


## Calculate SimSum Scores

Once again, SimSum is the simplest approach to linking classification. To generate our scores for the candidate record pairs, we simply sum the values each attribute comparison score into a single score for each record.

In [23]:
df_labeled_features["simsum"] = df_labeled_features.drop("ground_truth", axis=1).sum(axis=1)
df_labeled_features.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn,ground_truth,simsum
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
00062cca-a85a-43fe-a309-2d1a47a58323,02759653-c8db-4587-8821-d154a4c32498,0,1,0,0,0.6,0.933333,0.352941,0.076923,0.222222,0.4,0.333333,0.0,0.333333,0.0,False,4.252086
00062cca-a85a-43fe-a309-2d1a47a58323,02ce7446-e904-4c51-ab30-c69e6a0f8ff0,0,0,0,0,0.466667,0.577778,0.133333,0.0625,0.357143,0.2,0.25,0.0,0.416667,0.428571,False,2.892659
00062cca-a85a-43fe-a309-2d1a47a58323,033c561a-5a00-4a50-a576-28481298630c,1,0,1,0,1.0,0.577778,0.230769,1.0,0.1,0.2,0.25,0.0,0.083333,0.0,False,5.44188
00062cca-a85a-43fe-a309-2d1a47a58323,04562435-59aa-4740-b84f-af3ba0f1463a,1,0,1,0,1.0,0.0,0.25,0.090909,0.1875,0.4,0.25,0.0,0.333333,0.0,False,4.511742
00062cca-a85a-43fe-a309-2d1a47a58323,07cbfeea-7430-467d-98fc-dca36acc9853,1,0,0,0,0.88,0.425926,0.230769,1.0,0.444444,0.6,0.25,0.0,0.416667,0.0,False,5.247806


## Choosing a SimSum Classification Threshold

Now that we've generated scores for all of our candidate record pairs, the next step is to determine a threshold at which we can classify a record pair as a link, or not-a-link. To do this, it's first helpful to look at the score distribution.

### "Model" Score Distribution

We can see a pretty clear boundary between not-links and links when it comes to the SimSum score. There's a bit of an overlap from 7 - 9.5, but it looks like we'll probably want to set the cutoff somewhere in that range.

In [24]:
tutorial.plot_model_score_distribution(
    df_labeled_features,
    score_column_name="simsum",  
)

### Precision and Recall at Varying Thresholds

Next, we'll take a look at the calculated precision and recall at varying model score thresholds. Below is a function which calculates precision and recall for a range of scores.

In [25]:
def evaluate_linking(
    df: pd.DataFrame,
    score_column_name: Optional[str] = "score",
    ground_truth_column_name: Optional[str] = "ground_truth",
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """ Use model results to calculate precision & recall metrics.
    
        Args:
            df: dataframe containing model scores, and ground truth labels
                indexed on df_left index, df_right index
            score_column_name: Optional string name of column containing model scores 
            ground_truth_column_name: Optional string name of column containing ground
                truth values
                
        Returns:
            Tuple containing:
                pandas dataframe with precision and recall evaluation data
                at varying score thresholds
    """
    eval_data = []
    max_score = max(1, max(df[score_column_name]))
    
    # Calculate eval data at threshold intervals from zero to max score. 
    # Max score is generally 1.0 if using a ML model, but with SimSum it
    # can get much larger.
    for threshold in np.linspace(0, max_score, 50):
        tp = df[(df[score_column_name] >= threshold) & (df[ground_truth_column_name] == True)].shape[0]
        fp = df[(df[score_column_name] >= threshold) & (df[ground_truth_column_name] == False)].shape[0]
        tn = df[(df[score_column_name] < threshold) & (df[ground_truth_column_name] == False)].shape[0]
        fn = df[(df[score_column_name] < threshold) & (df[ground_truth_column_name] == True)].shape[0]
        
        precision = tp / (tp + fp)
        recall = tp / (tp + fn)
        f1 = 2 * ((precision * recall)/(precision + recall))
        
        eval_data.append(
            {
                "threshold" : threshold,
                "tp" : tp,
                "fp" : fp,
                "tn" : tn,
                "fn" : fn,
                "precision" : precision,
                "recall" : recall,
                "f1" : f1
            }
        )
        
    return pd.DataFrame(eval_data)

In [26]:
df_eval = evaluate_linking(
    df=df_labeled_features,
    score_column_name = "simsum",
)

In [27]:
df_eval.head()

Unnamed: 0,threshold,tp,fp,tn,fn,precision,recall,f1
0,0.0,5983,1213193,0,0,0.004907,1.0,0.009767
1,0.285714,5983,1213193,0,0,0.004907,1.0,0.009767
2,0.571429,5983,1213193,0,0,0.004907,1.0,0.009767
3,0.857143,5983,1213182,11,0,0.004907,1.0,0.009767
4,1.142857,5983,1212667,526,0,0.00491,1.0,0.009771


The plot precision and recall at varying score thresholds reinforces what we noted earlier in the score distribution - that our most suitable cutoff is in the range of 7 to 9.5. It relies on your own particular use case to determine exactly where the cutoff should be set (e.g. Is recall more important than precision, or vice versa?).

In [28]:
tutorial.plot_precision_recall_vs_threshold(df_eval)

We can also take a look at the F1 score at varying model thresholds. F1 is the harmonic mean of precision and recall, which provides us with a single figure to consider.

In [29]:
tutorial.plot_f1_score_vs_threshold(df_eval)

## Examining Individual Links

Another way to gain insight into the performance of link classification is examining individual links (including their original attribute values) in score ranges of interest. This can be particularly helpful where you see overlap of classes in your score distribution - i.e. where you see highly scored non-links and poorly scored true links. These cases can highlight model confusion, and shed light on potential feature improvements.

Below, we've:
* Defined a helper function to join scored pairs with their original entity attribute data
* Captured the top scoring non-links (negatives) as well as the lowest scoring true links (positives)

In [30]:
def augment_scored_pairs(
    df: pd.DataFrame,
    df_left: pd.DataFrame,
    df_right: pd.DataFrame,
    score_column_name: Optional[str] = "score",
    ground_truth_column_name: Optional[str] = "ground_truth"
) -> pd.DataFrame:
    """ Augment scored pairs with original entity attribute data.
    
        Args:
            df: dataframe containing pairs for examination that includes
                model scores and ground truth labels, and is indexed on
                df_left index, df_right index
            df_left: dataframe containing attributes for "left"-linked entities
            df_right: dataframe containing attributes for "right"-linked entities
            score_column_name: Optional string name of column containing model scores 
            ground_truth_column_name: Optional string name of column containing ground
                truth values
                
        Returns:
            Tuple containing:
                pandas dataframe containing pairs augmented with original entity attributes 
    """
    
    df = df[[score_column_name, ground_truth_column_name]]

    # Suffix our original attribute fields for display convenience when
    # we examine the links in the notebook.
    df_left = df_left.copy()
    df_left.columns = df_left.columns.map(lambda x: str(x) + '_A')

    df_right = df_right.copy()
    df_right.columns = df_right.columns.map(lambda x: str(x) + '_B')
    
    # Join the original link entity data via the dataframe indices.
    # This gives us the model score as well as the actual human-readable attributes
    # for each link.
    df_augmented_pairs = pd.merge(
        df,
        df_left,
        left_on=df_left.index.name,
        right_index=True,
    )

    # Join data from right entities.
    df_augmented_pairs = pd.merge(
        df_augmented_pairs,
        df_right,
        left_on=df_right.index.name,
        right_index=True,
    ) 
    
    return df_augmented_pairs

In [31]:
display_cols = [
    "first_name", "surname",
    "street_number", "address_1", "address_2", "suburb", "postcode", "state",
    "date_of_birth", "age", "phone_number", "soc_sec_id",
    "soundex_surname", "soundex_firstname",
    "nysiis_surname", "nysiis_firstname",
]

display_cols = [[f"{col}_A", f"{col}_B"] for col in display_cols]
display_cols = list(itertools.chain.from_iterable(display_cols))

### Top Scoring Non-Links

In [32]:
df_top_scoring_negatives = df_labeled_features[
    df_labeled_features["ground_truth"] == False
][["simsum", "ground_truth"]].sort_values("simsum", ascending=False).head(n=10)

df_top_scoring_negatives = augment_scored_pairs(df_top_scoring_negatives, df_A, df_B, score_column_name="simsum")

with pd.option_context('display.max_columns', None):
    display(df_top_scoring_negatives[["simsum", "ground_truth"] + display_cols])

Unnamed: 0_level_0,Unnamed: 1_level_0,simsum,ground_truth,first_name_A,first_name_B,surname_A,surname_B,street_number_A,street_number_B,address_1_A,address_1_B,address_2_A,address_2_B,suburb_A,suburb_B,postcode_A,postcode_B,state_A,state_B,date_of_birth_A,date_of_birth_B,age_A,age_B,phone_number_A,phone_number_B,soc_sec_id_A,soc_sec_id_B,soundex_surname_A,soundex_surname_B,soundex_firstname_A,soundex_firstname_B,nysiis_surname_A,nysiis_surname_B,nysiis_firstname_A,nysiis_firstname_B
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
3ce956aa-c72a-4367-9cff-dc2370c1e29d,cdee68d1-9d91-412c-83b7-79b1c405461c,9.755556,False,jock,jak,crouch,crouch,12,3,,,,,coffs harbour,cond er,4812,6229,qld,qld,,19831129.0,29.0,26.0,07 17186392,03 20107748,7172103,9248231,C620,C620,J200,J200,CRAC,CRAC,JAC,JAC
b620422c-1657-4bb5-856c-cad05c314dae,4b1c3ca3-7a0f-46be-93d1-f93bdccd5947,9.621558,False,jessica,jessica,green,green,28,20,kidston crescent,namatjir rive,,,longreach,balwyn north,2747,5046,nsw,nsw,19570710.0,19790826.0,23.0,30.0,02 26200605,07 90115044,2472114,2496105,G650,G650,J220,J220,GRAN,GRAN,JASAC,JASAC
6b4b3595-668f-44f2-a6f8-4f5523cf1882,b3e7c4a5-9d50-424a-86ef-4ae5e9453422,9.522024,False,amy,amy,white,white,3,26,balamara street,edwards street,,,paddington,whittington,2213,3124,nsw,nk,19531112.0,,37.0,,03 82841454,07 52785366,2845058,1355677,W300,W300,A500,A500,WAT,WAT,ANY,ANY
d034860e-1fc1-4f68-aaf9-7986a8c48143,731f45e2-152e-4002-87f2-eafad3163a38,9.505238,False,jakob,jacob,webb,webb,179,107,zadow place,court jauncey,,,robina,st ives,2011,2073,vic,vic,,19391023.0,21.0,33.0,02 62672567,03 74007096,6862734,6216860,W100,W100,J210,J210,WAB,WAB,JACAB,JACAB
a6f296f0-5d1a-4359-bf06-08f87186c63f,731f45e2-152e-4002-87f2-eafad3163a38,9.292857,False,jacob,jacob,webb,webb,20,107,badgery street,court jauncey,,,moss vale,st ives,7054,2073,vic,vic,19931219.0,19391023.0,31.0,33.0,04 52981263,03 74007096,7129001,6216860,W100,W100,J210,J210,WAB,WAB,JACAB,JACAB
defd07dd-a969-44e2-aefe-0ceb046d5ad3,6e970b0e-27e9-4773-b26f-5653142ad860,9.452381,False,jackson,jackson,miles,miles,6,2,clive steele avenue,molvig street,,,castella,blacktown,3078,4720,vic,vic,19391016.0,19520620.0,27.0,,08 95639180,04 61326890,2079318,9743885,M420,M420,J250,J250,MAL,MAL,JACSAN,JACSAN
50dee8b3-afac-475f-b15c-3565e2dba05a,3e90a4c3-6dc0-4714-a042-c1252ffc2504,9.43619,False,kelsye,kelsey,white,whit,158,56,de graaff street,warramoo crescent,,,greenwood,nicholls,810,3860,vic,vic,19080108.0,19700719.0,,20.0,08 53715657,08 08891386,1156431,2915954,W300,W300,K420,K420,WAT,WAT,CALSY,CALSY
35ba8725-6c10-4e93-abe3-2d14db6ac846,a40113a4-ee35-4aa6-a856-6eef3127ca8d,9.33912,False,hannah,hanna,green,gren,6,28,osmand street,couchman crescent,,,thornbury,parramatta,3977,3394,qld,qld,19501027.0,19611027.0,,24.0,08 08358735,02 79661009,9167048,7487333,G650,G650,H500,H500,GRAN,GRAN,HAN,HAN
c7374abc-3aed-4f60-bcbc-ae2225ff4f36,ca239cb4-e408-40c2-8cf9-bad7fd0f33e6,9.319372,False,isabella,isabelle,petersen,petersen,8,41,ern florence crescent,ellerston avenue,,,kariong,garbutt,4218,3840,nsw,nsw,,19859127.0,36.0,32.0,02 66363547,03 62145936,1833716,4497730,P362,P362,I214,I214,PATARSAN,PATARSAN,ISABAL,ISABAL
25ad1c3e-5ddf-4ddd-a068-76215b66d5b3,a77e8030-ff77-4a84-a76b-424e9914573e,9.298413,False,daniel,daniella,mason,mason,19,0,yamba place,,,,st clair,cammpsie,2564,2900,vic,vic,19890204.0,19902905.0,32.0,35.0,02 22417988,02 48407012,6895302,7068640,M250,M250,D540,D540,MASAN,MASAN,DANAL,DANAL


### Lowest Scoring True Links

In [33]:
df_lowest_scoring_positives = df_labeled_features[
    df_labeled_features["ground_truth"] == True
][["simsum", "ground_truth"]].sort_values("simsum").head(n=10)

df_lowest_scoring_positives = augment_scored_pairs(df_lowest_scoring_positives, df_A, df_B, score_column_name="simsum")

with pd.option_context('display.max_columns', None):
    display(df_lowest_scoring_positives[["simsum", "ground_truth"] + display_cols])

Unnamed: 0_level_0,Unnamed: 1_level_0,simsum,ground_truth,first_name_A,first_name_B,surname_A,surname_B,street_number_A,street_number_B,address_1_A,address_1_B,address_2_A,address_2_B,suburb_A,suburb_B,postcode_A,postcode_B,state_A,state_B,date_of_birth_A,date_of_birth_B,age_A,age_B,phone_number_A,phone_number_B,soc_sec_id_A,soc_sec_id_B,soundex_surname_A,soundex_surname_B,soundex_firstname_A,soundex_firstname_B,nysiis_surname_A,nysiis_surname_B,nysiis_firstname_A,nysiis_firstname_B
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
446d89af-0a42-4869-be26-395f956ac5b3,4726b263-cac4-42bf-9b36-ae23a3702046,4.98663,True,emiily,procter,procter,emiily,242.0,242.0,hytten place,hytten oace,mountain view retirement vlge,,mosman,mosmn,7010,7010,wa,wa,19420427.0,19420827.0,39.0,39.0,03 81814963,04 14761579,7458939,7458939,P623,E540,E540,P623,PRACTAR,ENALY,ENALY,PRACTAR
1fb2bfa2-fc4f-49b5-a86b-165de7507137,fa63fea3-180c-4155-b39d-6b2c0fa2d797,5.519841,True,harrison,matthews,matthews,godfrey,,,titheradge place,titheradgpe place,braeburn,,loftus,loftufs,2672,2672,sa,ss,,,10.0,10.0,,,9870844,4392278,M320,G316,H625,M320,MATAE,GADFRY,HARASAN,MATAE
a9110c73-4c8a-4f08-9617-2272b57f6fdf,420d9b98-3bad-4a3f-8db9-c3a62fa76d1a,5.798316,True,chantelle,lombardi,lombardi,chantelle,56.0,5.0,fraenkel street,fraenkel street,northfield,,tucki tucki,,5353,5335,qld,qld,19540204.0,,10.0,10.0,07 83432316,07 83432316,5454387,5454387,L516,C534,C534,L516,LANBARD,CANTAL,CANTAL,LANBARD
b785b486-34ff-4096-a551-f236ae9e7de6,4a2db454-ba83-4fc5-ac16-ad8cd2d8d7cf,5.8625,True,lucy,,clutterbuck,,1.0,1.0,higgerson street,higgerson street,,ormerod cottage,sandgate,sandgate,3520,3529,sa,sa,19230603.0,19235763.0,35.0,35.0,,,8181136,8181136,C436,,L200,,CLATARBAC,,LACY,
2efae440-0c12-43a7-aec4-88e209a2f89d,b11d0601-44ec-4bb8-8840-29fa26e71eb8,6.052632,True,patrick,lodge,lodge,patrick,5.0,5.0,mainwaring rich circuit,mainwaring rich circuit,palm garden villas,,ballarat,ballarat,2257,2257,qld,qld,19140723.0,19140273.0,27.0,27.0,08 66418661,08 66418661,6882001,6882001,L320,P362,P362,L320,LADG,PATRAC,PATRAC,LADG
beaf22f0-2ce6-46f3-88fa-887430ce48e2,c48976a0-aeb4-4fb3-8f8d-c8e7451d4245,6.105,True,amelia,,white,wigcht,452.0,452.0,brennan street,brennan street,rockley,,falmouth,670h,6701,falmouth,vic,vic,19420216.0,19420216.0,21.0,21.0,08 35282377,08 35822377,4734629,4734629,W300,W230,A540,,WAT,WAGCT,ANAL,
a0a29c16-1226-4251-aa5e-d1eafa23a31d,d52228f8-05f7-4e19-9ea0-e9bafe096262,6.115789,True,amalia,trowse,trowse,amalia,14.0,14.0,tregellas crescent,leahy close,,,cherry gardens,cherry gardens,4210,4280,sa,sa,,,,,04 10238698,04 10238698,4489113,4489113,T620,A540,A540,T620,TRAOS,ANAL,ANAL,TRAOS
c17df105-0130-46be-89ca-985571d6a4d2,122f8336-3cb2-468c-9b18-9e5f0b37a91b,6.142857,True,jessica,curry,curry,jessxa,60.0,60.0,lempriere crescent,lempriere crescent,adventure bay,adventure bay,frankston,frankston,2904,2904,nsw,nsw,,,34.0,34.0,02 14000167,02 14000167,4771896,9832666,C600,J200,J220,C600,CARY,JASX,JASAC,CARY
12a654dd-c9c5-4d0d-b4da-d25af2eae9ea,24bcc518-2e19-44e3-b250-b9c479a1ba13,6.25,True,ben,southpgate,southgate,be,1.0,1.0,oliver street,oliver street,,,coffs harbour,coffs harbour,3192,3192,sa,vic,19810423.0,,28.0,28.0,08 75233611,08 75233611,6808970,6808970,S323,B000,B500,S312,SATGAT,B,BAN,SATPGAT
d0e61bfc-8f98-4ce1-a378-d91896bb9eab,ef593234-6a42-421d-862a-aaa718819037,6.284848,True,john,johh,vlach,,12.0,12.0,pinkerton circuit,pinkerton circuit,tarwyn park,,labrador,labrayddor,2210,2210,qld,qld,19240422.0,19245032.0,33.0,33.0,07 22389543,07 66300763,2475801,2475801,V420,,J500,J000,VLAC,,JAN,JAH
