# Use SimSum Classification to Link FEBRL People Data

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rachhouse/intro-to-data-linking/blob/linking-work/notebooks/link_febrl_data.ipynb)

In this tutorial, we'll link synthesized people datasets generated by the [Freely Extensible Biomedical Record Linkage (FEBRL)](https://sourceforge.net/projects/febrl/) project. The FEBRL-generated datasets represent cleaned datasets, so in this notebook, we will step through:
* data augmentation,
* blocking,
* comparing, and
* classification using the SimSum methodology.

In [1]:
import datetime
import itertools
import os
import pathlib
import re
import uuid

from typing import Tuple, Optional

import altair as alt
import numpy as np
import pandas as pd
import recordlinkage as rl
import jellyfish
import sklearn

## Define Filepaths

In [2]:
DATA_DIR = pathlib.Path(os.path.abspath('')).parent / "data"

TRAINING_DATASET_A = DATA_DIR / "febrl_training_a.csv"
TRAINING_DATASET_B = DATA_DIR / "febrl_training_b.csv"
TRAINING_LABELS = DATA_DIR / "febrl_training_labels.csv"

## Load Training Datasets (cleaned)

In [3]:
df_A = pd.read_csv(TRAINING_DATASET_A)
df_A = df_A.set_index("person_id_A")
df_A.head()

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id
person_id_A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
f343cef9-cef0-445f-b688-972db2a029ca,dakota,geraghty,69,maclean street,skeers property,dandenong north,2529,nsw,19380417.0,31.0,03 01783133,6629995
2b5d49ca-06a5-468f-867e-0630bb7222f4,james,colquhoun,118,conlon crescent,,birkdale,5043,nsw,19680112.0,,07 14327140,5350518
fabd142c-9269-4f52-9899-3a82cccfe9e8,ruby,butt,103,,wollartukkee,east fremantle,4814,wa,19430120.0,30.0,02 88839517,3225206
1f719d2e-c842-49c0-ade9-c265f70288ae,marcus,rees,5,charlick place,lindoran,ballarat,4216,nsw,,27.0,08 17239266,7355062
7859d4fb-04fc-46fd-aa0f-5955603d35d9,jassim,belperio,36,john russell circuit,,eastwood,3131,nsw,19460129.0,20.0,02 61510457,9190750


In [4]:
df_B = pd.read_csv(TRAINING_DATASET_B)
df_B = df_B.set_index("person_id_B")
df_B.head()

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id
person_id_B,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
49bf7b37-c6af-41c8-91d7-5eb64c496a6c,charlotte,leukg,301,domain street,locn 1699,alma bay,2710,vic,,29.0,07 05109263,6356142
304b9d58-b06a-4d1c-970b-020e81efd1ff,callie,heerscgap,23,dudi lzce,,mill park,2324,tas,19820623.0,9.0,02 82637596,6775114
c080c996-dbef-4ec8-aa0d-0150629cd367,alanx,nguyen,6,callaghan street,,albury,4575,nsw,19220115.0,27.0,08 82171717,5275665
1c1d7e32-a925-47ab-9fa2-d6fe95f87de6,willjam,dud,83,purbrick street,glenveagh,muttabrra,6100,,19871212.0,23.0,07 54557966,7073899
80ea777d-6088-4d82-b839-2b566091d61a,lucy,baillie,34,hurley street,,glen iqnnes,5038,sa,19310448.0,,08 19431835,6880723


## Load training data labels

In [5]:
df_labels = pd.read_csv(TRAINING_LABELS)
df_labels = df_labels.set_index(['person_id_A', 'person_id_B'])
df_labels.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,label
person_id_A,person_id_B,Unnamed: 2_level_1
99945930-2ee8-4b4b-be7e-4f6e196b4ae4,8daea3a6-e54d-4cdb-9962-985eaa6f6839,1
d18a65af-9cfc-46df-a8ee-565095125bf6,b0f8c021-c43d-436c-8236-a3623223d91c,1
c4336ddb-8b50-4f8a-aa93-3e27478f909a,d2970d4f-1601-4aa6-99b0-c79898bab323,1
7cbb6367-5268-49fe-83a9-053ddfb0f2f8,e817653e-f486-4d70-9de5-5ce3fe1fac36,1
7622a53c-e004-48b5-87ab-9cdf4d84f186,2fe290ca-d919-463f-bd33-e60214ec2834,1


## Data Augmentation

Here, we'll augment our people data with fields that we can use for blocking and comparing.

**Phonetic Blocking**

https://pypi.org/project/jellyfish/

In [6]:
def dob_to_date(dob: str) -> Optional[pd.Timestamp]:
    """ Transform string date in YYYYMMDD format to a pd.Timestamp.
        Return None if transformation is not successful.
    """
    date_pattern = r"(\d{4})(\d{2})(\d{2})"
    dob_timestamp = None
    
    try:
        if m := re.match(date_pattern, dob.strip()):
            dob_timestamp = pd.Timestamp(int(m.group(1)), int(m.group(2)), int(m.group(3)))
    except:
        pass

    return dob_timestamp

In [7]:
%%time

for df in [df_A, df_B]:
    
    # Update NaNs to empty strings or jellyfish will choke.
    df["surname"] = df["surname"].fillna("")
    df["first_name"] = df["first_name"].fillna("")

    # Soundex phonetic encodings.
    df["soundex_surname"] = df["surname"].apply(lambda x: jellyfish.soundex(x))
    df["soundex_firstname"] = df["first_name"].apply(lambda x: jellyfish.soundex(x))
    
    # NYSIIS phonetic encodings.    
    df["nysiis_surname"] = df["surname"].apply(lambda x: jellyfish.nysiis(x))
    df["nysiis_firstname"] = df["first_name"].apply(lambda x: jellyfish.nysiis(x))
    
    # Last 3 of SSID.
    df["ssid_last3"] = df["soc_sec_id"].apply(lambda x: str(x)[-3:].zfill(3) if x else None)
    df["soc_sec_id"] = df["soc_sec_id"].astype(str)
    
    # DOB to date object.
    df["dob"] = df["date_of_birth"].apply(lambda x: dob_to_date(x))

CPU times: user 84.7 ms, sys: 3.69 ms, total: 88.4 ms
Wall time: 87 ms


Let's take a look at a sample of our new columns.

In [8]:
df_A.head()

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,ssid_last3,dob
person_id_A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
f343cef9-cef0-445f-b688-972db2a029ca,dakota,geraghty,69,maclean street,skeers property,dandenong north,2529,nsw,19380417.0,31.0,03 01783133,6629995,G623,D230,GARAGTY,DACAT,995,1938-04-17
2b5d49ca-06a5-468f-867e-0630bb7222f4,james,colquhoun,118,conlon crescent,,birkdale,5043,nsw,19680112.0,,07 14327140,5350518,C425,J520,CALGAHAN,JAN,518,1968-01-12
fabd142c-9269-4f52-9899-3a82cccfe9e8,ruby,butt,103,,wollartukkee,east fremantle,4814,wa,19430120.0,30.0,02 88839517,3225206,B300,R100,BAT,RABY,206,1943-01-20
1f719d2e-c842-49c0-ade9-c265f70288ae,marcus,rees,5,charlick place,lindoran,ballarat,4216,nsw,,27.0,08 17239266,7355062,R200,M622,R,MARC,62,NaT
7859d4fb-04fc-46fd-aa0f-5955603d35d9,jassim,belperio,36,john russell circuit,,eastwood,3131,nsw,19460129.0,20.0,02 61510457,9190750,B416,J250,BALPAR,JASAN,750,1946-01-29


## Blocking

In [9]:
# Look and see how many pairs we would need to process with a full (cartesian join) blocker.

indexer = rl.Index()
indexer.add(rl.index.Full())

candidate_links = indexer.index(df_A, df_B)
full_blocker_pairs = candidate_links.shape[0]

print(f"{full_blocker_pairs:,} total pairs.")

25,000,000 total pairs.


In [10]:
indexer = rl.Index()

indexer.add(rl.index.Block("soundex_surname"))
indexer.add(rl.index.Block("soundex_firstname"))
indexer.add(rl.index.Block("nysiis_surname"))
indexer.add(rl.index.Block("nysiis_firstname"))
indexer.add(rl.index.Block("ssid_last3"))
indexer.add(rl.index.Block("date_of_birth"))

candidate_links = indexer.index(df_A, df_B)
blocked_pairs = candidate_links.shape[0]

search_space_reduction = round((1 - (blocked_pairs/full_blocker_pairs)) * 100, 2)

print(f"{blocked_pairs:,} pairs after blocking: {search_space_reduction}% search space reduction.")

653,588 pairs after blocking: 97.39% search space reduction.


In [11]:
# Show what candidate links look like.
# candidate_links

## Comparing

In [12]:
%%time

comparer = rl.Compare()

# Phonetic encodings.
comparer.add(rl.compare.Exact("soundex_surname", "soundex_surname", label="soundex_surname"))
comparer.add(rl.compare.Exact("soundex_firstname", "soundex_firstname", label="soundex_firstname"))
comparer.add(rl.compare.Exact("nysiis_surname", "nysiis_surname", label="nysiis_surname"))
comparer.add(rl.compare.Exact("nysiis_firstname", "nysiis_firstname", label="nysiis_firstname"))

# First & last name.
comparer.add(rl.compare.String("surname", "surname", method="jarowinkler", label="last_name"))
comparer.add(rl.compare.String("first_name", "first_name", method="jarowinkler", label="first_name"))

# Address.
comparer.add(rl.compare.String("address_1", "address_1", method="damerau_levenshtein", label="address_1"))
comparer.add(rl.compare.String("address_2", "address_2", method="damerau_levenshtein", label="address_2"))
comparer.add(rl.compare.String("suburb", "suburb", method="damerau_levenshtein", label="suburb"))
comparer.add(rl.compare.String("postcode", "postcode", method="damerau_levenshtein", label="postcode"))
comparer.add(rl.compare.String("state", "state", method="damerau_levenshtein", label="state"))

# Other fields.
comparer.add(rl.compare.Date("dob", "dob", label="date_of_birth"))
comparer.add(rl.compare.String("phone_number", "phone_number", method="damerau_levenshtein", label="phone_number"))
comparer.add(rl.compare.String("soc_sec_id", "soc_sec_id", method="damerau_levenshtein", label="ssn"))

features = comparer.compute(candidate_links, df_A, df_B)

CPU times: user 43.3 s, sys: 823 ms, total: 44.2 s
Wall time: 43.9 s


In [13]:
features

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0011209e-8a5a-498d-a52f-505ec17b43e6,001eab52-59ea-47f7-a663-f8f60a71b022,0,0,0,0,0.000000,0.000000,0.166667,0.142857,0.153846,0.4,0.25,0.0,0.250000,0.000000
0011209e-8a5a-498d-a52f-505ec17b43e6,00830b20-97e8-4817-a749-c6d00c53dd39,0,0,0,0,0.430303,0.411111,0.500000,0.142857,0.250000,0.4,1.00,0.0,0.500000,0.285714
0011209e-8a5a-498d-a52f-505ec17b43e6,0131a9fd-42ad-47e2-b869-5b94b2fcd181,0,0,0,0,0.455556,0.000000,0.555556,0.133333,0.250000,0.2,0.25,0.0,0.333333,0.142857
0011209e-8a5a-498d-a52f-505ec17b43e6,017a9093-dd17-465c-b94a-56e8799f6641,0,0,0,0,0.561905,0.483333,0.350000,0.142857,0.272727,0.4,0.25,0.0,0.500000,0.000000
0011209e-8a5a-498d-a52f-505ec17b43e6,01d6b605-9b08-4bdf-a78e-30df177f029e,0,0,0,0,0.447619,0.577778,0.222222,0.142857,0.142857,0.4,0.25,0.0,0.500000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fffd38f2-20c9-40f2-95f5-0988a4f2ce05,e6c95f30-8b14-439f-9673-d86076afc45d,0,1,0,1,0.447619,1.000000,0.266667,1.000000,0.166667,0.6,1.00,0.0,0.500000,0.142857
fffd38f2-20c9-40f2-95f5-0988a4f2ce05,ea665738-5743-46ef-84c4-a08ac51436b5,0,1,0,1,0.436508,1.000000,0.200000,0.062500,0.222222,0.2,0.25,0.0,0.250000,0.000000
fffd38f2-20c9-40f2-95f5-0988a4f2ce05,f20ff3d2-3f42-4663-98b1-6a68bee57db0,0,1,0,0,0.427579,0.970833,0.411765,0.062500,0.166667,0.6,1.00,0.0,0.333333,0.000000
fffd38f2-20c9-40f2-95f5-0988a4f2ce05,fa58faef-f152-4a2b-963c-3b6648bd9480,0,1,0,1,0.447619,1.000000,0.266667,1.000000,0.166667,0.4,0.25,0.0,0.416667,0.285714


In [14]:
display(features.iloc[0].name)
display(features.iloc[0])

('0011209e-8a5a-498d-a52f-505ec17b43e6',
 '001eab52-59ea-47f7-a663-f8f60a71b022')

soundex_surname      0.000000
soundex_firstname    0.000000
nysiis_surname       0.000000
nysiis_firstname     0.000000
last_name            0.000000
first_name           0.000000
address_1            0.166667
address_2            0.142857
suburb               0.153846
postcode             0.400000
state                0.250000
date_of_birth        0.000000
phone_number         0.250000
ssn                  0.000000
Name: (0011209e-8a5a-498d-a52f-505ec17b43e6, 001eab52-59ea-47f7-a663-f8f60a71b022), dtype: float64

## Add labels to feature vectors

In [15]:
df_labeled_features = pd.merge(
    features,
    df_labels,
    on=['person_id_A', 'person_id_B'],
    how="left"
)

df_labeled_features["label"].fillna(0, inplace=True)
df_labeled_features

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn,label
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0011209e-8a5a-498d-a52f-505ec17b43e6,001eab52-59ea-47f7-a663-f8f60a71b022,0,0,0,0,0.000000,0.000000,0.166667,0.142857,0.153846,0.4,0.25,0.0,0.250000,0.000000,0.0
0011209e-8a5a-498d-a52f-505ec17b43e6,00830b20-97e8-4817-a749-c6d00c53dd39,0,0,0,0,0.430303,0.411111,0.500000,0.142857,0.250000,0.4,1.00,0.0,0.500000,0.285714,0.0
0011209e-8a5a-498d-a52f-505ec17b43e6,0131a9fd-42ad-47e2-b869-5b94b2fcd181,0,0,0,0,0.455556,0.000000,0.555556,0.133333,0.250000,0.2,0.25,0.0,0.333333,0.142857,0.0
0011209e-8a5a-498d-a52f-505ec17b43e6,017a9093-dd17-465c-b94a-56e8799f6641,0,0,0,0,0.561905,0.483333,0.350000,0.142857,0.272727,0.4,0.25,0.0,0.500000,0.000000,0.0
0011209e-8a5a-498d-a52f-505ec17b43e6,01d6b605-9b08-4bdf-a78e-30df177f029e,0,0,0,0,0.447619,0.577778,0.222222,0.142857,0.142857,0.4,0.25,0.0,0.500000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fffd38f2-20c9-40f2-95f5-0988a4f2ce05,e6c95f30-8b14-439f-9673-d86076afc45d,0,1,0,1,0.447619,1.000000,0.266667,1.000000,0.166667,0.6,1.00,0.0,0.500000,0.142857,0.0
fffd38f2-20c9-40f2-95f5-0988a4f2ce05,ea665738-5743-46ef-84c4-a08ac51436b5,0,1,0,1,0.436508,1.000000,0.200000,0.062500,0.222222,0.2,0.25,0.0,0.250000,0.000000,0.0
fffd38f2-20c9-40f2-95f5-0988a4f2ce05,f20ff3d2-3f42-4663-98b1-6a68bee57db0,0,1,0,0,0.427579,0.970833,0.411765,0.062500,0.166667,0.6,1.00,0.0,0.333333,0.000000,0.0
fffd38f2-20c9-40f2-95f5-0988a4f2ce05,fa58faef-f152-4a2b-963c-3b6648bd9480,0,1,0,1,0.447619,1.000000,0.266667,1.000000,0.166667,0.4,0.25,0.0,0.416667,0.285714,0.0


## Calculate SimSum Scores

In [16]:
df_labeled_features["simsum"] = df_labeled_features.drop("label", axis=1).sum(axis=1)
df_labeled_features

Unnamed: 0_level_0,Unnamed: 1_level_0,soundex_surname,soundex_firstname,nysiis_surname,nysiis_firstname,last_name,first_name,address_1,address_2,suburb,postcode,state,date_of_birth,phone_number,ssn,label,simsum
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0011209e-8a5a-498d-a52f-505ec17b43e6,001eab52-59ea-47f7-a663-f8f60a71b022,0,0,0,0,0.000000,0.000000,0.166667,0.142857,0.153846,0.4,0.25,0.0,0.250000,0.000000,0.0,1.363370
0011209e-8a5a-498d-a52f-505ec17b43e6,00830b20-97e8-4817-a749-c6d00c53dd39,0,0,0,0,0.430303,0.411111,0.500000,0.142857,0.250000,0.4,1.00,0.0,0.500000,0.285714,0.0,3.919986
0011209e-8a5a-498d-a52f-505ec17b43e6,0131a9fd-42ad-47e2-b869-5b94b2fcd181,0,0,0,0,0.455556,0.000000,0.555556,0.133333,0.250000,0.2,0.25,0.0,0.333333,0.142857,0.0,2.320635
0011209e-8a5a-498d-a52f-505ec17b43e6,017a9093-dd17-465c-b94a-56e8799f6641,0,0,0,0,0.561905,0.483333,0.350000,0.142857,0.272727,0.4,0.25,0.0,0.500000,0.000000,0.0,2.960823
0011209e-8a5a-498d-a52f-505ec17b43e6,01d6b605-9b08-4bdf-a78e-30df177f029e,0,0,0,0,0.447619,0.577778,0.222222,0.142857,0.142857,0.4,0.25,0.0,0.500000,0.000000,0.0,2.683333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fffd38f2-20c9-40f2-95f5-0988a4f2ce05,e6c95f30-8b14-439f-9673-d86076afc45d,0,1,0,1,0.447619,1.000000,0.266667,1.000000,0.166667,0.6,1.00,0.0,0.500000,0.142857,0.0,7.123810
fffd38f2-20c9-40f2-95f5-0988a4f2ce05,ea665738-5743-46ef-84c4-a08ac51436b5,0,1,0,1,0.436508,1.000000,0.200000,0.062500,0.222222,0.2,0.25,0.0,0.250000,0.000000,0.0,4.621230
fffd38f2-20c9-40f2-95f5-0988a4f2ce05,f20ff3d2-3f42-4663-98b1-6a68bee57db0,0,1,0,0,0.427579,0.970833,0.411765,0.062500,0.166667,0.6,1.00,0.0,0.333333,0.000000,0.0,4.972677
fffd38f2-20c9-40f2-95f5-0988a4f2ce05,fa58faef-f152-4a2b-963c-3b6648bd9480,0,1,0,1,0.447619,1.000000,0.266667,1.000000,0.166667,0.4,0.25,0.0,0.416667,0.285714,0.0,6.233333


In [17]:
df_labeled_features.shape

(653588, 16)

## Choosing a SimSum Classification Threshold

In [18]:
df_sim_sum_dist = df_labeled_features[["simsum", "label"]].copy()
df_sim_sum_dist["label"] = df_sim_sum_dist["label"].apply(lambda x: "True Link" if x == 1 else "Not a Link")
df_sim_sum_dist["simsum"] = df_sim_sum_dist["simsum"].apply(lambda x: round(x, 2))
df_sim_sum_dist["count"] = df_sim_sum_dist["label"]
df_sim_sum_dist = df_sim_sum_dist.groupby(["simsum", "label"]).count().reset_index()
df_sim_sum_dist

Unnamed: 0,simsum,label,count
0,0.71,Not a Link,1
1,0.74,Not a Link,1
2,0.84,Not a Link,1
3,0.85,Not a Link,1
4,0.86,Not a Link,1
...,...,...,...
1409,13.95,True Link,2
1410,13.96,True Link,1
1411,13.97,True Link,4
1412,13.98,True Link,1


In [19]:
min(df_sim_sum_dist["simsum"])

0.71

In [20]:
sorted(list(df_sim_sum_dist["label"].unique()))

['Not a Link', 'True Link']

In [21]:
legend_selection = alt.selection_multi(fields=["label"], bind="legend")

color_scale = alt.Scale(
    domain=["True Link", "Not a Link"],
    scheme="tableau10",
)

alt.Chart(df_sim_sum_dist, title=f"SimSum Score Distribution").mark_bar(opacity=0.7, binSpacing=0).encode(
    alt.X(
        "simsum:Q",
        bin=alt.Bin(extent=[0, max(df_sim_sum_dist["simsum"])], step=0.01),
        axis=alt.Axis(tickCount=5, title="SimSum Score (Binned)"),
    ),
    alt.Y("count", stack=None, axis=alt.Axis(title="Count of Links")),
    alt.Color(
        "label",
        scale=color_scale,
        legend=alt.Legend(title="Ground Truth Label"),
    ),
    opacity=alt.condition(legend_selection, alt.value(0.7), alt.value(0.2)),
    tooltip=[
        alt.Tooltip("simsum", title="SimSum Score"),
        alt.Tooltip("label", title="Ground Truth"),
        alt.Tooltip("count", title="Count of Links"),
    ],
).properties(
    height=200, width=800
).add_selection(legend_selection).interactive()

In [22]:
def evaluate_linking(
    df: pd.DataFrame,
    df_left: pd.DataFrame,
    df_right: pd.DataFrame,
    df_true_links: pd.DataFrame,
    score_column_name: str = "score",
    ground_truth_column_name: str = "ground_truth",
    k: int = 10
):
    """ Calculate precision & recall for model results,
    
        Args:
            df: Dataframe containing model scores, and ground truth labels
                indexed on 
            
            df_left: indexed on
            df_right:
            df_ground_truth:
            
    df needs to have df_A id, df_B id, score, ground truth label
        true_links: pandas MultiIndex of true links
    """

    
    # show dist
    
    # how many true links were found by blocking?
    # -> compare true links ids to ids of df
    
    # then display some graphs

    total_true_links = df_true_links.shape[0]
    true_links_after_blocking = pd.merge(
        df_true_links,
        df,
        left_index=True,
        right_index=True,
        how="inner"
    ).shape[0]
    
    true_link_pct_after_blocking = round((true_links_after_blocking / total_true_links)*100, 0)
    
    # True Links present in df.
    print(f"{true_link_pct_after_blocking}% true links present after blocking. ({true_links_after_blocking}/{total_true_links})")
    
    eval_data = []
    
    # Calculate true positives (tp), false positives (fp), true negatives (tn), false negatives (fn)
    # at threshold intervals from zero to max score.
    max_score = max(1, max(df[score_column_name]))

    for threshold in np.linspace(0, max_score, 50):
        tp = df[(df[score_column_name] >= threshold) & (df[ground_truth_column_name] == 1)].shape[0]
        fp = df[(df[score_column_name] >= threshold) & (df[ground_truth_column_name] == 0)].shape[0]
        tn = df[(df[score_column_name] < threshold) & (df[ground_truth_column_name] == 0)].shape[0]
        fn = df[(df[score_column_name] < threshold) & (df[ground_truth_column_name] == 1)].shape[0]
        
        eval_data.append(
            {
                "threshold" : threshold,
                "tp" : tp,
                "fp" : fp,
                "tn" : tn,
                "fn" : fn,
                "recall" : tp / (tp + fn),
                "precision" : tp / (tp + fp)
            }
        )

    
    def join_original_entity_data_to_links(df_k_links: pd.DataFrame, df_left, df_right) -> pd.DataFrame:
        """Helper function to join entity data to a datafram of link results."""
        
        # Join data from left entities.
        df_k_links = pd.merge(
            df_k_links,
            df_left,
            left_on=df_left.index.name,
            right_index=True,
        )
        
        # Join data from right entities.
        return pd.merge(
            df_k_links,
            df_right,
            left_on=df_right.index.name,
            right_index=True,
        )  
        

    df_top_k_links = join_original_entity_data_to_links(
        df[[score_column_name, ground_truth_column_name]].sort_values(score_column_name, ascending=False).head(n=k).reset_index(),
        df_left,
        df_right
    )
    
    df_bottom_k_links = join_original_entity_data_to_links(
        df[[score_column_name, ground_truth_column_name]].sort_values(score_column_name).head(n=k).reset_index(),
        df_left,
        df_right    
    )
    
    return pd.DataFrame(eval_data), df_top_k_links, df_bottom_k_links

df_eval, df_top_links, df_bottom_links = evaluate_linking(
    df=df_labeled_features,
    df_left=df_A,
    df_right=df_B,
    df_true_links=df_labels,
    score_column_name = "simsum",
    ground_truth_column_name = "label",  
)

100.0% true links present after blocking. (5000/5000)


In [25]:
display_cols = [
    'first_name', 'surname', 'street_number', 'address_1',
    'address_2', 'suburb', 'postcode', 'state', 'date_of_birth', 'age',
    'phone_number', 'soc_sec_id',
    "soundex_surname", "soundex_firstname",
    "nysiis_surname", "nysiis_firstname",
]

display_cols = [[f"{col}_x", f"{col}_y"] for col in display_cols]
display_cols = list(itertools.chain.from_iterable(display_cols))

In [26]:
with pd.option_context('display.max_columns', None):
    display(df_top_links[["person_id_A", "person_id_B", "simsum", "label"] + display_cols])

Unnamed: 0,person_id_A,person_id_B,simsum,label,first_name_x,first_name_y,surname_x,surname_y,street_number_x,street_number_y,address_1_x,address_1_y,address_2_x,address_2_y,suburb_x,suburb_y,postcode_x,postcode_y,state_x,state_y,date_of_birth_x,date_of_birth_y,age_x,age_y,phone_number_x,phone_number_y,soc_sec_id_x,soc_sec_id_y,soundex_surname_x,soundex_surname_y,soundex_firstname_x,soundex_firstname_y,nysiis_surname_x,nysiis_surname_y,nysiis_firstname_x,nysiis_firstname_y
0,0fe47250-302d-4760-a03d-9db132cbc108,1c3ded27-5776-44d7-b89e-3395374ea37a,14.0,1.0,olivia,olivia,beams,beams,28,61,bingara place,bingara place,,,pymble,pymble,3909,3909,nsw,nsw,19520717,19520717,23.0,23.0,04 14201344,04 14201344,2202477,2202477,B520,B520,O410,O410,BAN,BAN,OLAV,OLAV
1,f947662c-24b9-430e-9faa-6470b04a68f9,f2d8a764-ddf5-46a2-b349-a642d3ce3abb,14.0,1.0,james,james,morrison,morrison,11,15,ingram street,ingram street,villa 2,villa 2,noble park,noble park,2148,2148,nsw,nsw,19810409,19810409,30.0,,04 07562543,04 07562543,3227052,3227052,M625,M625,J520,J520,MARASAN,MARASAN,JAN,JAN
2,f936c127-9210-40b2-9f6c-c4b92001ce56,71ee0ae5-716f-4959-ab51-ec67fa0bf653,14.0,1.0,finley,finley,goode,goode,8,4,blair street,blair street,phillip island,phillip island,kirra,kirra,4740,4740,,,19880919,19880919,35.0,,03 60273146,03 60273146,6292183,6292183,G300,G300,F540,F540,GAD,GAD,FANLY,FANLY
3,fd284b19-29b5-4cc4-86a7-61bf9b6d2bb0,85978205-de2d-489b-93a0-2ac2d1ea2a4a,14.0,1.0,dylan,dylan,kelley,kelley,5,7,sid barnes crescent,sid barnes crescent,,,millbank,millbank,3108,3108,vic,vic,19020621,19020621,23.0,37.0,04 43032020,04 43032020,7831682,7831682,K400,K400,D450,D450,CALY,CALY,DYLAN,DYLAN
4,2b227b4e-fcad-4ce7-98c2-3158693816a8,a392d816-38af-4b27-b1cd-85434c73c672,14.0,1.0,meg,meg,afford,afford,14,1,folingsby street,folingsby street,jumble springs,jumble springs,seelands,seelands,3109,3109,sa,sa,19620528,19620528,10.0,10.0,08 53021222,08 53021222,9977867,9977867,A163,A163,M200,M200,AFAD,AFAD,MAG,MAG
5,57184252-789f-4d67-a8a5-22f0ad77e0e7,e9472358-7b80-44e9-a448-1f3421a861dd,14.0,1.0,ebony,ebony,tuting,tuting,59,595,,,,,warnbro,warnbro,3228,3228,tas,tas,19980802,19980802,28.0,26.0,04 87699696,04 87699696,2892050,2892050,T352,T352,E150,E150,TATANG,TATANG,EBANY,EBANY
6,e3601ac1-895a-4c09-991b-cefa527f1c38,050c6b62-e634-47fb-914d-38d657032b7b,14.0,1.0,connor,connor,wilde,wilde,10,18,edwards street,edwards street,,,evandale,evandale,2914,2914,sa,sa,19150613,19150613,37.0,37.0,02 50996268,02 50996268,8133686,8133686,W430,W430,C560,C560,WALD,WALD,CANAR,CANAR
7,cfba1182-6e6c-48e5-a9a8-035aec45dbf0,3b650107-fef2-494a-9c1f-b7fda9ef5191,14.0,1.0,matteus,matteus,brayton,brayton,888,888,hall street,hall street,,,st ives,st ives,3163,3163,vic,vic,19950306,19950306,,,04 44743841,04 44743841,7386748,7386748,B635,B635,M320,M320,BRAYTAN,BRAYTAN,MAT,MAT
8,67910ee0-edb4-49d7-bf78-25b692b77e0c,6016f014-12ca-43a2-a428-c72efe6f3e94,14.0,1.0,oliver,oliver,hathaway,hathaway,21,21,alabaster street,alabaster street,,,westmeadows,westmeadows,6110,6110,vic,vic,19130805,19130805,32.0,,07 26700519,07 26700519,8589628,8589628,H300,H300,O416,O416,HATY,HATY,OLAVAR,OLAVAR
9,76fed60f-bed8-4736-945f-56db4f74bd9c,ccfeb32a-4563-47eb-9b4d-13e25d30ba55,14.0,1.0,lachlan,lachlan,wiseman,wiseman,26,27,staunton place,staunton place,,,wanniassa,wanniassa,4740,4740,vic,vic,19080721,19080721,9.0,9.0,02 38019795,02 38019795,2647224,2647224,W255,W255,L245,L245,WASANAN,WASANAN,LACLAN,LACLAN


In [27]:
with pd.option_context('display.max_columns', None):
    display(df_bottom_links[["person_id_A", "person_id_B", "simsum", "label"] + display_cols])

Unnamed: 0,person_id_A,person_id_B,simsum,label,first_name_x,first_name_y,surname_x,surname_y,street_number_x,street_number_y,address_1_x,address_1_y,address_2_x,address_2_y,suburb_x,suburb_y,postcode_x,postcode_y,state_x,state_y,date_of_birth_x,date_of_birth_y,age_x,age_y,phone_number_x,phone_number_y,soc_sec_id_x,soc_sec_id_y,soundex_surname_x,soundex_surname_y,soundex_firstname_x,soundex_firstname_y,nysiis_surname_x,nysiis_surname_y,nysiis_firstname_x,nysiis_firstname_y
0,e496d82d-ca68-4bf4-9109-3d45bb528303,38887005-60d7-414f-88f8-31cb99d6aefe,0.714534,0.0,nicholas,jayb,rees,humphfcys,34,32,,higgerson street,,windsor dental centre,mitcham,balwyn north,2190,4802,nsw,wa,,,31.0,21.0,03 48152407,,8000601,3725158,R200,H512,N242,J100,R,HANFCY,NACAL,JAYB
1,100ef930-71df-4d95-b219-b1028cc2d1db,f0f076ee-6b99-4f1d-81b7-ab8f1299e624,0.740559,0.0,gabriel,joxhua,filipov,prodw,56,173,bavin street,,dudley specialist medical centre,,elwood,mona vale,7008,2672,nsw,,,,37.0,29.0,07 44471940,,6353487,1978244,F411,P630,G164,J200,FALAPAV,PRADW,GABRAL,JAX
6,1180b166-4689-4e49-9459-e757b9edaad4,f0f076ee-6b99-4f1d-81b7-ab8f1299e624,0.876923,0.0,brinley,joxhua,millar,prodw,74,173,shumack street,,kerry street,,dalby,mona vale,3805,2672,vic,,,,28.0,29.0,03 29449716,,5414901,1978244,M460,P630,B654,J200,MALAR,PRADW,BRANLY,JAX
9,40c83340-eb2d-4df5-aedf-ae45687446fe,f0f076ee-6b99-4f1d-81b7-ab8f1299e624,0.888763,0.0,emiily,joxhua,hammer,prodw,16,173,neumayer street,,rp 31513,,leichhardt,mona vale,4500,2672,vic,,,,34.0,29.0,07 75170221,,4013468,1978244,H560,P630,E540,J200,HANAR,PRADW,ENALY,JAX
2,3bd73e59-f0fa-4b21-a0de-782981a11961,bcd62e5f-c4a3-4abd-a22c-c4b13a69f6a2,0.842857,0.0,hollie,zac,woodbury,canini,86,1716,,whalan lace,,oxford,south perth,terreyhills,3143,4270,wa,nsw,,,37.0,32.0,,04 40897322,8392168,9700884,W316,C550,H400,Z200,WADBARY,CANAN,HALY,ZAC
3,180a2281-8d31-48c0-a572-e7f13025ca08,13388a90-3582-4313-8c96-5a82182d3146,0.848352,0.0,xani,declen,ponter,kiss,52,33,,flecker place,inglewood,warra warra,campbelltown,bonny hills,4421,5052,nsw,qlc,,,13.0,25.0,07 68697185,,6980982,9377051,P536,K200,X500,D245,PANTAR,C,XAN,DACLAN
4,fcf8cd20-4262-485d-b353-9366bdf313f9,761e9baa-0935-478e-a47e-7d70ec1f9ea2,0.859524,0.0,jordan,mitchell,,koolen,10,13,valder place,crowleyndourt,,upper meroo,winmalee,bray park,5049,4814,qld,,,,29.0,29.0,,04 66042159,5686855,2748162,,K450,J635,M324,,CALAN,JARDAN,MATCAL
5,19937d07-0d8c-4765-9be1-58f2c3722194,e23a28a4-c4e7-4bb8-9f53-127cd0f1c3ce,0.866667,0.0,james,ruby,miles,rafandlli,29,35,churchill way,elliott street,little glencoe,,peregian beach,ballarat,6230,4511,nsw,,,,24.0,,08 21445124,,7227769,6348843,M420,R153,J520,R100,MAL,RAFANDL,JAN,RABY
7,49198340-81f9-404d-a860-cf2a69c91c38,0a3d3a17-28a2-49b0-8a70-c5b0227f96bf,0.886447,0.0,emiily,,dixon,blake,9,8,antill street,clement zlace,mcivor house,,botany,mermaid waters,4560,5097,vic,,,,22.0,31.0,07 27293537,,4367087,5274614,D250,B420,E540,,DAXAN,BLAC,ENALY,
8,70500462-bf17-468b-b409-b4d7d9002a37,797c3e99-58be-448a-888a-ad49b827a509,0.888095,0.0,connor,asha,rees,gao,17,15,bugden avenue,,,the meadows,guyra,brookvale,3040,5167,vic,nds,,,,,,04 63180654,8638037,1590501,R200,G000,C560,A200,R,G,CANAR,AS


In [28]:
df_eval.head()

Unnamed: 0,threshold,tp,fp,tn,fn,recall,precision
0,0.0,5000,648588,0,0,1.0,0.00765
1,0.285714,5000,648588,0,0,1.0,0.00765
2,0.571429,5000,648588,0,0,1.0,0.00765
3,0.857143,5000,648584,4,0,1.0,0.00765
4,1.142857,5000,648390,198,0,1.0,0.007652


In [29]:
model_legend_select = alt.selection_multi(fields=["variable"], bind="legend")

alt.Chart(
    df_eval[["threshold", "recall", "precision"]].melt(id_vars=["threshold"]),
    title="Precision and Recall v.s. Model Threshold"
).mark_line().encode(
    alt.X("threshold:Q", axis=alt.Axis(title="Model Threshold")),
    alt.Y(
        "value:Q",
        scale=alt.Scale(domain=(0, 1)),
        axis=alt.Axis(title="Precision/Recall Value"),
    ),
    alt.Color(
        "variable:N", legend=alt.Legend(title="Variable")
    ),
    tooltip=alt.Tooltip(["variable", "threshold", "value"]),
).add_selection(
    model_legend_select
).properties(height=400, width=800)