# Use Active Learning to Link FEBRL People Data

<a href="https://colab.research.google.com/github/rachhouse/intro-to-data-linking/blob/main/tutorial_notebooks/03_Link_FEBRL_Data_with_Active_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>

In this tutorial, we'll use the [dedupe library](https://github.com/dedupeio/dedupe) to experiment with an active learning approach to linking our FEBRL people datasets.

## Google Colab Setup

In [1]:
# Check if we're running locally, or in Google Colab.
try:
    import google.colab
    COLAB = True
except ModuleNotFoundError:
    COLAB = False
    
# If we're running in Colab, download the tutorial functions file 
# to the Colab session local directory, and install required libraries.
if COLAB:
    import requests
    
    tutorial_functions_url = "https://raw.githubusercontent.com/rachhouse/intro-to-data-linking/main/tutorial_notebooks/linking_tutorial_functions.py"
    r = requests.get(tutorial_functions_url)
    
    with open("linking_tutorial_functions.py", "w") as fh:
        fh.write(r.text)
    
    !pip install -q dedupe altair

In [2]:
import datetime
import itertools
import os
import pathlib
import re
from typing import Any, Dict, Optional

import dedupe
import pandas as pd

import linking_tutorial_functions as tutorial

In [3]:
# Dedupe RecordLink/active learning file locations.
ACTIVE_LEARNING_DIR = pathlib.Path(os.path.abspath('')) / "active_learning"

OUTPUT_FILE = ACTIVE_LEARNING_DIR / "output.csv"
SETTINGS_FILE = ACTIVE_LEARNING_DIR / "learned_settings"
TRAINING_FILE = ACTIVE_LEARNING_DIR / "training.json"

## Load Training Dataset and Ground Truth Labels

In [4]:
df_A, df_B, df_ground_truth = tutorial.load_febrl_training_data(COLAB)

Let's refresh on our dataset columns and formats.

In [5]:
df_A.head()

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id
person_id_A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
c538959d-35b6-4b4f-aa9d-12e2195e57bd,marcus,butt,98,kirkwood crescent,euroka,terrigal,2409,nsw,19420616,30,02 40555328,7758524
17f19297-13ab-457b-ac0e-bdda526a8c51,jessica,white,15,sabine close,springdale,yungaburra,2046,,19100318,27,03 84921725,7406466
ecc89e8a-847a-4fd5-bf00-3e1e65a94e90,jay,voarino,108,howitt street,,childers,2147,,19700411,26,02 95550035,7232789
defd07dd-a969-44e2-aefe-0ceb046d5ad3,jackson,miles,6,clive steele avenue,,castella,3078,vic,19391016,27,08 95639180,2079318
caf3bb89-6892-4059-99bf-93c744597e2f,sienna,beattie,4,hooley place,,elsternwick,6164,sa,19120225,37,02 48925933,2667388


## Data Augmentation

We'll do minimal data augmentation before feeding our data to `dedupe`; we just want to format the date of birth data as `mm/dd/yy` format, and ensure all columns are in string format and stripped of trailing/leading whitespace. Additionally, `dedupe` requires input data to be in dictionaries, using the record id as the key and the metadata dictionary as the value. So, we'll convert our dataframes to this format.

In [6]:
def format_dob(dob: str) -> Optional[str]:
    """ Transform date of birth format from YYYYMMDD to mm/dd/yy.
        If DOB cannot be transformed, return None.
    """
    try:
        if re.match(r"\d{8}", dob):
            return (datetime.datetime.strptime(dob, "%Y%m%d")).strftime("%m/%d/%y")
    except:
        pass

    return None

def strip_and_null(x: Any) -> Optional[str]:
    """ Stringify incoming variable, remove trailing/leading whitespace
        and return resulting string. Return None if resulting string is empty.
    """
    x = str(x).strip()
    
    if x == "":
        return None
    else:
        return x
    
def convert_df_to_dict(df: pd.DataFrame) -> Dict[str, Dict]:
    """ Convert pandas DataFrame to dict keyed by record id.
        Convert all fields to strings or Nones to satisfy dedupe.
        Transform date format of date_of_birth field.
    """    

    for col in df.columns:
        df[col] = df[col].apply(lambda x: strip_and_null(x))

    df["date_of_birth"] = df["date_of_birth"].apply(lambda x: format_dob(x))    

    return df.to_dict("index")

In [7]:
records_A = convert_df_to_dict(df_A)
records_B = convert_df_to_dict(df_B)

We can examine a small sample of the resulting transformed records:

In [8]:
[records_A[k] for k in list(records_A.keys())[0:2]]

[{'first_name': 'marcus',
  'surname': 'butt',
  'street_number': '98',
  'address_1': 'kirkwood crescent',
  'address_2': 'euroka',
  'suburb': 'terrigal',
  'postcode': '2409',
  'state': 'nsw',
  'date_of_birth': '06/16/42',
  'age': '30',
  'phone_number': '02 40555328',
  'soc_sec_id': '7758524'},
 {'first_name': 'jessica',
  'surname': 'white',
  'street_number': '15',
  'address_1': 'sabine close',
  'address_2': 'springdale',
  'suburb': 'yungaburra',
  'postcode': '2046',
  'state': None,
  'date_of_birth': '03/18/10',
  'age': '27',
  'phone_number': '03 84921725',
  'soc_sec_id': '7406466'}]

In [11]:
%%time

# Define fields that the dedupe linker pays attention to.
fields = [
    { "field" : "first_name", "type" : "Name" },
    { "field" : "surname", "type" : "Name" },
    { "field" : "address_1", "type" : "ShortString" },
    { "field" : "address_2", "type" : "ShortString" },
    { "field" : "suburb", "type" : "ShortString" },
    { "field" : "postcode", "type" : "Exact" },
    { "field" : "state", "type" : "Exact" },
    { "field" : "date_of_birth", "type" : "DateTime" },
    { "field" : "soc_sec_id", "type" : "Exact" },
]

linker = dedupe.RecordLink(fields)

if TRAINING_FILE.exists():
    # If we have a previously saved training data file from a previous
    # linker run, include it for the linker in prepare_training.
    with open(TRAINING_FILE) as fh:
        linker.prepare_training(records_A, records_B, training_file=fh)
else:
    linker.prepare_training(records_A, records_B)

INFO:dedupe.canopy_index:Removing stop word  s
INFO:dedupe.canopy_index:Removing stop word ee
INFO:dedupe.canopy_index:Removing stop word re
INFO:dedupe.canopy_index:Removing stop word st
INFO:dedupe.canopy_index:Removing stop word ce
INFO:dedupe.canopy_index:Removing stop word en
INFO:dedupe.canopy_index:Removing stop word la
INFO:dedupe.canopy_index:Removing stop word et
INFO:dedupe.canopy_index:Removing stop word tr
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(PartialPredicate: (suffixArray, surname, Surname), SimplePredicate: (commonSixGram, first_name))


CPU times: user 2min 1s, sys: 407 ms, total: 2min 1s
Wall time: 2min 1s


## Active Learning Labeling Session!

In [12]:
dedupe.console_label(linker)

first_name : caitlin
surname : lock
address_1 : owen crescent
address_2 : None
suburb : hermit park
postcode : 2567
state : qld
date_of_birth : 05/11/59
soc_sec_id : 1830310

first_name : caiftn
surname : lpcd
address_1 : None
address_2 : None
suburb : hermit park
postcode : 2567
state : qld
date_of_birth : 05/11/59
soc_sec_id : 1830310

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


 y


first_name : bailey
surname : fauser
address_1 : dashwood retreat
address_2 : the shanty
suburb : wangaratta
postcode : 2444
state : wa
date_of_birth : 12/30/61
soc_sec_id : 4365312

first_name : talua
surname : neville
address_1 : None
address_2 : wee wilbertree
suburb : None
postcode : 2340
state : wa
date_of_birth : None
soc_sec_id : 5856189

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(PartialPredicate: (suffixArray, surname, Surname), SimplePredicate: (commonSixGram, first_name))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, suburb), SimplePredicate: (dayPredicate, date_of_birth))
first_name : jackson
surname : wooldridge-curtis
address_1 : fred williams crescent
address_2 : None
suburb : thornbury
postcode : 3775
state : nsw
date_of_birth : None
soc_sec_id : 3843063

first_name : jonathon
surname : bullcm
address_1 : None
address_2 : glebe retirement villa
suburb : None
postcode : 3047
state : vic
date_of_birth : 01/02/55
soc_sec_id : 1014005

1/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


first_name : micaela
surname : reid
address_1 : None
address_2 : mornington ret vlg
suburb : cape paterson
postcode : 4035
state : nsw
date_of_birth : 12/05/57
soc_sec_id : 5896116

first_name : micaela
surname : teug
address_1 : None
address_2 : mornington ret vlg
suburb : cape paterson
postcode : 4035
state : nsw
date_of_birth : 12/09/57
soc_sec_id : 5896116

1/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


first_name : amber
surname : mednis
address_1 : pennington crescent
address_2 : None
suburb : moree
postcode : 2295
state : nsw
date_of_birth : 11/06/76
soc_sec_id : 6250047

first_name : amber
surname : mednis
address_1 : pennington crescent
address_2 : None
suburb : moree
postcode : 2295
state : nsw
date_of_birth : 11/06/76
soc_sec_id : 6250047

1/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


first_name : victoria
surname : vincent
address_1 : brand street
address_2 : None
suburb : brighton
postcode : 7140
state : nsw
date_of_birth : 05/28/47
soc_sec_id : 8442501

first_name : victosia
surname : vincent
address_1 : brand street
address_2 : None
suburb : brighton
postcode : 7140
state : nsw
date_of_birth : 05/28/47
soc_sec_id : 8442501

2/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(PartialPredicate: (suffixArray, surname, Surname), SimplePredicate: (wholeFieldPredicate, address_1))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, suburb), SimplePredicate: (dayPredicate, date_of_birth))
first_name : andrew
surname : hingston
address_1 : atherton street
address_2 : None
suburb : dorrigo
postcode : 5290
state : qld
date_of_birth : 07/06/40
soc_sec_id : 7679919

first_name : andrew
surname : hingston
address_1 : None
address_2 : None
suburb : dorrigo
postcode : 5290
state : qld
date_of_birth : 07/06/40
soc_sec_id : 7679919

3/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


first_name : liam
surname : bradshaw
address_1 : guthridge crescent
address_2 : None
suburb : allawah
postcode : 3073
state : wa
date_of_birth : 06/01/26
soc_sec_id : 8647489

first_name : liam
surname : bradshaw
address_1 : guthridge brescent
address_2 : None
suburb : allawah
postcode : 3073
state : wa
date_of_birth : 06/01/26
soc_sec_id : 8647489

4/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(PartialPredicate: (suffixArray, surname, Surname), SimplePredicate: (wholeFieldPredicate, postcode))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, suburb), SimplePredicate: (dayPredicate, date_of_birth))
first_name : nan
surname : chin quan
address_1 : dallachy place
address_2 : None
suburb : campbelltown
postcode : 3859
state : nsw
date_of_birth : 01/03/77
soc_sec_id : 3552194

first_name : nan
surname : chin quan
address_1 : dallachy place
address_2 : kookaburra village
suburb : campbelltown
postcode : 3859
state : nsw
date_of_birth : 01/03/77
soc_sec_id : 3552194

5/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


first_name : tiarna
surname : soulsby
address_1 : newman morris circuit
address_2 : None
suburb : macgregor
postcode : 4874
state : vic
date_of_birth : 01/25/57
soc_sec_id : 6247158

first_name : soulsby
surname : nan
address_1 : newman morris circuit
address_2 : None
suburb : macgregor
postcode : 4874
state : vic
date_of_birth : 01/25/57
soc_sec_id : 6247158

6/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(PartialPredicate: (suffixArray, surname, Surname), SimplePredicate: (wholeFieldPredicate, postcode))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, surname), SimplePredicate: (suffixArray, suburb))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, suburb), SimplePredicate: (dayPredicate, date_of_birth))
first_name : daniel
surname : ryan
address_1 : windradyne street
address_2 : bishops creek
suburb : ferndale
postcode : 4558
state : vic
date_of_birth : 04/27/22
soc_sec_id : 6361655

first_name : daniel
surname : ryan
address_1 : windradyne street
address_2 : None
suburb : ferndale
postcode : 4558
state : vic
date_of_birth : 04/27/22
soc_sec_id : 6361655

7/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(PartialPredicate: (suffixArray, surname, Surname), SimplePredicate: (wholeFieldPredicate, postcode))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, surname), SimplePredicate: (suffixArray, suburb))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, address_1), SimplePredicate: (exclusiveDayPredicate, date_of_birth))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, suburb), SimplePredicate: (dayPredicate, date_of_birth))
first_name : kierra
surname : gaylard
address_1 : namatjira drive
address_2 : yarrabee
suburb : hampstead gardens
postcode : 2143
state : nsw
date_of_birth : 12/24/40
soc_sec_id : 5464799

first_name : kieerc
surname : gaykard
address_1 : namatjira drive
address_2 : yarrabee
suburb : None
postcode : 2143
state : nsw
date_of_birth : 12/24/40
soc_sec_id : 5464799

8/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(PartialPredicate: (suffixArray, surname, Surname), SimplePredicate: (wholeFieldPredicate, postcode))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, surname), SimplePredicate: (suffixArray, suburb))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, address_1), SimplePredicate: (exclusiveDayPredicate, date_of_birth))
INFO:dedupe.training:(PartialPredicate: (commonSixGram, first_name, Surname), SimplePredicate: (firstTokenPredicate, address_1))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, suburb), SimplePredicate: (dayPredicate, date_of_birth))
first_name : koula
surname : nan
address_1 : batchelor street
address_2 : None
suburb : tusmore
postcode : 3101
state : nsw
date_of_birth : 08/12/77
soc_sec_id : 9642747

first_name : nan
surname : nan
address_1 : batchelor street
address_2 : None
suburb : tusmore
postcode : 3101
state : nsw
date_of_birth : 08/12/77
soc_sec_id : 9642747

8/10 positive, 4/

 n


first_name : laura
surname : hilton
address_1 : savige street
address_2 : None
suburb : lansvale
postcode : 4132
state : nsw
date_of_birth : 04/18/24
soc_sec_id : 3338888

first_name : laura
surname : hilton
address_1 : savigestreet
address_2 : kerein hills
suburb : lansgae
postcode : 4132
state : nsw
date_of_birth : 04/18/24
soc_sec_id : 3338888

8/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


first_name : alexandra
surname : goldsworthy
address_1 : barnett close
address_2 : None
suburb : preston west
postcode : 2148
state : qld
date_of_birth : 11/20/53
soc_sec_id : 4659337

first_name : alexandra
surname : goldsworthy
address_1 : barnett close
address_2 : None
suburb : carlingrford
postcode : 2149
state : qld
date_of_birth : 11/20/53
soc_sec_id : 4659337

9/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (dayPredicate, date_of_birth), SimplePredicate: (fingerprint, surname))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, address_1), SimplePredicate: (exclusiveDayPredicate, date_of_birth))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, suburb), SimplePredicate: (dayPredicate, date_of_birth))
first_name : nathan
surname : beattie
address_1 : namatjira drive
address_2 : None
suburb : bonython
postcode : 3030
state : nsw
date_of_birth : 03/27/72
soc_sec_id : 3403730

first_name : nan
surname : beatteie
address_1 : namatjira drive
address_2 : None
suburb : bontynh
postcode : 3030
state : nsw
date_of_birth : 03/27/72
soc_sec_id : 3403730

10/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


first_name : alysha
surname : simonuik
address_1 : brunswick circuit
address_2 : masonic retirement village
suburb : ariah park
postcode : 2107
state : None
date_of_birth : None
soc_sec_id : 9461009

first_name : alysha
surname : karrannis
address_1 : None
address_2 : None
suburb : werrnvton
postcode : 6053
state : None
date_of_birth : 10/11/75
soc_sec_id : 3072094

10/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 n


first_name : jasmine
surname : bihar
address_1 : faulkner place
address_2 : tantallon
suburb : loganlea
postcode : 3082
state : vic
date_of_birth : 09/13/12
soc_sec_id : 3660223

first_name : jasmine
surname : biharl
address_1 : faulkner place
address_2 : None
suburb : loganlea
postcode : 3082
state : vic
date_of_birth : 09/13/12
soc_sec_id : 3660223

10/10 positive, 7/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 y


first_name : alexander
surname : hawes
address_1 : stein place
address_2 : mulvra
suburb : mentone
postcode : 5330
state : vic
date_of_birth : 08/16/91
soc_sec_id : 7104389

first_name : alexander
surname : hawfs
address_1 : steinplace
address_2 : None
suburb : mentone
postcode : 5330
state : vic
date_of_birth : 08/16/91
soc_sec_id : 7104389

11/10 positive, 7/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


 f


Finished labeling
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (dayPredicate, date_of_birth), TfidfNGramSearchPredicate: (0.8, surname))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, address_1), SimplePredicate: (exclusiveDayPredicate, date_of_birth))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, suburb), SimplePredicate: (dayPredicate, date_of_birth))


In [13]:
%%time
linker.train()

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
  matthews_cc = ((true_dupes * true_distinct
INFO:rlr.crossvalidation:optimum alpha: 0.000100, score -0.049184540457719465
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (dayPredicate, date_of_birth), TfidfNGramSearchPredicate: (0.8, surname))
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, address_1), SimplePredicate: (exclusiveDayPredicate, date_of_birth))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, suburb), SimplePredicate: (dayPredicate, date_of_birth))


CPU times: user 2.27 s, sys: 808 ms, total: 3.07 s
Wall time: 2.41 s


In [14]:
# Save training and settings to disk.
with open(TRAINING_FILE, "w") as fh:
    linker.write_training(fh)

## Examine Learned Blockers

Let's take a look at the predicates (blockers) that `dedupe` learned during our active learning labeling session.

In [15]:
linker.predicates

((SimplePredicate: (dayPredicate, date_of_birth),
  TfidfNGramSearchPredicate: (0.8, surname)),
 (SimplePredicate: (commonThreeTokens, address_1),
  SimplePredicate: (exclusiveDayPredicate, date_of_birth)),
 (SimplePredicate: (commonTwoTokens, suburb),
  SimplePredicate: (dayPredicate, date_of_birth)))

Next, let's examine the resulting candidate pairs and look at our blocking efficiency.

In [16]:
candidate_pairs = [x for x in linker.pairs(records_A, records_B)]
print(f"{len(candidate_pairs):,} candidate pairs generated from blocking.")

2,366 candidate pairs generated from blocking.


You'll notice that, in contrast to `recordlinkage`, our post-blocking candidate pairs contain both the record ids as well as the record metadata.

In [17]:
candidate_pairs[0]

(('4880a2e2-43ab-4b1f-9a07-9a1791b30bf1',
  {'first_name': 'nan',
   'surname': 'highet',
   'street_number': '43',
   'address_1': 'winnecke street',
   'address_2': 'leumeah',
   'suburb': 'bucasia',
   'postcode': '4114',
   'state': 'wa',
   'date_of_birth': '09/07/33',
   'age': '28',
   'phone_number': '02 77098151',
   'soc_sec_id': '6707345'}),
 ('7d9b4dee-46f2-4a1d-9831-b048f0fd0317',
  {'first_name': 'nan',
   'surname': 'highet',
   'street_number': '43',
   'address_1': 'winnecke street',
   'address_2': 'leumeah',
   'suburb': 'bucasuva',
   'postcode': '4114',
   'state': 'wa',
   'date_of_birth': '09/07/33',
   'age': '28',
   'phone_number': '02 77098151',
   'soc_sec_id': '6707345'}))

In [18]:
df_candidate_links = pd.DataFrame(
    [(x[0][0], x[1][0]) for x in candidate_pairs]
).rename(columns={0 : "person_id_A", 1 : "person_id_B"}).set_index(["person_id_A", "person_id_B"])

df_candidate_links.head()

person_id_A,person_id_B
4880a2e2-43ab-4b1f-9a07-9a1791b30bf1,7d9b4dee-46f2-4a1d-9831-b048f0fd0317
2ebf8a1a-3507-4e20-b9f7-16cd665e173c,a41ae5af-297c-46a0-a05b-5be750221820
4abff4b0-682b-40db-9fd1-03f386e1eba4,4199e562-f339-4f24-969b-c0276a6bc00c
14628973-b2f2-4e6a-b496-ba0c2d760876,98d0d4c1-a3a1-4c18-8dfe-ab9a447be162
84fef202-cd86-4a4d-9d4d-5659a8241e9b,9ecea496-3f45-4058-99d4-00c7ff6650b4


In [19]:
max_candidate_pairs = df_A.shape[0]*df_B.shape[0]

print(f"{max_candidate_pairs:,} total possible pairs.")

# Calculate search space reduction.
search_space_reduction = round(1 - len(candidate_pairs)/max_candidate_pairs, 6)
print(f"\n{len(candidate_pairs):,} pairs after full blocking: {search_space_reduction}% search space reduction.")

# Calculate retained true links percentage.
total_true_links = df_ground_truth.shape[0]
true_links_after_blocking = pd.merge(
    df_ground_truth,
    df_candidate_links,
    left_index=True,
    right_index=True,
    how="inner"
).shape[0]

retained_true_link_percent = round((true_links_after_blocking/total_true_links) * 100, 2)
print(f"{retained_true_link_percent}% true links retained after full blocking.")

42,250,000 total possible pairs.

2,366 pairs after full blocking: 0.999944% search space reduction.
39.0% true links retained after full blocking.


## Score Pairs and Examine Learned Classifier

After `dedupe` has trained blockers and a classification model based on our labeling session, we can block & classify the records in our training dataset.

In [20]:
%%time
linked_records = linker.join(records_A, records_B, threshold=0.0, constraint="one-to-one")

CPU times: user 5.6 s, sys: 570 ms, total: 6.17 s
Wall time: 16.5 s


`linker.join` will return scored pairs.

In [21]:
linked_records[0:3]

[(('ffded31e-99e0-44c8-94ad-ce6f5bae1d03',
   '424ce16b-771f-437f-be7b-de52330d429a'),
  1.0),
 (('ff98793a-82b3-47d8-90da-68c9165e7b3f',
   '2cc658ab-ba1e-4fd1-aa36-e06d80b9ae50'),
  1.0),
 (('fd4d60aa-823f-4f9a-b2c7-23c5f67618c9',
   '8ddc1c44-7fe7-4228-97fe-d583fd77dabe'),
  1.0)]

In [22]:
df_predictions = pd.DataFrame(
    [ {"person_id_A" : x[0][0], "person_id_B" : x[0][1], "model_score" : x[1]} for x in linked_records]
)

df_predictions = df_predictions.set_index(["person_id_A", "person_id_B"])
df_predictions.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,model_score
person_id_A,person_id_B,Unnamed: 2_level_1
ffded31e-99e0-44c8-94ad-ce6f5bae1d03,424ce16b-771f-437f-be7b-de52330d429a,1.0
ff98793a-82b3-47d8-90da-68c9165e7b3f,2cc658ab-ba1e-4fd1-aa36-e06d80b9ae50,1.0
fd4d60aa-823f-4f9a-b2c7-23c5f67618c9,8ddc1c44-7fe7-4228-97fe-d583fd77dabe,1.0
fbfb5fe2-064a-40fd-9bba-5cf5aaf41570,4416630d-afc4-496c-af50-d454905c079b,1.0
fbaa0b32-f76f-40d8-b942-5b62f6927ff8,1c8d0c38-fd87-4e3d-b779-0a7315ff6507,1.0


In [23]:
df_predictions = pd.merge(
    df_predictions,
    df_ground_truth,
    left_index=True,
    right_index=True,
    how="left",
)

df_predictions["ground_truth"].fillna(False, inplace=True)
df_predictions

Unnamed: 0_level_0,Unnamed: 1_level_0,model_score,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1
ffded31e-99e0-44c8-94ad-ce6f5bae1d03,424ce16b-771f-437f-be7b-de52330d429a,1.0,True
ff98793a-82b3-47d8-90da-68c9165e7b3f,2cc658ab-ba1e-4fd1-aa36-e06d80b9ae50,1.0,True
fd4d60aa-823f-4f9a-b2c7-23c5f67618c9,8ddc1c44-7fe7-4228-97fe-d583fd77dabe,1.0,True
fbfb5fe2-064a-40fd-9bba-5cf5aaf41570,4416630d-afc4-496c-af50-d454905c079b,1.0,True
fbaa0b32-f76f-40d8-b942-5b62f6927ff8,1c8d0c38-fd87-4e3d-b779-0a7315ff6507,1.0,True
...,...,...,...
e054386e-bbc7-4a5f-96f4-7ca4ff70599d,d123f546-de1b-4337-b0c1-bcda08540f12,0.0,True
cf0f9478-6355-474d-858c-df6aba3cbf69,6ed61212-8df8-4bdb-93a6-b11bba225984,0.0,False
7adff38f-4d0c-4c32-bbfb-41e8d6761797,04a5ee25-fb66-4772-9533-bbe590657074,0.0,False
342e1d61-c028-4d3c-944c-9db17bc26e42,a53859dd-259e-4f50-a3b9-44e76a55097e,0.0,True


In [None]:
# import random

# record_a_id = random.choice(list(records_A.keys()))
# record_a = records_A[record_a_id]
# record_b_id = random.choice(list(records_B.keys()))
# record_b = records_B[record_b_id]

In [None]:
# linker.score([((record_a_id, record_a), (record_b_id, record_b))])

## Choosing a Linking Model Score Threshold

### Model Score Distribution

In [24]:
tutorial.plot_model_score_distribution(df_predictions)

### Precision and Recall vs. Model Score

In [25]:
df_eval = tutorial.evaluate_linking(
    df=df_predictions
)

In [26]:
df_eval.head()

Unnamed: 0,threshold,tp,fp,tn,fn,precision,recall,f1
0,0.0,2337,9,0,0,0.996164,1.0,0.998078
1,0.020408,1799,2,7,538,0.99889,0.76979,0.869502
2,0.040816,1763,2,7,574,0.998867,0.754386,0.859581
3,0.061224,1735,2,7,602,0.998849,0.742405,0.851743
4,0.081633,1728,2,7,609,0.998844,0.739409,0.849766


In [27]:
tutorial.plot_precision_recall_vs_threshold(df_eval)

## Examine Scored Pairs

In [28]:
HEAD_SIZE = 10

In [29]:
display_cols = [
    "first_name", "surname",
    "street_number", "address_1", "address_2", "suburb", "postcode", "state",
    "date_of_birth", "age", "phone_number", "soc_sec_id",
]

display_cols = [[f"{col}_A", f"{col}_B"] for col in display_cols]
display_cols = list(itertools.chain.from_iterable(display_cols))

### Top Scoring Non-Links

In [30]:
df_top_scoring_negatives = df_predictions[
    df_predictions["ground_truth"] == False
][["model_score", "ground_truth"]].sort_values("model_score", ascending=False).head(n=HEAD_SIZE)

df_top_scoring_negatives = tutorial.augment_scored_pairs(df_top_scoring_negatives, df_A, df_B, score_column_name="model_score")

with pd.option_context('display.max_columns', None):
    display(df_top_scoring_negatives[["model_score", "ground_truth"] + display_cols])

Unnamed: 0_level_0,Unnamed: 1_level_0,model_score,ground_truth,first_name_A,first_name_B,surname_A,surname_B,street_number_A,street_number_B,address_1_A,address_1_B,address_2_A,address_2_B,suburb_A,suburb_B,postcode_A,postcode_B,state_A,state_B,date_of_birth_A,date_of_birth_B,age_A,age_B,phone_number_A,phone_number_B,soc_sec_id_A,soc_sec_id_B
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
85bfe847-3e35-4614-9e33-5426296bf0b4,3a5bc242-052e-4e72-a84c-afa66503f011,0.9999998,False,casey,tayla,dudley,dudley,101,603,weatherburn place,brooks sneret,,rowethorpe,mitcham,moruya,2039,3324,vic,qld,07/02/09,07/02/09,,13.0,02 84773710,08 29906509,9415805,5143854
683b8ab6-cbf8-4f2b-9d4e-e78d0c5adaf4,7b1563b9-0e92-43cf-93eb-35b77c5402c4,0.9994612,False,alyssa,kat,matthews,matthews,3,99,rohan rivett crescent,sid barnes crescent,,backwoodlands,broadmeadows,brighton-le-sands,3128,4810,nsw,,05/25/03,05/25/03,41.0,32.0,07 93040742,,9383564,8490430
8dab336a-aa76-48a3-a364-e5b8d4a8edc6,23440d94-7adf-461b-87ac-c63ff5b84a07,0.0003951516,False,charlie,isaac,matthews,matthews,54,19,campbell street,zeitz court,,aldinga beach court,st clair,barooga,6154,6100,vic,nsw,03/29/14,03/29/14,,41.0,08 32595673,04 90686717,7252580,1566855
fde60053-2d86-4254-85ba-f015662215cd,b03bda5b-2cae-4c56-a628-f960323756f1,1.115634e-25,False,nathan,jessica,,,39,64,outtrim avenue,,,burraroo,corrimal,south brisbane,2594,3860,nsw,qld,04/10/63,04/10/63,31.0,19.0,,04 65736008,7417565,7647740
fe70f9f7-b098-42bb-bfd5-536410ad1379,f69a2ad7-aa99-46e4-90ff-ef119fe648f9,0.0,False,daniele,claudia,carbone,harrijhyon,12,1,ross smith crescent,ross smith crescent,,,paralowie,simpson,3195,2079,,nsw,04/19/45,08/19/19,22.0,,07 01658686,04 28056742,3410983,3404962
e4832a5e-8b3f-4862-882d-8a5b2bb06bc8,a739b8c3-f018-4565-b720-05044404d01a,0.0,False,kelsey,allegra,morrison,stanley,3,21,clive steele avenue,clive steele avenue,avenal,,seaforth,croydonk north,4207,5109,nsw,qld,02/17/96,12/17/17,34.0,,02 31262797,08 14555372,2422208,2525959
cf0f9478-6355-474d-858c-df6aba3cbf69,6ed61212-8df8-4bdb-93a6-b11bba225984,0.0,False,,katelin,scheringer,rosa,745,20,newman morris circuit,newman morris circuit,coralyn,,sunshine north,rosebud,3158,5000,wa,vic,09/01/69,07/01/65,,32.0,02 06166214,07 13939593,7738996,6612876
7adff38f-4d0c-4c32-bbfb-41e8d6761797,04a5ee25-fb66-4772-9533-bbe590657074,0.0,False,alannah,sol,broadby,heuer,31,169,lewis luxton avenue,lewis luxton avenue,waratana,,morwell,toowoomba,3186,4405,nsw,vic,01/12/61,11/12/23,23.0,,03 08301065,08 48628122,8623146,7821247
10af4f35-20bd-419a-8acb-f1342da64918,7738b53b-a119-4e76-a110-ecdb8fceb104,0.0,False,larissa,emma,garcia,ciotti,89,73,la perouse street,la perouse street,villa 5,,burleigh heads,kinga foy,7051,3068,qld,vic,12/09/21,09/09/09,32.0,33.0,03 61551764,04 11698410,4634634,4345599


### Lowest Scoring True Links

In [31]:
df_lowest_scoring_positives = df_predictions[
    df_predictions["ground_truth"] == True
][["model_score", "ground_truth"]].sort_values("model_score").head(n=HEAD_SIZE)

df_lowest_scoring_positives = tutorial.augment_scored_pairs(df_lowest_scoring_positives, df_A, df_B, score_column_name="model_score")

with pd.option_context('display.max_columns', None):
    display(df_lowest_scoring_positives[["model_score", "ground_truth"] + display_cols])

Unnamed: 0_level_0,Unnamed: 1_level_0,model_score,ground_truth,first_name_A,first_name_B,surname_A,surname_B,street_number_A,street_number_B,address_1_A,address_1_B,address_2_A,address_2_B,suburb_A,suburb_B,postcode_A,postcode_B,state_A,state_B,date_of_birth_A,date_of_birth_B,age_A,age_B,phone_number_A,phone_number_B,soc_sec_id_A,soc_sec_id_B
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
342e1d61-c028-4d3c-944c-9db17bc26e42,a53859dd-259e-4f50-a3b9-44e76a55097e,0.0,True,ruby,george,george,ruby,12,12,captain cook crescent,captain cook crescent,,,pascoe vale south,pascoe vale south,4078,4078,nsw,nsw,10/26/11,12/26/12,20.0,20.0,03 10457848,03 10457848,9420195,9420195
e054386e-bbc7-4a5f-96f4-7ca4ff70599d,d123f546-de1b-4337-b0c1-bcda08540f12,0.0,True,crystal,sachin,george,george,21,2,blair street,blair street,canberra west,farm shed,coffs harbour,coffs harbour,3616,3616,vic,vij,11/23/55,11/23/55,11.0,11.0,07 27974382,07 27974382,1561548,1561548
13743517-3138-4147-839b-1b4794f8c60c,e34a5f0d-9b16-402a-9c05-a506136a5cc2,7.3386e-42,True,neneh,neneh,schumann,schumann,104,104,mitta place,,break-o-day,mitta place,dandenong north,dandenong north,6450,6450,nsw,nsw,06/28/63,06/28/63,10.0,10.0,07 65238388,07 65238388,1566591,1565291
d1d58d7a-d04d-4dc2-bdc7-54d7fa7621d7,21b1a719-6f1a-430c-853b-c38fff3ccf2e,3.248602e-40,True,oliver,olievr,hawes,hawes,95,95,howard street,howard street,bowan brae,burraroo,mill park,mill zpank,3199,3199,,,03/06/91,03/06/91,26.0,26.0,04 35969025,04 35969025,9392322,9392322
e6bb45aa-cb3a-419c-a451-019004fcb172,06754fb8-d6d8-4073-b471-804032b4dded,9.023493000000001e-39,True,david,,brock,brock,77,77,lawrence wackett crescent,lawrence wackett crescent,palm lake resort,maroochy river cvn park,cremorne,cremorne,3875,3875,sa,sa,08/24/61,08/24/61,12.0,21.0,03 01688358,03 01688358,8921176,8921176
0b01ff1f-ecd0-4d94-a6d5-1ffbe53712d9,26cc6aaa-834a-4462-825b-ad3424063d47,1.020727e-36,True,liam,liuo,couzens,couzens,56,56,abercrombie circuit,abercrombie circuit,villa 3,rowethorpe,thornbury,thornbury,3138,3138,,,01/19/43,01/19/43,31.0,31.0,07 97412698,07 97412698,6548859,6541859
61d57051-032f-4853-b0ed-01b6133072cc,9ccf4b5f-b3cb-40cb-8df2-45680c4e9751,1.875579e-36,True,silas,siloas,ilmberger,ilmberger,22,22,badimara street,badimara street,brentwood vlge,vlge brentwod,parkdale,parkdale,2117,2117,vic,vic,06/28/56,06/28/56,19.0,19.0,03 32498174,03 32498174,1577506,1577508
e3795280-6ee7-4767-ae79-62ec1349128e,eb60f359-530d-439f-82de-222235a1cfdc,2.602759e-35,True,zachary,zachary,george,george,265,267,tindall place,tindalld place,wattle brae,bega lats,lang lang,lang lang,2602,2602,vic,vic,09/18/45,09/18/45,,,02 99451938,02 99451938,2167938,2167938
5824d377-0ef6-40db-9d3f-86bf2e6173f7,66f9da1c-f454-45a0-85ad-b83e1b4e8f6e,2.479958e-33,True,callum,,rothe,callun,29,29,torrens street,torrenssceeet,fleetview,fleetview,bedford park,bedford park,2759,2655,vic,vic,10/17/61,10/17/61,30.0,30.0,07 16808853,07 16808853,6130642,6130642
b92f9308-6000-46a8-a246-22114b3effdb,7dda3f1a-11b2-463a-92f1-22d3eaa140de,4.2768780000000004e-33,True,mitchell,sophino,,,35,35,,,winchester,winchestaer,booragul,booragul,2770,2770,sa,sa,01/14/62,01/14/62,13.0,13.0,,,3487568,3487268
