<a href="https://colab.research.google.com/github/tripuragorla/CMPE-297-Assignments/blob/main/Assignment%207(Makeup)/activeLearning_end2end.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Use Active Learning to Link FEBRL People Data

<a href="https://colab.research.google.com/github/rachhouse/intro-to-data-linking/blob/main/tutorial_notebooks/03_Link_FEBRL_Data_with_Active_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>

In this tutorial, we'll use the [dedupe library](https://github.com/dedupeio/dedupe) to experiment with an active learning approach to linking our FEBRL people datasets.

Once again, we'll use the same training dataset and evaluation functions as the SimSum classification tutorial; these have been included in a separate `.py` file for re-use and convenience, so we can focus on code unique to this tutorial.

## Google Colab Setup

In [4]:
# Check if we're running locally, or in Google Colab.
try:
    import google.colab
    COLAB = True
except ModuleNotFoundError:
    COLAB = False
    
# If we're running in Colab, download the tutorial functions file 
# to the Colab session local directory, and install required libraries.
if COLAB:
    import requests
    
    tutorial_functions_url = "https://raw.githubusercontent.com/rachhouse/intro-to-data-linking/main/tutorial_notebooks/linking_tutorial_functions.py"
    r = requests.get(tutorial_functions_url)
    
    with open("linking_tutorial_functions.py", "w") as fh:
        fh.write(r.text)
    
    !pip install -q altair dedupe dedupe-variable-name jellyfish recordlinkage 

In [3]:
pip install -U numpy 

Collecting numpy
  Downloading numpy-1.21.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 5.5 MB/s 
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.3.post1 requires numpy<1.20,>=1.16.0, but you have numpy 1.21.4 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m
Successfully installed numpy-1.21.4


In [1]:
import datetime
import itertools
import os
import pathlib
import re
from typing import Any, Dict, Optional

import dedupe
import pandas as pd

import linking_tutorial_functions as tutorial

INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt


## Define Working Filepaths

For convenience, we'll define a `pathlib.Path` to reference our current working directory.

In [2]:
WORKING_DIR = pathlib.Path(os.path.abspath(''))
WORKING_DIR

PosixPath('/content')

## Load Training Dataset and Ground Truth Labels

In [5]:
df_A, df_B, df_ground_truth = tutorial.load_febrl_training_data(COLAB)

Let's take a quick look at our training dataset to refresh on the columns, formats, and data.

In [6]:
df_A.head()

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id
person_id_A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
fbc4143d-15f9-4f27-b5f0-dedbadce6616,matilda,struck,8,ballard place,,west perth,2470,qld,19611002,32.0,03 05903135,8276847
48a56cad-7ba6-45e1-97cd-517ba65bdab5,lachlan,eglinton,36,kambalda crescent,villa 427,auburn,5109,,19260108,27.0,,9937958
b1792d21-e4be-4b86-8dea-454ffa5194c5,mikayla,asher,588,britten-jones drive,,miami,4218,nsw,19251102,32.0,03 33770501,7017310
96653d73-bebc-4459-94f3-c3f0a8c514d4,grace,bristow,7,,wandella park snowy,cardiff,6163,nsw,19400120,,07 37864073,3535974
41f038b8-77c0-45a5-9e1f-e62b8637ffd1,wilson,bishop,11,chisholm street,,bronte,2490,nsw,19210305,27.0,04 15209769,5573522


## Data Augmentation

We'll do minimal data augmentation before feeding our training data to `dedupe`; we just want to format the date of birth data as `mm/dd/yy`, and ensure all columns are in string format and stripped of trailing/leading whitespace. Additionally, `dedupe` requires input data to be in dictionaries, using the record id as the key and the record metadata as the value. So, we'll convert our dataframes to this format.

In [7]:
def format_dob(dob: str) -> Optional[str]:
    """ Transform date of birth format from YYYYMMDD to mm/dd/yy.
        If DOB cannot be transformed, return None.
    """
    try:
        if re.match(r"\d{8}", dob):
            return (datetime.datetime.strptime(dob, "%Y%m%d")).strftime("%m/%d/%y")
    except:
        pass

    return None

def strip_and_null(x: Any) -> Optional[str]:
    """ Stringify incoming variable, remove trailing/leading whitespace
        and return resulting string. Return None if resulting string is empty.
    """
    x = str(x).strip()
    
    if x == "":
        return None
    else:
        return x
    
def convert_df_to_dict(df: pd.DataFrame) -> Dict[str, Dict]:
    """ Convert pandas DataFrame to dict keyed by record id.
        Convert all fields to strings or Nones to satisfy dedupe.
        Transform date format of date_of_birth field.
    """    

    for col in df.columns:
        df[col] = df[col].apply(lambda x: strip_and_null(x))

    df["date_of_birth"] = df["date_of_birth"].apply(lambda x: format_dob(x))    

    return df.to_dict("index")

In [8]:
records_A = convert_df_to_dict(df_A)
records_B = convert_df_to_dict(df_B)

We can examine a small sample of the resulting transformed records:

In [9]:
[records_A[k] for k in list(records_A.keys())[0:2]]

[{'address_1': 'ballard place',
  'address_2': None,
  'age': '32',
  'date_of_birth': '10/02/61',
  'first_name': 'matilda',
  'phone_number': '03 05903135',
  'postcode': '2470',
  'soc_sec_id': '8276847',
  'state': 'qld',
  'street_number': '8',
  'suburb': 'west perth',
  'surname': 'struck'},
 {'address_1': 'kambalda crescent',
  'address_2': 'villa 427',
  'age': '27',
  'date_of_birth': '01/08/26',
  'first_name': 'lachlan',
  'phone_number': None,
  'postcode': '5109',
  'soc_sec_id': '9937958',
  'state': None,
  'street_number': '36',
  'suburb': 'auburn',
  'surname': 'eglinton'}]

## Prepare Training

When we linked our data via SimSum and supervised learning, we defined our blockers and comparators manually with `recordlinkage`. The `dedupe` library takes an active learning approach to blocking and classification and will use our feedback gathered during the labeling session to learn blocking rules and train a classifier. 

To prepare our `dedupe.RecordLink` object for training, first we'll define the fields that we think `dedupe` should pay attention to when matching records - these definitions will serve as the comparators. The `field` contains the name of the attribute to use for comparison, and the `type` defines the comparison type.

In [10]:
%%time

fields = [
    { "field" : "first_name", "type" : "Name" },
    { "field" : "surname", "type" : "Name" },
    { "field" : "address_1", "type" : "ShortString" },
    { "field" : "address_2", "type" : "ShortString" },
    { "field" : "suburb", "type" : "ShortString" },
    { "field" : "postcode", "type" : "Exact" },
    { "field" : "state", "type" : "Exact" },
    { "field" : "date_of_birth", "type" : "DateTime" },
    { "field" : "soc_sec_id", "type" : "Exact" },
]

linker = dedupe.RecordLink(fields)
linker.prepare_training(records_A, records_B)

INFO:dedupe.canopy_index:Removing stop word re
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (commonThreeTokens, address_2)


CPU times: user 49.7 s, sys: 777 ms, total: 50.5 s
Wall time: 50 s


## Active Learning Labeling Session!

At this point, we're ready to provide feedback to `dedupe` via an active learning labeling session. For this, `dedupe` supplies a convenience method to iterate through pairs it is uncertain about. As you provide feedback for each pair, dedupe learns blocking rules and recalculates its linking model weights.

You can use `y` (yes, match), `n` (no, not match), and `u` (unsure) to provide feedback on candidate links. When you're ready to exit the labeling session, use `f`.

In [11]:
dedupe.console_label(linker)

first_name : kiandra
surname : dunstone
address_1 : None
address_2 : None
suburb : oaklands park
postcode : 6163
state : wa
date_of_birth : 10/29/11
soc_sec_id : 5277244

first_name : kiandra
surname : dunstone
address_1 : None
address_2 : None
suburb : oaklands park
postcode : 6163
state : wa
date_of_birth : None
soc_sec_id : 5277244

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


first_name : isabella
surname : brain
address_1 : key street
address_2 : None
suburb : condon
postcode : 2323
state : wa
date_of_birth : 02/09/92
soc_sec_id : 2590964

first_name : isabellw
surname : brain
address_1 : None
address_2 : None
suburb : condon
postcode : 2323
state : wa
date_of_birth : 02/09/92
soc_sec_id : 2590964

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
first_name : fyynlay
surname : wangel
address_1 : None
address_2 : None
suburb : rosebud
postcode : 4500
state : qld
date_of_birth : 05/16/58
soc_sec_id : 5725414

first_name : sofia
surname : webc
address_1 : archelplace
address_2 : None
suburb : rosebud
postcode : 2211
state : nsv
date_of_birth : None
soc_sec_id : 3201434

2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


yes


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


yes


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


yes


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


no


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


unsure


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


yes


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


previous


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


finished


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


yes


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


no


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


unsure


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


previous


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


finished


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


yes


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


no


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


no


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


yes


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


previous


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


finished


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


no


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


no


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


no


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


yes


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


no


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


finished


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


unsure


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


unsure


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


yes


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


no 


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


skdcd


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


!


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : jayden
surname : white
address_1 : phillip avenue
address_2 : None
suburb : toodyay
postcode : 3021
state : vic
date_of_birth : 12/10/85
soc_sec_id : 5956709

first_name : niamh
surname : nan
address_1 : phillip avenue
address_2 : None
suburb : None
postcode : 3939
state : vic
date_of_birth : None
soc_sec_id : 8913923

3/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)
first_name : niamh
surname : nan
address_1 : phillip avenue
address_2 : parlour mountain
suburb : None
postcode : 3939
state : vic
date_of_birth : 11/17/99
soc_sec_id : 8913923

first_name : niamh
surname : nan
address_1 : phillip avenue
address_2 : None
suburb : None
postcode : 3939
state : vic
date_of_birth : None
soc_sec_id : 8913923

3/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


u


first_name : harry
surname : jolly
address_1 : vidal street
address_2 : stockland cairns
suburb : wahroonga
postcode : 6062
state : nsw
date_of_birth : 04/16/13
soc_sec_id : 3416923

first_name : alex
surname : mcgrzehor
address_1 : None
address_2 : the summerdhouse
suburb : ermington
postcode : 2156
state : nsw
date_of_birth : None
soc_sec_id : 6348428

3/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


p


first_name : niamh
surname : nan
address_1 : phillip avenue
address_2 : parlour mountain
suburb : None
postcode : 3939
state : vic
date_of_birth : 11/17/99
soc_sec_id : 8913923

first_name : niamh
surname : nan
address_1 : phillip avenue
address_2 : None
suburb : None
postcode : 3939
state : vic
date_of_birth : None
soc_sec_id : 8913923

3/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


first_name : harry
surname : jolly
address_1 : vidal street
address_2 : stockland cairns
suburb : wahroonga
postcode : 6062
state : nsw
date_of_birth : 04/16/13
soc_sec_id : 3416923

first_name : alex
surname : mcgrzehor
address_1 : None
address_2 : the summerdhouse
suburb : ermington
postcode : 2156
state : nsw
date_of_birth : None
soc_sec_id : 6348428

4/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)
first_name : lauren
surname : de lucia
address_1 : mccawley street
address_2 : coolalie
suburb : bunya mountain
postcode : 5152
state : vic
date_of_birth : 08/10/55
soc_sec_id : 7275131

first_name : tristan
surname : nan
address_1 : None
address_2 : dookanooka
suburb : spencer park
postcode : 3621
state : nsw
date_of_birth : None
soc_sec_id : 4894286

4/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


first_name : abbey
surname : ryan
address_1 : maccallum circuit
address_2 : None
suburb : o'connor
postcode : 2113
state : None
date_of_birth : 02/26/05
soc_sec_id : 9364158

first_name : abbey
surname : ryann
address_1 : maccallum circuit
address_2 : None
suburb : o'conkor
postcode : 2113
state : None
date_of_birth : 02/26/05
soc_sec_id : 9346158

4/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : nicole
surname : bradshaw
address_1 : lander crescent
address_2 : None
suburb : None
postcode : 4061
state : wa
date_of_birth : 08/09/41
soc_sec_id : 7863588

first_name : nicole
surname : bradwamw
address_1 : lander crescent
address_2 : None
suburb : None
postcode : 6195
state : wa
date_of_birth : 08/09/41
soc_sec_id : 7865388

5/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


hy


(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, address_1)
first_name : mia
surname : hignett
address_1 : treacy place
address_2 : None
suburb : None
postcode : 3824
state : nsw
date_of_birth : 06/21/10
soc_sec_id : 8964765

first_name : mia
surname : hignehg
address_1 : treac plce
address_2 : None
suburb : None
postcode : 3824
state : nsw
date_of_birth : 06/21/10
soc_sec_id : 8964765

6/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


first_name : brianna
surname : boyes
address_1 : mcinnes street
address_2 : None
suburb : None
postcode : 2680
state : nsw
date_of_birth : 08/01/27
soc_sec_id : 4306495

first_name : brianna
surname : boye
address_1 : xrreet mcfinnes
address_2 : None
suburb : None
postcode : 2680
state : nsw
date_of_birth : 08/01/27
soc_sec_id : 4306495

6/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


first_name : daniel
surname : van akker
address_1 : None
address_2 : None
suburb : biggera waters
postcode : 4032
state : wa
date_of_birth : 01/17/08
soc_sec_id : 8117255

first_name : dann
surname : van akker
address_1 : None
address_2 : None
suburb : biggera wayers
postcode : 4032
state : wr
date_of_birth : None
soc_sec_id : 8117255

6/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


u


first_name : pascale
surname : pagden
address_1 : lowrie street
address_2 : rosehill
suburb : rivett
postcode : 2480
state : None
date_of_birth : 05/13/54
soc_sec_id : 4011202

first_name : pascale
surname : pagden
address_1 : None
address_2 : rosehill
suburb : rivef
postcode : 2480
state : None
date_of_birth : 06/23/54
soc_sec_id : 4011202

6/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


p


first_name : daniel
surname : van akker
address_1 : None
address_2 : None
suburb : biggera waters
postcode : 4032
state : wa
date_of_birth : 01/17/08
soc_sec_id : 8117255

first_name : dann
surname : van akker
address_1 : None
address_2 : None
suburb : biggera wayers
postcode : 4032
state : wr
date_of_birth : None
soc_sec_id : 8117255

6/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


p


(y)es / (n)o / (u)nsure / (f)inished


p


(y)es / (n)o / (u)nsure / (f)inished


n


first_name : pascale
surname : pagden
address_1 : lowrie street
address_2 : rosehill
suburb : rivett
postcode : 2480
state : None
date_of_birth : 05/13/54
soc_sec_id : 4011202

first_name : pascale
surname : pagden
address_1 : None
address_2 : rosehill
suburb : rivef
postcode : 2480
state : None
date_of_birth : 06/23/54
soc_sec_id : 4011202

6/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : louis
surname : traforti
address_1 : barnett close
address_2 : tilburoo
suburb : pacific paradise
postcode : 2518
state : vic
date_of_birth : 05/12/38
soc_sec_id : 1913191

first_name : louis
surname : trafcrati
address_1 : None
address_2 : tilburoo
suburb : None
postcode : 2518
state : vic
date_of_birth : 05/12/38
soc_sec_id : 1913191

7/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:SimplePredicate: (dayPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)


We can now train our linker, based on the labeling session feedback.

In [12]:
%%time
linker.train()

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
  * (true_distinct + false_distinct)))
INFO:rlr.crossvalidation:optimum alpha: 1.000000, score 0.15210872754635707
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)
INFO:dedupe.training:(SimplePredicate: (fingerprint, address_1), PartialIndexLevenshteinSearchPredicate: (1, surname, Surname), SimplePredicate: (suffixArray, first_name))
INFO:dedupe.training:(SimplePredicate: (doubleMetaphone, address_2), PartialPredicate: (metaphoneToken, first_name, Surname), PartialIndexLevenshteinSearchPredicate: (2, surname, Surname))
INFO:dedupe.training:(SimplePredicate: (fingerprint, address_1), SimplePredicate: (commonSixGram, first_name), PartialIndexLevenshteinSearchPredicate: (1, first_name, Surname))


CPU times: user 4.69 s, sys: 592 ms, total: 5.28 s
Wall time: 4.76 s


Let's persist our training data (captured during in the labeling session), as well as the learned model weights.

In [13]:
ACTIVE_LEARNING_DIR = WORKING_DIR / "dedupe_active_learning"
ACTIVE_LEARNING_DIR.mkdir(parents=True, exist_ok=True)

SETTINGS_FILE = ACTIVE_LEARNING_DIR / "dedupe_learned_settings"
TRAINING_FILE = ACTIVE_LEARNING_DIR / "dedupe_training.json"

with open(TRAINING_FILE, "w") as fh:
    linker.write_training(fh)
    
with open(SETTINGS_FILE, "wb") as sf:
    linker.write_settings(sf)

## Examine Learned Blockers

Now, let's take a look at the predicates (blockers) that `dedupe` learned during our active learning labeling session. Note that `dedupe` can learn composite predicates/blockers, i.e. individual predicates can be combined with logical operators.

In [14]:
linker.predicates

(SimplePredicate: (wholeFieldPredicate, suburb),
 (SimplePredicate: (fingerprint, address_1),
  PartialIndexLevenshteinSearchPredicate: (1, surname, Surname),
  SimplePredicate: (suffixArray, first_name)),
 (SimplePredicate: (doubleMetaphone, address_2),
  PartialPredicate: (metaphoneToken, first_name, Surname),
  PartialIndexLevenshteinSearchPredicate: (2, surname, Surname)),
 (SimplePredicate: (fingerprint, address_1),
  SimplePredicate: (commonSixGram, first_name),
  PartialIndexLevenshteinSearchPredicate: (1, first_name, Surname)))

Next, let's examine the resulting candidate pairs and look at our blocking efficiency. The `.pairs` method will give us all candidate record pairs that are generated by blocking with the learned blockers.

In [15]:
candidate_pairs = [x for x in linker.pairs(records_A, records_B)]
print(f"{len(candidate_pairs):,} candidate pairs generated from blocking.")

10,820 candidate pairs generated from blocking.


You'll notice that, in contrast to `recordlinkage`, our post-blocking candidate pairs contain both the record ids as well as the record metadata.

In [16]:
candidate_pairs[0]

(('fbc4143d-15f9-4f27-b5f0-dedbadce6616',
  {'address_1': 'ballard place',
   'address_2': None,
   'age': '32',
   'date_of_birth': '10/02/61',
   'first_name': 'matilda',
   'phone_number': '03 05903135',
   'postcode': '2470',
   'soc_sec_id': '8276847',
   'state': 'qld',
   'street_number': '8',
   'suburb': 'west perth',
   'surname': 'struck'}),
 ('a9f5a761-83d6-452e-9f27-a452b3d06a4e',
  {'address_1': 'ballard place',
   'address_2': None,
   'age': '32',
   'date_of_birth': '10/02/61',
   'first_name': 'matikda',
   'phone_number': '03 05903135',
   'postcode': '2407',
   'soc_sec_id': '8276847',
   'state': 'qld',
   'street_number': '0',
   'suburb': 'west perth',
   'surname': 'strucl'}))

We can assemble our candidate pair ids into an indexed pandas dataframe for easier comparision with our known true links.

In [17]:
df_candidate_links = pd.DataFrame(
    [(x[0][0], x[1][0]) for x in candidate_pairs]
).rename(columns={0 : "person_id_A", 1 : "person_id_B"}).set_index(["person_id_A", "person_id_B"])

df_candidate_links.head()

person_id_A,person_id_B
fbc4143d-15f9-4f27-b5f0-dedbadce6616,a9f5a761-83d6-452e-9f27-a452b3d06a4e
fbc4143d-15f9-4f27-b5f0-dedbadce6616,afbb48d7-b06e-40f3-8023-7c3852a6ad5d
fbc4143d-15f9-4f27-b5f0-dedbadce6616,aec70f38-c323-4d7d-9394-0488c22676a6
48a56cad-7ba6-45e1-97cd-517ba65bdab5,51bf6fb2-ec9e-415f-91d4-463a0f42fb08
48a56cad-7ba6-45e1-97cd-517ba65bdab5,0507647b-8ca2-41f2-8cbf-96f90cad647c


Now, let's take a look at our learned blocker performance.

In [18]:
max_candidate_pairs = df_A.shape[0]*df_B.shape[0]

print(f"{max_candidate_pairs:,} total possible pairs.")

# Calculate search space reduction.
search_space_reduction = round(1 - len(candidate_pairs)/max_candidate_pairs, 6)
print(f"\n{len(candidate_pairs):,} pairs after full blocking: {search_space_reduction}% search space reduction.")

# Calculate retained true links percentage.
total_true_links = df_ground_truth.shape[0]
true_links_after_blocking = pd.merge(
    df_ground_truth,
    df_candidate_links,
    left_index=True,
    right_index=True,
    how="inner"
).shape[0]

retained_true_link_percent = round((true_links_after_blocking/total_true_links) * 100, 2)
print(f"{retained_true_link_percent}% true links retained after blocking.")

10,562,500 total possible pairs.

10,820 pairs after full blocking: 0.998976% search space reduction.
77.7% true links retained after blocking.


## Score Pairs and Examine Learned Classifier

After `dedupe` has trained blockers and a classification model based on our labeling session, we can link the records in our training dataset via the `.join` method.

In [19]:
%%time
linked_records = linker.join(records_A, records_B, threshold=0.0, constraint="one-to-one")

CPU times: user 781 ms, sys: 95.9 ms, total: 877 ms
Wall time: 6.45 s


`linker.join` will return the links, along with a model confidence.

In [20]:
linked_records[0:3]

[(('fffcadcc-3c22-4d1f-b04d-651aaddeac57',
   '614cc809-04fa-40d1-8904-5e006cd8463a'),
  1.0),
 (('fff044ab-8dca-4946-bfa4-1675ee7d56b5',
   '94b962d7-2ae4-4cb5-baf7-1a2e602d8db8'),
  1.0),
 (('ffd668ac-2f63-4c05-a6a3-58ebcf1f4a80',
   'e601f439-c5b9-4ee7-a2fb-d437e6d23b76'),
  1.0)]

We'll format the `dedupe` linker predictions into a format that we can use with our existing evaluation functions.

In [21]:
df_predictions = pd.DataFrame(
    [ {"person_id_A" : x[0][0], "person_id_B" : x[0][1], "model_score" : x[1]} for x in linked_records]
)

df_predictions = df_predictions.set_index(["person_id_A", "person_id_B"])

df_predictions = pd.merge(
    df_predictions,
    df_ground_truth,
    left_index=True,
    right_index=True,
    how="left",
)

df_predictions["ground_truth"].fillna(False, inplace=True)
df_predictions

Unnamed: 0_level_0,Unnamed: 1_level_0,model_score,ground_truth
person_id_A,person_id_B,Unnamed: 2_level_1,Unnamed: 3_level_1
fffcadcc-3c22-4d1f-b04d-651aaddeac57,614cc809-04fa-40d1-8904-5e006cd8463a,1.000000,False
fff044ab-8dca-4946-bfa4-1675ee7d56b5,94b962d7-2ae4-4cb5-baf7-1a2e602d8db8,1.000000,False
ffd668ac-2f63-4c05-a6a3-58ebcf1f4a80,e601f439-c5b9-4ee7-a2fb-d437e6d23b76,1.000000,False
ffb8ff59-209a-438e-bf5d-0c0864cb4d7b,a37ad5f0-ce2e-4501-b98f-2d20090b66b7,1.000000,False
ffa8ddc9-ed8d-4bfc-bd1a-aa7e8b86122b,3b7136be-2898-4b83-a975-de87bc1ae845,1.000000,False
...,...,...,...
a90236c5-28aa-4353-a0ac-13ec568070c6,8a85898f-c423-4e21-ad4a-7976e7db0317,0.001327,False
461a2862-2530-443b-9cfc-cd17c792fad8,7983fb9b-090e-4234-91bb-41e8ac6f380b,0.001124,False
d0f5f0e4-86e8-4d16-90e1-e879cc2b6a94,3afd6b5a-d025-406f-aec0-779ea562f32e,0.000881,False
3357e5d4-2cd5-4df4-8d55-a783b6216c38,63921136-968b-4995-9910-50b281b28d13,0.000759,False


## Choosing a Linking Model Score Threshold

The `dedupe` `.join` method that we used to score our training data directly incorporates the learned blockers. Thus, note that the scored pairs appearing on the distribution represent blocked pairs, and that our blockers *significantly* reduced the candidate pair search space.

### Model Score Distribution

In [22]:
df_predictions["ground_truth"].value_counts()

False    1377
True     1008
Name: ground_truth, dtype: int64

In [23]:
tutorial.plot_model_score_distribution(df_predictions)

INFO:numexpr.utils:NumExpr defaulting to 2 threads.


### Precision and Recall vs. Model Score

In [24]:
df_eval = tutorial.evaluate_linking(
    df=df_predictions
)

In [25]:
df_eval.head()

Unnamed: 0,threshold,tp,fp,tn,fn,precision,recall,f1
0,0.0,1008,1377,0,0,0.422642,1.0,0.594164
1,0.020408,1006,1304,73,2,0.435498,0.998016,0.606389
2,0.040816,998,1279,98,10,0.438296,0.990079,0.60761
3,0.061224,995,1267,110,13,0.439876,0.987103,0.608563
4,0.081633,989,1257,120,19,0.440338,0.981151,0.607867


In [26]:
tutorial.plot_precision_recall_vs_threshold(df_eval)

## Iterating with Active Learning

When using active learning, we iterate on our linking solution, and incorporate progressively more labeled training data. Perhaps we're not satisfied with the current performance of the blockers or classifier, and we'd like to create more labeled examples for dedupe to train on.

Recall that earlier, we saved off our existing training data from the first labeling session. We can load this persisted data into a `dedupe` linker, and kick off another labeling session. Perhaps, after investigating the data during our first cycle, we don't think that dedupe should include `address_1` and `address2` in its comparators.

### Tweak the Linker and Use Existing Training Data

In [27]:
%%time

fields = [
    { "field" : "first_name", "type" : "Name" },
    { "field" : "surname", "type" : "Name" },
    { "field" : "suburb", "type" : "ShortString" },
    { "field" : "postcode", "type" : "Exact" },
    { "field" : "state", "type" : "Exact" },
    { "field" : "date_of_birth", "type" : "DateTime" },
    { "field" : "soc_sec_id", "type" : "Exact" },
]

linker2 = dedupe.RecordLink(fields)

with open(TRAINING_FILE, "r") as fh:
    linker2.prepare_training(records_A, records_B, training_file=fh)

INFO:dedupe.api:reading training from file
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (fingerprint, suburb)


CPU times: user 36.5 s, sys: 525 ms, total: 37.1 s
Wall time: 36.6 s


Now, we can kick off a second active learning/labeling session.

In [28]:
dedupe.console_label(linker2)

first_name : lachlan
surname : exalto
suburb : taree
postcode : 3199
state : nsw
date_of_birth : 07/14/22
soc_sec_id : 8570981

first_name : kayden
surname : bishop
suburb : dianella
postcode : 3730
state : None
date_of_birth : 07/22/22
soc_sec_id : 2038603

7/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


first_name : patrick
surname : laing
suburb : flagstaff hill
postcode : 2574
state : nsw
date_of_birth : 11/28/46
soc_sec_id : 3966773

first_name : cade
surname : ulale
suburb : goodna
postcode : 3072
state : None
date_of_birth : 11/29/46
soc_sec_id : 6482089

8/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
INFO:dedupe.training:SimplePredicate: (monthPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (fingerprint, suburb)
first_name : lukas
surname : le lievre
suburb : sadleir
postcode : 3169
state : nsw
date_of_birth : None
soc_sec_id : 8477822

first_name : melni
surname : crouch
suburb : burleigh waters
postcode : 2204
state : vic
date_of_birth : 04/23/44
soc_sec_id : 1291160

8/10 positive, 7/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


u


first_name : brody
surname : widdowson
suburb : katandra west
postcode : 3135
state : vic
date_of_birth : 06/05/60
soc_sec_id : 4290634

first_name : stadtmiller
surname : nan
suburb : None
postcode : 2132
state : wa
date_of_birth : 12/18/70
soc_sec_id : 3197490

8/10 positive, 7/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : giaan
surname : david
suburb : elizabeth vale
postcode : 3156
state : nsw
date_of_birth : 05/17/37
soc_sec_id : 3587379

first_name : harrizon
surname : oatey
suburb : None
postcode : 2323
state : ns
date_of_birth : 05/25/42
soc_sec_id : 3932970

9/10 positive, 7/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (1, surname, CorporationName)
INFO:dedupe.training:SimplePredicate: (monthPredicate, date_of_birth)
first_name : cain
surname : inall
suburb : goodna
postcode : 2650
state : vic
date_of_birth : 12/08/24
soc_sec_id : 8473822

first_name : cade
surname : ulale
suburb : goodna
postcode : 3072
state : None
date_of_birth : 11/29/46
soc_sec_id : 6482089

10/10 positive, 7/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


first_name : hanna
surname : gibb
suburb : bayswater
postcode : 2118
state : vic
date_of_birth : 06/15/56
soc_sec_id : 3746797

first_name : etqy
surname : blowes
suburb : bayswater north
postcode : 3261
state : qsd
date_of_birth : 06/09/92
soc_sec_id : 7096224

10/10 positive, 8/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


u


first_name : lynton
surname : matthews
suburb : patterson lakes
postcode : 4404
state : None
date_of_birth : 02/10/56
soc_sec_id : 3856928

first_name : lynton
surname : matthewws
suburb : patterson lakes
postcode : 4408
state : None
date_of_birth : None
soc_sec_id : 3856928

10/10 positive, 8/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


p


first_name : hanna
surname : gibb
suburb : bayswater
postcode : 2118
state : vic
date_of_birth : 06/15/56
soc_sec_id : 3746797

first_name : etqy
surname : blowes
suburb : bayswater north
postcode : 3261
state : qsd
date_of_birth : 06/09/92
soc_sec_id : 7096224

10/10 positive, 8/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


p


(y)es / (n)o / (u)nsure / (f)inished


y


first_name : lynton
surname : matthews
suburb : patterson lakes
postcode : 4404
state : None
date_of_birth : 02/10/56
soc_sec_id : 3856928

first_name : lynton
surname : matthewws
suburb : patterson lakes
postcode : 4408
state : None
date_of_birth : None
soc_sec_id : 3856928

11/10 positive, 8/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


n


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (1, surname, CorporationName)
INFO:dedupe.training:SimplePredicate: (monthPredicate, date_of_birth)
INFO:dedupe.training:SimplePredicate: (sameSevenCharStartPredicate, suburb)
first_name : joel
surname : lowe
suburb : None
postcode : 2429
state : nsw
date_of_birth : 12/29/18
soc_sec_id : 8931185

first_name : daniel
surname : tinhg
suburb : brunswick west
postcode : 4562
state : sa
date_of_birth : 12/04/70
soc_sec_id : 6662264

11/10 positive, 9/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling


### Retrain the Linker and Examine Blocking Performance

Now, let's retrain, and examine blocker performance. Ideally, we see an improved true link retention following our second labeling session.

In [29]:
%%time
linker2.train()

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
  scores = np.exp(scores + self.bias) / (1 + np.exp(scores + self.bias))
  scores = np.exp(scores + self.bias) / (1 + np.exp(scores + self.bias))
INFO:rlr.crossvalidation:optimum alpha: 0.000010, score -0.04363034122413712
INFO:dedupe.training:Final predicate set:


CPU times: user 14.5 s, sys: 2.88 s, total: 17.4 s
Wall time: 14 s


In [None]:
candidate_pairs = [x for x in linker2.pairs(records_A, records_B)]
print(f"{len(candidate_pairs):,} candidate pairs generated from blocking.")

df_candidate_links = pd.DataFrame(
    [(x[0][0], x[1][0]) for x in candidate_pairs]
).rename(columns={0 : "person_id_A", 1 : "person_id_B"}).set_index(["person_id_A", "person_id_B"])

max_candidate_pairs = df_A.shape[0]*df_B.shape[0]

print(f"{max_candidate_pairs:,} total possible pairs.")

# Calculate search space reduction.
search_space_reduction = round(1 - len(candidate_pairs)/max_candidate_pairs, 6)
print(f"\n{len(candidate_pairs):,} pairs after full blocking: {search_space_reduction}% search space reduction.")

# Calculate retained true links percentage.
total_true_links = df_ground_truth.shape[0]
true_links_after_blocking = pd.merge(
    df_ground_truth,
    df_candidate_links,
    left_index=True,
    right_index=True,
    how="inner"
).shape[0]

retained_true_link_percent = round((true_links_after_blocking/total_true_links) * 100, 2)
print(f"{retained_true_link_percent}% true links retained after blocking.")

### Evaluate Classification Performance

In [None]:
%%time
linked_records = linker2.join(records_A, records_B, threshold=0.0, constraint="one-to-one")

In [None]:
df_predictions = pd.DataFrame(
    [ {"person_id_A" : x[0][0], "person_id_B" : x[0][1], "model_score" : x[1]} for x in linked_records]
)

df_predictions = df_predictions.set_index(["person_id_A", "person_id_B"])

df_predictions = pd.merge(
    df_predictions,
    df_ground_truth,
    left_index=True,
    right_index=True,
    how="left",
)

df_predictions["ground_truth"].fillna(False, inplace=True)
df_predictions

In [None]:
df_predictions["ground_truth"].value_counts()

In [None]:
tutorial.plot_model_score_distribution(df_predictions)

In [None]:
df_eval = tutorial.evaluate_linking(
    df=df_predictions
)

tutorial.plot_precision_recall_vs_threshold(df_eval)