Splink steps:

1) Prepare data
2) Exploratory analysis 
3) Blocking:
    - Create your blocking rules for prediction
4) Estimate model parameters:
    - Define comparisons
    - Define your model using the link type, the comparisons and your blocking rules
    - Estimate the parameters of the model and visualise to see what the model is doing
5) Predict results
    - Generate match_weight and match_probability scores
    - Assign records to clusters 
6) Visualise predictions to see what the model is doing
7) Evaluate against labelled data (if poss!)

Data prerequisites

Unique IDs:
- Each input dataset must have a unique ID column (unique within the dataset)
- By default, Splink assumes this will be called unique_id, but this can be changed

Conformant datasets:
- If using multiple datasets, they must share the same column names and data formats (order doesn't matter)

Cleaning:
- Ensure consistency by cleaning the data, e.g. standardising date formats, matching text case, handling invalid data
- Usual data cleaning of obvious errors

Ensure nulls are consistently and correctly represented:
- Make sure that nulls are represented as true nulls, not empty strings - splink handles these types of value differently

In [1]:
#import packages

import splink #https://moj-analytical-services.github.io/splink/

from splink.internals.duckdb.database_api import DuckDBAPI
#splink implements data linking computations by generating SQL and submitting the SQL statements to a backend of our choice
#syntax is almost exactly the same between backends 
#worth using DuckDB first to explore data as it is fast, then migrate to a different backend if desired

from splink import block_on
from splink import Linker, SettingsCreator

from splink.datasets import splink_datasets #some datasets available with splink for practice https://moj-analytical-services.github.io/splink/api_docs/datasets.html
from splink.exploratory import completeness_chart #https://moj-analytical-services.github.io/splink/api_docs/exploratory.html
from splink.exploratory import profile_columns


from splink.blocking_analysis import cumulative_comparisons_to_be_scored_from_blocking_rules_chart #https://moj-analytical-services.github.io/splink/api_docs/blocking_analysis.html
from splink.blocking_analysis import count_comparisons_from_blocking_rule

import splink.comparison_library as cl #https://moj-analytical-services.github.io/splink/api_docs/comparison_library.html


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\SylviaHal-Fead\AppData\Roaming\Python\Python311\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "c:\Users\SylviaHal-Fead\AppData\Local\Programs\Python\Python311\Lib\site-packages\traitlets\config\application.py", line 1043, in launch_instance
    app.start()
  File "C:\Users\SylviaHal-Fead\AppData\Roaming\Python\Python311\site-packages\ipykernel\kernelapp.

AttributeError: _ARRAY_API not found

Exploratory analysis

- Splink has a range of visuals to help with EDA:

In [2]:
#import data
df = splink_datasets.fake_1000 #a synthetic dataset with some duplicates
df = df.drop(columns = ["cluster"]) #drop the cluster column
df.head(10) #check out the data

Unnamed: 0,unique_id,first_name,surname,dob,city,email
0,0,Robert,Alan,1971-06-24,,robert255@smith.net
1,1,Robert,Allen,1971-05-24,,roberta25@smith.net
2,2,Rob,Allen,1971-06-24,London,roberta25@smith.net
3,3,Robert,Alen,1971-06-24,Lonon,
4,4,Grace,,1997-04-26,Hull,grace.kelly52@jones.com
5,5,Grace,Kelly,1991-04-26,,grace.kelly52@jones.com
6,6,Logan,pMurphy,1973-08-01,,
7,7,,,2015-03-03,Portsmouth,evied56@harris-bailey.net
8,8,,Dean,2015-03-03,,
9,9,Evie,Dean,2015-03-03,Pootsmruth,evihd56@earris-bailey.net


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   unique_id   1000 non-null   int64 
 1   first_name  831 non-null    object
 2   surname     819 non-null    object
 3   dob         1000 non-null   object
 4   city        813 non-null    object
 5   email       789 non-null    object
dtypes: int64(1), object(5)
memory usage: 47.0+ KB


Analyse nulls - columns with higher numbers of nulls are less useful for data linking

In [4]:
db_api = DuckDBAPI()
completeness_chart(df, db_api=db_api)

- The completeness chart shows us the % of nulls in the dataset

Assess distribution of values
- Columns with higher cardinality (number of distinct values) are more useful for linking
- Columns that are more equally distributed are more useful for linking

In [5]:
#profile_columns (table_or_tables, db_api, column_expressions = None, top_n = 10, bottom_n = 10)

#profile all columns by leaving the column_expressions argument empty
#dist plot showing the count of values at each percentile
#top n chart showing the count of the top n values within column
#bottom n chart showing the count of the bottom n values in the column

profile_columns(df, db_api = DuckDBAPI(), top_n = 10, bottom_n = 5)

Blocking

Choosing blocking rules to optimise runtime
- To link records, we need to compare pairs of records and decide which pairs are matches
- For most large datasets, it won't be computationally possible to compare every row with every other row (the number of comparisons rises quadratically with the number of records)
- To decide which ones to compare we use blocking rules - these specify which pairwise comparisons to generate
- These are defined as SQL expressions, e.g.:

    from splink import block_on
    block_on("first_name", "surname")

    =
    
    SELECT *
    FROM input_tables as l
    INNER JOIN input_tables as r
    on l.first_name = r.first_name AND l.surname = r.surname


The aim of blocking rules are:
- Eliminate enough non-matching comparison pairs so the we can computationally handle the number of pairwise comparisons
- Eliminate as few true matching pairs as possible
- Splink has some tools to help us choose effective rules
- Lots of strict blocking rules are usually better than few loose rules; individually strict blocking rules are likely to exclude lots of true matches, multiple strict rules will make it implausible that a truly matching record gets missed, e.g.:

    block_on("first_name", "dob")
    will retain all matching pairs except those with errors or nulls in the first name or dob fields
    and
    block_on("email")
    will retain all matching pairs except those with errors or nulls in the email column

- Individually we would probably miss true matches where the records contain typos but between them it's unlikely that many of the same records would have typos in both fields
- If we add more strict blocking rules it becomes less likely that a record would get through the cracks here

In [6]:
#counting comparisons created by a single rule
#this is a good idea so we know that we are not generating too many records and wasting time computing them

br = block_on("substr(first_name, 1, 1)", "surname") #inital of first name and surname match
counts = count_comparisons_from_blocking_rule(
    table_or_tables=df,
    blocking_rule=br,
    link_type="dedupe_only",
    db_api=db_api
)
counts

{'number_of_comparisons_generated_pre_filter_conditions': 1632,
 'number_of_comparisons_to_be_scored_post_filter_conditions': 473,
 'filter_conditions_identified': '',
 'equi_join_conditions_identified': 'SUBSTR(l.first_name, 1, 1) = SUBSTR(r.first_name, 1, 1) AND l."surname" = r."surname"',
 'link_type_join_condition': 'where l."unique_id" < r."unique_id"'}

In [7]:
#a different blocking rule

br = "l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 2" #first names match and levenshtein difference is < 2

#levenshtein distance: measure of similarity between two strings - uses the number of insertion, deletion and substitutions needed to transform one string into another
#https://www.geeksforgeeks.org/introduction-to-levenshtein-distance/:

#example of levenshtein difference
#import duckdb
#duckdb.sql("SELECT levenshtein ('DAVE', 'DAVI')").df().iloc[0,0]

#example of Damerau-Levenshtein distance
#duckdb.sql("SELECT damerau_levenshtein ('DAVE', 'DAVI')").df().iloc[0,0]

#count_comparisons_from_blocking_rule(table_or_tables, blocking_rule, link_type, db_api, unique_id_column_name = "unique_id",
#source_dataset_column_name = None, compute_post_filter_count = True, max_rows_limit = int(1000000000.0))
#returns: dict(str, Union[int, str])
counts = count_comparisons_from_blocking_rule(
    table_or_tables=df,
    blocking_rule=br,
    link_type="dedupe_only",
    db_api=db_api
)
counts

{'number_of_comparisons_generated_pre_filter_conditions': 4827,
 'number_of_comparisons_to_be_scored_post_filter_conditions': 372,
 'filter_conditions_identified': 'LEVENSHTEIN(l.surname, r.surname) < 2',
 'equi_join_conditions_identified': 'l.first_name = r.first_name',
 'link_type_join_condition': 'where l."unique_id" < r."unique_id"'}

In [13]:
#counting the number of comparisons created by a list of blocking rules
blocking_rules_for_analysis = [
    block_on("substr(first_name, 1, 1)", "surname"),
    block_on("surname"),
    block_on("email"),
    block_on("city", "first_name"), 
    "l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 2",
]

#cumulative_comparisons_to_be_scored_from_blocking_rules_chart(table_or_tbles, blocking_rules, link_type, db_api, unique_id_column_name = "unique_id", 
#max_rows_limit = int(1000000000.0), source_dataset_column_name = None)
cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
    table_or_tables=df,
    blocking_rules=blocking_rules_for_analysis,
    db_api = db_api,
    link_type="dedupe_only"
)

Building and estimating the model
- Estimating the model will help us understand the relative importance of different parts of the data for data linking
- The relative importance is captured in the partial match weights (which are added to compute the overall match score)
- To build a model, we define the partial match weights that splink should estimate by defining how the data in the records should be compared through comparisons
- Comparisons represent how data from one or more input columns is compared
- A model is composed of many comparisons, which between them assess the similarity of all the columns being used for linking the data
- Each comparison contains two or more ComparisonLevels which define n discrete graduations of similarity between input columns within the comparison
- Splink has a library of comparison functions which are split into:
    - generic comparison functions which apply a particular fuzzy matching pattern (e.g. levenshtein distance)
    - tailored comparison functions for specific data types

- There are 3 ways of specifying comparisons:
    - Using "out-of-the-box" Comparisons
    - Composing pre-defined ComparisonLevels
    - Writing a full dictionary spec of a Comparison by hand

- "Out-of-the-box" Comparisons:
    - The ComparisonLibrary has pre-baked similarity functions that cover many common use cases
    - These functions generate an entire Comparison, composed of ComparisonLevels
    - Include non-data-specific and data-specific comparisons

- Composing pre-defined ComparisonLevels
    - Compose our own Comparisons

- Full dictionary spec
    - All Comparisons are eventually turned into a dictionary
    - The library functions are convenience functions that provide a shorthand way to produce valid dictionaries, but we can specify our own Comparisons directly as a dictionary to get maximum control

In [14]:
#generic comparison
city_comparison = cl.LevenshteinAtThresholds("city", 2) #L distance <= 2
print(city_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'LevenshteinAtThresholds' of "city".
Similarity is assessed using the following ComparisonLevels:
    - 'city is NULL' with SQL rule: "city_l" IS NULL OR "city_r" IS NULL
    - 'Exact match on city' with SQL rule: "city_l" = "city_r"
    - 'Levenshtein distance of city <= 2' with SQL rule: levenshtein("city_l", "city_r") <= 2
    - 'All other comparisons' with SQL rule: ELSE



In [15]:
#tailored comparison
email_comparison = cl.EmailComparison("email")
print(email_comparison.get_comparison("duckdb").human_readable_description)

Comparison 'EmailComparison' of "email".
Similarity is assessed using the following ComparisonLevels:
    - 'email is NULL' with SQL rule: "email_l" IS NULL OR "email_r" IS NULL
    - 'Exact match on email' with SQL rule: "email_l" = "email_r"
    - 'Exact match on username' with SQL rule: NULLIF(regexp_extract("email_l", '^[^@]+', 0), '') = NULLIF(regexp_extract("email_r", '^[^@]+', 0), '')
    - 'Jaro-Winkler distance of email >= 0.88' with SQL rule: jaro_winkler_similarity("email_l", "email_r") >= 0.88
    - 'Jaro-Winkler >0.88 on username' with SQL rule: jaro_winkler_similarity(NULLIF(regexp_extract("email_l", '^[^@]+', 0), ''), NULLIF(regexp_extract("email_r", '^[^@]+', 0), '')) >= 0.88
    - 'All other comparisons' with SQL rule: ELSE



In [16]:
#Comparisons are specified as part of the Splink settings, a Python dictionary which controls the configuration of a Splink model

settings = SettingsCreator(
    link_type = "dedupe_only", #only deduping not linking
    comparisons=[
        cl.NameComparison("first_name"),#define the info we want to use to link the data
        cl.NameComparison("surname"),
        cl.LevenshteinAtThresholds("dob", 1),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True), #for columns where some values appear much more frequently than others we set this as true
        cl.EmailComparison("email")
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "city"), #we only want to compare records where either the first name and city or surname match
        block_on("surname"),
    ],
    retain_intermediate_calculation_columns=True, #lets have a look at extra info to help us understand the calculations - this slows things down though!
)

linker = Linker(df, settings, db_api=DuckDBAPI())

Term frquency adjustments
- The Fellegi-Sunter model doesn't account for skew in the distributions of linking variables
- Consider, for example, a binary gender variable were males outnumber females by 10:1
- This doesn't affect the m probability - given that two records are a match, the gender fields should match with roughly the same probability for males and females
- The u probability, however is affected - given that two records are not a match, it is much more likely that both records will be male than that they will both be female - u probability is too low for the more common value and too high otherwise
- One option might be to create different comparison levels for the gender variable, but this means we have to calculate more probabilities, and we would need many comparison levels if we had higher cardinality values
- To deal with the problem we can add an independent TF adjustment term for each comparison

Estimating the parameters of the model

- Here we estimate the three parameters of the splink model

Estimating lambda
- 𝜆 = Pr(Records match) = probability that two records match
- In some cases we might know lambda, for example, if we know that there is a one-to-one match between datasets
- In most cases we don't know this, so we combine a set of deterministic matching rules and a guess of the recall corresponding to these rules

In [17]:
deterministic_rules = [
    block_on("first_name", "dob"),
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    block_on("email")
]

linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall = 0.7)

Probability two random records match is estimated to be  0.00298.
This means that amongst all possible pairwise record comparisons, one in 335.56 are expected to match.  With 499,500 total possible comparisons, we expect a total of around 1,488.57 matching pairs


Estimating m and u probabilities

- m and u probabilities quantify the strength of the evidence we have in our data 
- m = Pr(scenario|records match)
- u - Pr(scenario|records do not match)
- What is important is the relative size of these values - this is the Bayes Factor:
        K = m/u = Pr(scenario|records match)/Pr(scenario|records do not match)
- Bayes Factors act as a relative multiplier that increases or decreases the overall prediction of whether the records match

Estimating u probabilities
- Once we have lambda, we can estimate u probabilities
- We use the estimate_u_using_random_sampling method which samples random pairs of records, since most random pairs will be non-matches
- Over these non-matches we compute the distribution of ComparisonLevels for each comparison

In [18]:
linker.training.estimate_u_using_random_sampling(max_pairs = 1e6) #the larger the random sample, the more accurate the predictions

You are using the default value for `max_pairs`, which may be too small and thus lead to inaccurate estimates for your model's u-parameters. Consider increasing to 1e8 or 1e9, which will result in more accurate estimates, but with a longer run time.
----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - first_name (no m values are trained).
    - surname (no m values are trained).
    - dob (no m values are trained).
    - city (no m values are trained).
    - email (no m values are trained).


- We now need to estimate m - for this we have to have some idea of what the true matches are
- We can use an iterative maximum likelihood approach called Expectation Maximisation:
    - Iterative optimisation method that finds maximum likelihood of parameters in models that have unobserved latent variables (unobserved variables in models that can only be inferred indirectly through their effects on observed variables)
    - Made up of an estimation (E) step and maximisation (M) step:
        - E- compute the latent variables (expectation of the log-likelihood (log of likelihood function, which measures the goodness of fir between data nd the model) using the current parameter estimates)
        - M- determine the parameters that maixmise the expected log-likelihood obtained in E step, and update the model paremeters based on the estimated latent variables
- This estimates the m values by generating pairwise record comparisons and using them to maximise a likelihood function
- Each estimation pass requires us to configure an estimation blocking rule to reduce the number of record comparisons so it is manageable

In [19]:
training_session_fname_sname = (linker.training.estimate_parameters_using_expectation_maximisation(block_on("first_name", "surname")))
training_session_dob = (linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob")))


----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
(l."first_name" = r."first_name") AND (l."surname" = r."surname")

Parameter estimates will be made for the following comparison(s):
    - dob
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - first_name
    - surname

Level Jaro-Winkler >0.88 on username on comparison email not observed in dataset, unable to train m value

Iteration 1: Largest change in params was -0.521 in the m_probability of dob, level `Exact match on dob`
Iteration 2: Largest change in params was 0.0516 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0183 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.00744 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00349 in probability_two_random_records_match
Iteration 6: Larg

In [20]:
#now we can visualise the model parameters
#see the final estimated match weights
linker.visualisations.match_weights_chart()

- The match weights chart shows results of a trained Splink model
- Each comparison is represented in a bar chart - eahc bar shows evidence for two records being a match for each comparison level
- The first bar is our prior - bayesian prior - represents our belief that two random records will be a match

Things to focus on:

Match weights should gradually reduce within a comparison:
- Comparison levels are order-dependent - so the most similar levels come first and the levels get gradually less similar
- So we would expect that the match weight will reduce as we move down the levels

We might want to combine comparison levels that are very similar
- Comparisons are broken up into levels to show different levels of similarity
- Because of this, we expect the amount of evidence (match weight) to vary between comparison levels
- Two levels with the same match weight do not provide the model with any additional information, so we should combine very similar levels into a similar level

We might want to simplify a model that has a number of highly predictive features
- Where we have a large variation between comparison levels, it indicates that we have a highly predictive feature (consider the difference between and exact match on email and all other comparisons on email)
- If we have a lot of highly predictive features, we might consider simplifying the model using the more predictive features

Logically walk through each comparison level
- Check the amount of evidence (match weight) that has been allocated by the model for each comparison level
- Logically consider each level and how much evidence matches give us - do they make sense knowing our data

In [19]:
#m and u values
linker.visualisations.m_u_parameters_chart()

- Left shows estimated m probabilities - the probability of a given comparison level when two records are a match - the proportion of matching records allocated to the comparison level
- Right shows estimated u probabilities - the probability of a given comparison level when two records do not match - the proportion of non-matching records allocated to the comparison level

Things to focus on:

Logically walk through each comparison level
- Consider, for example, how often exact matches and fuzzy levels occurence in non-matching comparisons
- Consider the cardinality of the features - high cardinality features will generally have lower likelihood of "all other comparisons"


In [20]:
#comparisons
linker.visualisations.parameter_estimate_comparisons_chart()

- Shows how parameter estimates have differed across different estimation methods

In [21]:
#we can then save our model to a .json file for future use
settings = linker.misc.save_model_to_json(
    "file_path", overwrite = True #enter a file path here to save
)

Unlinkable records
- Before we generate our predictions we can detect records deemed unlinkable - those that don't contain enough info to be linked
- We do this by linking records to themselves (if even when matched to themselves, they do not meet match thresholds, they will never match to anything)
- Shows us the proportion of records that are unlinkable at specific threshold match weight/match probability
- The record may link to other records but there's not enough information to disambiguate potential links

In [22]:
linker.evaluation.unlinkables_chart()
#this will show us the percentage of unlinkable records at different threshold match weights

Making predictions


In [23]:
#load our model
import pandas as pd
pd.options.display.max_columns = 1000


In [24]:
import json
import urllib
import urllib.request

url = "link" #enter url here if saved to git

with urllib.request.urlopen(url) as u:
    settings = json.loads(u.read().decode())

linker = Linker(df, settings, db_api=DuckDBAPI())

Predicting match weights
- Linker.predict() runs the model
- This generates all pariwise record comparisons that match at least one of the blocking_rules_to_generate_predictions
- Uses the rules we defined in the comparisons to evaluate the similarity of the input data
- Uses the estimated match weights, applying term frequency adjustments where we have set this as true, to produce the final_match_weight and match_probability scores
- We can also define a threshold_match_probability or theshold_match_Weight to drop any rows where the predicted score is below a certain threshold

Bayes Factors -> probabilities

- The prior is our existing belief that two random records match (our belief of the scenario before we have any evidence)
- The posterior is our belief that two records match given the evidence we have (the data we have about the records)

- Mathematically:

    posterior odds = prior odds * Bayes Factor

- And Bayes Theorem is:

    Pr(a|b) = Pr(b|a) * Pr(a) / Pr(b)

- or:

    posterior probability = likelihood * prior probability / evidence

- so if we consider one column (e.g. first name):

    Pr(match|first name matches) = Pr(first name matches|match) * Pr(match) / Pr(first name matches)

- which we can also write as:

    Pr(match|first name matches) = Pr(first name matches|match) * Pr(match) / Pr(first name matches|match) * Pr(match) + Pr(first name matches|non match) * Pr(non match)

- m = Pr(scenario|records match)
- u - Pr(scenario|records do not match)

- so this is the same as:

    posterior probability = m * prior probability / m * prior probability + u * (1 - prior probability)

- Odds is:

    odds = p / 1-p

- so:

    posterior odds = prior / 1 - prior  m / u

- so for a specific scenario:

    posterior odds = prior odds * Bayes Factor

- This formula can account for data in multiple scenarios (e.g. a match in on multiple parameters, or a match on some and not on others) (Naive Bayes classifier):

    posterior odds = prior odds * Bayes Factor 1 * Bayes Factor 2 ... * Bayes Factor n

- Which means:
    posterior odds = prior odds * m1 m2 ... mn / u1 u2 ... un



In [25]:
df_predictions = linker.inference.predict(threshold_match_probability=0.2)
df_predictions.as_pandas_dataframe(limit = 5)

Blocking time: 0.01 seconds
Predict time: 0.51 seconds

You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained


Unnamed: 0,match_weight,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,tf_first_name_l,tf_first_name_r,bf_first_name,bf_tf_adj_first_name,surname_l,surname_r,gamma_surname,tf_surname_l,tf_surname_r,bf_surname,bf_tf_adj_surname,dob_l,dob_r,gamma_dob,bf_dob,city_l,city_r,gamma_city,tf_city_l,tf_city_r,bf_city,bf_tf_adj_city,email_l,email_r,gamma_email,tf_email_l,tf_email_r,bf_email,bf_tf_adj_email,match_key
0,8.522364,0.997288,110,112,Oliver,Oliver,4,0.033694,0.033694,84.821765,0.171945,Atkinnos,Atkinson,3,0.001221,0.008547,79.354325,1.0,2009-12-21,2010-01-20,0,0.460743,London,London,1,0.212792,0.212792,10.20126,0.259162,oliver.atkinson@moran-smith.com,oliver.atkinson@moran-smith.com,4,0.006337,0.006337,252.050601,0.346193,0
1,17.447412,0.999994,110,114,Oliver,Oliver,4,0.033694,0.033694,84.821765,0.171945,Atkinnos,Atkinson,3,0.001221,0.008547,79.354325,1.0,2009-12-21,2009-12-21,2,223.957757,London,London,1,0.212792,0.212792,10.20126,0.259162,oliver.atkinson@moran-smith.com,oliver.atkinson@moran-smith.com,4,0.006337,0.006337,252.050601,0.346193,0
2,17.514207,0.999995,227,228,Julia,Julia,4,0.00361,0.00361,84.821765,1.604819,Smith,Smith,4,0.013431,0.013431,88.870507,0.364081,2004-04-27,2004-04-26,1,93.268001,Luton,Luton,1,0.00369,0.00369,10.20126,14.944992,,julia.smith@english.org,-1,,0.002535,1.0,1.0,0
3,13.065817,0.999883,256,257,Sofia,Sofia,4,0.00361,0.00361,84.821765,1.604819,Russell,Russell,4,0.01221,0.01221,88.870507,0.400489,2014-09-03,2014-09-03,2,223.957757,London,London,1,0.212792,0.212792,10.20126,0.259162,,sofiarussell9@rivera.com,-1,,0.003802,1.0,1.0,0
4,9.575398,0.998691,286,287,Bailey,Bailey,4,0.002407,0.002407,84.821765,2.407229,Freddie,Freddie,4,0.007326,0.007326,88.870507,0.667482,1985-01-05,1986-02-04,0,0.460743,Huddersfield,Huddersfield,1,0.0123,0.0123,10.20126,4.483498,,fbailey@schneider.biz,-1,,0.002535,1.0,1.0,0


Clustering records
- From the linker.predict we get a list of pairwise record comparisons and their scores
- We now need to convert the pairwise results into clusters

In [26]:
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predictions, threshold_match_probability=0.5
)
clusters.as_pandas_dataframe(limit = 10)

Completed iteration 1, num representatives needing updating: 2
Completed iteration 2, num representatives needing updating: 0


Unnamed: 0,cluster_id,unique_id,first_name,surname,dob,city,email
0,0,0,Robert,Alan,1971-06-24,,robert255@smith.net
1,1,1,Robert,Allen,1971-05-24,,roberta25@smith.net
2,1,2,Rob,Allen,1971-06-24,London,roberta25@smith.net
3,3,3,Robert,Alen,1971-06-24,Lonon,
4,4,4,Grace,,1997-04-26,Hull,grace.kelly52@jones.com
5,5,5,Grace,Kelly,1991-04-26,,grace.kelly52@jones.com
6,6,6,Logan,pMurphy,1973-08-01,,
7,7,7,,,2015-03-03,Portsmouth,evied56@harris-bailey.net
8,8,8,,Dean,2015-03-03,,
9,8,9,Evie,Dean,2015-03-03,Pootsmruth,evihd56@earris-bailey.net


In [29]:
sql = f"""
select *
from {df_predictions.physical_name}
limit 2
"""

linker.misc.query_sql(sql)

Unnamed: 0,match_weight,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,tf_first_name_l,tf_first_name_r,bf_first_name,bf_tf_adj_first_name,surname_l,surname_r,gamma_surname,tf_surname_l,tf_surname_r,bf_surname,bf_tf_adj_surname,dob_l,dob_r,gamma_dob,bf_dob,city_l,city_r,gamma_city,tf_city_l,tf_city_r,bf_city,bf_tf_adj_city,email_l,email_r,gamma_email,tf_email_l,tf_email_r,bf_email,bf_tf_adj_email,match_key
0,8.522364,0.997288,110,112,Oliver,Oliver,4,0.033694,0.033694,84.821765,0.171945,Atkinnos,Atkinson,3,0.001221,0.008547,79.354325,1.0,2009-12-21,2010-01-20,0,0.460743,London,London,1,0.212792,0.212792,10.20126,0.259162,oliver.atkinson@moran-smith.com,oliver.atkinson@moran-smith.com,4,0.006337,0.006337,252.050601,0.346193,0
1,17.447412,0.999994,110,114,Oliver,Oliver,4,0.033694,0.033694,84.821765,0.171945,Atkinnos,Atkinson,3,0.001221,0.008547,79.354325,1.0,2009-12-21,2009-12-21,2,223.957757,London,London,1,0.212792,0.212792,10.20126,0.259162,oliver.atkinson@moran-smith.com,oliver.atkinson@moran-smith.com,4,0.006337,0.006337,252.050601,0.346193,0


Visualising predictions
- Visualising our results will help us to understand how the model is working
- Through visualising we need to look for areas where we think our model isn't working well so we can fix these issues

In [30]:
#waterfall chart shows us how Splink computed the final matchweight for a particular record comparison
records_to_view = df_predictions.as_record_dict(limit = 5)
linker.visualisations.waterfall_chart(records_to_view, filter_nulls = False)

The match weight chart shows us the evidence in favour of a match vs against a match as a result of the evidence.
The overall match weight represents the similarity of two records and can be shown by summing the partial match weights:

ωPrior + ωfirst_name + ωsurname + ωdob + ωcity + ωemail (+ term frequency adjustments where applied)

In [33]:
#Comparison viewer dashboard gives us an interactive dashboard with example predictions from the spectrum of match scores
linker.visualisations.comparison_viewer_dashboard(df_predictions, "file_path", overwrite = True) #enter file path here

In [34]:
#cluster studio dashboard gives us an interactive dashboard that visualises the clustering of our predictions
#this will show us examples of clusters of different sizes - the shape and size of clusters can show us problems and therefore help us to find false positives and negatives

linker.visualisations.cluster_studio_dashboard(
    df_predictions,
    clusters,
    "file_path", #enter file path here
    sampling_method="by_cluster_size",
    overwrite = True
)

Evaluating the model
- Splink has some more formal accuracy analysis functions
- Help us to understand prevalence of false positives and negatives
- These rely on having a sample of labelled matches

In [54]:
#load in labels
df = splink_datasets.fake_1000

settings["blocking_rules_to_generate_predictions"] = [
    block_on("first_name"),
    block_on("city"),
    block_on("email"),
    block_on("dob")
]

linker = Linker(df, settings, db_api=DuckDBAPI())
df_predictions = linker.inference.predict(threshold_match_probability=0.01)
from splink.datasets import splink_dataset_labels
df_labels = splink_dataset_labels.fake_1000_labels
labels_table = linker.table_management.register_labels_table(df_labels)
df_labels.head(5)

Blocking time: 0.03 seconds
Predict time: 1.93 seconds

You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained


Unnamed: 0,unique_id_l,source_dataset_l,unique_id_r,source_dataset_r,clerical_match_score
0,0,fake_1000,1,fake_1000,1.0
1,0,fake_1000,2,fake_1000,1.0
2,0,fake_1000,3,fake_1000,1.0
3,0,fake_1000,4,fake_1000,0.0
4,0,fake_1000,5,fake_1000,0.0


In [44]:
#see false negatives
splink_df = linker.evaluation.prediction_errors_from_labels_table(
    labels_table, include_false_negatives=True, include_false_positives=False
)
false_negatives = splink_df.as_record_dict(limit = 5)
linker.visualisations.waterfall_chart(false_negatives)

In [45]:
#see false positives
splink_df = linker.evaluation.prediction_errors_from_labels_table(
    labels_table, include_false_negatives=False, include_false_positives=True, threshold_match_probability=0.01 # setting threshold low otherwise no fps
)
false_positives = splink_df.as_record_dict(limit = 5)
linker.visualisations.waterfall_chart(false_positives)

In [55]:
#threshold selection charts show keu accuracy stats
linker.evaluation.accuracy_analysis_from_labels_table(
    labels_table, output_type="threshold_selection", add_metrics=["f1"])

In [56]:
#ROC curve
linker.evaluation.accuracy_analysis_from_labels_table(
    labels_table, output_type = "roc"
)

In [58]:
#truth table
roc_table = linker.evaluation.accuracy_analysis_from_labels_table(
    labels_table, output_type="table"
)
roc_table.as_pandas_dataframe(limit = 5)

Unnamed: 0,truth_threshold,match_probability,total_clerical_labels,p,n,tp,tn,fp,fn,P_rate,N_rate,tp_rate,tn_rate,fp_rate,fn_rate,precision,recall,specificity,npv,accuracy,f1,f2,f0_5,p4,phi
0,-18.9,2e-06,3176.0,2031.0,1145.0,1709.0,1103.0,42.0,322.0,0.639484,0.360516,0.841457,0.963319,0.036681,0.158543,0.976014,0.841457,0.963319,0.774035,0.88539,0.903755,0.865316,0.945766,0.880476,0.776931
1,-16.7,9e-06,3176.0,2031.0,1145.0,1709.0,1119.0,26.0,322.0,0.639484,0.360516,0.841457,0.977293,0.022707,0.158543,0.985014,0.841457,0.977293,0.776544,0.890428,0.907594,0.866721,0.952514,0.88601,0.789637
2,-12.8,0.00014,3176.0,2031.0,1145.0,1709.0,1125.0,20.0,322.0,0.639484,0.360516,0.841457,0.982533,0.017467,0.158543,0.988433,0.841457,0.982533,0.777471,0.892317,0.909043,0.867249,0.955069,0.888076,0.794416
3,-12.5,0.000173,3176.0,2031.0,1145.0,1708.0,1125.0,20.0,323.0,0.639484,0.360516,0.840965,0.982533,0.017467,0.159035,0.988426,0.840965,0.982533,0.776934,0.892003,0.908752,0.866829,0.954937,0.887763,0.793897
4,-12.4,0.000185,3176.0,2031.0,1145.0,1705.0,1132.0,13.0,326.0,0.639484,0.360516,0.839488,0.988646,0.011354,0.160512,0.992433,0.839488,0.988646,0.776406,0.893262,0.909576,0.866186,0.957542,0.889225,0.797936


In [59]:
#unlinkables chart
linker.evaluation.unlinkables_chart()