### Record linkage with Splink

This notebook provides a short worked example using the Splink [quick and dirty persons model](https://moj-analytical-services.github.io/splink/demos/examples/duckdb/quick_and_dirty_persons.html).

In [19]:
from splink.duckdb.linker import DuckDBLinker
from splink.duckdb.blocking_rule_library import block_on
import splink.duckdb.comparison_library as cl
import pandas as pd

I'm using a slightly modified dataset of [Miami-Dade County Jail Bookings](https://gis-mdc.opendata.arcgis.com/datasets/c2275711ced240c6bc4e998ee1910e85). 

The dataset used in this notebook can also be found within this GitHub repository. There are 266,656 unique forename, surname and dob combinations within the dataset.

In [20]:
df = pd.read_csv('data_md_jb_v2.csv', low_memory=False)
df.head()

Unnamed: 0,unique_id,book_date,date_eu,defendant,surname,first_name,address,city_state_zip,dob,charge1,charge2,charge3,zip,city,state
0,322567,2021/12/01 05:00:00+00,01/12/2021,GERT ALLYSSON,GERT,ALLYSSON,JAKOBFUGLISTRASSE 18 804,ZURICH YY,1995-06-02,CONT SUBS/POSS,CONT SUBS/POSS,DRUG PARAPHERNA/POSN,,ZURICH,YY
1,402605,2023/09/29 04:00:00+00,29/09/2023,THOMAS COLMY,THOMAS,COLMY,1368 JAMES CT,ZIONVILLE IN 46077,1981-07-05,OUT-OF-CNTY/WARRANT,,,46077.0,ZIONVILLE,IN
2,300730,2021/05/29 04:00:00+00,29/05/2021,FARMER CORNELL L,FARMER CORNELL,L,2209 ESHCOL AVE,ZION FL 60099,1988-08-20,BATTERY,,,60099.0,ZION,FL
3,78183,2016/09/24 04:00:00+00,24/09/2016,SCHAPPERT COLE,SCHAPPERT,COLE,1106 CARDINAL DRIVE,ZION IL,1991-05-28,UTTER FORGED INSTRU,RESIST OFF W/O VIOL,DEBIT CARD/UNLAW/USE,,ZION,IL
4,377120,2023/03/09 05:00:00+00,09/03/2023,BENNETT TERRY T,BENNETT TERRY,T,1109 PHEASANT RUN,ZION IL 60099,1992-06-05,SMO/CAN/M/HE/PP/PROH,,,60099.0,ZION,IL


In [21]:
unique_persons_raw = df.drop_duplicates(subset=['defendant', 'dob']).shape[0]
print("Unique number of individuals in raw dataset by combining name and dob':", unique_persons_raw)

Unique number of individuals in raw dataset by combining name and dob': 266656


### 1.Settings

I've modified the suggested settings in the Splink walkthrough to reflect the columns available in the sample dataset.

A full guide on the blocking rules can be found at this link: https://moj-analytical-services.github.io/splink/topic_guides/blocking/blocking_rules.html

In [22]:
settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on("defendant"),
        block_on(["substr(defendant,1,6)", "dob"]),
    ],
    "comparisons": [
        cl.jaro_at_thresholds("defendant", [0.9, 0.7], term_frequency_adjustments=True),
        cl.levenshtein_at_thresholds("dob", [1, 2]),
    ],       

}

### 2. Parameters

See here: https://moj-analytical-services.github.io/splink/demos/tutorials/04_Estimating_model_parameters.html

In [23]:
# these are unchanged from SplinkQD documentation tutorial
linker = DuckDBLinker(df, settings, set_up_basic_logging=False)
deterministic_rules = [
    "l.defendant = r.defendant",
    "l.defendant = r.defendant and l.dob = r.dob",
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)

linker.estimate_u_using_random_sampling(max_pairs=2e6)

### 3. Results

In [24]:
results = linker.predict(threshold_match_probability=0.75)


You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'defendant':
    m values not fully trained
Comparison: 'dob':
    m values not fully trained


In [25]:
results.as_pandas_dataframe(limit=5)

Unnamed: 0,match_weight,match_probability,unique_id_l,unique_id_r,defendant_l,defendant_r,gamma_defendant,dob_l,dob_r,gamma_dob,match_key
0,1.634583,0.756393,4922,11063,MORILLO RICHARD,MORILLO RICHARD,3,1977-09-19,1977-09-17,2,0
1,1.634583,0.756393,3854,11063,MORILLO RICHARD,MORILLO RICHARD,3,1977-09-19,1977-09-17,2,0
2,1.634583,0.756393,264718,267793,LEEMAN COURTNEY BROCK,LEEMAN COURTNEY BROCK,3,1979-02-23,1979-06-23,2,0
3,1.634583,0.756393,258668,267793,LEEMAN COURTNEY BROCK,LEEMAN COURTNEY BROCK,3,1979-02-23,1979-06-23,2,0
4,1.634583,0.756393,165746,267793,LEEMAN COURTNEY BROCK,LEEMAN COURTNEY BROCK,3,1979-02-23,1979-06-23,2,0


### 4. Adding unique id to persons and using results

SpinkQD identified 27,490 duplicate individuals from our original dataset. 

In [26]:
#add in cluster id
clusters = linker.cluster_pairwise_predictions_at_threshold(results, threshold_match_probability=0.5)
clusters.as_pandas_dataframe(limit=5)

Unnamed: 0,cluster_id,unique_id,book_date,date_eu,defendant,surname,first_name,address,city_state_zip,dob,charge1,charge2,charge3,zip,city,state,__splink_salt,tf_defendant
0,368679,368679,2022/12/31 05:00:00+00,31/12/2022,GARCIA JULIET,GARCIA,JULIET,965 SW 136TH PL,MIAMI FL 33184,1974-12-31,RETAIL THEFT/750>,,,33184,MIAMI,FL,0.96566,2e-06
1,368681,368681,2022/12/31 05:00:00+00,31/12/2022,GARCIA RAY,GARCIA,RAY,716 SW 6TH STREET,HALLANDALE FL 33009,1985-08-22,CCF BEF 7/1/23,POSS OF CANNABIS,,33009,HALLANDALE,FL,0.948272,2e-06
2,368682,368682,2022/12/29 05:00:00+00,29/12/2022,ZAMOR EDWIN,ZAMOR,EDWIN,13645 NE 12TH AVE,NORTH MIAMI FL 33161,1988-04-26,WEAPON/OPENLY CARRY,,,33161,NORTH MIAMI,FL,0.4621,2e-06
3,368683,368683,2022/12/31 05:00:00+00,31/12/2022,GARCIASALES LISANDRO DAVID,GARCIASALES LISANDRO,DAVID,43 NE 183RD TER,MIAMI FL 33179,1982-02-27,RESIST OFF W/O VIOL,,,33179,MIAMI,FL,0.126729,2e-06
4,368685,368685,2022/12/31 05:00:00+00,31/12/2022,GIULIESI EMANUELE,GIULIESI,EMANUELE,1423 COLLINS AVE 319,MIAMI BEACH FL 33139,1975-09-05,BATTERY,,,33139,MIAMI BEACH,FL,0.891717,2e-06


In [29]:
clusters.to_csv('cluster_results.csv', overwrite=True)
results_df = pd.read_csv('cluster_results.csv', low_memory=False)

unique_persons_splinkQD = results_df.drop_duplicates('cluster_id').shape[0]
print("Unique number of individuals after Splink Quick and Dirty':", unique_persons_splinkQD)

Unique number of individuals after Splink Quick and Dirty': 239166


In [30]:
difference = unique_persons_raw - unique_persons_splinkQD
print("Number of duplicate persons found in raw data':", difference)

Number of duplicate persons found in raw data': 27490
