<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1">Overview</a></span></li><li><span><a href="#Problem-statement" data-toc-modified-id="Problem-statement-2">Problem statement</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-3">Setup</a></span></li><li><span><a href="#Load-data" data-toc-modified-id="Load-data-4">Load data</a></span></li><li><span><a href="#Does-this-linkage-task-meet-the-requirements-of-Name-Match?" data-toc-modified-id="Does-this-linkage-task-meet-the-requirements-of-Name-Match?-5">Does this linkage task meet the requirements of Name Match?</a></span></li><li><span><a href="#Preparing-the-data-for-linkage" data-toc-modified-id="Preparing-the-data-for-linkage-6">Preparing the data for linkage</a></span></li><li><span><a href="#Running-Name-Match" data-toc-modified-id="Running-Name-Match-7">Running Name Match</a></span></li><li><span><a href="#So-how-many-of-the-potential-candidate-have-run-for-office-before?" data-toc-modified-id="So-how-many-of-the-potential-candidate-have-run-for-office-before?-8">So how many of the potential candidate have run for office before?</a></span></li><li><span><a href="#Understanding-results" data-toc-modified-id="Understanding-results-9">Understanding results</a></span></li><li><span><a href="#Evaluating-the-results-(for-real)" data-toc-modified-id="Evaluating-the-results-(for-real)-10">Evaluating the results (for real)</a></span></li><li><span><a href="#Creating-custom-constraints-(optional/advanced)" data-toc-modified-id="Creating-custom-constraints-(optional/advanced)-11">Creating custom constraints (optional/advanced)</a></span></li></ul></div>

# Overview

This tutorial demonstrates how to use Name Match to link two datasets. It walks through step-by-step:
* How to determine if the data meets the requirements of Name Match
* How to prepare the data for linkage
* Running Name Match
    * Setting up the config file
    * Linking the datasets using the Name Match package
* How to investigate the quality of the results
* Improving you match with custom constraints (optional, for more advanced users with complex tasks)

# Problem statement

There is an upcoming election in Brazil, and reporters are starting to speculate about who will be running for various offices. You want to know which of these potential candidates have run for election in the past -- what position(s) they ran for when, who they ran against, etc. 

To answer these questions, you need to link two datasets: 
* The list of potential candidates being discussed in the news
* The official ballot records from the last three election years

# Setup

Import packages and configure notebook settings

In [1]:
import pandas as pd
import numpy as np
from namematch.namematcher import NameMatcher

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.core.display import display, Markdown

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Load data

In [2]:
potential_candidates = pd.read_csv('raw_data/potential_candidates.csv')
past_candidates = pd.read_csv('raw_data/past_candidates.csv')

In [3]:
print(f"\nThere are {len(potential_candidates)} candidates thinking about running in the upcoming election.\n")

potential_candidates.head()


There are 10588 candidates thinking about running in the upcoming election.



Unnamed: 0,name,dob,sex,race,marital_status
0,ANTONIA EUFRASIA VIEIRA DE SOUSA,1975-01-01,M,BRANCA,CASADO(A)
1,LAURA SOEIRO RODRIGUES,1964-04-12,F,BRANCA,DIVORCIADO(A)
2,RODOLFO FRANCA GALVAO SEGUNDO,1982-07-19,M,PARDA,CASADO(A)
3,ROSIMAR FERREIRA DE ARAUJO,1981-07-23,M,PARDA,SOLTEIRO(A)
4,BRUNA GOMIDE DA SILVA,1982-07-02,F,PARDA,SOLTEIRO(A)


In [4]:
print(f"\nThere are {len(past_candidates)} ballot records from the last three election years ",
      f"({past_candidates.candidate_id.nunique()} unique political candidates).\n")

past_candidates.head()


There are 11731 ballot records from the last three election years  (9722 unique political candidates).



Unnamed: 0,candidacy_id,candidate_id,full_name,dob,gender,race,marital_status,election_date,home_state
0,674185040,40845818449,AZMAZETE BERNARDINO DE SENA PAIVA,21/12/1966,FEMININO,PARDA,DIVORCIADO(A),05/10/2014,PE
1,391482866,94465061234,CLAUDIA GOMES ROLIM,15/05/1981,FEMININO,PARDA,SOLTEIRO(A),05/10/2014,RO
2,351167154,1356367798,JOANA D ARC MAGESKI DE SOUSA,22/05/1973,FEMININO,BRANCA,CASADO(A),05/10/2014,ES
3,617984557,26771321843,MARIA ZILDA SILVA,28/02/1969,FEMININO,BRANCA,DIVORCIADO(A),05/10/2014,SP
4,441638425,31813569649,HELENA MARIA DE SOUSA,09/08/1953,FEMININO,PRETA,SOLTEIRO(A),05/10/2014,MG


# Does this linkage task meet the requirements of Name Match?

The requirements of Name Match are: 
1. At least some of the records involved in the link already have a unique person-identifier
2. For the records that already have a unique person-identifier, it must be possible for a person to show up multiple times with variation in their identifying information (e.g. typos, nicknames, last-name-changes, etc.)

#### Does this task meet requirement #1?

<font color='green'><b>Yes!</b></font> The `past_candidates` file, has a `candiate_id` field with non-null values.

In [5]:
past_candidates.candidate_id.head()

0    40845818449
1    94465061234
2     1356367798
3    26771321843
4    31813569649
Name: candidate_id, dtype: int64

#### Does this task meet requirment #2? 

To answer this question, we'll need to do a little exploratory data analysis. The question we're trying to answer is: "Is it possible for a candidate_id to appear multiple times, with different values in fields like name or dob?"

In [6]:
# calculate the number of unique names and dobs per candidate id

by_candidate_id = past_candidates.groupby('candidate_id')[['full_name', 'dob']].nunique()
by_candidate_id.head()

Unnamed: 0_level_0,full_name,dob
candidate_id,Unnamed: 1_level_1,Unnamed: 2_level_1
578274,1,2
2184052,1,1
2498669,2,1
4233514,1,1
4772733,2,1


When we count the number of unique names and dobs per `candidate_id` we see that not all values are 1. Meaning a candidate can appear multiple times with variations in the spelling of their name or recording of their dob information.

This means that <font color='green'><b>Yes!</b></font> the second requriment of Name Match is met.

# Preparing the data for linkage

### Deciding which fields should be used to inform the match

First and last name fields are required for running Name Match. Date-of-birth and age fields are also required, though Name Match can tolerate a small amount of missingness in those fields. Using other fields about a person's identity (e.g. race, gender, address, middle name/initial) is optional, but can be helpful to the algorithim when distinguishing people. Thus, it is a good idea to include them when they are available for the majority of records. 

A field should only be considered for inclusion if it: 1) contains information that could be useful for identifying the person and 2) is present in all of the datasets being linked*.

Let's see what fields are available in the datasets we're linking:

In [7]:
potential_candidates.columns.tolist()
past_candidates.columns.tolist()

['name', 'dob', 'sex', 'race', 'marital_status']

['candidacy_id',
 'candidate_id',
 'full_name',
 'dob',
 'gender',
 'race',
 'marital_status',
 'election_date',
 'home_state']

The fields that should be used to inform the matching algorithm in this task are: 
* First name, last name, middle name (these fields will have to be generated from the `name` and `full_name` fields in the raw data during pre-processing -- see below)
* Date of birth
* Age (needs to be created from DOB -- see below)
* Gender
* Race
* Marital status

Notice that we don't include information about the candidate's home state, even though that information is related to the candidate's identity -- this is because it is only present in one of our two datasets and would therefore not be helpful when linking.

One might think that including information that can change somewhat frequently (e.g. address, marital status, etc.) would be a bad idea. Though this is an impirical question dependent on the specific matching task, we've seen that this information often does help the algorithm disambiguate people. For example, a change in marital status might help signal to the algorithm that it's okay to link two records with the same first name and different last names. 

\* It's possible to include fields that are missing from one or more of the datasets being linked, but it is ill-advised in most circumstances.

### Preprocessing

#### Basic data cleaning

There might be a few basic data cleaning steps that need to be taken before running Name Match, such as separating fields into smaller components or converting timestamps to dates. All we need to do for this matching task is to separate the full name field into distinct first, middle, and last name fields.

In [8]:
def assign_name_components(name_components):
    
    # NOTE, this implementation is just an example and may not be 
    # correct for these specific datasets
    
    if len(name_components) == 1:
        first_name = np.NaN
        middle_name = np.NaN
        last_name = name_components[0]
    elif len(name_components) == 2:
        first_name = name_components[0]
        middle_name = np.NaN
        last_name = name_components[1]
    else:
        first_name = name_components[0]
        middle_name = name_components[1]
        last_name = ' '.join(name_components[2:])
        
    return first_name, middle_name, last_name
    

def create_separate_name_columns(df, name_col):
    
    df = df.copy()
    
    name_components = df[name_col].str.strip().str.split('\s+')
    name_components = name_components.apply(lambda x: assign_name_components(x))    
    
    name_components_df = pd.DataFrame(name_components.tolist(), index=df.index, 
                                      columns=['first_name', 'middle_name', 'last_name'])
    
    df = pd.concat([df, name_components_df], axis=1)
    
    return df

In [9]:
potential_candidates = create_separate_name_columns(potential_candidates, 'name')
past_candidates = create_separate_name_columns(past_candidates, 'full_name')

In [10]:
potential_candidates.head()
past_candidates.head()

Unnamed: 0,name,dob,sex,race,marital_status,first_name,middle_name,last_name
0,ANTONIA EUFRASIA VIEIRA DE SOUSA,1975-01-01,M,BRANCA,CASADO(A),ANTONIA,EUFRASIA,VIEIRA DE SOUSA
1,LAURA SOEIRO RODRIGUES,1964-04-12,F,BRANCA,DIVORCIADO(A),LAURA,SOEIRO,RODRIGUES
2,RODOLFO FRANCA GALVAO SEGUNDO,1982-07-19,M,PARDA,CASADO(A),RODOLFO,FRANCA,GALVAO SEGUNDO
3,ROSIMAR FERREIRA DE ARAUJO,1981-07-23,M,PARDA,SOLTEIRO(A),ROSIMAR,FERREIRA,DE ARAUJO
4,BRUNA GOMIDE DA SILVA,1982-07-02,F,PARDA,SOLTEIRO(A),BRUNA,GOMIDE,DA SILVA


Unnamed: 0,candidacy_id,candidate_id,full_name,dob,gender,race,marital_status,election_date,home_state,first_name,middle_name,last_name
0,674185040,40845818449,AZMAZETE BERNARDINO DE SENA PAIVA,21/12/1966,FEMININO,PARDA,DIVORCIADO(A),05/10/2014,PE,AZMAZETE,BERNARDINO,DE SENA PAIVA
1,391482866,94465061234,CLAUDIA GOMES ROLIM,15/05/1981,FEMININO,PARDA,SOLTEIRO(A),05/10/2014,RO,CLAUDIA,GOMES,ROLIM
2,351167154,1356367798,JOANA D ARC MAGESKI DE SOUSA,22/05/1973,FEMININO,BRANCA,CASADO(A),05/10/2014,ES,JOANA,D,ARC MAGESKI DE SOUSA
3,617984557,26771321843,MARIA ZILDA SILVA,28/02/1969,FEMININO,BRANCA,DIVORCIADO(A),05/10/2014,SP,MARIA,ZILDA,SILVA
4,441638425,31813569649,HELENA MARIA DE SOUSA,09/08/1953,FEMININO,PRETA,SOLTEIRO(A),05/10/2014,MG,HELENA,MARIA,DE SOUSA


#### Standardizing the two data files

It's important that the files that are being linked together represent certain information in the same way. 

For example, one of our datasets encodes gender as M vs F, while the other encodes it as MASCULINO vs. FEMININO. If left as is, the algorithm will compare F and FEMININO and determine that the genders do not match. Since we want this to be seen as a matching gender, we need to standardize the way gender information is represented. 

Here we'll do that by shortening `MASCULINO` and `FEMININO` to M and F, respectively, in the past_candidates dataset.

In [11]:
past_candidates['gender'] = past_candidates.gender.map({'MASCULINO':'M', 'FEMININO':'F'})

The only other type of standardization needed for this matching task relates to the date fields -- specifically `dob`. One dataset encodes dates in the yyyy-mm-dd format, while the other encodes dates as dd/mm/yyyy. We'll fix this by encoding all dates in the yyyy-mm-dd format.

In [12]:
past_candidates['dob'] = pd.to_datetime(past_candidates.dob, format='%d/%m/%Y')
past_candidates['election_date'] = pd.to_datetime(past_candidates.election_date, format='%d/%m/%Y')

Note that standardizing column names (e.g. gender vs sex) is not necessary during pre-processing. There will be a way to refer to each variable by its original name when configuring the match -- see below.

#### Creating a single-reference age field

Age can be a useful field during record matching, but only if it is calculated as of a single reference date, such as "age as of January 1st, 2025" below. This is necessary because it is likely that the records associated wth a given person were not all generated on the same day. For example, if a person is 18 in a record from 2010 and 26 in a record from 2016, we don't want the algorithm to see 18 and 26 and assume it's not the same person. 

This may mean that you need to adjust any pre-existing `age` fields in your input dataset. For this dataset, however, there was not an existing age field -- so we simply calculate it from DOB. 

In [13]:
past_candidates['age_in_2025'] = (pd.to_datetime('2025-01-01') - past_candidates.dob).astype('<m8[Y]').astype(int)
potential_candidates['age_in_2025'] = (pd.to_datetime('2025-01-01') - pd.to_datetime(potential_candidates.dob)).astype('<m8[Y]').astype(int)

#### Locating or creating a record id

Name Match must have a way of uniquely identifying each record in a given dataset. Therefore, each input dataset must have a column with a different value for every record. The `past_candidates` dataset already meets this requirement because of the `candidacy_id` field.

In [14]:
len(past_candidates) == past_candidates.candidacy_id.nunique()

True

However, the `potential_candidates` dataset does not come with a unique record id -- so one must be created during pre-processing.

In [15]:
potential_candidates['record_id'] = np.arange(len(potential_candidates))
potential_candidates['record_id'].head()

0    0
1    1
2    2
3    3
4    4
Name: record_id, dtype: int64

#### Outputting the prepared datasets

After all of these preparation steps, our two datasets are ready for Name match! We'll write them out as CSVs.

In [16]:
potential_candidates.head()
past_candidates.head()

Unnamed: 0,name,dob,sex,race,marital_status,first_name,middle_name,last_name,age_in_2025,record_id
0,ANTONIA EUFRASIA VIEIRA DE SOUSA,1975-01-01,M,BRANCA,CASADO(A),ANTONIA,EUFRASIA,VIEIRA DE SOUSA,50,0
1,LAURA SOEIRO RODRIGUES,1964-04-12,F,BRANCA,DIVORCIADO(A),LAURA,SOEIRO,RODRIGUES,60,1
2,RODOLFO FRANCA GALVAO SEGUNDO,1982-07-19,M,PARDA,CASADO(A),RODOLFO,FRANCA,GALVAO SEGUNDO,42,2
3,ROSIMAR FERREIRA DE ARAUJO,1981-07-23,M,PARDA,SOLTEIRO(A),ROSIMAR,FERREIRA,DE ARAUJO,43,3
4,BRUNA GOMIDE DA SILVA,1982-07-02,F,PARDA,SOLTEIRO(A),BRUNA,GOMIDE,DA SILVA,42,4


Unnamed: 0,candidacy_id,candidate_id,full_name,dob,gender,race,marital_status,election_date,home_state,first_name,middle_name,last_name,age_in_2025
0,674185040,40845818449,AZMAZETE BERNARDINO DE SENA PAIVA,1966-12-21,F,PARDA,DIVORCIADO(A),2014-10-05,PE,AZMAZETE,BERNARDINO,DE SENA PAIVA,58
1,391482866,94465061234,CLAUDIA GOMES ROLIM,1981-05-15,F,PARDA,SOLTEIRO(A),2014-10-05,RO,CLAUDIA,GOMES,ROLIM,43
2,351167154,1356367798,JOANA D ARC MAGESKI DE SOUSA,1973-05-22,F,BRANCA,CASADO(A),2014-10-05,ES,JOANA,D,ARC MAGESKI DE SOUSA,51
3,617984557,26771321843,MARIA ZILDA SILVA,1969-02-28,F,BRANCA,DIVORCIADO(A),2014-10-05,SP,MARIA,ZILDA,SILVA,55
4,441638425,31813569649,HELENA MARIA DE SOUSA,1953-08-09,F,PRETA,SOLTEIRO(A),2014-10-05,MG,HELENA,MARIA,DE SOUSA,71


In [17]:
# past_candidates.to_csv('preprocessed_data/past_candidates.csv', index=False)
# potential_candidates.to_csv('preprocessed_data/potential_candidates.csv', index=False)

# Running Name Match

### Create the config

There are certain decisions the user makes about what files to link, which variables to use, and what settings the algorithm should take -- the config is where you record these decisions! 

The Name Match package can accept the config in two formats: a python dictionary, as we'll show below, or a YAML file. A YAML file is simply a way of writing key-value pairs in a flat file. Thus, it operates much like a python dictionary. Neither option is better or worse than the other, so do what makes sense to you based on what you're most familiar with and how you're calling Name Match. 

There are three main "sections" of the config: the data files, the variables, and general parameters/settings. 

The **data files** section is comprised of a single `data_files` dictionary of dictionaires. The keys are the reference names of each dataset being linked, and the values are dictionaires with information like `filepath` and `record_id_col` for each dataset.

The **variables** section is comprised of a single `variables` list of dictionaries. Each element in the list is a dictionary with information about one of the variables that should be used in the match. Each variable gets a reference `name` (e.g. gender), a `compare_type` (e.g. String, Date, Category, etc.), and a set of `<dataset>_col` definitions. The `<dataset>_col` field is where you specify the name of the field in the underlying dataset (e.g. `sex` for one dataset and `gender` for the other). For each variable, there should be one `<dataset>_col` field per input dataset defined in the `data_files` section. Each variable dictionary can take optional fields like `drop`, `check`, and `set_missing`. See the Documentation for more information. 

The **parameters** section of the config is simply a series of key/value pairs where you tell the algorithm anything special about how you want it to run. All parameters have default values (see Documentation) that can be left alone for all simple matches. These defaults make this section of the config optional. 

In [18]:
config_dict = {
    
    # define the data files that will be linked and/or deduplicated
    # -------------------------------------------------------------
    
    'data_files': {
        'potential_candidates': {
            'filepath' : 'preprocessed_data/potential_candidates.csv',
            'record_id_col' : 'record_id'
        },
        'past_candidates': {
            'filepath' : 'preprocessed_data/past_candidates.csv',
            'record_id_col' : 'candidacy_id'
        }        
    },
    
    # define the variables that should be used in the matching process
    # ----------------------------------------------------------------
    
    'variables': [
        {
            'name' : 'first_name',
            'compare_type' : 'String',
            'potential_candidates_col' : 'first_name',
            'past_candidates_col' : 'first_name',
            'drop': [''] # don't include records with missing first-name in the match
        },
        {
            'name' : 'last_name',
            'compare_type' : 'String',
            'potential_candidates_col' : 'last_name',
            'past_candidates_col' : 'last_name',
            'drop': [''] # don't include records with missing last-name in the match
        },
        {
            'name' : 'dob',
            'compare_type' : 'Date',
            'potential_candidates_col' : 'dob',
            'past_candidates_col' : 'dob',
            'check' : 'Date - %Y-%m-%d' # optional, check to make sure that all input values
                                        # are in this format; otherwise, set to missing
        },
        {
            'name' : 'age',
            'compare_type' : 'Number',
            'potential_candidates_col' : 'age_in_2025',
            'past_candidates_col' : 'age_in_2025',
        },
        {
            'name' : 'middle_name',
            'compare_type' : 'String',
            'potential_candidates_col' : 'middle_name',
            'past_candidates_col' : 'middle_name'            
        },
        {
            'name' : 'gender',
            'compare_type' : 'Category',
            'potential_candidates_col' : 'sex', # notice we refer to the original column name, which differs by dataset
            'past_candidates_col' : 'gender',
            'check' : 'M,F' # optional, check that input values are either M or F; otherwise, set to missing
        },
        {
            'name' : 'race',
            'compare_type' : 'Category',
            'potential_candidates_col' : 'race',
            'past_candidates_col' : 'race'
        },
        {
            'name' : 'marital_status',
            'compare_type' : 'Category',
            'potential_candidates_col' : 'marital_status',
            'past_candidates_col' : 'marital_status'
        },
        {
            # exactly one of the variables defined in the config should have compare_type of "UniqueID";
            # this is the pre-existing unique person-identifier that we identified when checking if our 
            # datasets met the requirements of Name Match
            'name' : 'official_candidate_id',
            'compare_type' : 'UniqueID', 
            'potential_candidates_col' : '', # the potential candidates dataset does not have this 
                                             # column -- therefore the col value is empty ('')
            'past_candidates_col' : 'candidate_id'
        }
    ],
    
    # set parameter values (optional -- defaults work for most basic matching tasks)
    # ------------------------------------------------------------------------------
    
    'num_workers': 8,
    'pct_train': .9,
    'allow_clusters_w_multiple_unique_ids': False,
    'missingness_model': None,
    'negate_exact_match_variables': ['middle_name']
    
}

### Link the data

To run Name Match,
1. Instantiate a NameMatcher object by passing in the config dictionary created above
2. Call the run() command on that object

Depending on the size of your input datasets, the run command may take several hours* to complete. For this tutorial, it should only take a few minutes.

In [19]:
nm = NameMatcher(
    config=config_dict, 
    output_dir='tutorial_output/'
)

28-Jan-22 14:52:14 INFO     The log file will be located at /projects/2017-007-namematch/melissa_work/dev_master_nm/name_match/examples/tutorial_output/details/name_match.log.


In [20]:
nm.run()

[32m2022-01-28 14:52:15[0m - [1;30mINFO    [0m Running task: ProcessInputData
[32m2022-01-28 14:52:15[0m - [1;30mINFO    [0m This is not an incremental run.
[32m2022-01-28 14:52:15[0m - [1;30mINFO    [0m Reading potential_candidates data.
[32m2022-01-28 14:52:15[0m - [1;30mINFO    [0m Done writing potential_candidates data.
[32m2022-01-28 14:52:15[0m - [1;30mINFO    [0m Reading past_candidates data.
[32m2022-01-28 14:52:16[0m - [1;30mINFO    [0m Done writing past_candidates data.
[32m2022-01-28 14:52:16[0m - [1;30mINFO    [0m Number of input records: 22319
[32m2022-01-28 14:52:16[0m - [1;30mINFO    [0m Number of valid input records: 22317
[32m2022-01-28 14:52:16[0m - [1;30mINFO    [0m Writing stats for task: ProcessInputData
[32m2022-01-28 14:52:16[0m - [1;30mINFO    [0m Running task: GenerateMustLinks
[32m2022-01-28 14:52:16[0m - [1;30mINFO    [0m Generating "must-link" record pairs.
[32m2022-01-28 14:52:16[0m - [1;30mINFO    [0m Writing


0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************


[32m2022-01-28 14:52:39[0m - [1;30mINFO    [0m Creating query shingles matrix.
[32m2022-01-28 14:52:40[0m - [1;30mINFO    [0m Getting identical candidate pairs.
[32m2022-01-28 14:52:40[0m - [1;30mINFO    [0m Querying to get candidate pairs.
[32m2022-01-28 14:52:40[0m - [1;30mINFO    [0m Indices to query: 1
[32m2022-01-28 14:52:57[0m - [1;30mINFO    [0m Computing cosine similarities.
[32m2022-01-28 14:52:57[0m - [1;30mINFO    [0m Number of uncovered pairs: 84
[32m2022-01-28 14:52:57[0m - [1;30mINFO    [0m Number of true pairs: 1906
[32m2022-01-28 14:52:57[0m - [1;30mINFO    [0m Calculating pair completeness.
[32m2022-01-28 14:52:57[0m - [1;30mINFO    [0m Pair completeness, including equal blockstrings (cosine level): 0.974
[32m2022-01-28 14:52:57[0m - [1;30mINFO    [0m Pair completeness, including equal blockstrings (cosine + editdistance level): 0.956
[32m2022-01-28 14:52:57[0m - [1;30mINFO    [0m Pair completeness, non-equal blockstrings (co

\* See the runtime guidelines in the Documentation for a more specific estimate

##### Look at the final output

The output datasets look exactly like the input datasets, except they now have an additional column called `cluster_id`. This is the unique person identifier that can be used to link records within and across datasets.

In [21]:
matched_potential_candidates = pd.read_csv('tutorial_output/potential_candidates_with_clusterid.csv')
matched_past_candidates = pd.read_csv('tutorial_output/past_candidates_with_clusterid.csv')

matched_potential_candidates.head()
matched_past_candidates.head()

Unnamed: 0,name,dob,sex,race,marital_status,first_name,middle_name,last_name,age_in_2025,record_id,cluster_id
0,ANTONIA EUFRASIA VIEIRA DE SOUSA,1975-01-01,M,BRANCA,CASADO(A),ANTONIA,EUFRASIA,VIEIRA DE SOUSA,50,0,1716
1,LAURA SOEIRO RODRIGUES,1964-04-12,F,BRANCA,DIVORCIADO(A),LAURA,SOEIRO,RODRIGUES,60,1,1717
2,RODOLFO FRANCA GALVAO SEGUNDO,1982-07-19,M,PARDA,CASADO(A),RODOLFO,FRANCA,GALVAO SEGUNDO,42,2,1718
3,ROSIMAR FERREIRA DE ARAUJO,1981-07-23,M,PARDA,SOLTEIRO(A),ROSIMAR,FERREIRA,DE ARAUJO,43,3,1719
4,BRUNA GOMIDE DA SILVA,1982-07-02,F,PARDA,SOLTEIRO(A),BRUNA,GOMIDE,DA SILVA,42,4,1720


Unnamed: 0,candidacy_id,candidate_id,full_name,dob,gender,race,marital_status,election_date,home_state,first_name,middle_name,last_name,age_in_2025,cluster_id
0,674185040,40845818449,AZMAZETE BERNARDINO DE SENA PAIVA,1966-12-21,F,PARDA,DIVORCIADO(A),2014-10-05,PE,AZMAZETE,BERNARDINO,DE SENA PAIVA,58,0
1,391482866,94465061234,CLAUDIA GOMES ROLIM,1981-05-15,F,PARDA,SOLTEIRO(A),2014-10-05,RO,CLAUDIA,GOMES,ROLIM,43,12304
2,351167154,1356367798,JOANA D ARC MAGESKI DE SOUSA,1973-05-22,F,BRANCA,CASADO(A),2014-10-05,ES,JOANA,D,ARC MAGESKI DE SOUSA,51,1202
3,617984557,26771321843,MARIA ZILDA SILVA,1969-02-28,F,BRANCA,DIVORCIADO(A),2014-10-05,SP,MARIA,ZILDA,SILVA,55,12305
4,441638425,31813569649,HELENA MARIA DE SOUSA,1953-08-09,F,PRETA,SOLTEIRO(A),2014-10-05,MG,HELENA,MARIA,DE SOUSA,71,12306


# So how many of the potential candidate have run for office before?

We can answer this question by seeing what share of `cluster_ids` from the potential candidate's dataset appear in the past candidates dataset.

In [22]:
share_run_before__algorithm = matched_potential_candidates.cluster_id.isin(matched_past_candidates.cluster_id).mean()
print(f"{(share_run_before__algorithm * 100).round(1)}%")

32.4%


# Understanding results

It is often challenging to get exact estimates of the error in your match, since the reason you needed a probabilistic record linkage solution in the first place is because you don't know exactly which records should be linked and which records shouldn't. However, there are several steps you can take to ensure there are no major problems with your matching process or results. 

#### Look at the metrics automatically output by Name Match

Name Match generates several performance metrics automatically during processing. While the exact value of these measures will vary task to task, it's a good idea to understand what the metrics mean so you can gauge whether the values you're seeing make sense for your matching task. And since these metrics are output as they are calculated, they can often serve as an early warning system of any big issues with the match config or pre-processing.

##### Pair Completeness

One metric, called Pair Completeness, essentially measures the "recall" of Name Match's blocking step. Specifically, it measures how many of the name/dob pairs that are *known* to belong to the same person make it past the first big match-filtering step of the algorithm. Ideally, this measure will be ~90%+. 

Based on the following line logged during the `run()` call, we can see that the pair completeness metric looks good for this task.

`Pair completeness, including equal blockstrings (cosine + editdistance level): 0.956`

##### Out-of-sample precision, recall, AUC

One of the three main steps in the Name Match algorithm is a standard supervised prediction problem. A match-or-no-match model is learned using 90% (by default) of the labeled pairs. The model is then evaluated on the held-out labeled pairs, using standard binary model evaluation metrics like precision, recall, and AUC. Typical values of successful matches are ~95%+. You can find these metrics in the log under the `----- EVALUATING BASIC MATCH MODEL -----` heading. 

For this task, we see precision, recall, and AUC value either 1 or very very close.

#### Calculate matching statistics

How many links were made? What share of people in dataset A appear in dataset B? How many records does each person have across all datasets -- min, max, average? These types of matching statistics can be very helpful for understanding the quality of your match.

For example, if the statistic we calculated above (i.e. how many people in the potential candidates file linked to a record in the past candidate's file?) was a lot higher or lower -- e.g. 99% or 1% -- your intuition and understanding of politics might tell you that there was a problem.

As another sanity check, let's see what the maximum number of records a single person has is. To do this, we'll load the "all names" file -- this is simply the complete list of records that were matched by the algorithm.

In [23]:
all_records = pd.read_csv('tutorial_output/details/all_names_with_clusterid.csv', dtype={'official_candidate_id':'str'})
all_records['n_records'] = all_records.groupby('cluster_id').record_id.transform('nunique')
all_records = all_records.drop(columns=['file_type', 'blockstring', 'dataset', 'drop_from_nm'])

In [24]:
all_records.n_records.max()

5

The most times that a person has run for office is 5. This means that a person has either run in five difference races in the last 3 election years, or they've run in 4 difference races and are considering running again in the upcoming election. That seems reasonable, given that it's possible for a candidate to run for multiple seats (e.g. state and federal) in the same year. 

#### Randomly flip through clusters, especially the bigger clusters

It's always a good idea to spot check some of the clusters -- groups of records assigned to the same person -- to get a micro-level view of some example links. I like to look at a random set of clusters (where at least one link was made) and then look specifically at the larger clusters.

##### Random clusters

In [25]:
random_cluster = all_records[all_records.n_records > 1].cluster_id.sample().iloc[0]
random_cluster

all_records[all_records.cluster_id == random_cluster]

1291

Unnamed: 0,record_id,first_name,last_name,dob,age,middle_name,gender,race,marital_status,official_candidate_id,cluster_id,n_records
8012,potential_candidates__8012,WILSON,DA SILVA,1980-01-22,44,GOMES,M,PRETA,SOLTEIRO(A),,1291,4
10709,past_candidates__107360054,WILSON,DA SILVA,1980-01-22,44,GOMES,M,PARDA,SOLTEIRO(A),90915542153.0,1291,4
16568,past_candidates__148146566,WILSON,DA SILVA,1980-01-21,44,GOMES,M,PARDA,SOLTEIRO(A),90915542153.0,1291,4
16881,past_candidates__110778780,WILSON,DA SILVA,1980-01-22,44,GOMES,M,PRETA,SOLTEIRO(A),90915542153.0,1291,4


##### Look at the biggest cluster(s)

In [26]:
cluster_n = all_records.groupby('cluster_id').size()
cluster_n.max()
cluster_n[cluster_n >= 5]

5

cluster_id
172     5
1032    5
1118    5
1159    5
1162    5
1181    5
1400    5
dtype: int64

In [27]:
all_records[all_records.cluster_id == 1162]

Unnamed: 0,record_id,first_name,last_name,dob,age,middle_name,gender,race,marital_status,official_candidate_id,cluster_id,n_records
439,potential_candidates__439,MARCO,SOUZA RIBEIRO DA COSTA,1969-06-25,55,ANTONIO,M,BRANCA,CASADO(A),,1162,5
10658,past_candidates__816141244,MARCO,SOUZA RIBEIRO DA COSTA,1969-06-25,55,ANTONIO,M,BRANCA,CASADO(A),9331269803.0,1162,5
14598,past_candidates__113204699,MARCO,SOUZA RIBEIRO DA COSTA,1969-06-25,55,ANTONIO,M,BRANCA,CASADO(A),9331269803.0,1162,5
21142,past_candidates__920501723,MARCO,SOUZA RIBEIRO DA COSTA,1969-06-25,55,ANTONIO,M,BRANCA,CASADO(A),9331269803.0,1162,5
21703,past_candidates__405227063,MARCO,RIBEIRO DA COSTA,1969-06-25,55,ANTONIO,M,BRANCA,CASADO(A),9331269803.0,1162,5


This 5-record cluster looks good. The algorithm linked a potential candidate to 4 records from past candidacies. One of the records has a slightly different representation of last name -- but the matching other fields indicate that including this record in the cluster was the right decision.

In [28]:
all_records[all_records.cluster_id == 1118]

Unnamed: 0,record_id,first_name,last_name,dob,age,middle_name,gender,race,marital_status,official_candidate_id,cluster_id,n_records
1608,potential_candidates__1608,RAIMUNDO,DE MIRANDA,1966-01-20,58,ANTUNES,M,PARDA,CASADO(A),,1118,5
10914,past_candidates__274628643,RAIMUNDO,DE MIRANDA,1966-01-19,58,ANTUNES,M,BRANCA,CASADO(A),49819020425.0,1118,5
20508,past_candidates__400578545,RAIMUNDO,DE MIRANDA,1966-01-20,58,ANTUNES,M,PARDA,CASADO(A),49819020425.0,1118,5
20515,past_candidates__614256201,RAIMUNDO,DE MIRANDA,1966-01-20,58,ANTUNES,M,PARDA,CASADO(A),49819020425.0,1118,5
21619,past_candidates__387740913,RAIMUNDO,DE MIRANDA,1966-01-20,58,ANTUNES,M,BRANCA,CASADO(A),49819020425.0,1118,5


Despite having a record with a one-day-away DOB and having two different race values, the five records in this cluster appear to all belong to the same person.

In [29]:
all_records[all_records.cluster_id == 1400]

Unnamed: 0,record_id,first_name,last_name,dob,age,middle_name,gender,race,marital_status,official_candidate_id,cluster_id,n_records
620,potential_candidates__620,JOSE,DE SOUSA NETO,1958-03-19,66,JULIO,M,BRANCA,CASADO(A),,1400,5
10055,potential_candidates__10055,JOSE,DE SOUSA DINELLY,1958-03-19,66,MARIA,M,PARDA,CASADO(A),,1400,5
11250,past_candidates__420175251,JOSE,DE SOUSA DINELY,1958-03-19,66,MARIA,M,PARDA,CASADO(A),7248857220.0,1400,5
11904,past_candidates__889948730,JOSE,DE SOUZA DINELY,1958-03-19,66,MARIA,M,BRANCA,CASADO(A),7248857220.0,1400,5
21209,past_candidates__454079248,JOSE,DE SOUSA DINELLY,1958-03-19,66,MARIA,M,BRANCA,CASADO(A),7248857220.0,1400,5


This cluster appears to have one False Positive record. Ideally, record `potential_candidates__620` would have been placed in a different cluster. However, some level of error is expected when using a probabilistic matching tool and the fact that the first name matches, most of the last name matches, and the the DOB matches makes the linking of this record to the others understandable (and forgiveable!). One extra indication that this record does not belong in the cluster is the fact that the cluster contains two records from the potential-candidates dataset. We would not expect a cluster to contain two of these records, since that dataset only contains one record per person. See the end of this tutorial for a more advanced technique for prohibiting this type of unexpected behavior. 

#### Verify your expectations

You might have a few expectations about the results, either based on what you know about Name Match or based on the parameters values you set in the config. Checking these expectations can be a nice quick way to gain confidence in your results. If these things don't look as expected, there's likely a problem with your preprocessing or config file.

For example, records that exact match on every field (or even just the most important fields like name and dob) should nearly always get clustered together. This is true for our match!

In [30]:
pct_exact_matches_linked = (all_records.groupby(['first_name', 'last_name', 'dob']).cluster_id.nunique() == 1).mean() 
print(f'{np.round(pct_exact_matches_linked * 100, 2)}% of "exact matches" are linked.')

100.0% of "exact matches" are linked.


All records with the same `official_candidate_id` (the field marked in the config with `compare_type: 'UniqueID'`) should be in the same cluster. True!

In [31]:
records_with_candidate_id = all_records[all_records.official_candidate_id.notnull()].copy()

In [32]:
(records_with_candidate_id.groupby('official_candidate_id').cluster_id.nunique() == 1).all()

True

If you used the default value of `False` for the `allow_clusters_w_multiple_unique_ids` parameter, then each cluster should have at most one unique value of `official_candidate_id`. Again, this is true for our results!

In [33]:
(records_with_candidate_id.groupby('cluster_id').official_candidate_id.nunique() == 1).all()

True

# Evaluating the results (for real)

Normally you don't have the answer key, so you must rely on methods from the previous section to evaluate the quality of the match. However, since this is a toy problem and we DO have the answer key -- let's see how we did.

In [34]:
answer_key = pd.read_csv('raw_data/answer_key.csv')

matched_potential_candidates = pd.merge(
    matched_potential_candidates, 
    answer_key,
    on=[col for col in answer_key.columns if col != 'candidate_id']
)

### Recall -- how many actual matches did we recover?

In [35]:
merged = pd.merge(
    matched_potential_candidates,
    matched_past_candidates,     
    on='candidate_id', 
    suffixes=['__potential', '__past']
)

In [36]:
recall = (merged.cluster_id__potential == merged.cluster_id__past).mean()

print(f"{recall.round(3) * 100}%")

99.6%


##### Look at the false negatives

In [37]:
merged[merged.cluster_id__potential != merged.cluster_id__past].filter(regex='^name|full_name|dob|candidate_id|cluster_id')

Unnamed: 0,name,dob__potential,cluster_id__potential,candidate_id,full_name,dob__past,cluster_id__past
63,REJANE MATOS RIBEIRO,1965-02-21,1901,39899772534,REJANE DA SILVA MATOS,1965-02-21,17558
393,MARCOS FEITOSA SOBRAL,1981-04-15,2697,66780934291,MARCOS FEITOSA DOS REIS,1981-04-15,17504
662,ANA LOURINETE COSTA LOBO MONTANHER,1979-02-19,3411,54336953449,LOURINETE EURYDICE COSTA LOBO MONTANHER,1970-02-19,18322
830,FRANCISCO SOTERO DOS SANTOS,1991-05-24,3880,21684219272,FRANCISCO SOTERO DOS SANTOS,1968-06-10,1665
831,FRANCISCO SOTERO DOS SANTOS,1991-05-24,3880,21684219272,FRANCISCO SOTERO DOS SANTOS,1968-05-10,1665
884,SEBASTIAO LEAL DE MORAES,1968-01-20,3995,35998482204,SEBASTIAO LEAL DE MORAES,1986-05-27,18370
1077,ANGELICA SCHNEIDER,1967-10-14,4490,45092788020,ANGELICA SCHNEIDER KAFER,1967-10-14,18272
1213,FRANCISCO MARIANO VENANCIO,1972-03-28,4829,550049894,FRANCISCO MARIANO VENANCIO,1957-02-19,13800
1562,RICELIO LINHARES DE MARTINS,1986-01-17,5608,11034271725,RICELIO LINHARES,1986-01-17,16698
2442,BRUNA RAFAELA NASCIMENTO,1989-02-23,7731,8847828406,GEYMISSON BRUNO DO NASCIMENTO,1989-02-23,16651


### Precision -- how many of the links we made were correct?

In [38]:
merged = pd.merge(
    matched_potential_candidates,
    matched_past_candidates,     
    on='cluster_id', 
    suffixes=['__potential', '__past']
)

In [39]:
precision = (merged.candidate_id__potential == merged.candidate_id__past).mean()

print(f"{precision.round(3) * 100}%")

99.7%


##### Look at the false positives

In [40]:
merged[merged.candidate_id__potential != merged.candidate_id__past].filter(regex='^name|full_name|dob|candidate_id|cluster_id')

Unnamed: 0,name,dob__potential,cluster_id,candidate_id__potential,candidate_id__past,full_name,dob__past
61,JOSE AIRTON BRAGA DA SILVA,1957-09-18,1892,5770467215,199955808,JOSE DONIZETTI FERREIRA DA SILVA,1957-09-18
169,ANTONIO CARLOS RODRIGUES,1961-07-19,2169,33074194391,22609270178,ANTONIO RODRIGUES,1961-06-13
193,LUIZ CARLOS DA SILVA JUNIOR,1973-05-26,2219,74815989400,766567729,LUIZ CARLOS DA SILVA,1970-03-26
236,JOSE JULIO DE SOUSA NETO,1958-03-19,1400,32059035600,7248857220,JOSE MARIA DE SOUSA DINELY,1958-03-19
237,JOSE JULIO DE SOUSA NETO,1958-03-19,1400,32059035600,7248857220,JOSE MARIA DE SOUZA DINELY,1958-03-19
238,JOSE JULIO DE SOUSA NETO,1958-03-19,1400,32059035600,7248857220,JOSE MARIA DE SOUSA DINELLY,1958-03-19
299,MARCOS ANTONIO ALVES,1952-05-23,2468,590409808,25083481049,MARCO ANTONIO ALVES,1952-02-03
573,MARIA DA CONCEICAO DE JESUS SOUSA,1966-12-08,438,39773116549,3649700654,MARIA DA CONCEICAO OLIVEIRA,1964-12-08
574,MARIA DA CONCEICAO DE JESUS SOUSA,1966-12-08,438,39773116549,3649700654,MARIA DA CONCEICAO OLIVEIRA,1964-12-08
591,LUIZ CARLOS DE OLIVEIRA,1964-04-17,3181,61289248087,9128522400,LUIZ CARLOS DE OLIVEIRA ARAUJO,1954-04-16


### Accuracy of answer to primary research question

#### According to the algorithm...

In [41]:
share_run_before__algorithm = matched_potential_candidates.cluster_id.isin(matched_past_candidates.cluster_id).mean()

print(f"{(share_run_before__algorithm * 100).round(1)}% of potential candidates have run for office before")

32.4% of potential candidates have run for office before


#### In reality, according to ground truth...

In [42]:
share_run_before__ground_truth = matched_potential_candidates.candidate_id.isin(matched_past_candidates.candidate_id).mean()

print(f"{(share_run_before__ground_truth * 100).round(1)}% potential candidates have run for office before")

32.5% potential candidates have run for office before


# Creating custom constraints (optional/advanced)

Sometimes when reviewing the results of a Name Match run, you'll notice a specific type of error that occurs somewhat regularly. For example, maybe your dataset has a lot of father/son pairs with the same name (e.g. John Smith Jr and John Smtih Sr) and the records for these two different people are mistakenly getting linked together. You may wish to impose a rule that Name Match is not allowed to link two records with different non-missing suffixes (JR, SR, III, etc.). 

Or maybe before you even run Name Match for the first time, you have domain knowledge that tells you certain types of links should never happen. For example, if you know that a person should never have more than 3 records across the datasets being linked, wouldn't it be great if you could tell Name Match to never form clusters with more than 3 records?

This is where the optional user-defined constraints come in!

We have already seen a perfect example in this tutorial of an incorrect link that could have been prevented by encoding user-knowledge into a hard matching constraint. Recall that in Cell 30, we noticed a cluster with multiple records from the potential candidates dataset. We realized that since the potential candidates dataset has just one record per person, two records from this dataset should never link together -- any such links would be known errors. 

How many times does this type of issue occur?

In [43]:
matched_potential_candidates.groupby('cluster_id').size().value_counts()

1    10568
2       10
dtype: int64

There are a few clusters that contain 2 different potential candidate records. Therefore, we know for sure that these clusters have an error. 

Let's see how we could have used our knowledge of the underlying data to prohibit these erroneous links from the beginning. 

### Constraint functions

Name Match can takes as input two constraint functions, which by default return True: `is_valid_link` and `is_valid_cluster`.

The logic in the `is_valid_link` function answers the question "would a link between these two records be valid?" The logic in the  `is_valid_cluster` function answers the question "if a link was made that generated this cluster, would this cluster be valid?"

For every link the algorithm considers making, these two functions are run. If either were to return False, the link would not be made.

### Our use case

As the user, you can alter these functions however you want. Let's see how we would do so to impose the following constraint.

**Constraint:** Records from the `potential_candidates` dataset cannot link to other records from the `potential_candidates` dataset.

In [44]:
def is_valid_link(predicted_links_df):

    # To start, all potential links are considered valid
    predicted_links_df['valid'] = True

    # If both records come from the dataset `potential_candidates`, 
    # the link is invalid
    predicted_links_df.loc[
        (predicted_links_df.dataset_1 == 'potential_candidates') & (predicted_links_df.dataset_2 == 'potential_candidates'), 
        'valid'] = False
    
    return predicted_links_df['valid']

In [45]:
def is_valid_cluster(cluster):
    
    # If more than one record in the cluster that results from a link are from 
    # the `potential_candidates` dataset, the link is invalid.
    if (cluster['dataset'] == 'potential_candidates').sum() > 1:
        return False
    
    return True

Notice how we're accessing a field called "dataset" -- this is a field that is created automatically by Name Match as the input datasets are loaded. Its values are the reference names assigned to each input data file in the config. The other fields that could be accessed in the constraint functions are the variables specifically defined in the config (e.g. dob, gender).

To impose these constraints, this is what the call to Name Match would look like: 

In [None]:
from namematch.cluster import Constraints

constraints = Constraints()
constraints.is_valid_link = is_valid_link
constraints.is_valid_cluster = is_valid_cluster

nm = NameMatcher(
    config=config_dict,
    output_dir='tutorial_output/',
    constraints=constraints
)