In [10]:
import pandas as pd
import re

Load dataframe with surnames obtained from the [2010 US Census](https://www.census.gov/data/developers/data-sets/surnames/2010.html)

In [11]:
# Do not use default na values as it causes the surname "Null" to be converted to NaN
df = pd.read_csv("us_surnames.csv", keep_default_na=False, na_values="")

## Soundex Implementation

In [12]:
digits = {
    "B": "1", # Labial obstruents
    "F": "1",
    "P": "1",
    "V": "1",
    "C": "2", # Coronals and dorsals
    "G": "2",
    "J": "2",
    "K": "2",
    "Q": "2",
    "S": "2",
    "X": "2",
    "Z": "2",
    "D": "3", # Alveolar Stops
    "T": "3",
    "L": "4", # Lateral approximant
    "M": "5", # Nasals
    "N": "5",
    "R": "6"  # Rhotic
} # Ignore vowels, H, and W

def soundex (name):
    """
    Convert name into soundex code
    """
    r = r'[AEIOUY]'
    
    name = name.upper() # Although all surnames in the data set are uppercase, don't assume so
    code = name[0]

    last_idx = 0
    for i in range(1, len(name)):
        cur = name[i]

        if cur in digits:
            last = digits[name[last_idx]] if name[last_idx] in digits else name[last_idx]
            digit = digits[cur]

            # Do not add duplicate digits unless theres a vowel between the current character and the previously coded
            if digit != last or re.search(r, name[last_idx: i + 1]):     
                code += digit
                last_idx = i
    
    # Pad or truncate if length isn't 4
    code = code[:4].ljust(4, "0")
    
    return code

## Exploring surname codes

First, add a column to the df with corresponding soundex code:

In [13]:
df["SOUNDEX"] = df["NAME"].apply(lambda x: soundex(x))
df

Unnamed: 0,NAME,SOUNDEX
0,SMITH,S530
1,JOHNSON,J525
2,WILLIAMS,W452
3,BROWN,B650
4,JONES,J520
...,...,...
162248,DOBBEN,D150
162249,DIETZMANN,D325
162250,DOKAS,D220
162251,DONLEA,D540


Lets look at all the surnames with the code S530

In [14]:
code_S530 = list(df.loc[df["SOUNDEX"] == "S530"]["NAME"])
code_S530

['SMITH',
 'SCHMIDT',
 'SCHMITT',
 'SNEED',
 'SCHMID',
 'SNEAD',
 'SMYTH',
 'SMOOT',
 'SANTOYO',
 'SHUMATE',
 'SANDHU',
 'SANDY',
 'SCHMIT',
 'SAND',
 'SANTO',
 'SMYTHE',
 'SUNDAY',
 'SMIT',
 'SWINT',
 'SINNOTT',
 'SANTA',
 'SMITHEY',
 'SNODDY',
 'SANTEE',
 'SAINT',
 'SHAND',
 'SMIDT',
 'SANDE',
 'SCHWINDT',
 'SCHWANDT',
 'SUMMITT',
 'SANTI',
 'SUND',
 'SENNETT',
 'SANT',
 'SMEAD',
 'SUNDE',
 'SCHWIND',
 'SINNETT',
 'SAMUDIO',
 'SMIDDY',
 'SAMAD',
 'SANDO',
 'SHIMADA',
 'SANDT',
 'SAMET',
 'SUMIDA',
 'SENAT',
 'SHENOUDA',
 'SMET',
 'SHINDE',
 'SINDT',
 'SNEATH',
 'SANDOW',
 'SMID',
 'SANTY',
 'SEMIDEY',
 'SUMMIT',
 'SAWANT',
 'SCINTO',
 'SANTOYA',
 'SEMEDO',
 'SMITHEE',
 'SONODA',
 'SANDA',
 'SCHMIED',
 'SCHMUDE',
 'SNIDE',
 'SONTAY',
 'SANDI',
 'SAMMUT',
 'SANDAU',
 'SAINATO',
 'SCHWENT',
 'SINEATH',
 'SHIMODA',
 'SUNDT',
 'SCINTA',
 'SANTOY',
 'SWANDA',
 'SAMADI',
 'SAMEDI',
 'SHANDY',
 'SCHMITTOU',
 'SANTAY',
 'SENNOTT',
 'SYNNOTT',
 'SZANTO',
 'SCHWEND',
 'SMUDA',
 'SUNADA',
 'SAMM

Eyeballing the results shows that various unrelated names can get grouped under the same code. So for instance, SMITH, SNEED, SHIMODA, SHOEMATE, SANTO, and so on. However, their plausible variants are also present, like SCHMIDT, SCHMITT, and SMYTH for SMITH; and SNEAD for SNEED. Ideally, only phonetically similar names should share the same code. 