# Korean Maritime Safety Tribunal Vessel Name Matching

This notebook implements fuzzy string matching to identify Korean vessels across different datasets. It:

1. Loads vessel name data from extracted vessel names and Korean vessels datasets
2. Implements helper functions to:
   - Remove digits from vessel names
   - Perform fuzzy matching using token set ratio comparison
3. Uses two-stage matching:
   - First compares alphabetic-only parts of names
   - Then compares full vessel names if alpha parts match
4. Allows configurable matching thresholds for both stages

The goal is to reliably match vessel names despite minor differences in formatting and numbering.


In [1]:
import pandas as pd
from rapidfuzz import fuzz
import re

In [2]:
vessel_names_df = pd.read_csv('../data/extracted_vessel_names.csv')
korean_vessels_df = pd.read_csv('../data/Korean-vessels-for-infraction-scraping-02-18-25.csv')

In [3]:
def remove_digits(text):
    """Remove digits from a string."""
    return re.sub(r'\d+', '', text)

In [4]:
def fuzzy_name_match(name_a, name_b, alpha_threshold=80, full_threshold=70):
    """
    Returns True if the alphabetic parts of the two strings match above alpha_threshold,
    and the full strings match above full_threshold.
    Adjust thresholds as needed.
    """
    # Convert to lowercase for consistency
    a_lower = name_a.lower()
    b_lower = name_b.lower()
    
    # Compare alphabetic-only parts
    a_alpha = remove_digits(a_lower)
    b_alpha = remove_digits(b_lower)
    alpha_ratio = fuzz.token_set_ratio(a_alpha, b_alpha)
    
    if alpha_ratio < alpha_threshold:
        # Alphabetic parts aren't similar enough, skip
        return False
    
    # If the alpha parts match strongly, then check full string
    full_ratio = fuzz.token_set_ratio(a_lower, b_lower)
    return full_ratio >= full_threshold

In [5]:
def check_vessel_name(vessel_name):
    """
    For each name in the extracted list, see if there's a fuzzy match 
    that meets both alpha-only and full-string thresholds.
    """
    for extracted_name in vessel_names:
        if fuzzy_name_match(extracted_name, vessel_name):
            return True
    return False

In [7]:
vessel_names = vessel_names_df.iloc[0].tolist()

korean_vessels_df['Match_Found'] = korean_vessels_df['Vessel Name'].apply(check_vessel_name)
matched_vessels = korean_vessels_df[korean_vessels_df['Match_Found']]

In [8]:
matched_vessels

Unnamed: 0,MMSI,Vessel Name,IMO Number,Flag,Call Sign,length_m,Vessel Type / Fishing Method,registry code,Port of Registry,Owner Name,...,North Pacific Fisheries Commission IUU Listing Reason,Southern Indian Ocean Fisheries Agreement IUU Listing Date,Southern Indian Ocean Fisheries Agreement IUU Listing Reason,IUU Combined List Vessel Link,Friend of the Sea Registration Number,Company authorized to sell the Friend of the Sea certified products,FAO fishing area per Friend of the Sea,Targeted species per Friend of the Sea,Friend of the Sea Certificate Status,Match_Found
