# KMST Vessel Name Parser

This notebook parses vessel names from Korean Maritime Safety Tribunal (KMST) case titles. It includes:

- Functions to extract vessel names from case titles
- Handling of various vessel name patterns:
  - Vessels with numbers (e.g. "Gwangjeong 8", "No. 203 Wonchangho") 
  - Multiple vessels in "between X and Y" patterns
  - Vessels with type prefixes (e.g. "fishing boat", "ore carrier")
  - Korean vessel names ending in "ho"
- Pandas DataFrame processing of KMST case data


In [2]:
import pandas as pd
import re

In [4]:
def extract_vessel_names(case_name):
    """
    Extracts vessel names from case titles, handling:
    - Vessels with numbers (e.g., "Gwangjeong 8", "No. 203 Wonchangho")
    - Multiple vessels in "between X and Y" or "X and Y" patterns
    - Vessels with type prefixes (e.g., "fishing boat", "ore carrier")
    - Korean vessel names ending in "ho"
    
    Returns a list of vessel names with their associated numbers.
    """
    # List of vessel type indicators
    vessel_types = [
        'fishing vessel', 'fishing boat', 'tugboat', 'motorboat', 'cargo ship',
        'oil tanker', 'bulk carrier', 'chemical carrier', 'passenger ship', 'ferry',
        'ore carrier', 'vessel', 'boat', 'carrier', 'ship'
    ]
    vessel_types_pattern = '|'.join([re.escape(vt) for vt in vessel_types])
    
    # Words that should not be part of vessel names
    stopwords = {
        'incident', 'collision', 'crew', 'injury', 'grounding', 'fire',
        'damage', 'death', 'disappearance', 'of', 'on', 'with', 'the',
        'guitar', 'marine', 'pollution', 'contact', 'sinking'
    }
    
    vessels = []
    
    # Pattern for "between X and Y" or "X and Y" with optional vessel types
    between_pattern = re.compile(
        rf'(?:between\s+)?(?:{vessel_types_pattern}\s+)?([A-Z][A-Za-z\-]+(?:\s+\d+)?)\s+and\s+(?:{vessel_types_pattern}\s+)?([A-Z][A-Za-z\-]+(?:\s+\d+)?)',
        re.IGNORECASE
    )
    
    # Pattern for vessel with type prefix
    type_prefix_pattern = re.compile(
        rf'(?:{vessel_types_pattern})\s+(?:No\.\s*)?(\d+\s+)?([A-Z][A-Za-z\-]+(?:\s+\d+)?)',
        re.IGNORECASE
    )
    
    # Pattern for Korean vessels ending in "ho"
    ho_pattern = re.compile(r'(?:^|\s)((?:\d+\s+)?[A-Z][A-Za-z]+[Hh]o(?:\s+\d+)?)')
    
    # First check for vessels in "between/and" constructions
    for match in between_pattern.finditer(case_name):
        vessels.extend([g.strip() for g in match.groups() if g])
    
    # Then look for vessels with type prefixes
    if not vessels:  # Only if we haven't found vessels yet
        for match in type_prefix_pattern.finditer(case_name):
            number, name = match.groups()
            if number and name:
                vessels.append(f"{number.strip()} {name.strip()}")
            elif name:
                vessels.append(name.strip())
    
    # Finally look for Korean vessels ending in "ho"
    if not vessels:  # Only if we haven't found vessels yet
        for match in ho_pattern.finditer(case_name):
            vessels.append(match.group(1).strip())
    
    # Clean up vessel names
    clean_vessels = []
    for vessel in vessels:
        # Remove any vessel type words that might have been included
        for vt in vessel_types:
            vessel = re.sub(rf'\b{vt}\b', '', vessel, flags=re.IGNORECASE)
        
        # Remove stopwords at the end of vessel names
        parts = vessel.split()
        cleaned_parts = []
        for part in parts:
            if part.lower() in stopwords:
                break
            cleaned_parts.append(part)
        
        vessel = ' '.join(cleaned_parts).strip()
        if vessel and vessel not in clean_vessels:
            clean_vessels.append(vessel)
    
    return clean_vessels

In [5]:
# Read the CSV files
df = pd.read_csv('../data/all_decisions_google_api.csv')

In [6]:
# Apply the extraction to each case name
df['vessel_names'] = df['case_name'].apply(extract_vessel_names)


In [7]:
# Display first 20 results
print("First 20 cases with extracted vessel names:")
for case, vessels in zip(df['case_name'][:20], df['vessel_names'][:20]):
    print(f"\nCase: {case}")
    print(f"Vessels: {vessels}")

First 20 cases with extracted vessel names:

Case: Fishing vessel Myungyoonho Fishing vessel Daeyangho collision incident
Vessels: ['Myungyoonho', 'Daeyangho']

Case: The grounding incident of the towed vessel Geumoh 7 by the tugboat Woogukti 5
Vessels: ['Geumoh 7', 'Woogukti 5']

Case: Fishing vessel No. 26 Namseongho grounding incident
Vessels: ['26 Namseongho']

Case: Collision incident between fishing boats Gwangjeong 8 and Gwangjeong 88
Vessels: ['Gwangjeong 8', 'Gwangjeong 88']

Case: Fishing vessel Yeonheungho 2007 collision with refrigerated transport vessel Sing Yue
Vessels: ['Yeonheungho 2007', 'Sing']

Case: Fishing vessel No. 101 Tongyeongho crew casualty incident
Vessels: ['101 Tongyeongho']

Case: Fishing boat Yun Seong-ho and cargo ship JC Ruby collision incident
Vessels: ['Seong-ho', 'cargo']

Case: Iron ore carrier Stella Queen guitar (upper deck damage) incident
Vessels: ['Stella']

Case: Fishing boat 107th window fire incident
Vessels: []

Case: Fishing boat Palpalho

In [11]:
df.head()

Unnamed: 0,case_name,url,vessel_names
0,Fishing vessel Myungyoonho Fishing vessel Daey...,https://www.kmst.go.kr/web/atch/atchFileDownlo...,"[Myungyoonho, Daeyangho]"
1,The grounding incident of the towed vessel Geu...,https://www.kmst.go.kr/web/atch/atchFileDownlo...,"[Geumoh 7, Woogukti 5]"
2,Fishing vessel No. 26 Namseongho grounding inc...,https://www.kmst.go.kr/web/atch/atchFileDownlo...,[26 Namseongho]
3,Collision incident between fishing boats Gwang...,https://www.kmst.go.kr/web/atch/atchFileDownlo...,"[Gwangjeong 8, Gwangjeong 88]"
4,Fishing vessel Yeonheungho 2007 collision with...,https://www.kmst.go.kr/web/atch/atchFileDownlo...,"[Yeonheungho 2007, Sing]"


In [13]:
df['vessel_names'] = df['vessel_names'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)  # Keep square brackets in the output
df.to_csv('../data/extracted_vessel_names.csv', index=False, quoting=1)  # Quote all values in the CSV
