# Accident Analysis (Step-by-Step)

In this notebook, we will extract detailed accident information from résumé text.
We will:

1. **Import Libraries**  
2. **Define Regex Patterns**  
3. **Create an Extraction Function**  
4. **Load and Clean the Data**  
5. **Apply the Function**  
6. **Inspect and Save the Results**


## 1) Import Libraries

Let's bring in everything we need:
- **pandas** for data manipulation.
- **re** for regular expressions.
- **typing** for type hints.
- **numpy** (optional, but included).

In [None]:
import pandas as pd
import re
from typing import List, Dict, Union
import numpy as np
print("Libraries imported.")

## 2) Define Regex Patterns

In this cell, we create dictionaries holding patterns for:
- **Accident characteristics** (like severity and location)
- **Vehicle types**
- **Parties involved**

In [None]:
# Accident characteristics we want to extract
combined_patterns = {
    "severity": r"Accident (Léger non mortel|Grave non mortel|Mortel)",
    "location_context": r"(En agglomération|Hors agglomération)",
    "road_configuration": r"(En Y|En T|En X|Hors intersection|Place|A plus de 4 branches)",
    "lighting": r"(Plein jour|Crépuscule ou aube|Nuit avec éclairage public allumé|Nuit sans éclairage public)",
    "weather": r"météo (Normale|Pluie légère|Temps couvert|Pluie forte)",
    "surface": r"surface chaussée : (Normale|Mouillée|Non renseigné|Autre|Corps gras - huile|Enneigée|Flaques)"
}

# Various vehicle types
vehicle_patterns = {
    "Cyclomoteur <=50 cm3": r"Cyclomoteur <=50 cm3",
    "Véhicule de tourisme (VT)": r"Véhicule de tourisme \(VT\)",
    "Moto ou sidecar > 50 <= 125 cm3": r"Moto ou sidecar > 50 <= 125 cm3",
    "Moto ou sidecar > 125 cm3": r"Moto ou sidecar > 125 cm3",
    "Scooter <= 50 cm3": r"Scooter <= 50 cm3",
    "Scooter > 125 cm3": r"Scooter > 125 cm3",
    "Scooter > 50 <= 125 cm3": r"Scooter > 50 <= 125 cm3",
    "VU seul 1,5T < PTAC <=3,5T": r"VU seul 1,5T < PTAC <=3,5T",
    "PL > 3,5T + remorque": r"PL > 3,5T \+ remorque",
    "PL seul 3,5T <=": r"PL seul 3,5T <=",
    "Autocar": r"Autocar",
    "Autobus": r"Autobus",
    "Bicyclette": r"Bicyclette",
    "EDP-m": r"EDP-m",
    "Voiturette": r"Voiturette",
    "Quad léger <= 50 cm3": r"Quad léger <= 50 cm3",
    "Tramway": r"Tramway",
    "Autre véhicule": r"Autre véhicule",
    "Tracteur routier + semi-remorque": r"Tracteur routier \+ semi-remorque",
    "PL seul PTAC > 7,5T": r"PL seul PTAC > 7,5T",
    "3 RM > 125 cm3": r"3 RM > 125 cm3",
    "3 RM <= 50 cm3": r"3 RM <= 50 cm3",
    "Vélo par assistance électrique": r"Vélo par assistance électrique",
    "Engin spécial": r"Engin spécial",
    "EDP-sm": r"EDP-sm",
    "Quad lourd > 50 cm3": r"Quad lourd > 50 cm3",
    "Tracteur agricole": r"Tracteur agricole",
    "EDP sans moteur": r"Autre engin de déplacement personnel \(EDP\) sans moteur",
    "Indéterminable": r"Indéterminable",
    "Piéton": r"Piéton",
    "PL seul 3,5T <PTAC <= 7,5T": r"PL seul 3,5T <PTAC <= 7,5T",
    "Nouvel engin de déplacement personnel \(EDP\) à moteur": r"Nouvel engin de déplacement personnel \(EDP\) à moteur",
    "Tracteur routier": r"Tracteur routier",
    "PL seul PTAC <= 7,5T": r"PL seul PTAC <= 7,5T",
    "3 RM > 50 <= 125 cm3": r"3 RM > 50 <= 125 cm3",
    "Autre engin de déplacement personnel \(EDP\) sans moteur": r"Autre engin de déplacement personnel \(EDP\) sans moteur"
}

# Parties involved
parties_patterns = {
    "Piéton Féminin": r"\b1 Piéton Feminin\b",
    "Piéton Masculin": r"\b1 Piéton Masculin\b",
    "Piéton": r"\b1 Piéton(?!\s(?:Feminin|Masculin))\b",
    "Usager Masculin": r"\b1 usager Masculin(?: de \d+ ans)?(?:\s\(.+?\))?(?!\spassager)",
    "Usager Féminin": r"\b1 usager Feminin(?: de \d+ ans)?(?:\s\(.+?\))?(?!\spassager)",
    "Passager Masculin": r"avec \d+ passager(?:s)? Masculin",
    "Passager Féminin": r"avec \d+ passager(?:s)? Feminin",
    "Passager": r"avec (\d+) passager",
    "Bicyclette": r"\b1\s(?:Bic|Bicyclette)\b",
    "Véhicule de tourisme (VT)": r"heurte 1 Véhicule de tourisme \(VT\)"
}
print("Patterns defined.")

## 3) Create an Extraction Function

This function will:
- Check if input is a valid string.
- Find matches for accident characteristics.
- Find all listed vehicle types.
- Determine which parties are involved.
- Return all findings in a dictionary.

In [None]:
def extract_accident_details(resume: str) -> Dict[str, Union[str, bool, List[str]]]:
    # If the résumé is missing or not a string, return empty.
    if not isinstance(resume, str) or pd.isna(resume):
        return {}

    result = {}

    # 1) Extract accident characteristics.
    for key, pattern in combined_patterns.items():
        match = re.search(pattern, resume)
        result[key] = match.group(1) if match else None

    # 2) Extract vehicle types found.
    all_vehicle_patterns = "|".join(vehicle_patterns.values())
    result["vehicle_types"] = re.findall(all_vehicle_patterns, resume)

    # 3) Identify parties involved.
    result["parties_involved"] = []
    for key, pattern in parties_patterns.items():
        matches = re.findall(pattern, resume)
        if matches:
            # For multiple passengers, store the number.
            if key == "Passager":
                for count in matches:
                    passenger_text = f"avec {count} passager{'s' if int(count) > 1 else ''}"
                    result["parties_involved"].append(passenger_text)
            else:
                # For everything else, add it as many times as matched.
                result["parties_involved"].extend([key] * len(matches))

    return result

print("Extraction function created.")

## 4) Load and Clean the Data

We'll read in a CSV file called `accidents.csv`.
Then we'll remove extra spaces from the `Résumé` column.

In [None]:
# Make sure the file exists and the path is correct.
df = pd.read_csv("../data/accidents.csv", encoding='utf-8', sep=";")

# Clean up extra spaces.
df['Résumé'] = df['Résumé'].apply(
    lambda x: re.sub(r'\s+', ' ', x.strip()) if isinstance(x, str) else x
)
print("Résumé column cleaned.")

## 5) Apply the Function

We will use the function on each row in our data.

In [None]:
# Apply the extraction function.
df['Résumé_Details'] = df['Résumé'].apply(extract_accident_details)
print("Extraction applied to each résumé.")


## 6) Inspect and Save the Results

We'll merge our new columns, preview the data, and then save it to a new CSV.

In [None]:
# Expand the dictionaries in 'Résumé_Details'
df_improved = pd.concat([df, df['Résumé_Details'].apply(pd.Series)], axis=1)

# Remove unneeded columns.
df_improved.drop(['Résumé_Details'], axis=1, inplace=True, errors='ignore')
df_improved.drop(['Résumé_Details'], axis=1, inplace=True, errors='ignore')

# Show a preview in a Markdown-like table.
df_preview = df_improved[
    [
        'Résumé',
        'severity',
        'location_context',
        'road_configuration',
        'lighting',
        'weather',
        'surface',
        'vehicle_types',
        'parties_involved'
    ]
].head(15)

print(df_preview.to_markdown(index=False))

# Save the results.
df_improved.to_csv('../date/accidents-enriched.csv', index=False)
print("Results saved as 'accidents-enriched.csv'.")