# Analysis of Paris Accident Data - Part 2

**Goal**: Parse unstructured `report_summary` text using regex and map Boulevard Périphérique accidents to their nearest "Porte".

In this notebook, we:
- Extract structured details from free-text accident summaries
- Identify accidents on Boulevard Périphérique
- Map each such accident to the nearest Porte using KML data
- Save the enriched dataset for further analysis

In [1]:
import pandas as pd
import re
import math
from xml.dom import minidom

# Load the cleaned dataset from Notebook 1
df = pd.read_csv('../data/accidents_cleaned.csv', sep=';')
print("Cleaned data loaded successfully!")
df.head(3)

Cleaned data loaded successfully!


Unnamed: 0,accident_date,victim_transport_mode,victim_category,victim_age,victim_sex,environment,address,longitude,latitude,accident_ID,periphery_info,victim_age_group,victim_minor_injuries?,victim_hospitalized?,victim_deceased?,report_summary,district_code,arrondgeo,arrondissement
0,2017-04-03,Piéton,Piéton,62,F,En-Agg,BOULEVARD BEAUMARCHAIS,2.36867,48.855,837613,Paris Intra Muros,55-64 ans,True,False,False,"Accident Léger non mortel, En agglomération, H...",75104,"{""coordinates"": [[[[2.369123881, 48.853166231]...",4
1,2017-08-28,2 Roues Motorisées,Conducteur,30,M,En-Agg,RUE MARBEUF,2.3013,48.8667,837073,Paris Intra Muros,25-34 ans,True,False,False,"Accident Léger non mortel, En agglomération, E...",75108,"{""coordinates"": [[[[2.301737288, 48.863496077]...",8
2,2017-11-06,2 Roues Motorisées,Conducteur,37,M,En-Agg,RUE LA CONDAMINE,2.32163,48.8858,840008,Paris Intra Muros,35-44 ans,True,False,False,"Accident Léger non mortel, En agglomération, E...",75117,"{""coordinates"": [[[[2.303774362, 48.894153779]...",17


## Regex Parsing of `report_summary`

We use regular expressions to extract structured details from the `report_summary` field.

In [2]:
# Extract structured details using regex patterns
summaries = df['report_summary']

# Intersection Type
pattern_intersection = r"En agglomération, ([^,]+)"
intersection_type = summaries.str.extract(pattern_intersection, expand=False)

# Lighting Condition
pattern_lighting = r"^(?:[^,]*,){3}\s*(.*?)\s*,\s*avec une météo"
lighting_condition = summaries.str.extract(pattern_lighting, expand=False)

# Weather Condition
pattern_meteo = r"avec une météo\s+(.*?)\s+et\s+une\s+surface\s+chaussée"
weather_condition = summaries.str.extract(pattern_meteo, expand=False)

# Road Surface
pattern_road_surface = r"et\s+une\s+surface\s+chaussée\s*:\s*(.*?)\."
road_surface = summaries.str.extract(pattern_road_surface, expand=False)

# First Vehicle
pattern_first_vehicle = r"1\s+(.*?)\s+circulant"
first_vehicle = summaries.str.extract(pattern_first_vehicle, expand=False)

# Speed Limit
pattern_max_speed = r"VMA à (\d+)"
max_speed = summaries.str.extract(pattern_max_speed, expand=False)

# Driver Sex
pattern_driver_sex = r"conduit\s+par\s+1\s+usager\s+([MFmf])\w*"
driver_sex = summaries.str.extract(pattern_driver_sex, expand=False)

# Driver Age
pattern_driver_age = r"conduit\s+par\s+1\s+usager\s+\S+\s+de\s+(\d+)(?:\s+a[n]s?)?"
driver_age = summaries.str.extract(pattern_driver_age, expand=False).astype(float)

# Add extracted fields to the dataframe
df['intersection_type'] = intersection_type
df['lighting_condition'] = lighting_condition
df['weather_condition'] = weather_condition
df['road_surface'] = road_surface
df['first_vehicle'] = first_vehicle
df['max_speed'] = max_speed
df['first_vehicle_driver_sex'] = driver_sex
df['first_vehicle_driver_age'] = driver_age

print("Regex extraction complete.")
df.head(3)

Regex extraction complete.


Unnamed: 0,accident_date,victim_transport_mode,victim_category,victim_age,victim_sex,environment,address,longitude,latitude,accident_ID,...,arrondgeo,arrondissement,intersection_type,lighting_condition,weather_condition,road_surface,first_vehicle,max_speed,first_vehicle_driver_sex,first_vehicle_driver_age
0,2017-04-03,Piéton,Piéton,62,F,En-Agg,BOULEVARD BEAUMARCHAIS,2.36867,48.855,837613,...,"{""coordinates"": [[[[2.369123881, 48.853166231]...",4,Hors intersection,Plein jour,Normale,Non renseigné,Cyclomoteur <=50 cm3,,M,26.0
1,2017-08-28,2 Roues Motorisées,Conducteur,30,M,En-Agg,RUE MARBEUF,2.3013,48.8667,837073,...,"{""coordinates"": [[[[2.301737288, 48.863496077]...",8,En Y,Plein jour,Normale,Normale,Scooter > 125 cm3,,M,30.0
2,2017-11-06,2 Roues Motorisées,Conducteur,37,M,En-Agg,RUE LA CONDAMINE,2.32163,48.8858,840008,...,"{""coordinates"": [[[[2.303774362, 48.894153779]...",17,En X,Plein jour,Normale,Normale,Véhicule de tourisme (VT),,M,45.0


## Mapping Boulevard Périphérique Accidents to the Nearest Porte

We use a KML file and the Haversine formula to match accidents on Boulevard Périphérique to the closest Porte.

In [3]:
# Dictionary mapping KML IDs to real Porte names
kml_id_to_porte_name = {
    "#1": "Porte de Bercy",
    "#2": "Quai d'Ivry",
    "#3": "Porte d'Ivry",
    "#4": "Porte d'Italie",
    "#5": "Autoroute A6b",
    "#6": "Porte de Gentilly",
    "#7": "Autoroute A6a",
    "#8": "Porte d'Orléans",
    "#9": "Porte de Châtillon",
    "#10": "Porte de Vanves",
    "#11": "Porte Brancion",
    "#12": "Porte de la Plaine",
    "#13": "Porte de Sèvres",
    "#14": "Quai d'Issy",
    "#15": "Porte de Saint-Cloud - Quai Saint-Exupéry",
    "#16": "Porte de Saint-Cloud",
    "#17": "Porte Molitor",
    "#18": "Porte d'Auteuil",
    "#19": "Porte d'Auteuil (A13)",
    "#20": "Porte de Passy",
    "#21": "Porte de la Muette",
    "#22": "Porte Dauphine",
    "#23": "Porte Maillot",
    "#24": "Porte des Ternes",
    "#25": "Porte de Champerret (1/2 B)",
    "#26": "Porte de Champerret (1/2 H)",
    "#27": "Porte d'Asnières",
    "#28": "Porte de Clichy",
    "#29": "Porte de Saint-Ouen",
    "#30": "Porte de Clignancourt",
    "#31": "Porte de la Chapelle",
    "#32": "Porte d'Aubervilliers",
    "#33": "Porte de la Villette",
    "#34": "Porte de Pantin",
    "#35": "Porte du Pré-Saint-Gervais",
    "#36": "Porte des Lilas",
    "#37": "Porte de Bagnolet",
    "#38": "Porte de Montreuil",
    "#39": "Porte de Vincennes",
    "#40": "Porte de Saint-Mandé",
    "#41": "Porte Dorée",
    "#42": "Porte de Charenton",
    "#43": "Porte de Bercy (autoroute, km 35)"
}

def haversine_distance(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in km
    from math import radians, sin, cos, atan2, sqrt
    phi1, phi2 = radians(lat1), radians(lat2)
    dphi = radians(lat2 - lat1)
    dlambda = radians(lon2 - lon1)
    a = sin(dphi/2)**2 + cos(phi1)*cos(phi2)*sin(dlambda/2)**2
    c = 2 * atan2(sqrt(a), sqrt(1-a))
    return R * c

def parse_kml_interchanges(kml_path):
    dom = minidom.parse(kml_path)
    placemarks = dom.getElementsByTagName("Placemark")
    data = []
    for pm in placemarks:
        name_nodes = pm.getElementsByTagName("name")
        name_value = name_nodes[0].firstChild.nodeValue.strip() if name_nodes else None
        coord_nodes = pm.getElementsByTagName("coordinates")
        if coord_nodes:
            coords_text = coord_nodes[0].firstChild.nodeValue.strip()
            lon_str, lat_str, _ = coords_text.split(',')
            longitude = float(lon_str)
            latitude = float(lat_str)
        else:
            longitude, latitude = None, None
        data.append({
            'name': name_value,
            'longitude': longitude,
            'latitude': latitude
        })
    return pd.DataFrame(data)

# Parse the KML file containing Porte information
df_portes = parse_kml_interchanges("../data/peripherique_interchanges.kml")
df_portes['real_name'] = df_portes['name'].map(kml_id_to_porte_name)
print("KML parsing complete.")
df_portes.head(3)

KML parsing complete.


Unnamed: 0,name,longitude,latitude,real_name
0,#1,2.39139,48.82722,Porte de Bercy
1,#2,2.38417,48.82472,Quai d'Ivry
2,#3,2.373185,48.819569,Porte d'Ivry


In [4]:
# Identify accidents on Boulevard Périphérique
on_periph_mask = df['address'].str.contains("PERIPHERIQUE", case=False, na=False)
df_periph = df[on_periph_mask].copy()
print(f"Number of accidents on the Périphérique: {len(df_periph)}")

def find_nearest_porte(lat, lon, df_portes):
    min_dist = float('inf')
    nearest_porte = None
    for _, row in df_portes.iterrows():
        dist = haversine_distance(lat, lon, row['latitude'], row['longitude'])
        if dist < min_dist:
            min_dist = dist
            nearest_porte = row['real_name']
    return nearest_porte, min_dist

porte_names = []
porte_distances = []
for idx, row in df_periph.iterrows():
    name, dist = find_nearest_porte(row['latitude'], row['longitude'], df_portes)
    porte_names.append(name)
    porte_distances.append(dist)

df_periph['nearest_porte_name'] = porte_names
df_periph['nearest_porte_distance_km'] = porte_distances

# Update main dataframe with mapped Porte names for Périphérique accidents
df.loc[df_periph.index, 'address'] = df_periph['nearest_porte_name']
df.loc[df_periph.index, 'porte_distance_km'] = df_periph['nearest_porte_distance_km']

print("Mapping of Périphérique accidents complete.")
df.head(3)

Number of accidents on the Périphérique: 2474
Mapping of Périphérique accidents complete.


Unnamed: 0,accident_date,victim_transport_mode,victim_category,victim_age,victim_sex,environment,address,longitude,latitude,accident_ID,...,arrondissement,intersection_type,lighting_condition,weather_condition,road_surface,first_vehicle,max_speed,first_vehicle_driver_sex,first_vehicle_driver_age,porte_distance_km
0,2017-04-03,Piéton,Piéton,62,F,En-Agg,BOULEVARD BEAUMARCHAIS,2.36867,48.855,837613,...,4,Hors intersection,Plein jour,Normale,Non renseigné,Cyclomoteur <=50 cm3,,M,26.0,
1,2017-08-28,2 Roues Motorisées,Conducteur,30,M,En-Agg,RUE MARBEUF,2.3013,48.8667,837073,...,8,En Y,Plein jour,Normale,Normale,Scooter > 125 cm3,,M,30.0,
2,2017-11-06,2 Roues Motorisées,Conducteur,37,M,En-Agg,RUE LA CONDAMINE,2.32163,48.8858,840008,...,17,En X,Plein jour,Normale,Normale,Véhicule de tourisme (VT),,M,45.0,


## Save the Enriched Dataset

The enriched dataset is saved as `accidents_parsed.csv` for further analysis.

In [5]:
df.to_csv('../data/accidents_parsed.csv', index=False, sep=';')
print('Enriched dataset saved as accidents_parsed.csv')

Enriched dataset saved as accidents_parsed.csv
