# Analysis of Paris Accident Data - Part 2

**Goal**: Help Paris mayors implement concrete solutions to reduce road accidents.

In this notebook we:
- **Load** and explore accident data.
- **Parse** the free-text "Summary" field with regex.
- **Identify** accidents on the Boulevard Périphérique.
- **Match** each Périphérique accident to its nearest Porte (using KML data + Haversine).
- **Replace** the address column for Périphérique accidents with the matched Porte name.
- **Save** the final, enriched dataset for further analysis.

## Step 1: Data Loading - Setting the Stage

Objectif (en français): Aider les mairies de Paris à mettre en place des solutions concrètes pour réduire les accidents de la route, dont:
- Programme de sécurisation des rues les plus dangereuses
- Campagne de prévention sur les comportements à risque
- Amélioration des infrastructures urbaines

Storyline (Exemples d’analyses futures):
- Identifier les arrondissements et rues les plus "accidentogènes"
- Analyser la temporalité (périodes critiques)
- Analyser la météo
- Analyser les types de véhicules impliqués
- Formuler des recommandations concrètes (ex: pistes cyclables)

Allons-y!

In [1]:
import re
import pandas as pd
from xml.dom import minidom
import math

# Load the dataset
df = pd.read_csv('../data/accidents_cleaned.csv', sep=';')

# Extract the 'Summary' column for regex parsing
summaries = df['report_summary']

print("Data loaded successfully!")
df.head()

Data loaded successfully!


Unnamed: 0,accident_date,victim_transport_mode,victim_category,victim_age,victim_sex,environment,address,longitude,latitude,accident_ID,periphery_info,victim_age_group,victim_minor_injuries?,victim_hospitalized?,victim_deceased?,report_summary,district_code,arrondgeo
0,2017-04-03,Piéton,Piéton,62,F,En-Agg,BOULEVARD BEAUMARCHAIS,2.36867,48.855,837613,Paris Intra Muros,55-64 ans,True,False,False,"Accident Léger non mortel, En agglomération, H...",75104,"{""coordinates"": [[[[2.369123881, 48.853166231]..."
1,2017-08-28,2 Roues Motorisées,Conducteur,30,M,En-Agg,RUE MARBEUF,2.3013,48.8667,837073,Paris Intra Muros,25-34 ans,True,False,False,"Accident Léger non mortel, En agglomération, E...",75108,"{""coordinates"": [[[[2.301737288, 48.863496077]..."
2,2017-11-06,2 Roues Motorisées,Conducteur,37,M,En-Agg,RUE LA CONDAMINE,2.32163,48.8858,840008,Paris Intra Muros,35-44 ans,True,False,False,"Accident Léger non mortel, En agglomération, E...",75117,"{""coordinates"": [[[[2.303774362, 48.894153779]..."
3,2017-09-29,Vélo,Conducteur,51,M,En-Agg,BOULEVARD DE L HOPITAL,2.35941,48.8368,838501,Paris Intra Muros,45-54 ans,True,False,False,"Accident Léger non mortel, En agglomération, H...",75113,"{""coordinates"": [[[[2.366087726, 48.844967843]..."
4,2017-12-21,2 Roues Motorisées,Conducteur,50,M,En-Agg,AVENUE DES MINIMES * SAINT MANDE/VINCENN,2.42821,48.8415,838218,Paris Intra Muros,45-54 ans,True,False,False,"Accident Léger non mortel, En agglomération, H...",75112,"{""coordinates"": [[[[2.467319402, 48.839099389]..."


## Step 2: Define our Battle Plan (Regex to the Rescue)

We’ll use regular expressions to extract structured details from the free-text `report_summary`. Our approach is rule-based:
- **intersection_type**
- **lighting_condition**
- **weather_condition**
- **road_surface**
- **first_vehicle**
- **max_speed**
- **first_vehicle_driver_sex**
- **first_vehicle_driver_age**

In [2]:
# 2.1 Intersection Type
pattern_intersection = r"En agglomération, ([^,]+)"
intersection_type = summaries.str.extract(pattern_intersection, expand=False)

# 2.2 Lighting Condition
pattern_lighting = r"^(?:[^,]*,){3}\s*(.*?)\s*,\s*avec une météo"
lighting_condition = summaries.str.extract(pattern_lighting, expand=False)

# 2.3 Weather Condition
pattern_meteo = r"avec une météo\s+(.*?)\s+et\s+une\s+surface\s+chaussée"
weather_condition = summaries.str.extract(pattern_meteo, expand=False)

# 2.4 Road Surface
pattern_road_surface = r"et\s+une\s+surface\s+chaussée\s*:\s*(.*?)\."  
road_surface = summaries.str.extract(pattern_road_surface, expand=False)

# 2.5 First Vehicle
pattern_first_vehicle = r"1\s+(.*?)\s+circulant"
first_vehicle = summaries.str.extract(pattern_first_vehicle, expand=False)

# 2.6 Speed Limit
pattern_max_speed = r"VMA à (\d+)"
max_speed = summaries.str.extract(pattern_max_speed, expand=False)

# 2.7 Driver Sex
pattern_driver_sex = r"conduit\s+par\s+1\s+usager\s+([MFmf])\w*"
driver_sex = summaries.str.extract(pattern_driver_sex, expand=False)

# 2.8 Driver Age
pattern_driver_age = r"conduit\s+par\s+1\s+usager\s+\S+\s+de\s+(\d+)(?:\s+a[n]s?)?"
driver_age = summaries.str.extract(pattern_driver_age, expand=False).astype(float)

print("Regex extraction complete.")

Regex extraction complete.


## Step 3: Store the Extracted Data (Data, Meet DataFrame)
We now place each extracted element into a new column of our main DataFrame.

In [3]:
df['intersection_type'] = intersection_type
df['lighting_condition'] = lighting_condition
df['weather_condition'] = weather_condition
df['road_surface'] = road_surface
df['first_vehicle'] = first_vehicle
df['max_speed'] = max_speed
df['first_vehicle_driver_sex'] = driver_sex
df['first_vehicle_driver_age'] = driver_age

df.head(3)

Unnamed: 0,accident_date,victim_transport_mode,victim_category,victim_age,victim_sex,environment,address,longitude,latitude,accident_ID,...,district_code,arrondgeo,intersection_type,lighting_condition,weather_condition,road_surface,first_vehicle,max_speed,first_vehicle_driver_sex,first_vehicle_driver_age
0,2017-04-03,Piéton,Piéton,62,F,En-Agg,BOULEVARD BEAUMARCHAIS,2.36867,48.855,837613,...,75104,"{""coordinates"": [[[[2.369123881, 48.853166231]...",Hors intersection,Plein jour,Normale,Non renseigné,Cyclomoteur <=50 cm3,,M,26.0
1,2017-08-28,2 Roues Motorisées,Conducteur,30,M,En-Agg,RUE MARBEUF,2.3013,48.8667,837073,...,75108,"{""coordinates"": [[[[2.301737288, 48.863496077]...",En Y,Plein jour,Normale,Normale,Scooter > 125 cm3,,M,30.0
2,2017-11-06,2 Roues Motorisées,Conducteur,37,M,En-Agg,RUE LA CONDAMINE,2.32163,48.8858,840008,...,75117,"{""coordinates"": [[[[2.303774362, 48.894153779]...",En X,Plein jour,Normale,Normale,Véhicule de tourisme (VT),,M,45.0


## Step 4: Map Boulevard Périphérique Accidents to the Nearest "Porte"
We have a KML file with each Porte around the ring road. The KML names are just `"#4"`, etc., so we'll map them to real labels ("Porte d’Italie"), compute distances, and replace the address in our main DataFrame.

In [4]:
# 4.1 Dictionary for KML ID => Real Porte Names
kml_id_to_porte_name = {
    "#1": "Porte de Bercy",
    "#2": "Quai d'Ivry",
    "#3": "Porte d'Ivry",
    "#4": "Porte d'Italie",
    "#5": "Autoroute A6b",
    "#6": "Porte de Gentilly",
    "#7": "Autoroute A6a",
    "#8": "Porte d'Orléans",
    "#9": "Porte de Châtillon",
    "#10": "Porte de Vanves",
    "#11": "Porte Brancion",
    "#12": "Porte de la Plaine",
    "#13": "Porte de Sèvres",
    "#14": "Quai d'Issy",
    "#15": "Porte de Saint-Cloud - Quai Saint-Exupéry",
    "#16": "Porte de Saint-Cloud",
    "#17": "Porte Molitor",
    "#18": "Porte d'Auteuil",
    "#19": "Porte d'Auteuil (A13)",
    "#20": "Porte de Passy",
    "#21": "Porte de la Muette",
    "#22": "Porte Dauphine",
    "#23": "Porte Maillot",
    "#24": "Porte des Ternes",
    "#25": "Porte de Champerret (1/2 B)",
    "#26": "Porte de Champerret (1/2 H)",
    "#27": "Porte d'Asnières",
    "#28": "Porte de Clichy",
    "#29": "Porte de Saint-Ouen",
    "#30": "Porte de Clignancourt",
    "#31": "Porte de la Chapelle",
    "#32": "Porte d'Aubervilliers",
    "#33": "Porte de la Villette",
    "#34": "Porte de Pantin",
    "#35": "Porte du Pré-Saint-Gervais",
    "#36": "Porte des Lilas",
    "#37": "Porte de Bagnolet",
    "#38": "Porte de Montreuil",
    "#39": "Porte de Vincennes",
    "#40": "Porte de Saint-Mandé",
    "#41": "Porte Dorée",
    "#42": "Porte de Charenton",
    "#43": "Porte de Bercy (autoroute, km 35)"
}

In [5]:
# 4.2 Parse the KML
def parse_kml_interchanges(kml_path):
    dom = minidom.parse(kml_path)
    placemarks = dom.getElementsByTagName("Placemark")
    data = []
    for pm in placemarks:
        name_nodes = pm.getElementsByTagName("name")
        name_value = name_nodes[0].firstChild.nodeValue.strip() if name_nodes else None

        coord_nodes = pm.getElementsByTagName("coordinates")
        if coord_nodes:
            coords_text = coord_nodes[0].firstChild.nodeValue.strip()
            lon_str, lat_str, _alt = coords_text.split(',')
            longitude = float(lon_str)
            latitude  = float(lat_str)
        else:
            longitude = None
            latitude  = None

        data.append({
            'name': name_value,
            'longitude': longitude,
            'latitude': latitude
        })
    return pd.DataFrame(data)

# Parse the KML file (adjust path if needed)
df_portes = parse_kml_interchanges("../data/peripherique_interchanges.kml")

# Map the numeric IDs (#1, #2...) to real porte names
df_portes['real_name'] = df_portes['name'].map(kml_id_to_porte_name)
df_portes.head()

Unnamed: 0,name,longitude,latitude,real_name
0,#1,2.39139,48.82722,Porte de Bercy
1,#2,2.38417,48.82472,Quai d'Ivry
2,#3,2.373185,48.819569,Porte d'Ivry
3,#4,2.36028,48.81611,Porte d'Italie
4,#5,2.35639,48.81639,Autoroute A6b


### 4.3 Identify Accidents on the Boulevard Périphérique
We filter for addresses containing `"PERIPHERIQUE"`.

In [6]:
on_periph_mask = df['address'].str.contains("PERIPHERIQUE", case=False, na=False)
df_periph = df[on_periph_mask].copy()
print(f"Number of accidents on the Périphérique: {len(df_periph)}")
df_periph.head(3)

Number of accidents on the Périphérique: 2474


Unnamed: 0,accident_date,victim_transport_mode,victim_category,victim_age,victim_sex,environment,address,longitude,latitude,accident_ID,...,district_code,arrondgeo,intersection_type,lighting_condition,weather_condition,road_surface,first_vehicle,max_speed,first_vehicle_driver_sex,first_vehicle_driver_age
5,2017-09-19,2 Roues Motorisées,Conducteur,47,M,En-Agg,BD PERIPHERIQUE EXTERIEUR,2.36115,48.9009,841686,...,75118,"{""coordinates"": [[[[2.351983536, 48.901484899]...",Hors intersection,Plein jour,Pluie légère,Mouillée,Véhicule de tourisme (VT),,F,32.0
7,2017-07-12,4 Roues,Conducteur,30,M,En-Agg,BD PERIPHERIQUE EXTERIEUR,2.37882,48.9003,841591,...,75119,"{""coordinates"": [[[[2.410820319, 48.878436176]...",Hors intersection,Plein jour,Normale,Normale,Véhicule de tourisme (VT),,M,30.0
20,2017-12-18,4 Roues,Conducteur,34,F,En-Agg,BD PERIPHERIQUE EXTERIEUR,2.38222,48.8234,841815,...,75113,"{""coordinates"": [[[[2.366087726, 48.844967843]...",Hors intersection,Plein jour,Normale,Normale,Véhicule de tourisme (VT),,F,34.0


### 4.4 Haversine Distance
We define a helper to compute the distance (in kilometers) between two lat/long points.

In [7]:
def haversine_distance(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in km
    phi1 = math.radians(lat1)
    phi2 = math.radians(lat2)
    dphi = math.radians(lat2 - lat1)
    dlambda = math.radians(lon2 - lon1)

    a = (math.sin(dphi / 2) ** 2 +
         math.cos(phi1) * math.cos(phi2) * math.sin(dlambda / 2) ** 2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    return R * c

print("Haversine function ready.")

Haversine function ready.


### 4.5 Match Each Périphérique Accident to the Nearest Porte
We loop over the accidents and pick whichever Porte is closest via Haversine.

In [8]:
def find_nearest_porte(lat, lon, df_portes):
    min_dist = float('inf')
    nearest_porte_name = None
    for _, row in df_portes.iterrows():
        dist = haversine_distance(lat, lon, row['latitude'], row['longitude'])
        if dist < min_dist:
            min_dist = dist
            nearest_porte_name = row['real_name']
    return nearest_porte_name, min_dist

porte_names = []
porte_distances = []

for idx, row in df_periph.iterrows():
    lat_acc = row['latitude']
    lon_acc = row['longitude']
    name_, dist_ = find_nearest_porte(lat_acc, lon_acc, df_portes)
    porte_names.append(name_)
    porte_distances.append(dist_)

df_periph['nearest_porte_name'] = porte_names
df_periph['nearest_porte_distance_km'] = porte_distances

df_periph[['address','nearest_porte_name','nearest_porte_distance_km']].head(10)

Unnamed: 0,address,nearest_porte_name,nearest_porte_distance_km
5,BD PERIPHERIQUE EXTERIEUR,Porte de la Chapelle,0.172134
7,BD PERIPHERIQUE EXTERIEUR,Porte d'Aubervilliers,0.572584
20,BD PERIPHERIQUE EXTERIEUR,Quai d'Ivry,0.20475
21,BD PERIPHERIQUE EXTERIEUR,Autoroute A6a,0.260146
23,BD PERIPHERIQUE INTERIEUR,Porte de Charenton,0.320872
25,BD PERIPHERIQUE INTERIEUR,Quai d'Ivry,0.205622
27,BD PERIPHERIQUE INTERIEUR,Porte de Châtillon,0.124988
32,BD PERIPHERIQUE EXTERIEUR,Porte Dorée,0.218766
34,BD PERIPHERIQUE INTERIEUR,Porte d'Orléans,0.360899
36,BD PERIPHERIQUE EXTERIEUR,Porte Dauphine,0.29247


### 4.6 Replace the `address` Column for These Rows
So that *"BD PERIPHERIQUE EXTERIEUR"* becomes *"Porte d'Auteuil"*, etc.

In [9]:
# In the main df, overwrite 'address' with 'nearest_porte_name'
df.loc[df_periph.index, 'address'] = df_periph['nearest_porte_name']
df.loc[df_periph.index, 'porte_distance_km'] = df_periph['nearest_porte_distance_km']

# Quick check
df[['address','latitude','longitude','porte_distance_km']].sample(5)

Unnamed: 0,address,latitude,longitude,porte_distance_km
30600,BOULEVARD DE L HOPITAL,,,
8642,RUE SAINT CHARLES,48.849963,2.288431,
10542,RUE JULIA BARTET,48.8249,2.3021,
5418,QUAI DE VALMY,48.8711,2.36514,
16007,RUE SAINT AMBROISE,48.861,2.37475,


## Step 5: Save the Updated Dataset
We'll export our final DataFrame (with new columns from regex and corrected addresses) to a new CSV. This is the dataset we'll feed into further analyses or visualizations.

> "Gentle reminder: Always keep a backup! And never feed your cat near a running vacuum."

Alright, let's do it...

In [10]:
output_path = '../data/accidents_parsed.csv'
df.to_csv(output_path, sep=';', index=False)
print(f"Done! Updated CSV saved to: {output_path}")

Done! Updated CSV saved to: ../data/accidents_parsed.csv


# Conclusion & Next Steps

- We **extracted** structured info from the free-text `report_summary` field.
- We **mapped** Périphérique accidents to their nearest **Porte**, making addresses more specific.
- We now have a robust `accidents_parsed.csv` for deeper analysis:
  - **Identify** which arrondissements have the most accidents.
  - **Isolate** the top 3 or so dangerous streets in each.
  - **Analyze** time periods (peak months, days, hours) to guide city campaigns.
  - **Assess** weather patterns (rainy months, winter conditions, etc.).
  - **Propose** expansions of bike lanes, better signage, or improved lighting.

Stay tuned for Part 3!