# Analysis of Paris Accident Data - Part 2: Text Parsing and Location Enrichment

**Introduction: Unlocking Hidden Clues**

In Notebook 1, we cleaned and prepped our data. Now, it's time to dig deeper!  This notebook focuses on extracting valuable information from the free-text `report_summary` field and refining location data for accidents on the Boulevard Périphérique. This step is crucial because it provides the data needed for our **solution-driven** analyses (temporal, meteorological, and accident type) that will follow in Notebook 3.

**Notebook Objectives:**

1.  **Load and Explore:**  Reload the cleaned dataset and examine the `report_summary` column.
2.  **Regex Parsing:**  Use regular expressions (regex) to extract structured data from the summaries (e.g., weather, intersection type).
3.  **Périphérique Mapping:**  Identify accidents on the Boulevard Périphérique and map them to the nearest "Porte" (exit) for more precise location data.
4.  **Data Enrichment:**  Add the extracted and refined information to our main DataFrame.
5.  **Save Progress:**  Save the enriched dataset for the final analysis.

Let's get started!

### Step 1: Load and Explore

In [1]:
import re
import pandas as pd
from xml.dom import minidom
import math

# Load the cleaned dataset from Notebook 1.
df = pd.read_csv('../data/accidents_cleaned.csv', sep=';')

# Extract the 'report_summary' column for regex parsing.
summaries = df['report_summary']

print("Data loaded successfully!")
df.head()

Data loaded successfully!


Unnamed: 0,accident_date,victim_transport_mode,victim_category,victim_age,victim_sex,environment,address,longitude,latitude,accident_ID,periphery_info,victim_age_group,victim_minor_injuries?,victim_hospitalized?,victim_deceased?,report_summary,district_code,arrondgeo
0,2017-04-03,Piéton,Piéton,62,F,En-Agg,BOULEVARD BEAUMARCHAIS,2.36867,48.855,837613,Paris Intra Muros,55-64 ans,True,False,False,"Accident Léger non mortel, En agglomération, H...",75104,"{""coordinates"": [[[[2.369123881, 48.853166231]..."
1,2017-08-28,2 Roues Motorisées,Conducteur,30,M,En-Agg,RUE MARBEUF,2.3013,48.8667,837073,Paris Intra Muros,25-34 ans,True,False,False,"Accident Léger non mortel, En agglomération, E...",75108,"{""coordinates"": [[[[2.301737288, 48.863496077]..."
2,2017-11-06,2 Roues Motorisées,Conducteur,37,M,En-Agg,RUE LA CONDAMINE,2.32163,48.8858,840008,Paris Intra Muros,35-44 ans,True,False,False,"Accident Léger non mortel, En agglomération, E...",75117,"{""coordinates"": [[[[2.303774362, 48.894153779]..."
3,2017-09-29,Vélo,Conducteur,51,M,En-Agg,BOULEVARD DE L HOPITAL,2.35941,48.8368,838501,Paris Intra Muros,45-54 ans,True,False,False,"Accident Léger non mortel, En agglomération, H...",75113,"{""coordinates"": [[[[2.366087726, 48.844967843]..."
4,2017-12-21,2 Roues Motorisées,Conducteur,50,M,En-Agg,AVENUE DES MINIMES * SAINT MANDE/VINCENN,2.42821,48.8415,838218,Paris Intra Muros,45-54 ans,True,False,False,"Accident Léger non mortel, En agglomération, H...",75112,"{""coordinates"": [[[[2.467319402, 48.839099389]..."


### Step 2: Define our Battle Plan (Regex to the Rescue)

We'll use regular expressions (regex) to extract structured details from the free-text `report_summary`. Our approach is rule-based, targeting key pieces of information:

*   **intersection_type:** The type of intersection (if applicable).
*   **lighting_condition:**  The lighting conditions at the time of the accident.
*   **weather_condition:** The weather conditions.
*   **road_surface:** The condition of the road surface.
*   **first_vehicle:**  The type of vehicle mentioned first in the summary.
*   **max_speed:**  The posted speed limit (if mentioned).
*   **first_vehicle_driver_sex:** The sex of the driver of the first vehicle.
*   **first_vehicle_driver_age:** The age of the driver of the first vehicle.

In [2]:
# 2.1 Intersection Type
pattern_intersection = r"En agglomération, ([^,]+)"
intersection_type = summaries.str.extract(pattern_intersection, expand=False)

# 2.2 Lighting Condition
pattern_lighting = r"^(?:[^,]*,){3}\\s*(.*?)\\s*,\\s*avec une météo"
lighting_condition = summaries.str.extract(pattern_lighting, expand=False)

# 2.3 Weather Condition
pattern_meteo = r"avec une météo\\s+(.*?)\\s+et\\s+une\\s+surface\\s+chaussée"
weather_condition = summaries.str.extract(pattern_meteo, expand=False)

# 2.4 Road Surface
pattern_road_surface = r"et\\s+une\\s+surface\\s+chaussée\\s*:\\s*(.*?)\\."
road_surface = summaries.str.extract(pattern_road_surface, expand=False)

# 2.5 First Vehicle
pattern_first_vehicle = r"1\\s+(.*?)\\s+circulant"
first_vehicle = summaries.str.extract(pattern_first_vehicle, expand=False)

# 2.6 Speed Limit
pattern_max_speed = r"VMA à (\\d+)"
max_speed = summaries.str.extract(pattern_max_speed, expand=False)

# 2.7 Driver Sex
pattern_driver_sex = r"conduit\\s+par\\s+1\\s+usager\\s+([MFmf])\\w*"
driver_sex = summaries.str.extract(pattern_driver_sex, expand=False)

# 2.8 Driver Age
pattern_driver_age = r"conduit\\s+par\\s+1\\s+usager\\s+\\S+\\s+de\\s+(\\d+)(?:\\s+a[n]s?)?"
driver_age = summaries.str.extract(pattern_driver_age, expand=False).astype(float)

print("Regex extraction complete.")

Regex extraction complete.


### Step 3: Store the Extracted Data (Data, Meet DataFrame)

Now, we'll add the extracted information as new columns to our main DataFrame.

In [3]:
df['intersection_type'] = intersection_type
df['lighting_condition'] = lighting_condition
df['weather_condition'] = weather_condition
df['road_surface'] = road_surface
df['first_vehicle'] = first_vehicle
df['max_speed'] = max_speed
df['first_vehicle_driver_sex'] = driver_sex
df['first_vehicle_driver_age'] = driver_age

df.head(3)

Unnamed: 0,accident_date,victim_transport_mode,victim_category,victim_age,victim_sex,environment,address,longitude,latitude,accident_ID,...,district_code,arrondgeo,intersection_type,lighting_condition,weather_condition,road_surface,first_vehicle,max_speed,first_vehicle_driver_sex,first_vehicle_driver_age
0,2017-04-03,Piéton,Piéton,62,F,En-Agg,BOULEVARD BEAUMARCHAIS,2.36867,48.855,837613,...,75104,"{""coordinates"": [[[[2.369123881, 48.853166231]...",Hors intersection,,,,,,,
1,2017-08-28,2 Roues Motorisées,Conducteur,30,M,En-Agg,RUE MARBEUF,2.3013,48.8667,837073,...,75108,"{""coordinates"": [[[[2.301737288, 48.863496077]...",En Y,,,,,,,
2,2017-11-06,2 Roues Motorisées,Conducteur,37,M,En-Agg,RUE LA CONDAMINE,2.32163,48.8858,840008,...,75117,"{""coordinates"": [[[[2.303774362, 48.894153779]...",En X,,,,,,,


### Step 4: Map Boulevard Périphérique Accidents to the Nearest "Porte"

The Boulevard Périphérique is a ring road around Paris.  Accident locations are often reported as "BD PERIPHERIQUE," which isn't very precise.  We have a KML file containing the coordinates of each "Porte" (exit) on the Périphérique.  We'll use this to map accidents to their nearest Porte, providing much more specific location information.

In [4]:
# 4.1 Dictionary for KML ID => Real Porte Names
kml_id_to_porte_name = {
    "#1": "Porte de Bercy",
    "#2": "Quai d'Ivry",
    "#3": "Porte d'Ivry",
    "#4": "Porte d'Italie",
    "#5": "Autoroute A6b",
    "#6": "Porte de Gentilly",
    "#7": "Autoroute A6a",
    "#8": "Porte d'Orléans",
    "#9": "Porte de Châtillon",
    "#10": "Porte de Vanves",
    "#11": "Porte Brancion",
    "#12": "Porte de la Plaine",
    "#13": "Porte de Sèvres",
    "#14": "Quai d'Issy",
    "#15": "Porte de Saint-Cloud - Quai Saint-Exupéry",
    "#16": "Porte de Saint-Cloud",
    "#17": "Porte Molitor",
    "#18": "Porte d'Auteuil",
    "#19": "Porte d'Auteuil (A13)",
    "#20": "Porte de Passy",
    "#21": "Porte de la Muette",
    "#22": "Porte Dauphine",
    "#23": "Porte Maillot",
    "#24": "Porte des Ternes",
    "#25": "Porte de Champerret (1/2 B)",
    "#26": "Porte de Champerret (1/2 H)",
    "#27": "Porte d'Asnières",
    "#28": "Porte de Clichy",
    "#29": "Porte de Saint-Ouen",
    "#30": "Porte de Clignancourt",
    "#31": "Porte de la Chapelle",
    "#32": "Porte d'Aubervilliers",
    "#33": "Porte de la Villette",
    "#34": "Porte de Pantin",
    "#35": "Porte du Pré-Saint-Gervais",
    "#36": "Porte des Lilas",
    "#37": "Porte de Bagnolet",
    "#38": "Porte de Montreuil",
    "#39": "Porte de Vincennes",
    "#40": "Porte de Saint-Mandé",
    "#41": "Porte Dorée",
    "#42": "Porte de Charenton",
    "#43": "Porte de Bercy (autoroute, km 35)"
}

# 4.2 Parse the KML
def parse_kml_interchanges(kml_path):
    dom = minidom.parse(kml_path)
    placemarks = dom.getElementsByTagName("Placemark")
    data = []
    for pm in placemarks:
        name_nodes = pm.getElementsByTagName("name")
        name_value = name_nodes[0].firstChild.nodeValue.strip() if name_nodes else None

        coord_nodes = pm.getElementsByTagName("coordinates")
        if coord_nodes:
            coords_text = coord_nodes[0].firstChild.nodeValue.strip()
            lon_str, lat_str, _alt = coords_text.split(',')
            longitude = float(lon_str)
            latitude  = float(lat_str)
        else:
            longitude = None
            latitude  = None

        data.append({
            'name': name_value,
            'longitude': longitude,
            'latitude': latitude
        })
    return pd.DataFrame(data)

# Parse the KML file (adjust path if needed)
df_portes = parse_kml_interchanges("../data/peripherique_interchanges.kml")

# Map the numeric IDs (#1, #2...) to real porte names
df_portes['real_name'] = df_portes['name'].map(kml_id_to_porte_name)
df_portes.head()

Unnamed: 0,name,longitude,latitude,real_name
0,#1,2.39139,48.82722,Porte de Bercy
1,#2,2.38417,48.82472,Quai d'Ivry
2,#3,2.373185,48.819569,Porte d'Ivry
3,#4,2.36028,48.81611,Porte d'Italie
4,#5,2.35639,48.81639,Autoroute A6b


### 4.3 Identify Accidents on the Boulevard Périphérique

We filter for addresses containing `"PERIPHERIQUE"`.

In [5]:
on_periph_mask = df['address'].str.contains("PERIPHERIQUE", case=False, na=False)
df_periph = df[on_periph_mask].copy()
print(f"Number of accidents on the Périphérique: {len(df_periph)}")
df_periph.head(3)

Number of accidents on the Périphérique: 2474


Unnamed: 0,accident_date,victim_transport_mode,victim_category,victim_age,victim_sex,environment,address,longitude,latitude,accident_ID,...,district_code,arrondgeo,intersection_type,lighting_condition,weather_condition,road_surface,first_vehicle,max_speed,first_vehicle_driver_sex,first_vehicle_driver_age
5,2017-09-19,2 Roues Motorisées,Conducteur,47,M,En-Agg,BD PERIPHERIQUE EXTERIEUR,2.36115,48.9009,841686,...,75118,"{""coordinates"": [[[[2.351983536, 48.901484899]...",Hors intersection,,,,,,,
7,2017-07-12,4 Roues,Conducteur,30,M,En-Agg,BD PERIPHERIQUE EXTERIEUR,2.37882,48.9003,841591,...,75119,"{""coordinates"": [[[[2.410820319, 48.878436176]...",Hors intersection,,,,,,,
20,2017-12-18,4 Roues,Conducteur,34,F,En-Agg,BD PERIPHERIQUE EXTERIEUR,2.38222,48.8234,841815,...,75113,"{""coordinates"": [[[[2.366087726, 48.844967843]...",Hors intersection,,,,,,,


### 4.4 Haversine Distance

We define a helper to compute the distance (in kilometers) between two lat/long points.

In [6]:
def haversine_distance(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in km
    phi1 = math.radians(lat1)
    phi2 = math.radians(lat2)
    dphi = math.radians(lat2 - lat1)
    dlambda = math.radians(lon2 - lon1)

    a = (math.sin(dphi / 2) ** 2 +
         math.cos(phi1) * math.cos(phi2) * math.sin(dlambda / 2) ** 2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    return R * c

print("Haversine function ready.")

Haversine function ready.


### 4.5 Match Each Périphérique Accident to the Nearest Porte

We loop over the accidents and pick whichever Porte is closest via Haversine.

In [7]:
def find_nearest_porte(lat, lon, df_portes):
    min_dist = float('inf')
    nearest_porte_name = None
    for _, row in df_portes.iterrows():
        dist = haversine_distance(lat, lon, row['latitude'], row['longitude'])
        if dist < min_dist:
            min_dist = dist
            nearest_porte_name = row['real_name']
    return nearest_porte_name, min_dist

porte_names = []
porte_distances = []

for idx, row in df_periph.iterrows():
    lat_acc = row['latitude']
    lon_acc = row['longitude']
    name_, dist_ = find_nearest_porte(lat_acc, lon_acc, df_portes)
    porte_names.append(name_)
    porte_distances.append(dist_)

df_periph['nearest_porte_name'] = porte_names
df_periph['nearest_porte_distance_km'] = porte_distances

df_periph[['address','nearest_porte_name','nearest_porte_distance_km']].head(10)

Unnamed: 0,address,nearest_porte_name,nearest_porte_distance_km
5,BD PERIPHERIQUE EXTERIEUR,Porte de la Chapelle,0.172134
7,BD PERIPHERIQUE EXTERIEUR,Porte d'Aubervilliers,0.572584
20,BD PERIPHERIQUE EXTERIEUR,Quai d'Ivry,0.20475
21,BD PERIPHERIQUE EXTERIEUR,Autoroute A6a,0.260146
23,BD PERIPHERIQUE INTERIEUR,Porte de Charenton,0.320872
25,BD PERIPHERIQUE INTERIEUR,Quai d'Ivry,0.205622
27,BD PERIPHERIQUE INTERIEUR,Porte de Châtillon,0.124988
32,BD PERIPHERIQUE EXTERIEUR,Porte Dorée,0.218766
34,BD PERIPHERIQUE INTERIEUR,Porte d'Orléans,0.360899
36,BD PERIPHERIQUE EXTERIEUR,Porte Dauphine,0.29247


### 4.6 Replace the `address` Column for These Rows

So that *"BD PERIPHERIQUE EXTERIEUR"* becomes *"Porte d'Auteuil"*, etc.

In [8]:
# In the main df, overwrite 'address' with 'nearest_porte_name'
df.loc[df_periph.index, 'address'] = df_periph['nearest_porte_name']
df.loc[df_periph.index, 'porte_distance_km'] = df_periph['nearest_porte_distance_km']

# Quick check
df[['address','latitude','longitude','porte_distance_km']].sample(5)

Unnamed: 0,address,latitude,longitude,porte_distance_km
27646,RUE DE MADRID,48.8788,2.32221,
13347,RUE D ALESIA,48.83182,2.314137,
26909,PLACE AUGUSTE BARON,48.900103,2.388344,
37972,BOULEVARD DE CLICHY,48.8844,2.32893,
29585,RUE CAMBRONNE,48.841817,2.30326,


### Step 5: Save the Updated Dataset

We'll export our final DataFrame (with new columns from regex and corrected addresses) to a new CSV. This is the dataset we'll feed into further analyses or visualizations.

In [9]:
output_path = '../data/accidents_parsed.csv'
df.to_csv(output_path, sep=';', index=False)
print(f"Done! Updated CSV saved to: {output_path}")

Done! Updated CSV saved to: ../data/accidents_parsed.csv


**Conclusion and Next Steps:**

We've successfully extracted key information from the `report_summary` field and improved the location data for accidents on the Boulevard Périphérique.  We now have a rich dataset, `accidents_parsed.csv`, ready for the crucial **solution-driven** analysis in Notebook 3.

Specifically, we'll be able to:

*   **Identify the top 3 most dangerous arrondissements.**
*   **Identify the top 3 most dangerous streets *within* those arrondissements.**
*   **Perform temporal analysis** to pinpoint peak accident times (months, days, hours).
*   **Link weather condition** to guide solution.
*   **Link victim transport mode** to guide solution.
*   **Perform meteorological analysis** to understand the relationship between weather and accidents.
*   **Analyze accident types** (using the information we extracted and `victim_transport_mode`) to propose targeted infrastructure improvements.

All of these analyses will lead to *concrete recommendations* for the city of Paris, fulfilling our overall objective.  Let's move on to Notebook 3!