# Analysis of Paris accident data - Part 2: parsing the summary field

Looking at rows details, we seem to have a enough information for an analysis of Paris traffic accidents. If we want to go deeper, it appears that the `Summary` column often offers a short, free-text narrative about how each accident occurred.

But let's be honest, the `Summary` field is a bit of a beast. We've got half-finished sentences, abbreviations that probably made sense to the person writing them but are a mystery to us, and a structure that changes from line to line. It's like the Wild West of data fields. Take this one, for instance: "Minor accident, Non-fatal, In urban area, T-intersection, Daylight, with Normal weather and Normal road surface. 1 Passenger vehicle (PV) traveling on Municipal Road (MR) driven by 1 Male user, 30 years old (Ind) hits 1 Veh". What, I ask you, is a "Veh"? Did they just give up halfway through typing? Another example that will for sure give us a hard time: "Minor accident, Non-fatal, In urban area, Not at an intersection, Daylight, with Normal weather and Normal road surface. 1 Bicycle traveling on Municipal Road (MR) ridden by 1 Male user, 19 years old (Ind) hits 1 Pedestrian Male d". "Male d", really? This man probably deserves more than a letter.   
And it's not just about filling in missing words. We need to figure out what all the abbreviations mean. We're dealing with a whole different dialect here, filled with "VMA" and "EDP-m," which, if you're curious, stand for the maximum speed limit and a motorized personal mobility device. Who knew?

We're not going to use fancy generative AI models for this task, though. We need absolute accuracy and control. These AI models are great at generating text that looks right, but they can make mistakes, and we can't afford that when dealing with data that we'll be using for analysis. Plus, we need to understand exactly why a correction was made, and these models are like black boxes – it's hard to know what's going on inside. We'll try to use a more transparent and reliable approach.

**First, we need to define the regular expressions that will help us capture the key pieces of information from those messy summaries.**

It is a **simplified, rule-based approach** to extract structured data from the `Summary` column of the `accidents_cleaned.csv` dataset. We will focus on the most common patterns and use basic regular expressions for clarity.


In [265]:
import re
import pandas as pd

# Load the dataset
df = pd.read_csv('../data/accidents_cleaned.csv', sep=';')

# Extract the 'Summary' column as a Series
summaries = df['Summary']

# Define the regex pattern
pattern_intersection = r"En agglomération, ([^,]+)"

# Apply the regex to each element in the Series
intersection_type = summaries.str.extract(pattern_intersection, expand=False)

# Display the results
intersection_type.value_counts()


Summary
Hors intersection       12825
En X                    10287
En T                     6633
En Y                     3039
Place                    2831
A plus de 4 branches     1226
En giratoire               70
Autre                      61
Passage à niveau            1
N/C                         1
Name: count, dtype: int64

In [266]:
pattern_lighting = r"^(?:[^,]*,){3}\s*(.*?)\s*,\s*avec une météo"

# Apply the regex to each element in the Series
lighting_condition = summaries.str.extract(pattern_lighting, expand=False)

# Display the results
lighting_condition.value_counts()

Summary
Plein jour                               24023
Nuit avec éclairage public  allumé        8702
Crépuscule ou aube                        1973
Nuit avec éclairage public allumé         1930
Nuit sans éclairage public                 419
Nuit avec éclairage public non allumé      205
Name: count, dtype: int64

In [267]:
pattern_meteo = r"avec une météo\s+(.*?)\s+et\s+une\s+surface\s+chaussée"

weather_condition = summaries.str.extract(pattern_meteo, expand=False)

# Display the results
weather_condition.value_counts()

Summary
Normale                29941
Pluie légère            4546
Temps couvert           1699
Pluie forte              658
Temps éblouissant        190
Neige - grèle             89
Autre                     57
Vent fort - tempête       46
Brouillard - fumée        26
Name: count, dtype: int64

In [268]:
pattern_road_surface = r"et\s+une\s+surface\s+chaussée\s*:\s*(.*?)\."

road_surface = summaries.str.extract(pattern_road_surface, expand=False)

# Display or check unique results
road_surface.value_counts()

Summary
Normale               29369
Mouillée               7008
Non renseigné           619
Corps gras - huile       75
Flaques                  58
Autre                    51
Enneigée                 40
Verglacée                27
Inondée                   5
Name: count, dtype: int64

In [269]:
pattern_first_vehicle = r"1\s+(.*?)\s+circulant"

first_vehicle = summaries.str.extract(pattern_first_vehicle, expand=False)

first_vehicle.value_counts()


Summary
Véhicule de tourisme (VT)                                 15363
Bicyclette                                                 3635
Moto ou sidecar > 125 cm3                                  3216
VU seul 1,5T < PTAC <=3,5T                                 2911
Scooter <= 50 cm3                                          2780
Scooter  > 50 <= 125 cm3                                   1975
Cyclomoteur <=50 cm3                                       1105
Scooter > 125 cm3                                          1035
EDP-m                                                       909
Moto ou sidecar  > 50 <= 125 cm3                            850
3 RM > 125 cm3                                              613
Autobus                                                     513
Vélo par assistance électrique                              456
Moto ou sidecar > 50 <= 125 cm3                             424
Scooter > 50 <= 125 cm3                                     320
PL seul PTAC > 7,5T             

In [270]:
pattern_max_speed = r"VMA à (\d+)"

max_speed = summaries.str.extract(pattern_max_speed, expand=False)

max_speed.value_counts()

Summary
30     11799
50     11262
70      3518
90       167
20        66
25        46
10        33
5         16
1         14
15        12
2          5
3          5
110        5
4          2
6          2
300        1
80         1
60         1
35         1
45         1
40         1
Name: count, dtype: int64

We only search for the first letter of Masculin or Feminin because the text is truncated and sometimes only displaying Masc or Fe.

In [271]:
pattern_driver_sex = r"conduit\s+par\s+1\s+usager\s+([MFmf])\w*"

driver_sex = summaries.str.extract(pattern_driver_sex, expand=False)

driver_sex.value_counts()

Summary
M    28675
F     6149
Name: count, dtype: int64

In [272]:
pattern_driver_age = r"conduit\s+par\s+1\s+usager\s+\S+\s+de\s+(\d+)(?:\s+a[n]s?)?"

driver_age = summaries.str.extract(pattern_driver_age, expand=False)

# Convert to numeric
driver_age.value_counts()


Summary
26     1011
27      993
25      992
31      972
28      960
       ... 
0         2
9         2
99        1
97        1
119       1
Name: count, Length: 98, dtype: int64

In [273]:
pattern_other_party = r"heurte\s+1\s+(.*?)(?=\s+(?:conduit|Masculin|Feminin|de)|\s*\(|$)"

other_party = summaries.str.extract(pattern_other_party, expand=False)

# Convert to numeric
other_party.value_counts()

Summary
Piéton                  2326
Véhicule                2268
Piét                     594
Pié                      385
Bicyclette               379
                        ... 
PL seul 3,                 1
Autre véhicule condu       1
Autre véhic                1
Autre véh                  1
Autocar co                 1
Name: count, Length: 343, dtype: int64

In [None]:
df['intersection_type'] = intersection_type
df['lighting_condition'] = lighting_condition
df['weather_condition'] = weather_condition
df['road_surface'] = road_surface
df['first_vehicle'] = first_vehicle
df['max_speed'] = max_speed
df['driver_sex'] = driver_sex
df['driver_age'] = driver_age
df['other_party'] = other_party

# 4. Save the updated DataFrame to a new CSV file (using a semicolon separator)
df.to_csv('../data/accidents_parsed.csv', sep=';', index=False)