# Analysis of Paris Accident Data - Part 2: Parsing the Summary Field

We've cleaned our dataset, addressed language issues, and removed redundant columns. Now, we'll focus on extracting structured data from the free-text "Summary" field. This involves converting the narrative accident descriptions into analyzable, categorical data.

# The Challenge: Understanding the Narrative

Each "Summary" entry provides a brief account of an accident. However, these accounts are often incomplete, use abbreviations, and lack a consistent structure. Our goal is to impose structure on these narratives, assuming a consistent underlying pattern, even when information is missing or truncated.

For instance, a typical Summary might read:

"Minor accident, Non-fatal, In urban area, T-intersection, Daylight, with Normal weather and Normal road surface. 1 Passenger vehicle (PV) traveling on Municipal Road (MR) driven by 1 Male user, 30 years old (Ind) hits 1 Veh".

We aim to transform this text into structured data without relying on generative AI models, ensuring accuracy and transparency in our data transformation.

# Step 1: Data Loading - Setting the Stage

In [None]:
import re
import pandas as pd

# Load the dataset
df = pd.read_csv('../data/accidents_cleaned.csv', sep=';')

# Extract the 'Summary' column as a Series
summaries = df['report_summary']


report_summary
Hors intersection       12825
En X                    10287
En T                     6633
En Y                     3039
Place                    2831
A plus de 4 branches     1226
En giratoire               70
Autre                      61
Passage à niveau            1
N/C                         1
Name: count, dtype: int64

# Step 2: Define our battle plan : Regular Expressions to the Rescue

We'll use regular expressions (regex) to extract specific data points from the summaries. Regular expressions are sequences of characters that define a search pattern, allowing us to pinpoint and extract the information we need. Our approach will be rule-based and simplified, focusing on the most frequent patterns within the summaries and relying on strong assumptions as to where specific pieces of information may be found, if they are present. We start with intersection pattern.

In [None]:

# Define the regex pattern
pattern_intersection = r"En agglomération, ([^,]+)"

# Apply the regex to each element in the Series
intersection_type = summaries.str.extract(pattern_intersection, expand=False)

# Display the results
intersection_type.value_counts()

In [2]:
pattern_lighting = r"^(?:[^,]*,){3}\s*(.*?)\s*,\s*avec une météo"

# Apply the regex to each element in the Series
lighting_condition = summaries.str.extract(pattern_lighting, expand=False)

# Display the results
lighting_condition.value_counts()

report_summary
Plein jour                               24023
Nuit avec éclairage public  allumé        8702
Crépuscule ou aube                        1973
Nuit avec éclairage public allumé         1930
Nuit sans éclairage public                 419
Nuit avec éclairage public non allumé      205
Name: count, dtype: int64

In [3]:
pattern_meteo = r"avec une météo\s+(.*?)\s+et\s+une\s+surface\s+chaussée"

weather_condition = summaries.str.extract(pattern_meteo, expand=False)

# Display the results
weather_condition.value_counts()

report_summary
Normale                29941
Pluie légère            4546
Temps couvert           1699
Pluie forte              658
Temps éblouissant        190
Neige - grèle             89
Autre                     57
Vent fort - tempête       46
Brouillard - fumée        26
Name: count, dtype: int64

In [4]:
pattern_road_surface = r"et\s+une\s+surface\s+chaussée\s*:\s*(.*?)\."

road_surface = summaries.str.extract(pattern_road_surface, expand=False)

# Display or check unique results
road_surface.value_counts()

report_summary
Normale               29369
Mouillée               7008
Non renseigné           619
Corps gras - huile       75
Flaques                  58
Autre                    51
Enneigée                 40
Verglacée                27
Inondée                   5
Name: count, dtype: int64

In [5]:
pattern_first_vehicle = r"1\s+(.*?)\s+circulant"

first_vehicle = summaries.str.extract(pattern_first_vehicle, expand=False)

first_vehicle.value_counts()


report_summary
Véhicule de tourisme (VT)                                 15363
Bicyclette                                                 3635
Moto ou sidecar > 125 cm3                                  3216
VU seul 1,5T < PTAC <=3,5T                                 2911
Scooter <= 50 cm3                                          2780
Scooter  > 50 <= 125 cm3                                   1975
Cyclomoteur <=50 cm3                                       1105
Scooter > 125 cm3                                          1035
EDP-m                                                       909
Moto ou sidecar  > 50 <= 125 cm3                            850
3 RM > 125 cm3                                              613
Autobus                                                     513
Vélo par assistance électrique                              456
Moto ou sidecar > 50 <= 125 cm3                             424
Scooter > 50 <= 125 cm3                                     320
PL seul PTAC > 7,5T      

In [6]:
pattern_max_speed = r"VMA à (\d+)"

max_speed = summaries.str.extract(pattern_max_speed, expand=False)

max_speed.value_counts()

report_summary
30     11799
50     11262
70      3518
90       167
20        66
25        46
10        33
5         16
1         14
15        12
2          5
3          5
110        5
4          2
6          2
300        1
80         1
60         1
35         1
45         1
40         1
Name: count, dtype: int64

We only search for the first letter of Masculin or Feminin because the text is truncated and sometimes only displaying Masc or Fe.

In [7]:
pattern_driver_sex = r"conduit\s+par\s+1\s+usager\s+([MFmf])\w*"

driver_sex = summaries.str.extract(pattern_driver_sex, expand=False)

driver_sex.value_counts()

report_summary
M    28675
F     6149
Name: count, dtype: int64

In [8]:
pattern_driver_age = r"conduit\s+par\s+1\s+usager\s+\S+\s+de\s+(\d+)(?:\s+a[n]s?)?"

driver_age = summaries.str.extract(pattern_driver_age, expand=False)

# Convert to numeric
driver_age.value_counts()


report_summary
26     1011
27      993
25      992
31      972
28      960
       ... 
0         2
9         2
99        1
97        1
119       1
Name: count, Length: 98, dtype: int64

# Step 3: Store the Extracted Data - Data, Meet DataFrame

As we extract each piece of information, we add it as a new column to our DataFrame. That means we end up with brand new, neatly organized columns like intersection_type, lighting_condition, weather_condition, and so on. This structures our data nicely for future analysis, keeping each new category neatly in its own space.

In [9]:
df['intersection_type'] = intersection_type
df['lighting_condition'] = lighting_condition
df['weather_condition'] = weather_condition
df['road_surface'] = road_surface
df['first_vehicle'] = first_vehicle
df['max_speed'] = max_speed
df['first_vehicle_driver_sex'] = driver_sex
df['first_vehicle_driver_age'] = driver_age

# Step 5: Save the Updated Dataset

With all this newly parsed data integrated into our DataFrame, it is time to save our progress. We export the updated DataFrame to a new CSV file, which we'll name accidents_parsed.csv. Now we've got a parsed dataset that's ready for analysis! Use df.to_csv('../data/accidents_parsed.csv', sep=';', index=False) to make it happen.

In [None]:
# 4. Save the updated DataFrame to a new CSV file (using a semicolon separator)
df.to_csv('../data/accidents_parsed.csv', sep=';', index=False)