# Analysis of Paris accident data - Part 2: parsing the summary field

Looking at rows details, we seem to have a enough information for an analysis of Paris traffic accidents. If we want to go deeper, it appears that the `Summary` column often offers a short, free-text narrative about how each accident occurred.

But let's be honest, the `Summary` field is a bit of a beast. We've got half-finished sentences, abbreviations that probably made sense to the person writing them but are a mystery to us, and a structure that changes from line to line. It's like the Wild West of data fields. Take this one, for instance: "Minor accident, Non-fatal, In urban area, T-intersection, Daylight, with Normal weather and Normal road surface. 1 Passenger vehicle (PV) traveling on Municipal Road (MR) driven by 1 Male user, 30 years old (Ind) hits 1 Veh". What, I ask you, is a "Veh"? Did they just give up halfway through typing? Another example that will for sure give us a hard time: "Minor accident, Non-fatal, In urban area, Not at an intersection, Daylight, with Normal weather and Normal road surface. 1 Bicycle traveling on Municipal Road (MR) ridden by 1 Male user, 19 years old (Ind) hits 1 Pedestrian Male d". "Male d", really? This man probably deserves more than a letter.   
And it's not just about filling in missing words. We need to figure out what all the abbreviations mean. We're dealing with a whole different dialect here, filled with "VMA" and "EDP-m," which, if you're curious, stand for the maximum speed limit and a motorized personal mobility device. Who knew?

We're not going to use fancy generative AI models for this task, though. We need absolute accuracy and control. These AI models are great at generating text that looks right, but they can make mistakes, and we can't afford that when dealing with data that we'll be using for analysis. Plus, we need to understand exactly why a correction was made, and these models are like black boxes – it's hard to know what's going on inside. We'll try to use a more transparent and reliable approach.

**First, we need to define the regular expressions that will help us capture the key pieces of information from those messy summaries.**

It is a **simplified, rule-based approach** to extract structured data from the `Summary` column of the `accidents_cleaned.csv` dataset. We will focus on the most common patterns and use basic regular expressions for clarity.


In [16]:
import re
import pandas as pd

# Load the dataset
df = pd.read_csv('../data/accidents_cleaned.csv', sep=';')

# Extract the 'Summary' column as a Series
summaries = df['Summary']

# Define the regex pattern
pattern = r"En agglomération, ([^,]+)"

# Apply the regex to each element in the Series
matches = summaries.str.extract(pattern, expand=False)

# Display the results
matches.unique()


array(['Hors intersection', 'En Y', 'En X', 'En T', 'Place',
       'A plus de 4 branches', nan, 'En giratoire', 'Autre',
       'Passage à niveau', 'N/C'], dtype=object)

In [17]:
pattern_meteo = r"avec une météo\s+([^,]+)"

# Apply the regex to extract the "météo" information
meteo_matches = summaries.str.extract(pattern_meteo, expand=False)

# Display the unique weather conditions
unique_meteo = meteo_matches.unique()

print(unique_meteo)

['Normale et une surface chaussée : Non renseigné. 1 Cyclomoteur <=50 cm3 circulant sur Voie Communale VC conduit par 1 usager Masculin de 26 ans (Ind)heurte 1 P'
 'Normale et une surface chaussée : Normale. 1 Scooter > 125 cm3 circulant sur Voie Communale VC conduit par 1 usager Masculin de 30 ans (BL)'
 'Normale et une surface chaussée : Normale. 1 Véhicule de tourisme (VT) circulant sur Voie Communale VC conduit par 1 usager Masculin de 45 ans (Ind)heurte 1 Scooter <= 50 cm'
 ...
 'Normale et une surface chaussée : Autre. \n\n1 Moto ou sidecar  > 50 <= 125 cm3 circulant sur Voie Communale VC (VMA à 30) conduit par 1 usager Masculin de 51 ans (BH)'
 'Normale et une surface chaussée : Normale. \n\n1 Cyclomoteur <=50 cm3 circulant sur Voie Communale VC (VMA à 30) conduit par 1 usager Masculin de 35 ans (BL)\n\nheurte 1 Moto ou sid'
 'Normale et une surface chaussée : Normale. \n\n1 EDP-m circulant sur Voie Communale VC (VMA à 30) conduit par 1 usager Feminin de 22 ans (BL)\n\nheurte 1 