# Data inspection - Check rendundacies and make the dataframe slimmer


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv ('../data/accidents.csv', sep=';')

In [3]:
# First look at the dataframe
df.head(3)

Unnamed: 0,IdUsager,Date,PV,Arrondissement,Mode,Catégorie,Gravité,Age,Genre,Milieu,...,Blessés Légers,Blessés hospitalisés,Tué,Résumé,Coordonnées,Nom arrondissement,arronco,arrondgeo,Coordonnées.1,Nom arrondissement.1
0,2389401,2017-04-03,3527,75111,Piéton,Piéton,Blessé léger,62.0,Feminin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, H...","48.855, 2.36867",Paris 4e Arrondissement,75104,"{""coordinates"": [[[[2.369123881, 48.853166231]...","48.855, 2.36867",Paris 4e Arrondissement
1,2388322,2017-08-28,9113,75108,2 Roues Motorisées,Conducteur,Blessé léger,30.0,Masculin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, E...","48.8667, 2.3013",Paris 8e Arrondissement,75108,"{""coordinates"": [[[[2.301737288, 48.863496077]...","48.8667, 2.3013",Paris 8e Arrondissement
2,2394191,2017-11-06,11991,75117,2 Roues Motorisées,Conducteur,Blessé léger,37.0,Masculin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, E...","48.8858, 2.32163",Paris 17e Arrondissement,75117,"{""coordinates"": [[[[2.303774362, 48.894153779]...","48.8858, 2.32163",Paris 17e Arrondissement


In [4]:
# Rename columns to English-friendly names for our beloved readers

df = df.rename(columns={
    'IdUsager': 'UserID',
    'Date': 'Date',
    'PV': 'ReportNumber',
    'Mode': 'TransportMode',
    'Catégorie': 'Category',
    'Gravité': 'Severity',
    'Age': 'Age',
    'Genre': 'Gender',
    'Milieu': 'Environment',
    'Adresse': 'Address',
    'Id accident': 'AccidentID',
    'PIM/BD PERIPHERIQUE': 'PeripheryInfo',
    "Tranche d'age": 'AgeGroup',
    'Blessés Légers': 'MinorInjuries',
    'Blessés hospitalisés': 'HospitalizedInjuries',
    'Tué': 'Fatalities',
    'Résumé': 'Summary',
    'Nom arrondissement': 'DistrictName',           
    'Nom arrondissement.1': 'DistrictName.1',
    'Coordonnées': 'Coordinates',
    'Coordonnées.1': 'Coordinates.1',
    'Arrondissement': 'District',
    'arronco': 'District_code'
})

df.head(3)


Unnamed: 0,UserID,Date,ReportNumber,District,TransportMode,Category,Severity,Age,Gender,Environment,...,MinorInjuries,HospitalizedInjuries,Fatalities,Summary,Coordinates,DistrictName,District_code,arrondgeo,Coordinates.1,DistrictName.1
0,2389401,2017-04-03,3527,75111,Piéton,Piéton,Blessé léger,62.0,Feminin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, H...","48.855, 2.36867",Paris 4e Arrondissement,75104,"{""coordinates"": [[[[2.369123881, 48.853166231]...","48.855, 2.36867",Paris 4e Arrondissement
1,2388322,2017-08-28,9113,75108,2 Roues Motorisées,Conducteur,Blessé léger,30.0,Masculin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, E...","48.8667, 2.3013",Paris 8e Arrondissement,75108,"{""coordinates"": [[[[2.301737288, 48.863496077]...","48.8667, 2.3013",Paris 8e Arrondissement
2,2394191,2017-11-06,11991,75117,2 Roues Motorisées,Conducteur,Blessé léger,37.0,Masculin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, E...","48.8858, 2.32163",Paris 17e Arrondissement,75117,"{""coordinates"": [[[[2.303774362, 48.894153779]...","48.8858, 2.32163",Paris 17e Arrondissement


Let's create a summary of each column, including data types, missing values, and unique values. This helps us understand the data's structure and identify potential issues.

In [5]:
def column_summary(df):
    summary_data = []
    
    for col_name in df.columns:
        col_dtype = df[col_name].dtype
        num_of_nulls = df[col_name].isnull().sum()
        num_of_non_nulls = df[col_name].notnull().sum()
        num_of_distinct_values = df[col_name].nunique()
        
        summary_data.append({
            'col_name': col_name,
            'col_dtype': col_dtype,
            'num_of_nulls': num_of_nulls,
            'num_of_non_nulls': num_of_non_nulls,
            'num_of_distinct_values': num_of_distinct_values,
        })
    
    summary_df = pd.DataFrame(summary_data)
    return summary_df

summary_df = column_summary(df)
display(summary_df)

Unnamed: 0,col_name,col_dtype,num_of_nulls,num_of_non_nulls,num_of_distinct_values
0,UserID,int64,0,41211,41211
1,Date,object,0,41211,2550
2,ReportNumber,int64,0,41211,7105
3,District,int64,0,41211,40
4,TransportMode,object,0,41211,6
5,Category,object,0,41211,3
6,Severity,object,0,41211,3
7,Age,float64,0,41211,103
8,Gender,object,0,41211,2
9,Environment,object,0,41211,2


This summary suggests that some columns might be duplicates (e.g., DistrictName and DistrictName.1, Coordinates and Coordinates.1). We'll investigate these further.

In [6]:
district_mismatch = df[
    df['DistrictName'].notna() &
    (df['DistrictName'] != df['DistrictName.1'])
]
district_mismatch

Unnamed: 0,UserID,Date,ReportNumber,District,TransportMode,Category,Severity,Age,Gender,Environment,...,MinorInjuries,HospitalizedInjuries,Fatalities,Summary,Coordinates,DistrictName,District_code,arrondgeo,Coordinates.1,DistrictName.1


The check shows no differences, indicating that both `DistrictName` columns are redundant. I then proceed to verify whether the `Coordinates` and `Coordinates.1` columns, as well as the `Longitude` and `Latitude` columns, also contain exactly the same information. This consistency is important to confirm before dropping duplicates. Below is a simple equality check to see if any row has mismatched coordinate data. If it yields an empty result, the columns match perfectly.


In [7]:
coord_mismatch = df[
    df['Coordinates'].notna() &
    (df['Coordinates'] != df['Coordinates.1'])
]
coord_mismatch

latlong_mismatch = df[
    (df['Longitude'].notna()) & 
    (df['Latitude'].notna()) &
    (df['Coordinates'].notna()) &
    (df.apply(lambda row: f"{row['Latitude']}, {row['Longitude']}", axis=1) != df['Coordinates'])
]
latlong_mismatch

Unnamed: 0,UserID,Date,ReportNumber,District,TransportMode,Category,Severity,Age,Gender,Environment,...,MinorInjuries,HospitalizedInjuries,Fatalities,Summary,Coordinates,DistrictName,District_code,arrondgeo,Coordinates.1,DistrictName.1


Both resulting DataFrames (coord_mismatch and latlong_mismatch) are empty, I am confident that `Coordinates`, `Coordinates.1`, `Longitude`, and `Latitude` all store the same location details.


### Arrondissement data quality check

From the summary step, we also notice other arrondissement-related columns, such as `District` and `arronco`, which might differ slightly. Often, `DistrictName` (probably police data) and `arronco` (probably geocoding-based) can conflict near arrondissement boundaries.

To visually confirm the boundary issue, I opened Google Maps at the coordinates of an accident with conflicting district data (48.855, 2.3687). 
![alt text](image-1.png)

I see that the accident occurred right where the 4th and 11th districts meet. The police data labeled it as the 11th, whereas the geocoding approach correctly placed it in the 4th.

**This validates my decision to keep only one reliable district-related column in the final dataset. I choose `arronco` because it is derived from geolocation and typically has fewer missing values, thus more precise.**

Below, I remove columns that store data I no longer need (e.g., duplicates) or less accurate versions. I also convert remaining columns to their adequate data types and save the dataframe to a new CSV file.

In [8]:
columns_to_drop = [
    'DistrictName',     # Redundant if we keep geocoding-based arronco
    'Coordinates',      # Redundant if we rely on Latitude/Longitude
    'DistrictName.1',   # Also redundant
    'District',         # Less reliable than "arronco"
    'Coordinates.1',    # Also redundant
    'Champ13',          # Unused
    'arrondgeo'         # Polygon boundaries (not needed here)
]

df = df.drop(columns=columns_to_drop)

# Convert 'Date' column to datetime
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# List of numeric columns
numeric_cols = [
    'UserID', 'ReportNumber', 'Age', 'AccidentID', 
    'MinorInjuries', 'HospitalizedInjuries', 'Fatalities'
]

# Convert numeric columns to nullable integer types
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce').astype('Int64')

# List of string columns
string_cols = [
    'TransportMode', 'Category', 'Severity', 'Gender', 'Environment', 
    'Address', 'PeripheryInfo', 'AgeGroup', 'Summary'
]

# Convert string columns to the Pandas string dtype
df[string_cols] = df[string_cols].astype('string')

# Save the new df
df.to_csv('../data/accidents_cleaned.csv',index=False, sep=';' )

## Data preparation - Let's look closer at the `Summary` column.

Looking at rows details, we seem to have a enough information for an analysis of Paris traffic accidents. If we want to go deeper, it appears that the `Summary` column often offers a short, free-text narrative about how each accident occurred.

Now, we're embarking on the data parsing phase of our project. This is where we take the raw, messy text summaries of traffic accidents and transform them into structured data that we can analyze. You might think this would be straightforward, but it's surprisingly tricky.

Here's the problem: These summaries are full of incomplete phrases, like in this example: "Minor accident, Non-fatal, In urban area, T-intersection, Daylight, with Normal weather and Normal road surface. 1 Passenger vehicle (PV) traveling on Municipal Road (MR) driven by 1 Male user, 30 years old (Ind) hits 1 Veh". What does "Veh" stand for? Or consider this one: "Minor accident, Non-fatal, In urban area, Not at an intersection, Daylight, with Normal weather and Normal road surface. 1 Bicycle traveling on Municipal Road (MR) ridden by 1 Male user, 19 years old (Ind) hits 1 Pedestrian Male d". What comes after "d"? We can guess, but we need to be certain. It's not just about filling in the blanks; the structure of the summaries varies, and they use a lot of specialized abbreviations, like "VMA" for the maximum speed limit or "EDP-m" for a motorized personal mobility device.

We're not going to use fancy generative AI models for this task, though. Why? Because we need absolute accuracy and control. These AI models are great at generating text that looks right, but they can make mistakes, and we can't afford that when dealing with data that we'll be using for analysis. Plus, we need to understand exactly why a correction was made, and these models are like black boxes – it's hard to know what's going on inside. We'll use a more transparent and reliable approach, combining carefully crafted rules with some clever techniques to handle these tricky text snippets.