# Analysis of Paris Accident Data - Part 1

**Goal**: Help Paris mayors implement concrete solutions to reduce road accidents.

In this notebook we:
- Load the dataset, inspect its structure, and identify key attributes like demographics, location, road conditions, and accident timing.
- Rename columns to English-friendly names, detect missing values, and remove redundant or duplicate fields.
- Validate district and coordinate data, ensuring geolocation-based district codes are reliable.
- Drop unnecessary columns, standardize data types, convert relevant fields to booleans, and save the cleaned dataset for further analysis.

# Initial data assessment: A peek under the hood

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv ('../data/accidents.csv', sep=';')

In [3]:
# First look at the top 3 rows of the dataframe
df.head(3)

Unnamed: 0,IdUsager,Date,PV,Arrondissement,Mode,Catégorie,Gravité,Age,Genre,Milieu,...,Blessés Légers,Blessés hospitalisés,Tué,Résumé,Coordonnées,Nom arrondissement,arronco,arrondgeo,Coordonnées.1,Nom arrondissement.1
0,2389401,2017-04-03,3527,75111,Piéton,Piéton,Blessé léger,62.0,Feminin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, H...","48.855, 2.36867",Paris 4e Arrondissement,75104,"{""coordinates"": [[[[2.369123881, 48.853166231]...","48.855, 2.36867",Paris 4e Arrondissement
1,2388322,2017-08-28,9113,75108,2 Roues Motorisées,Conducteur,Blessé léger,30.0,Masculin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, E...","48.8667, 2.3013",Paris 8e Arrondissement,75108,"{""coordinates"": [[[[2.301737288, 48.863496077]...","48.8667, 2.3013",Paris 8e Arrondissement
2,2394191,2017-11-06,11991,75117,2 Roues Motorisées,Conducteur,Blessé léger,37.0,Masculin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, E...","48.8858, 2.32163",Paris 17e Arrondissement,75117,"{""coordinates"": [[[[2.303774362, 48.894153779]...","48.8858, 2.32163",Paris 17e Arrondissement


Here we understand that the data seems to detail victims, including, but not limited to:

- who: the demographics of the people involved

- where: location info using geographical coordinates as well as administrative districts and address

- how: description of the road setting (urban, countryside, etc.) and means of transportation

- when: data and time of the accident

We need to assess what we're dealing with. The immediate objective is to:

- translate: Get those column names into plain English. We'll rename IdUsager to victim_ID for instance.

- standardize: We are going to be renaming every single column, to enhance clarity (as for instance, if we rename PIM/BD PERIPHERIQUE as periphery_info)
We achieve that objective with a few lines of codes where we create a dictionary of values old:new, and we df.rename according to it. We verify we renamed everything correctly with another quick df.head(3)

In [4]:
# Rename columns to English-friendly names for our beloved readers

df = df.rename(columns={
    'IdUsager': 'victim_ID',
    'Date': 'accident_date',
    'PV': 'report_number',
    'Mode': 'victim_transport_mode',
    'Catégorie': 'victim_category',
    'Gravité': 'victim_injury_severity',
    'Age': 'victim_age',
    'Genre': 'victim_sex',
    'Milieu': 'environment',
    'Adresse': 'address',
    'Id accident': 'accident_ID',
    'PIM/BD PERIPHERIQUE': 'periphery_info',
    "Tranche d'age": 'victim_age_group',
    'Blessés Légers': 'victim_minor_injuries?',
    'Blessés hospitalisés': 'victim_hospitalized?',
    'Tué': 'victim_deceased?',
    'Résumé': 'report_summary',
    'Nom arrondissement': 'district_name',           
    'Nom arrondissement.1': 'district_name.1',
    'Coordonnées': 'coordinates',
    'Coordonnées.1': 'coordinates.1',
    'Arrondissement': 'district',
    'arronco': 'district_code',
    'Latitude': 'latitude',
    'Longitude': 'longitude'
})

df.head(3)


Unnamed: 0,victim_ID,accident_date,report_number,district,victim_transport_mode,victim_category,victim_injury_severity,victim_age,victim_sex,environment,...,victim_minor_injuries?,victim_hospitalized?,victim_deceased?,report_summary,coordinates,district_name,district_code,arrondgeo,coordinates.1,district_name.1
0,2389401,2017-04-03,3527,75111,Piéton,Piéton,Blessé léger,62.0,Feminin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, H...","48.855, 2.36867",Paris 4e Arrondissement,75104,"{""coordinates"": [[[[2.369123881, 48.853166231]...","48.855, 2.36867",Paris 4e Arrondissement
1,2388322,2017-08-28,9113,75108,2 Roues Motorisées,Conducteur,Blessé léger,30.0,Masculin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, E...","48.8667, 2.3013",Paris 8e Arrondissement,75108,"{""coordinates"": [[[[2.301737288, 48.863496077]...","48.8667, 2.3013",Paris 8e Arrondissement
2,2394191,2017-11-06,11991,75117,2 Roues Motorisées,Conducteur,Blessé léger,37.0,Masculin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, E...","48.8858, 2.32163",Paris 17e Arrondissement,75117,"{""coordinates"": [[[[2.303774362, 48.894153779]...","48.8858, 2.32163",Paris 17e Arrondissement


# Objective 2: Data inspection - What's what?

We will also figure out the details of each column, including data types and missing information, to evaluate data quality.
To get a better grip on the dataset, we need to look at each column in detail. So we want to :

- check data types: Are numbers stored as numbers? Are strings actually strings?

- spot missing values: Find those pesky NaNs hiding in our data.

- count unique values: See how many different values are in each column to check for any obvious issues.

In [5]:
def column_summary(df):
    summary_data = []
    
    for col_name in df.columns:
        col_dtype = df[col_name].dtype
        num_of_nulls = df[col_name].isnull().sum()
        num_of_non_nulls = df[col_name].notnull().sum()
        num_of_distinct_values = df[col_name].nunique()
        
        summary_data.append({
            'col_name': col_name,
            'col_dtype': col_dtype,
            'num_of_nulls': num_of_nulls,
            'num_of_non_nulls': num_of_non_nulls,
            'num_of_distinct_values': num_of_distinct_values,
        })
    
    summary_df = pd.DataFrame(summary_data)
    return summary_df
summary_df = column_summary(df)
display(summary_df)

Unnamed: 0,col_name,col_dtype,num_of_nulls,num_of_non_nulls,num_of_distinct_values
0,victim_ID,int64,0,41211,41211
1,accident_date,object,0,41211,2550
2,report_number,int64,0,41211,7105
3,district,int64,0,41211,40
4,victim_transport_mode,object,0,41211,6
5,victim_category,object,0,41211,3
6,victim_injury_severity,object,0,41211,3
7,victim_age,float64,0,41211,103
8,victim_sex,object,0,41211,2
9,environment,object,0,41211,2


This summary suggests that some columns might be duplicates (e.g., DistrictName and DistrictName.1, Coordinates and Coordinates.1). We'll investigate these further.

# Objective 3: Duplicate detection - Are we seeing double?

It appears we might have some redundant columns. Two DistrictName columns? Two Coordinates columns?

We use the below checks to verify any row mismatch between 'DistrictName' and 'DistrictName.1' on one hand, and Coordinates, Coordinates.1, Latitude and Longitude. 
It's crucial to pinpoint such duplicate data so we are not messing up the results of any further analysis.

In [6]:
district_mismatch = df[
    df['district_name'].notna() &
    (df['district_name'] != df['district_name.1'])
]

In [7]:
coord_mismatch = df[
    df['coordinates'].notna() &
    (df['coordinates'] != df['coordinates.1'])
]

latlong_mismatch = df[
    (df['longitude'].notna()) & 
    (df['latitude'].notna()) &
    (df['coordinates'].notna()) &
    (df.apply(lambda row: f"{row['latitude']}, {row['longitude']}", axis=1) != df['coordinates'])
]
latlong_mismatch

Unnamed: 0,victim_ID,accident_date,report_number,district,victim_transport_mode,victim_category,victim_injury_severity,victim_age,victim_sex,environment,...,victim_minor_injuries?,victim_hospitalized?,victim_deceased?,report_summary,coordinates,district_name,district_code,arrondgeo,coordinates.1,district_name.1


Our analysis shows no discrepancies between DistrictName and DistrictName.1. A similar comparison for the coordinate columns using also reveals no differences, confirming that Coordinates, Coordinates.1, Longitude, and Latitude contain identical location information. So, one of each of them will have to go, to reduce clutter.

# Objective 4: address district data discrepancies

We have multiple columns telling us where the accident happened. Which one to trust?

district_code was derived from geolocation. It’s typically more precise, especially near borderline cases.
The dataset also has stuff like district_name but hey, if they both say 4th vs. 11th, we need to figure out which is correct. A quick look at coordinates near boundary lines suggests geocoding usually wins:

![alt text](image-1.png)

*I see that the accident occurred right where the 4th and 11th districts meet. The police data labeled it as the 11th, whereas the geocoding approach correctly placed it in the 4th.*

# Objective 5: Data slimming and cleaning - Out with the old, in with the new!


Time for some digital decluttering and polishing of what is remaining! Our aim is to :

- Drop redundant columns: If two columns tell the same story, one has to go!

- Correct data types: Make sure numbers are treated as numbers, text as text, and turn 1s and 0s into proper booleans.

- Fix categorical value labels: A few updates need to happen, such as victim_sex's value, that we want to homogenize under a M/F system, and remove redundancies in injury_severity indicators by converting to bool type.

In [8]:
columns_to_drop = [
    'district_name',     # Redundant if we keep geocoding-based district_code
    'district_name.1',   # Also redundant
    'coordinates',       # Redundant if we rely on Latitude/Longitude
    'coordinates.1',     # Also redundant
    'district',          # Less reliable than "district_code"
    'Champ13',          # Redundant
    'victim_injury_severity', # Redundant
    'victim_ID', # Not needed for analysis
    'report_number', # Not needed for analysis
]
df = df.drop(columns=columns_to_drop)

# Convert 'Date' column to datetime
df['accident_date'] = pd.to_datetime(df['accident_date'], errors='coerce')

# List of interget columns
int_cols = [
    'victim_age', 'accident_ID', 
    'victim_minor_injuries?', 'victim_hospitalized?', 'victim_deceased?',
]

# List of floats columns
float_cols = ['latitude', 'longitude']

# Convert integer columns
df[int_cols] = df[int_cols].apply(pd.to_numeric, errors='coerce').astype('Int64')

# Convert float columns
df[float_cols] = df[float_cols].apply(pd.to_numeric, errors='coerce').astype('float64')

# List of string columns
string_cols = [
    'victim_transport_mode', 'victim_category', 
    'victim_sex', 'environment', 'address', 'periphery_info', 
    'victim_age_group', 'report_summary'
]


# Convert string columns to the Pandas string dtype
df[string_cols] = df[string_cols].astype('string')

# Convert `victim_minor_injuries?`, `victim_hospitalized?`, and `victim_deceased?` to booleans
bool_cols = ['victim_minor_injuries?', 'victim_hospitalized?', 'victim_deceased?']
df[bool_cols] = df[bool_cols].map(lambda x: True if x == 1 else False)

# Update `victim_sex` to use `M` or `F`
df['victim_sex'] = df['victim_sex'].map({'Masculin': 'M', 'Feminin': 'F'})

# Objective 6: Save progress

All this hard work won't be for nothing. Once we have cleaned the data we save the result into a new, spiffier CSV file called accidents_cleaned.csv.


In [9]:
# Save the new df
df.to_csv('../data/accidents_cleaned.csv',index=False, sep=';' )

# Next Up : Cracking the report_summary column and unveiling the stories hidden

Here are enough insights available already for a good start, but it would be even better to dig into the unstructured, free-text data present in the report_summary column. Each summary is like a tiny story, waiting to be decoded.

Examples:
"Minor accident, Non-fatal, In urban area, T-intersection..."
"... hits 1 Veh"