# Analysis of Paris Accident Data - Part 1: Data Cleaning - From Raw to Refined

**Introduction: Setting the Stage**

Welcome to the first step in our journey to make the streets of Paris safer! Our grand mission is to analyze accident data and provide the city's decision-makers with **concrete, actionable insights** to reduce road incidents.
This notebook focuses on the crucial first step: **data cleaning and preparation**. We'll take the raw dataset, scrub it clean, and get it ready for deeper analysis.

**Notebook Objectives:**

1.  **Initial Data Assessment:** Load the data and get a feel for its structure and contents.
2.  **Translation & Standardization:** Rename columns to make them more understandable (English-friendly!).
3.  **Data Inspection:** Examine data types, missing values, and unique values in each column.
4.  **Duplicate Detection:** Identify and eliminate any redundant columns.
5.  **District Data Discrepancies:** Resolve inconsistencies in location data.
6.  **Data Slimming and Cleaning:** Drop unnecessary columns, correct data types, and standardize categorical values.
7.  **Save Progress:** Preserve our cleaned data for future use.

Let's dive in!

### Step 1: Initial Data Assessment - A Peek Under the Hood

In [1]:
import pandas as pd

# Load the dataset.  Think of this as opening the treasure chest!
df = pd.read_csv('../data/accidents.csv', sep=';')

# Take a quick look at the first 3 rows.  Just a sneak peek!
df.head(3)

Unnamed: 0,IdUsager,Date,PV,Arrondissement,Mode,Catégorie,Gravité,Age,Genre,Milieu,...,Blessés Légers,Blessés hospitalisés,Tué,Résumé,Coordonnées,Nom arrondissement,arronco,arrondgeo,Coordonnées.1,Nom arrondissement.1
0,2389401,2017-04-03,3527,75111,Piéton,Piéton,Blessé léger,62.0,Feminin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, H...","48.855, 2.36867",Paris 4e Arrondissement,75104,"{""coordinates"": [[[[2.369123881, 48.853166231]...","48.855, 2.36867",Paris 4e Arrondissement
1,2388322,2017-08-28,9113,75108,2 Roues Motorisées,Conducteur,Blessé léger,30.0,Masculin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, E...","48.8667, 2.3013",Paris 8e Arrondissement,75108,"{""coordinates"": [[[[2.301737288, 48.863496077]...","48.8667, 2.3013",Paris 8e Arrondissement
2,2394191,2017-11-06,11991,75117,2 Roues Motorisées,Conducteur,Blessé léger,37.0,Masculin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, E...","48.8858, 2.32163",Paris 17e Arrondissement,75117,"{""coordinates"": [[[[2.303774362, 48.894153779]...","48.8858, 2.32163",Paris 17e Arrondissement


The data appears to describe accident victims, providing details about:

*   **Who:** Demographics of those involved (age, sex, etc.)
*   **Where:** Location information (coordinates, district, address)
*   **How:** Description of the road environment and mode of transport
*   **When:** Date and time of the accident

### Step 2: Translation & Standardization - Making it Readable

We need to make this data more user-friendly. The first step is to translate those French column names into English. We'll also standardize the names for consistency.

In [2]:
# Create a dictionary to map old column names to new ones.  Like a translation guide!
column_mapping = {
    'IdUsager': 'victim_ID',
    'Date': 'accident_date',
    'PV': 'report_number',
    'Mode': 'victim_transport_mode',
    'Catégorie': 'victim_category',
    'Gravité': 'victim_injury_severity',
    'Age': 'victim_age',
    'Genre': 'victim_sex',
    'Milieu': 'environment',
    'Adresse': 'address',
    'Id accident': 'accident_ID',
    'PIM/BD PERIPHERIQUE': 'periphery_info',
    "Tranche d'age": 'victim_age_group',
    'Blessés Légers': 'victim_minor_injuries?',
    'Blessés hospitalisés': 'victim_hospitalized?',
    'Tué': 'victim_deceased?',
    'Résumé': 'report_summary',
    'Nom arrondissement': 'district_name',
    'Nom arrondissement.1': 'district_name.1',
    'Coordonnées': 'coordinates',
    'Coordonnées.1': 'coordinates.1',
    'Arrondissement': 'district',
    'arronco': 'district_code',
    'Latitude': 'latitude',
    'Longitude': 'longitude'
}

# Rename the columns using our dictionary.  Voilà!
df = df.rename(columns=column_mapping)

# Check our work – did the renaming work as expected?
df.head(3)

Unnamed: 0,victim_ID,accident_date,report_number,district,victim_transport_mode,victim_category,victim_injury_severity,victim_age,victim_sex,environment,...,victim_minor_injuries?,victim_hospitalized?,victim_deceased?,report_summary,coordinates,district_name,district_code,arrondgeo,coordinates.1,district_name.1
0,2389401,2017-04-03,3527,75111,Piéton,Piéton,Blessé léger,62.0,Feminin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, H...","48.855, 2.36867",Paris 4e Arrondissement,75104,"{""coordinates"": [[[[2.369123881, 48.853166231]...","48.855, 2.36867",Paris 4e Arrondissement
1,2388322,2017-08-28,9113,75108,2 Roues Motorisées,Conducteur,Blessé léger,30.0,Masculin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, E...","48.8667, 2.3013",Paris 8e Arrondissement,75108,"{""coordinates"": [[[[2.301737288, 48.863496077]...","48.8667, 2.3013",Paris 8e Arrondissement
2,2394191,2017-11-06,11991,75117,2 Roues Motorisées,Conducteur,Blessé léger,37.0,Masculin,En-Agg,...,1.0,,,"Accident Léger non mortel, En agglomération, E...","48.8858, 2.32163",Paris 17e Arrondissement,75117,"{""coordinates"": [[[[2.303774362, 48.894153779]...","48.8858, 2.32163",Paris 17e Arrondissement


### Step 3: Data Inspection - What's What?

Now, let's get a detailed overview of each column. We'll check data types, count missing values (NaNs), and see how many unique values are present. This helps us assess data quality and plan our next steps.

In [3]:
def column_summary(df):
    """
    Creates a summary DataFrame for each column in the input DataFrame.

    Args:
        df (pd.DataFrame): The input DataFrame.

    Returns:
        pd.DataFrame: A summary DataFrame with column information.
    """
    summary_data = []

    for col_name in df.columns:
        col_dtype = df[col_name].dtype  # Data type of the column
        num_of_nulls = df[col_name].isnull().sum()  # Count of missing values
        num_of_non_nulls = df[col_name].notnull().sum()  # Count of non-missing values
        num_of_distinct_values = df[col_name].nunique()  # Count of unique values

        summary_data.append({
            'col_name': col_name,
            'col_dtype': col_dtype,
            'num_of_nulls': num_of_nulls,
            'num_of_non_nulls': num_of_non_nulls,
            'num_of_distinct_values': num_of_distinct_values,
        })

    summary_df = pd.DataFrame(summary_data)
    return summary_df

# Generate and display the summary.  Like a data report card!
summary_df = column_summary(df)
display(summary_df)

Unnamed: 0,col_name,col_dtype,num_of_nulls,num_of_non_nulls,num_of_distinct_values
0,victim_ID,int64,0,41211,41211
1,accident_date,object,0,41211,2550
2,report_number,int64,0,41211,7105
3,district,int64,0,41211,40
4,victim_transport_mode,object,0,41211,6
5,victim_category,object,0,41211,3
6,victim_injury_severity,object,0,41211,3
7,victim_age,float64,0,41211,103
8,victim_sex,object,0,41211,2
9,environment,object,0,41211,2


This summary suggests potential duplicate columns (e.g., `district_name` and `district_name.1`). We'll investigate these next.

### Step 4: Duplicate Detection - Are We Seeing Double?

It looks like we might have some redundant columns. Do `district_name` and `district_name.1` *really* contain different information? What about the coordinate columns? Let's find out!

In [4]:
# Check for rows where 'district_name' and 'district_name.1' disagree.
district_mismatch = df[
    df['district_name'].notna() &  # Ensure we're comparing valid values
    (df['district_name'] != df['district_name.1'])
]
print(f"District mismatches: {len(district_mismatch)}")


# Check for rows where 'coordinates' and 'coordinates.1' disagree.
coord_mismatch = df[
    df['coordinates'].notna() &
    (df['coordinates'] != df['coordinates.1'])
]
print(f"Coordinate mismatches: {len(coord_mismatch)}")

# Check if coordinates, latitude, and longitude are consistent.
latlong_mismatch = df[
    (df['longitude'].notna()) &
    (df['latitude'].notna()) &
    (df['coordinates'].notna()) &
    (df.apply(lambda row: f"{row['latitude']}, {row['longitude']}", axis=1) != df['coordinates'])
]
print(f"Lat/Long mismatches: {len(latlong_mismatch)}")

District mismatches: 0
Coordinate mismatches: 0
Lat/Long mismatches: 0


Our checks reveal no discrepancies! The duplicate columns contain identical information. This means we can safely remove one of each pair to streamline our data.

### Step 5: Address District Data Discrepancies - Location, Location, Location!

We have multiple columns telling us where the accident happened. Which one to trust?

district_code was derived from geolocation. It’s typically more precise, especially near borderline cases.
The dataset also has stuff like district_name but hey, if they both say 4th vs. 11th, we need to figure out which is correct. A quick look at coordinates near boundary lines suggests geocoding usually wins:

![alt text](image-1.png)

*I see that the accident occurred right where the 4th and 11th districts meet. The police data labeled it as the 11th, whereas the geocoding approach correctly placed it in the 4th.*

### Step 6: Data Slimming and Cleaning - Out with the Old, In with the New!

Time for some serious data cleaning! We'll:

*   **Drop redundant columns:** Goodbye, duplicates!
*   **Correct data types:** Ensure numbers are numeric, text is string, and 1s/0s are booleans.
*   **Standardize categorical values:** Clean up inconsistencies in text values (e.g., "M"/"F" for sex).

In [5]:
# Columns to drop (redundant or unnecessary for our analysis).
columns_to_drop = [
    'district_name',     # Redundant - using geocoded district_code
    'district_name.1',   # Redundant
    'coordinates',       # Redundant - using latitude/longitude
    'coordinates.1',     # Redundant
    'district',          # Less reliable than district_code
    'Champ13',          # Redundant
    'victim_injury_severity', # Redundant
    'victim_ID', # Not needed for analysis
    'report_number', # Not needed for analysis
]
df = df.drop(columns=columns_to_drop)

# Convert 'accident_date' to datetime objects.  Essential for time-based analysis!
df['accident_date'] = pd.to_datetime(df['accident_date'], errors='coerce')

# List of integer columns
int_cols = [
    'victim_age', 'accident_ID',
    'victim_minor_injuries?', 'victim_hospitalized?', 'victim_deceased?',
]

# List of floats columns
float_cols = ['latitude', 'longitude']

# Convert integer columns
df[int_cols] = df[int_cols].apply(pd.to_numeric, errors='coerce').astype('Int64')

# Convert float columns
df[float_cols] = df[float_cols].apply(pd.to_numeric, errors='coerce').astype('float64')

# List of string columns.
string_cols = [
    'victim_transport_mode', 'victim_category',
    'victim_sex', 'environment', 'address', 'periphery_info',
    'victim_age_group', 'report_summary'
]

# Convert string columns to Pandas string dtype (better for text handling).
df[string_cols] = df[string_cols].astype('string')

# Convert injury columns to booleans (True/False).  More intuitive!
bool_cols = ['victim_minor_injuries?', 'victim_hospitalized?', 'victim_deceased?']
df[bool_cols] = df[bool_cols].map(lambda x: True if x == 1 else False)

# Standardize 'victim_sex' to 'M' or 'F'.  Consistency is key!
df['victim_sex'] = df['victim_sex'].map({'Masculin': 'M', 'Feminin': 'F'})

### Step 7: Save Progress - Preserving Our Work

All this cleaning deserves to be saved! We'll export the cleaned DataFrame to a new CSV file, `accidents_cleaned.csv`. This is our foundation for further analysis.

In [6]:
# Save the cleaned DataFrame to a new CSV file.
df.to_csv('../data/accidents_cleaned.csv', index=False, sep=';')

---

**Conclusion and Next Steps:**

We've cleaned and prepared our accident data. We now have a dataset ready for deeper exploration.

Our next quest is to find information the `report_summary` column. These free-text descriptions contain valuable information, like intersection type, weather conditions, and more. We'll use regular expressions (regex) to extract these hidden gems.