# Analysis of Paris Accident Data - Part 1

**Goal**: Prepare and clean the raw accident dataset to enable further analysis.

In this notebook, we:
- Load and inspect the raw dataset
- Rename columns to English-friendly names
- Check for duplicates and inconsistencies
- Clean, standardize, and convert data types
- Save the cleaned dataset for further analysis

In [None]:
import pandas as pd

# Load the raw dataset
df = pd.read_csv('../data/accidents.csv', sep=';')

# Display the first 3 rows
print(df.head(3))

### Renaming Columns for Clarity

We rename columns to make them more understandable.

In [None]:
df = df.rename(columns={
    'IdUsager': 'victim_ID',
    'Date': 'accident_date',
    'PV': 'report_number',
    'Mode': 'victim_transport_mode',
    'Catégorie': 'victim_category',
    'Gravité': 'victim_injury_severity',
    'Age': 'victim_age',
    'Genre': 'victim_sex',
    'Milieu': 'environment',
    'Adresse': 'address',
    'Id accident': 'accident_ID',
    'PIM/BD PERIPHERIQUE': 'periphery_info',
    "Tranche d'age": 'victim_age_group',
    'Blessés Légers': 'victim_minor_injuries?',
    'Blessés hospitalisés': 'victim_hospitalized?',
    'Tué': 'victim_deceased?',
    'Résumé': 'report_summary',
    'Nom arrondissement': 'district_name',
    'Nom arrondissement.1': 'district_name.1',
    'Coordonnées': 'coordinates',
    'Coordonnées.1': 'coordinates.1',
    'Arrondissement': 'district',
    'arronco': 'district_code',
    'Latitude': 'latitude',
    'Longitude': 'longitude'
})

print(df.head(3))

### Column Summary & Duplicate Check

We inspect data types, missing values, and check for duplicate columns.

In [None]:
def column_summary(df):
    summary_data = []
    for col in df.columns:
        summary_data.append({
            'col_name': col,
            'dtype': df[col].dtype,
            'num_nulls': df[col].isnull().sum(),
            'num_unique': df[col].nunique()
        })
    return pd.DataFrame(summary_data)

summary_df = column_summary(df)
display(summary_df)

# Check for duplicate columns (e.g., district_name vs district_name.1)
district_mismatch = df[df['district_name'].notna() & (df['district_name'] != df['district_name.1'])]
coord_mismatch = df[df['coordinates'].notna() & (df['coordinates'] != df['coordinates.1'])]
print("District mismatches:", district_mismatch.shape[0])
print("Coordinate mismatches:", coord_mismatch.shape[0])

### Data Slimming and Cleaning

We drop redundant columns, convert data types, and standardize values.

In [None]:
columns_to_drop = [
    'district_name', 'district_name.1', 'coordinates', 'coordinates.1', 'district',
    'Champ13', 'victim_injury_severity', 'victim_ID', 'report_number'
]
df = df.drop(columns=columns_to_drop)

# Convert date column to datetime
df['accident_date'] = pd.to_datetime(df['accident_date'], errors='coerce')

# Convert numeric columns
int_cols = ['victim_age', 'accident_ID', 'victim_minor_injuries?', 'victim_hospitalized?', 'victim_deceased?']
df[int_cols] = df[int_cols].apply(pd.to_numeric, errors='coerce').astype('Int64')

float_cols = ['latitude', 'longitude']
df[float_cols] = df[float_cols].apply(pd.to_numeric, errors='coerce').astype('float64')

# Convert string columns
string_cols = [
    'victim_transport_mode', 'victim_category', 'victim_sex', 'environment',
    'address', 'periphery_info', 'victim_age_group', 'report_summary'
]
df[string_cols] = df[string_cols].astype('string')

# Convert booleans (assuming 1 represents True)
bool_cols = ['victim_minor_injuries?', 'victim_hospitalized?', 'victim_deceased?']
df[bool_cols] = df[bool_cols].apply(lambda x: x.map(lambda y: True if y == 1 else False))

# Standardize victim_sex values
df['victim_sex'] = df['victim_sex'].map({'Masculin': 'M', 'Feminin': 'F'})

print(df.head(3))

In [None]:
# Convert 'district_code' to arrondissement numbers (using last two digits)
df['arrondissement'] = df['district_code'].astype(str).str[-2:].astype(int)

### Save the Cleaned Dataset

The cleaned dataset is saved for use in subsequent notebooks.

In [None]:
df.to_csv('../data/accidents_cleaned.csv', index=False, sep=';')
print('Cleaned dataset saved as accidents_cleaned.csv')