# Analysis of Paris Accident Data - Part 1

**Goal**: Prepare and clean the raw accident dataset to enable further analysis.

In this notebook, we:
- Load and inspect the raw dataset
- Rename columns to English-friendly names
- Check for duplicates and inconsistencies
- Clean, standardize, and convert data types
- Save the cleaned dataset for further analysis

In [16]:
import pandas as pd

# Load the raw dataset
df = pd.read_csv('../data/accidents.csv', sep=';')

# Display the first 3 rows
print(df.head(3))

   IdUsager        Date     PV  Arrondissement                Mode  \
0   2389401  2017-04-03   3527           75111              Piéton   
1   2388322  2017-08-28   9113           75108  2 Roues Motorisées   
2   2394191  2017-11-06  11991           75117  2 Roues Motorisées   

    Catégorie       Gravité   Age     Genre  Milieu  ... Blessés Légers  \
0      Piéton  Blessé léger  62.0   Feminin  En-Agg  ...            1.0   
1  Conducteur  Blessé léger  30.0  Masculin  En-Agg  ...            1.0   
2  Conducteur  Blessé léger  37.0  Masculin  En-Agg  ...            1.0   

  Blessés hospitalisés Tué                                             Résumé  \
0                  NaN NaN  Accident Léger non mortel, En agglomération, H...   
1                  NaN NaN  Accident Léger non mortel, En agglomération, E...   
2                  NaN NaN  Accident Léger non mortel, En agglomération, E...   

        Coordonnées        Nom arrondissement arronco  \
0   48.855, 2.36867   Paris 4e Arron

### Renaming Columns for Clarity

We rename columns to make them more understandable.

In [17]:
df = df.rename(columns={
    'IdUsager': 'victim_ID',
    'Date': 'accident_date',
    'PV': 'report_number',
    'Mode': 'victim_transport_mode',
    'Catégorie': 'victim_category',
    'Gravité': 'victim_injury_severity',
    'Age': 'victim_age',
    'Genre': 'victim_sex',
    'Milieu': 'environment',
    'Adresse': 'address',
    'Id accident': 'accident_ID',
    'PIM/BD PERIPHERIQUE': 'periphery_info',
    "Tranche d'age": 'victim_age_group',
    'Blessés Légers': 'victim_minor_injuries?',
    'Blessés hospitalisés': 'victim_hospitalized?',
    'Tué': 'victim_deceased?',
    'Résumé': 'report_summary',
    'Nom arrondissement': 'district_name',
    'Nom arrondissement.1': 'district_name.1',
    'Coordonnées': 'coordinates',
    'Coordonnées.1': 'coordinates.1',
    'Arrondissement': 'district',
    'arronco': 'district_code',
    'Latitude': 'latitude',
    'Longitude': 'longitude'
})

print(df.head(3))

   victim_ID accident_date  report_number  district victim_transport_mode  \
0    2389401    2017-04-03           3527     75111                Piéton   
1    2388322    2017-08-28           9113     75108    2 Roues Motorisées   
2    2394191    2017-11-06          11991     75117    2 Roues Motorisées   

  victim_category victim_injury_severity  victim_age victim_sex environment  \
0          Piéton           Blessé léger        62.0    Feminin      En-Agg   
1      Conducteur           Blessé léger        30.0   Masculin      En-Agg   
2      Conducteur           Blessé léger        37.0   Masculin      En-Agg   

   ... victim_minor_injuries? victim_hospitalized? victim_deceased?  \
0  ...                    1.0                  NaN              NaN   
1  ...                    1.0                  NaN              NaN   
2  ...                    1.0                  NaN              NaN   

                                      report_summary       coordinates  \
0  Accident Lég

### Column Summary & Duplicate Check

We inspect data types, missing values, and check for duplicate columns.

In [18]:
def column_summary(df):
    summary_data = []
    for col in df.columns:
        summary_data.append({
            'col_name': col,
            'dtype': df[col].dtype,
            'num_nulls': df[col].isnull().sum(),
            'num_unique': df[col].nunique()
        })
    return pd.DataFrame(summary_data)

summary_df = column_summary(df)
display(summary_df)

# Check for duplicate columns (e.g., district_name vs district_name.1)
district_mismatch = df[df['district_name'].notna() & (df['district_name'] != df['district_name.1'])]
coord_mismatch = df[df['coordinates'].notna() & (df['coordinates'] != df['coordinates.1'])]
print("District mismatches:", district_mismatch.shape[0])
print("Coordinate mismatches:", coord_mismatch.shape[0])

Unnamed: 0,col_name,dtype,num_nulls,num_unique
0,victim_ID,int64,0,41211
1,accident_date,object,0,2550
2,report_number,int64,0,7105
3,district,int64,0,40
4,victim_transport_mode,object,0,6
5,victim_category,object,0,3
6,victim_injury_severity,object,0,3
7,victim_age,float64,0,103
8,victim_sex,object,0,2
9,environment,object,0,2


District mismatches: 0
Coordinate mismatches: 0


### Data Slimming and Cleaning

We drop redundant columns, convert data types, and standardize values.

In [19]:
columns_to_drop = [
    'district_name', 'district_name.1', 'coordinates', 'coordinates.1', 'district',
    'Champ13', 'victim_injury_severity', 'victim_ID', 'report_number'
]
df = df.drop(columns=columns_to_drop)

# Convert date column to datetime
df['accident_date'] = pd.to_datetime(df['accident_date'], errors='coerce')

# Convert numeric columns
int_cols = ['victim_age', 'accident_ID', 'victim_minor_injuries?', 'victim_hospitalized?', 'victim_deceased?']
df[int_cols] = df[int_cols].apply(pd.to_numeric, errors='coerce').astype('Int64')

float_cols = ['latitude', 'longitude']
df[float_cols] = df[float_cols].apply(pd.to_numeric, errors='coerce').astype('float64')

# Convert string columns
string_cols = [
    'victim_transport_mode', 'victim_category', 'victim_sex', 'environment',
    'address', 'periphery_info', 'victim_age_group', 'report_summary'
]
df[string_cols] = df[string_cols].astype('string')

# Convert booleans (assuming 1 represents True)
bool_cols = ['victim_minor_injuries?', 'victim_hospitalized?', 'victim_deceased?']
df[bool_cols] = df[bool_cols].apply(lambda x: x.map(lambda y: True if y == 1 else False))

# Standardize victim_sex values
df['victim_sex'] = df['victim_sex'].map({'Masculin': 'M', 'Feminin': 'F'})

print(df.head(3))

  accident_date victim_transport_mode victim_category  victim_age victim_sex  \
0    2017-04-03                Piéton          Piéton          62          F   
1    2017-08-28    2 Roues Motorisées      Conducteur          30          M   
2    2017-11-06    2 Roues Motorisées      Conducteur          37          M   

  environment                 address  longitude  latitude  accident_ID  \
0      En-Agg  BOULEVARD BEAUMARCHAIS    2.36867   48.8550       837613   
1      En-Agg             RUE MARBEUF    2.30130   48.8667       837073   
2      En-Agg        RUE LA CONDAMINE    2.32163   48.8858       840008   

      periphery_info victim_age_group  victim_minor_injuries?  \
0  Paris Intra Muros        55-64 ans                    True   
1  Paris Intra Muros        25-34 ans                    True   
2  Paris Intra Muros        35-44 ans                    True   

   victim_hospitalized?  victim_deceased?  \
0                 False             False   
1                 False    

In [20]:
# Convert 'district_code' to arrondissement numbers (using last two digits)
df['arrondissement'] = df['district_code'].astype(str).str[-2:].astype(int)

In [21]:
# Clean street names for clarity
def clean_street_name(address_series):
    return (address_series
            .str.upper()
            .str.replace(r'\bBD\b', 'BOULEVARD', regex=True)
            .str.replace(r'\bRTE\b', 'ROUTE', regex=True)
            .str.replace(r'\bAV\b', 'AVENUE', regex=True))

df['clean_address'] = clean_street_name(df['address'])

### Save the Cleaned Dataset

The cleaned dataset is saved for use in subsequent notebooks.

In [22]:
df.to_csv('../data/accidents_cleaned.csv', index=False, sep=';')
print('Cleaned dataset saved as accidents_cleaned.csv')

Cleaned dataset saved as accidents_cleaned.csv
