In [1]:
import glob
import pandas as pd
import numpy as np
import re

# Data import and primary data curation

All imported files were preprocessed to standardize file name schemes and separators.

In general, issues occurring during import and data type conversion was fixed on the fly.

Further curation will be performed after in deep evaluation of the data.


## `usagers` files
- import the files
- fix `dtype` issues on the fly
- fix an issue with the `id_vehicule` variable
- export the file to `users.csv` in `data`

### Data import and dtype issues fixing

In [2]:
# read file names
files = []
for file in glob.glob("../mar23cds_accidents_data_import/usagers*.csv"):
    files.append(file)

# Fix dtype issues
dtype = {'place': 'Int64',
         'an_nais': 'Int64',
         'trajet': 'Int64',
         'secu1': 'Int64',
         'secu2': 'Int64',
         'secu3': 'Int64',
         'locp': 'Int64',
         'etatp': 'Int64',
         'secu': 'Int64'}

# import user files
users = pd.DataFrame()

for i in range(0, len(files)):
    file = pd.read_csv(files[i], sep = ',', dtype = dtype)
    if i == 0:
        users = file
    else:
        users = pd.concat([users, file], axis = 0)

# show info
users.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2509620 entries, 0 to 197497
Data columns (total 16 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   Num_Acc      int64 
 1   id_vehicule  object
 2   num_veh      object
 3   place        Int64 
 4   catu         int64 
 5   grav         int64 
 6   sexe         int64 
 7   an_nais      Int64 
 8   trajet       Int64 
 9   secu1        Int64 
 10  secu2        Int64 
 11  secu3        Int64 
 12  locp         Int64 
 13  actp         object
 14  etatp        Int64 
 15  secu         Int64 
dtypes: Int64(9), int64(4), object(3)
memory usage: 347.0+ MB


### Fixing the issue in `id_vehicule`
We have a curious encoding in `id_vehicule`.

The value `\xa0` encodes a protected blank, which cannot be removed by integrated stripping functions.
Hence, we need a lambda function to:

1. remove all non-numeric characters
2. replace empty vales by nan
3. finally, convert the variable to integer

In [3]:
# show example
users.iloc[0].id_vehicule

'138\xa0306\xa0524'

In [4]:
# remove all non-numeric characters by a regex
users['id_vehicule'] = users['id_vehicule'].apply(lambda x: re.sub('[^0-9]', '', str(x)))

# replace empty values by nan
users['id_vehicule'] = users['id_vehicule'].replace('', np.nan).astype(float)

# convert to int64
users['id_vehicule'] = users['id_vehicule'].astype('Int64')

In [5]:
# check the result
users.iloc[0].id_vehicule

138306524

In [6]:
users.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2509620 entries, 0 to 197497
Data columns (total 16 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   Num_Acc      int64 
 1   id_vehicule  Int64 
 2   num_veh      object
 3   place        Int64 
 4   catu         int64 
 5   grav         int64 
 6   sexe         int64 
 7   an_nais      Int64 
 8   trajet       Int64 
 9   secu1        Int64 
 10  secu2        Int64 
 11  secu3        Int64 
 12  locp         Int64 
 13  actp         object
 14  etatp        Int64 
 15  secu         Int64 
dtypes: Int64(10), int64(4), object(2)
memory usage: 349.4+ MB


### Export the data

In [7]:
users.to_csv("./data/users.csv", sep = ',', header = True, na_rep = 'n/a', index=False)

## `caracteristique` files
- import the files
- fix `dtype` issues on the fly
- fix issues with the variables `lat`, `long`, and `dep`
- export the file to `features.csv` in `data`

### Data import and dtype issues fixing

- convert `col` and `atm` to `Int64`
- encoding is `latin-1` to map Frensh characters


In [8]:
dtype = {'col': 'Int64',
         'atm': 'Int64'}

# read characteristics (=features) files in data directory
files = []
for file in glob.glob("../mar23cds_accidents_data_import/caracter*.csv"):
    files.append(file)

# import user files
features = pd.DataFrame()

for i in range(0, len(files)):
    file = pd.read_csv(files[i], sep = ',', encoding='latin-1', na_values=['N/A', 'n/a',''], dtype = dtype)
    if i == 0:
        features = file
    else:
        features = pd.concat([features, file], axis = 0)


In [9]:
features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1121571 entries, 0 to 58396
Data columns (total 16 columns):
 #   Column   Non-Null Count    Dtype 
---  ------   --------------    ----- 
 0   Num_Acc  1121571 non-null  int64 
 1   an       1121571 non-null  int64 
 2   mois     1121571 non-null  int64 
 3   jour     1121571 non-null  int64 
 4   hrmn     1121571 non-null  object
 5   lum      1121571 non-null  int64 
 6   agg      1121571 non-null  int64 
 7   int      1121571 non-null  int64 
 8   atm      1121498 non-null  Int64 
 9   col      1121552 non-null  Int64 
 10  com      1121569 non-null  object
 11  adr      978295 non-null   object
 12  gps      480052 non-null   object
 13  lat      634503 non-null   object
 14  long     634499 non-null   object
 15  dep      1121571 non-null  object
dtypes: Int64(2), int64(7), object(7)
memory usage: 147.6+ MB


In [10]:
features.head()

Unnamed: 0,Num_Acc,an,mois,jour,hrmn,lum,agg,int,atm,col,com,adr,gps,lat,long,dep
0,201400000001,14,5,7,2015,1,2,1,1,3,11,route de don,M,0.0,0.0,590
1,201400000002,14,5,31,430,1,2,1,1,6,11,106 ROUTE DE DON,M,0.0,0.0,590
2,201400000003,14,8,23,1800,1,2,9,1,3,52,75 bis rue jean jaures,M,0.0,0.0,590
3,201400000004,14,6,12,1700,1,2,1,1,1,25,rue des Sablonnieres D41,M,0.0,0.0,590
4,201400000005,14,6,23,500,2,1,1,1,1,25,,M,0.0,0.0,590


### The latitude and longitude data is formatted in two schemes:

1. decimal format with decimal separator
2. string format without decimal separator
3. NAs

The string format encodes the decimal format without decimals.
Hence, we have 3 data formats:

1. `float64`
2. `int64`
3. `NA`

I should be possible to convert the `int64` format to decimals by padding and the na's to `np.nan` to convert the 
variables to `int64`.

In [11]:
# This part was optimised with help of ChatGPT 3.0
# remove all non-numeric characters by a regex
features['lat'] = features['lat'].apply(lambda x: re.sub('[^0-9]', '', str(x)))
features['long'] = features['long'].apply(lambda x: re.sub('[^0-9]', '', str(x)))

# split the string in two decimals before decimal separator
features['lat'] = features['lat'].apply(lambda x: f"{x[:2]}.{x[2:]}" if x and isinstance(x, str) else x)
features['long'] = features['long'].apply(lambda x: f"{x[:2]}.{x[2:]}" if x and isinstance(x, str) else x)

# add nan to all empty values
features['lat'] = features['lat'].replace('', np.nan).astype(float)
features['long'] = features['long'].replace('', np.nan).astype(float)

In [12]:
features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1121571 entries, 0 to 58396
Data columns (total 16 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   Num_Acc  1121571 non-null  int64  
 1   an       1121571 non-null  int64  
 2   mois     1121571 non-null  int64  
 3   jour     1121571 non-null  int64  
 4   hrmn     1121571 non-null  object 
 5   lum      1121571 non-null  int64  
 6   agg      1121571 non-null  int64  
 7   int      1121571 non-null  int64  
 8   atm      1121498 non-null  Int64  
 9   col      1121552 non-null  Int64  
 10  com      1121569 non-null  object 
 11  adr      978295 non-null   object 
 12  gps      480052 non-null   object 
 13  lat      634503 non-null   float64
 14  long     631290 non-null   float64
 15  dep      1121571 non-null  object 
dtypes: Int64(2), float64(2), int64(7), object(5)
memory usage: 147.6+ MB


In [13]:
features.head()

Unnamed: 0,Num_Acc,an,mois,jour,hrmn,lum,agg,int,atm,col,com,adr,gps,lat,long,dep
0,201400000001,14,5,7,2015,1,2,1,1,3,11,route de don,M,0.0,0.0,590
1,201400000002,14,5,31,430,1,2,1,1,6,11,106 ROUTE DE DON,M,0.0,0.0,590
2,201400000003,14,8,23,1800,1,2,9,1,3,52,75 bis rue jean jaures,M,0.0,0.0,590
3,201400000004,14,6,12,1700,1,2,1,1,1,25,rue des Sablonnieres D41,M,0.0,0.0,590
4,201400000005,14,6,23,500,2,1,1,1,1,25,,M,0.0,0.0,590


### Export the data
The data is exported to `features.csv` in the `data` folder.

In [14]:
features.to_csv("./data/features.csv", sep = ',', header = True, na_rep = 'n/a', index=False)

## `lieux` files
- import the files
- fix `dtype` issues on the fly
- fix issues with latitude and longitude
- export the file to `places.csv` in `data`

### Data import and dtype issues fixing

- convert `col` and `atm` to `Int64`
- encoding is `latin-1` to map Frensh characters


In [15]:
# read places files in data directory
files = []
for file in glob.glob("../mar23cds_accidents_data_import/lieux*.csv"):
    files.append(file)

dtype = {'catr': 'Int64',
         'circ': 'Int64',
         'nbv': 'Int64',
         'vosp': 'Int64',
         'prof': 'Int64',
         'plan': 'Int64',
         'surf': 'Int64',
         'infra': 'Int64',
         'situ': 'Int64',
         'vma': 'Int64',
         'env1': 'Int64'
}

# import user files
places = pd.DataFrame()

for i in range(0, len(files)):
    file = pd.read_csv(files[i], sep = ',', encoding='latin-1', na_values=['N/A', 'n/a',''], low_memory=False, dtype=dtype)
    if i == 0:
        places = file
    else:
        places = pd.concat([places, file], axis = 0)

In [16]:
places.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1121571 entries, 0 to 87025
Data columns (total 19 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   Num_Acc  1121571 non-null  int64  
 1   catr     1121570 non-null  Int64  
 2   voie     1014391 non-null  object 
 3   v1       485747 non-null   float64
 4   v2       51369 non-null    object 
 5   circ     1119997 non-null  Int64  
 6   nbv      1118840 non-null  Int64  
 7   vosp     1118810 non-null  Int64  
 8   prof     1119622 non-null  Int64  
 9   pr       646087 non-null   object 
 10  pr1      644268 non-null   object 
 11  plan     1119290 non-null  Int64  
 12  lartpc   902739 non-null   object 
 13  larrout  1008730 non-null  object 
 14  surf     1119647 non-null  Int64  
 15  infra    1116163 non-null  Int64  
 16  situ     1116601 non-null  Int64  
 17  vma      163102 non-null   Int64  
 18  env1     953029 non-null   Int64  
dtypes: Int64(11), float64(1), int64(1), object(6)
mem

In [17]:
places.head()

Unnamed: 0,Num_Acc,catr,voie,v1,v2,circ,nbv,vosp,prof,pr,pr1,plan,lartpc,larrout,surf,infra,situ,vma,env1
0,201900000001,1,3,0.0,,3,10,0,1,6,900,2,,,1,2,1,70,
1,201900000002,1,1,0.0,,1,2,0,4,3,845,2,,,1,0,1,70,
2,201900000003,1,86,0.0,,3,8,0,1,10,500,3,,,1,0,1,90,
3,201900000004,1,4,0.0,,3,5,0,1,2,299,1,,,1,0,1,90,
4,201900000005,1,86,0.0,INT,1,3,0,1,41,0,3,,,1,2,1,90,


### Export the data
The data is exported to `places.csv` in the `data` folder.

In [18]:
places.to_csv("./data/places.csv", sep = ',', header = True, na_rep = 'n/a', index=False)

## `vehicules` files
- import the files
- fix `dtype` issues on the fly
- fix issues with latitude and longitude
- export the file to `vehicles.csv` in `data`

### Data import and dtype issues fixing

- convert `col` and `atm` to `Int64`

In [19]:
# read vehicle files in data directory
files = []
for file in glob.glob("../mar23cds_accidents_data_import/vehicules*.csv"):
    files.append(file)

# fix dtype issues
dtype = {'senc': 'Int64',
         'catv': 'Int64',
         'occutc': 'Int64',
         'obs': 'Int64',
         'obsm': 'Int64',
         'choc': 'Int64',
         'manv': 'Int64',
         'motor': 'Int64'}

# import user files
vehicles = pd.DataFrame()

for i in range(0, len(files)):
    file = pd.read_csv(files[i], sep = ',', encoding='utf-8', na_values=['N/A', 'n/a',''], low_memory=False, dtype=dtype)
    if i == 0:
        vehicles = file
    else:
        vehicles = pd.concat([vehicles, file], axis = 0)

In [20]:
vehicles.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1914902 entries, 0 to 131047
Data columns (total 11 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   Num_Acc      int64 
 1   senc         Int64 
 2   catv         Int64 
 3   occutc       Int64 
 4   obs          Int64 
 5   obsm         Int64 
 6   choc         Int64 
 7   manv         Int64 
 8   num_veh      object
 9   id_vehicule  object
 10  motor        Int64 
dtypes: Int64(8), int64(1), object(2)
memory usage: 189.9+ MB


In [21]:
vehicles.head()

Unnamed: 0,Num_Acc,senc,catv,occutc,obs,obsm,choc,manv,num_veh,id_vehicule,motor
0,200500000001,0,7,0,0,2,1,1,A01,,
1,200500000001,0,7,0,0,2,8,10,B02,,
2,200500000002,0,7,0,0,2,7,16,A01,,
3,200500000002,0,2,0,0,2,1,1,B02,,
4,200500000003,0,2,0,0,2,1,1,A01,,


### Fixing the issue in `id_vehicule`

In [22]:
# remove all non-numeric characters by a regex
vehicles['id_vehicule'] = vehicles['id_vehicule'].apply(lambda x: re.sub('[^0-9]', '', str(x)))

# replace empty values by nan
vehicles['id_vehicule'] = vehicles['id_vehicule'].replace('', np.nan).astype(float)

# convert to int64
vehicles['id_vehicule'] = vehicles['id_vehicule'].astype('Int64')

In [23]:
vehicles.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1914902 entries, 0 to 131047
Data columns (total 11 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   Num_Acc      int64 
 1   senc         Int64 
 2   catv         Int64 
 3   occutc       Int64 
 4   obs          Int64 
 5   obsm         Int64 
 6   choc         Int64 
 7   manv         Int64 
 8   num_veh      object
 9   id_vehicule  Int64 
 10  motor        Int64 
dtypes: Int64(9), int64(1), object(1)
memory usage: 191.8+ MB


### Export the data
The data is exported to `vehicles.csv` in the `data` folder.

In [24]:
vehicles.to_csv("./data/vehicles.csv", sep = ',', header = True, na_rep = 'n/a', index=False)

## `registered vehicules` files
- import the files
- fix `dtype` issues on the fly
- fix issues with latitude and longitude
- export the file to `registered_vehicles.csv` in `data`

### Data import and dtype issues fixing

- convert `col` and `atm` to `Int64`

In [25]:
# read registered vehicle files in data directory
files = []
for file in glob.glob("../mar23cds_accidents_data_import/registered_vehicles*.csv"):
    files.append(file)

# import user files
registered_vehicles = pd.DataFrame()

for i in range(0, len(files)):
    file = pd.read_csv(files[i], sep = ';', na_values=['N/A', 'n/a',''], 
                       low_memory=False, encoding="UTF-8")
    #dtype = {'Age véhicule': 'Int64'})
    if i == 0:
        registered_vehicles = file
    else:
        registered_vehicles = pd.concat([registered_vehicles, file], axis = 0)

registered_vehicles.head()

Unnamed: 0,Id_accident,Lettre Conventionnelle Véhicule,Année,Lieu Admin Actuel - Territoire Nom,Type Accident - Libellé,CNIT,Catégorie véhicule,Age véhicule,Type Accident - Libellé (old)
0,900 493,A,2014,DOM,Accident Léger,VF7MFDJYF651296,VT,,
1,900 493,B,2014,DOM,Accident Léger,,Cyclo,,
2,900 494,A,2014,DOM,Accident grave non mortel,,VT,4.0,
3,900 495,A,2014,DOM,Accident grave non mortel,LMP21C10N026,Cyclo,7.0,
4,900 496,A,2014,DOM,Accident grave non mortel,LSY91C10U174,Cyclo,6.0,


In [26]:
registered_vehicles['Age véhicule'] = registered_vehicles['Age véhicule'].replace('#VALEURMULTI', np.nan).astype(float)
registered_vehicles['Age véhicule'] = registered_vehicles['Age véhicule'].astype('Int64')

In [27]:
registered_vehicles.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1200791 entries, 0 to 74663
Data columns (total 9 columns):
 #   Column                              Non-Null Count    Dtype 
---  ------                              --------------    ----- 
 0   Id_accident                         1200791 non-null  object
 1   Lettre Conventionnelle Véhicule     1200790 non-null  object
 2   Année                               1200791 non-null  int64 
 3   Lieu Admin Actuel - Territoire Nom  1200791 non-null  object
 4   Type Accident - Libellé             1078193 non-null  object
 5   CNIT                                776011 non-null   object
 6   Catégorie véhicule                  1200791 non-null  object
 7   Age véhicule                        1096600 non-null  Int64 
 8   Type Accident - Libellé (old)       122598 non-null   object
dtypes: Int64(1), int64(1), object(7)
memory usage: 92.8+ MB


### Fixing the issue in `Id_accident`

In [28]:
# remove all non-numeric characters by a regex
registered_vehicles['Id_accident'] = registered_vehicles['Id_accident'].apply(lambda x: re.sub('[^0-9]', '', str(x)))

# replace empty values by nan
registered_vehicles['Id_accident'] = registered_vehicles['Id_accident'].replace('', np.nan).astype(float)

# convert to int64
registered_vehicles['Id_accident'] = registered_vehicles['Id_accident'].astype('Int64')

In [29]:
registered_vehicles.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1200791 entries, 0 to 74663
Data columns (total 9 columns):
 #   Column                              Non-Null Count    Dtype 
---  ------                              --------------    ----- 
 0   Id_accident                         1200791 non-null  Int64 
 1   Lettre Conventionnelle Véhicule     1200790 non-null  object
 2   Année                               1200791 non-null  int64 
 3   Lieu Admin Actuel - Territoire Nom  1200791 non-null  object
 4   Type Accident - Libellé             1078193 non-null  object
 5   CNIT                                776011 non-null   object
 6   Catégorie véhicule                  1200791 non-null  object
 7   Age véhicule                        1096600 non-null  Int64 
 8   Type Accident - Libellé (old)       122598 non-null   object
dtypes: Int64(2), int64(1), object(6)
memory usage: 93.9+ MB


In [30]:
registered_vehicles.head()

Unnamed: 0,Id_accident,Lettre Conventionnelle Véhicule,Année,Lieu Admin Actuel - Territoire Nom,Type Accident - Libellé,CNIT,Catégorie véhicule,Age véhicule,Type Accident - Libellé (old)
0,900493,A,2014,DOM,Accident Léger,VF7MFDJYF651296,VT,,
1,900493,B,2014,DOM,Accident Léger,,Cyclo,,
2,900494,A,2014,DOM,Accident grave non mortel,,VT,4.0,
3,900495,A,2014,DOM,Accident grave non mortel,LMP21C10N026,Cyclo,7.0,
4,900496,A,2014,DOM,Accident grave non mortel,LSY91C10U174,Cyclo,6.0,


### Export the data
The data is exported to `registered_vehicles.csv` in the `data` folder.

In [31]:
registered_vehicles.to_csv("./data/registered_vehicles.csv", sep = ',', header = True, na_rep = 'n/a', index=False)