In [2]:
import pandas as pd
import numpy as np

# Final processing of the geolocation
The geolocations were processed during import and merging of the files. We did:

- removal of all non-metropolitan departments
- changed the Corse departments 2a and 2b to one department 2
- imputed the geolocation from address, if possible
- replaced the weired (0, 40) geolocation by NA
- added NA to all missing relocation (e.g. (0, 0) which is not in France)

In this notebook, we remove datasets with still missing geolocations and set the `gps` position to `M`.

In [43]:
merged_table = pd.read_csv('./data/merged_tables.csv', low_memory=False, header = 0, index_col=0, na_values='n/a')

The variables of interest for us are `lat` and `long`. The variables should either have a float value if a geolocation exists or are NA if no geolocation exists.

We can remove all datasets with NA

In [44]:
merged_table.dropna(subset = ['lat', 'long'], inplace=True)

## Change values in `gps'
First, check and replace values in `gps`.

In [45]:
merged_table['gps'].unique()

array([nan, 'M', 'A'], dtype=object)

Still datasets of the Antilles in the dataset...

In [46]:
merged_table.drop(merged_table[merged_table['gps'] == 'A'].index, inplace = True)
merged_table['gps'].unique()

array([nan, 'M'], dtype=object)

Now, fill gaps in `gps`

In [48]:
merged_table['gps'] = merged_table['gps'].fillna('M')
merged_table['gps'].unique()

array(['M'], dtype=object)

In [49]:
merged_table.info()

<class 'pandas.core.frame.DataFrame'>
Index: 416924 entries, 201900000001 to 201800049520
Data columns (total 57 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   id_vehicule    347617 non-null  float64
 1   num_veh        416924 non-null  object 
 2   place          412775 non-null  float64
 3   catu           416924 non-null  int64  
 4   grav           416924 non-null  int64  
 5   sexe           416924 non-null  int64  
 6   an_nais        413902 non-null  float64
 7   trajet         416863 non-null  float64
 8   secu1          347617 non-null  float64
 9   secu2          347617 non-null  float64
 10  secu3          347617 non-null  float64
 11  locp           414798 non-null  float64
 12  actp           414793 non-null  object 
 13  etatp          414791 non-null  float64
 14  secu           67378 non-null   float64
 15  an             416924 non-null  int64  
 16  mois           416924 non-null  int64  
 17  jour           41

In [50]:
merged_table.to_csv("./data/merged_tables.csv", sep = ',', header = True, na_rep = 'n/a', index=True)