# Step 1: Data mining + DataViz’

**Deadline Friday, 16th June**

- Your first task will be to define the context and the scope of the project: I expect you to really take the time to understand the project and to learn as much as possible about the concepts it will introduce.

- You will then have to take in hand and discover your dataset and make an almost exhaustive analysis of it in order to highlight the structure, the difficulties and possible biases of the dataset.

You can use this template: [Template - Data audit report](https://docs.google.com/spreadsheets/d/1BZF56pzSsScHQZjJnM945iCcAKyxm2BqRsv7at-1bqY/edit?usp=sharing)
- I will also expect at least 5 graphical representations built from your dataset, visual and especially relevant. For each of them I will expect:
- A precise commentary, which analyzes the figure and provides a “business” opinion.
- A validation of the observation by data manipulation or a statistical test.

**Author** Tobias Schulze
**Date** 11 June 2023

In [1]:
# loading required packages
import pandas as pd
import numpy as np
import seaborn as sbn

# load the geo mapping
import overpy as op

In [2]:
# read the data files
features = pd.read_csv("./data/features.csv", na_values="N/A", low_memory=False, index_col=0)
places = pd.read_csv("./data/places.csv", na_values="N/A", low_memory=False, index_col=0)
users = pd.read_csv("./data/users.csv", na_values="N/A", low_memory=False, index_col=0)
vehicles = pd.read_csv("./data/vehicles.csv", na_values="N/A", low_memory=False, index_col=0)
registered_vehicles =  pd.read_csv("./data/registered_vehicles.csv", na_values="N/A", low_memory=False, index_col=0)

In [3]:
features.head()

Unnamed: 0_level_0,an,mois,jour,hrmn,lum,agg,int,atm,col,com,adr,gps,lat,long,dep
Num_Acc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
201400000001,14,5,7,2015,1,2,1,1.0,3.0,11,route de don,M,0.0,0.0,590
201400000002,14,5,31,430,1,2,1,1.0,6.0,11,106 ROUTE DE DON,M,0.0,0.0,590
201400000003,14,8,23,1800,1,2,9,1.0,3.0,52,75 bis rue jean jaures,M,0.0,0.0,590
201400000004,14,6,12,1700,1,2,1,1.0,1.0,25,rue des Sablonnieres D41,M,0.0,0.0,590
201400000005,14,6,23,500,2,1,1,1.0,1.0,25,,M,0.0,0.0,590


In [7]:
places.head()

Unnamed: 0_level_0,an,mois,jour,hrmn,lum,agg,int,atm,col,com,...,pr1,vosp,prof,plan,lartpc,larrout,surf,infra,situ,env1
Num_Acc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
201400000001,14.0,5.0,7.0,2015,1.0,2.0,1.0,1.0,3.0,11,...,,,,,,,,,,
201400000002,14.0,5.0,31.0,430,1.0,2.0,1.0,1.0,6.0,11,...,,,,,,,,,,
201400000003,14.0,8.0,23.0,1800,1.0,2.0,9.0,1.0,3.0,52,...,,,,,,,,,,
201400000004,14.0,6.0,12.0,1700,1.0,2.0,1.0,1.0,1.0,25,...,,,,,,,,,,
201400000005,14.0,6.0,23.0,500,2.0,1.0,1.0,1.0,1.0,25,...,,,,,,,,,,


In [4]:
users.head()

Unnamed: 0_level_0,id_vehicule,num_veh,place,catu,grav,sexe,an_nais,trajet,secu1,secu2,secu3,locp,actp,etatp,secu
Num_Acc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
201900000001,138306524.0,B01,2.0,2,4,2,2002.0,0.0,1.0,0.0,-1.0,-1.0,-1,-1.0,
201900000001,138306524.0,B01,1.0,1,4,2,1993.0,5.0,1.0,0.0,-1.0,-1.0,-1,-1.0,
201900000001,138306525.0,A01,1.0,1,1,1,1959.0,0.0,1.0,0.0,-1.0,-1.0,-1,-1.0,
201900000002,138306523.0,A01,1.0,1,4,2,1994.0,0.0,1.0,0.0,-1.0,-1.0,-1,-1.0,
201900000003,138306520.0,A01,1.0,1,1,1,1996.0,0.0,1.0,0.0,-1.0,-1.0,0,-1.0,


`id_vehicule` is not integer for any reason, so convert:

In [7]:
users.id_vehicule = users['id_vehicule'].astype('Int64')
users.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2509620 entries, 201900000001 to 200500087954
Data columns (total 15 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id_vehicule  Int64  
 1   num_veh      object 
 2   place        float64
 3   catu         int64  
 4   grav         int64  
 5   sexe         int64  
 6   an_nais      float64
 7   trajet       float64
 8   secu1        float64
 9   secu2        float64
 10  secu3        float64
 11  locp         float64
 12  actp         object 
 13  etatp        float64
 14  secu         float64
dtypes: Int64(1), float64(9), int64(3), object(2)
memory usage: 308.7+ MB


In [8]:
users.head()

Unnamed: 0_level_0,id_vehicule,num_veh,place,catu,grav,sexe,an_nais,trajet,secu1,secu2,secu3,locp,actp,etatp,secu
Num_Acc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
201900000001,138306524,B01,2.0,2,4,2,2002.0,0.0,1.0,0.0,-1.0,-1.0,-1,-1.0,
201900000001,138306524,B01,1.0,1,4,2,1993.0,5.0,1.0,0.0,-1.0,-1.0,-1,-1.0,
201900000001,138306525,A01,1.0,1,1,1,1959.0,0.0,1.0,0.0,-1.0,-1.0,-1,-1.0,
201900000002,138306523,A01,1.0,1,4,2,1994.0,0.0,1.0,0.0,-1.0,-1.0,-1,-1.0,
201900000003,138306520,A01,1.0,1,1,1,1996.0,0.0,1.0,0.0,-1.0,-1.0,0,-1.0,


In [5]:
vehicles.head()

Unnamed: 0_level_0,senc,catv,occutc,obs,obsm,choc,manv,num_veh,id_vehicule,motor
Num_Acc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
200500000001,0.0,7,0.0,0.0,2.0,1.0,1.0,A01,,
200500000001,0.0,7,0.0,0.0,2.0,8.0,10.0,B02,,
200500000002,0.0,7,0.0,0.0,2.0,7.0,16.0,A01,,
200500000002,0.0,2,0.0,0.0,2.0,1.0,1.0,B02,,
200500000003,0.0,2,0.0,0.0,2.0,1.0,1.0,A01,,


In [6]:
registered_vehicles.head()

Unnamed: 0_level_0,Lettre Conventionnelle Véhicule,Année,Lieu Admin Actuel - Territoire Nom,Type Accident - Libellé,CNIT,Catégorie véhicule,Age véhicule,Type Accident - Libellé (old)
Id_accident,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
900493,A,2014,DOM,Accident Léger,VF7MFDJYF651296,VT,,
900493,B,2014,DOM,Accident Léger,,Cyclo,,
900494,A,2014,DOM,Accident grave non mortel,,VT,4.0,
900495,A,2014,DOM,Accident grave non mortel,LMP21C10N026,Cyclo,7.0,
900496,A,2014,DOM,Accident grave non mortel,LSY91C10U174,Cyclo,6.0,
