# Step 1: Data mining + DataViz’

**Deadline Friday, 16th June**

- Your first task will be to define the context and the scope of the project: I expect you to really take the time to understand the project and to learn as much as possible about the concepts it will introduce.

- You will then have to take in hand and discover your dataset and make an almost exhaustive analysis of it in order to highlight the structure, the difficulties and possible biases of the dataset.

You can use this template: [Template - Data audit report](https://docs.google.com/spreadsheets/d/1BZF56pzSsScHQZjJnM945iCcAKyxm2BqRsv7at-1bqY/edit?usp=sharing)
- I will also expect at least 5 graphical representations built from your dataset, visual and especially relevant. For each of them I will expect:
- A precise commentary, which analyzes the figure and provides a “business” opinion.
- A validation of the observation by data manipulation or a statistical test.

**Author** Tobias Schulze
**Date** 11 June 2023

In [1]:
# loading required packages
import pandas as pd
import numpy as np
import seaborn as sbn

# load the geo mapping
import overpy as op

In [2]:
# read the data files
features = pd.read_csv("./data/features.csv", na_values="N/A", low_memory=False, index_col=0)
places = pd.read_csv("./data/places.csv", na_values="N/A", low_memory=False, index_col=0)
users = pd.read_csv("./data/users.csv", na_values="N/A", low_memory=False, index_col=0)
vehicles = pd.read_csv("./data/vehicles.csv", na_values="N/A", low_memory=False, index_col=0)
registered_vehicles =  pd.read_csv("./data/registered_vehicles.csv", na_values="N/A", low_memory=False, index_col=0)

In [10]:
features.head()

Unnamed: 0_level_0,an,mois,jour,hrmn,lum,agg,int,atm,col,com,adr,gps,lat,long,dep
Num_Acc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
201400000001,14,5,7,2015,1,2,1,1.0,3.0,11,route de don,M,0.0,0.0,590
201400000002,14,5,31,430,1,2,1,1.0,6.0,11,106 ROUTE DE DON,M,0.0,0.0,590
201400000003,14,8,23,1800,1,2,9,1.0,3.0,52,75 bis rue jean jaures,M,0.0,0.0,590
201400000004,14,6,12,1700,1,2,1,1.0,1.0,25,rue des Sablonnieres D41,M,0.0,0.0,590
201400000005,14,6,23,500,2,1,1,1.0,1.0,25,,M,0.0,0.0,590


In [11]:
features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 982060 entries, 201400000001 to 201300058397
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   an      982060 non-null  int64  
 1   mois    982060 non-null  int64  
 2   jour    982060 non-null  int64  
 3   hrmn    982060 non-null  object 
 4   lum     982060 non-null  int64  
 5   agg     982060 non-null  int64  
 6   int     982060 non-null  int64  
 7   atm     981987 non-null  float64
 8   col     982041 non-null  float64
 9   com     982058 non-null  object 
 10  adr     850489 non-null  object 
 11  gps     455722 non-null  object 
 12  lat     553725 non-null  object 
 13  long    553721 non-null  object 
 14  dep     982060 non-null  object 
dtypes: float64(2), int64(6), object(7)
memory usage: 119.9+ MB


### Description of features
- georeference is incomplete
- departments 

In [29]:
places.head()

Unnamed: 0_level_0,an,mois,jour,hrmn,lum,agg,int,atm,col,com,...,pr1,vosp,prof,plan,lartpc,larrout,surf,infra,situ,env1
Num_Acc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
201400000001,14.0,5.0,7.0,2015,1.0,2.0,1.0,1.0,3.0,11,...,,,,,,,,,,
201400000002,14.0,5.0,31.0,430,1.0,2.0,1.0,1.0,6.0,11,...,,,,,,,,,,
201400000003,14.0,8.0,23.0,1800,1.0,2.0,9.0,1.0,3.0,52,...,,,,,,,,,,
201400000004,14.0,6.0,12.0,1700,1.0,2.0,1.0,1.0,1.0,25,...,,,,,,,,,,
201400000005,14.0,6.0,23.0,500,2.0,1.0,1.0,1.0,1.0,25,...,,,,,,,,,,


In [30]:
places.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1069086 entries, 201400000001 to 200500087954
Data columns (total 32 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   an       982060 non-null  float64
 1   mois     982060 non-null  float64
 2   jour     982060 non-null  float64
 3   hrmn     982060 non-null  object 
 4   lum      982060 non-null  float64
 5   agg      982060 non-null  float64
 6   int      982060 non-null  float64
 7   atm      981987 non-null  float64
 8   col      982041 non-null  float64
 9   com      982058 non-null  object 
 10  adr      850489 non-null  object 
 11  gps      455722 non-null  object 
 12  lat      553725 non-null  object 
 13  long     553721 non-null  object 
 14  dep      982060 non-null  object 
 15  catr     87025 non-null   float64
 16  voie     79497 non-null   float64
 17  v1       86821 non-null   float64
 18  v2       2136 non-null    object 
 19  circ     87026 non-null   float64
 20  nbv      8702