# Instructions

In this challenge, you'll try to predict the severity of car accidents, based on features collected from after-crash police investigation

This [Kaggle challenge](https://www.kaggle.com/c/accident-severity) comprises of 1,000,000 accidents report, split into multiple `.csv` files.

Download the data [here](https://wagon-public-datasets.s3.amazonaws.com/04-ML_08-workflow_datasets.zip). For instance:

```
cd ~/code/$GITHUB_NICKNAME/data-challenges/05-ML/08-Workflow/01-Car-accidents-severity
curl https://wagon-public-datasets.s3.amazonaws.com/car_acccidents_datasets.zip > data.zip
unzip data.zip -d data
rm -rf data.zip
```
**The goal of the model is to predict the severity of car accidents**. The target variable is called `grav` (for 'gravity') in the file `users.csv`. This variable has four levels, but in this challenge, we'll convert it to a binary classification problem.

We will:
- Load data into pandas
- Create a single DataFrame for our model
- Extract the features you think would be relevant and build a data pipeline
- Then, iterate on the different phase and try to get the best model! 

We will give you the basic preprocessing, and you will build and improve the models and feature engineering.

‚ö†Ô∏è **Some very important good practices to follow for large exploratory notebooks**
- Build your Notebook linearily so that it can always be run from top to bottom without any errors
- Regularity clean the outputs of your cells that are not needed
- Clean the variables that are not needed using python built-in function `del`, or the the Jupyter nbextentions `variable_inspector`
- Make heavy use of `table_of_content` and `collapsable_headings` 

# Data sourcing

Let's get started! The data we want to use is from the `csv` files in `/data/data_training`

## Loading data

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [5]:
cara = pd.read_csv("data/data_training/caracteristics.csv", encoding="ISO-8859-1")
users = pd.read_csv("data/data_training/users.csv", encoding="ISO-8859-1")
places = pd.read_csv("data/data_training/places.csv", encoding="ISO-8859-1")
vehicles = pd.read_csv("data/data_training/vehicles.csv", encoding="ISO-8859-1")

  interactivity=interactivity, compiler=compiler, result=result)


‚ùì Explore the different tables, and the different variables using `challenge_variable.md`, which provides a description of features. More details can be found [here](https://static.data.gouv.fr/resources/base-de-donnees-accidents-corporels-de-la-circulation/20180927-112352/description-des-bases-de-donnees-onisr-annees-2005-a-2017.pdf) if needed, or in the [Kaggle](https://www.kaggle.com/ahmedlahlou/accidents-in-france-from-2005-to-2016/discussion) discussion channel. Understand

In [6]:
# Your code below

## Merge datasets

‚ùì We will create one single dataset where each row should represent a `user` in a car, by merging the data from the different files dataset.  
**Take some time to think about how you would do it yourself**, and only then, read carefully through the code below to understand exactly what we did

In [7]:
# Merge caracteristics and places on 'Num_Acc'
data = cara.merge(places, on='Num_Acc')

In [8]:
# Create a common key to merge users amd vehicles on
users['Num_Acc_num_veh'] = users['Num_Acc'].map(lambda x: str(x)) + users['num_veh']
vehicles['Num_Acc_num_veh'] = vehicles['Num_Acc'].map(lambda x: str(x)) + vehicles['num_veh']
# Remove useless columns
vehicles = vehicles.drop(columns=['index'])
users = users.drop(columns=['index', 'Num_Acc', 'num_veh'])
# Merge vehicles and users
tmp = vehicles.merge(users, on='Num_Acc_num_veh', how='inner')

In [9]:
# Merge all datasets on 'Num_Acc'
data = data.merge(tmp, on='Num_Acc', how='inner')
del tmp

In [10]:
data.shape

(1209362, 54)

# Preprocessing

We will apply some preprocessing methods like standardization or missing values removal or imputing.
Remember to look at `challenge_variable.md` for a description of features.

## Clean Dataset

In [11]:
# drop lines without targets (if any)
data_cleaned = data[~np.isnan(data.grav)]

In [None]:
# Check whih features with highest ratio of NaN per column
(data_cleaned.isna().sum() / data_cleaned.shape[0]).sort_values(ascending=False)

In [13]:
# Remove too incomplete features
too_incomplete_features=[
    'locp', 'actp', 'etatp'
]

In [14]:
# Remove features that can be safely considered useless for the predictive power of our model
useless_features=[
    'v2', 'lat', 'long', 'gps', 'pr1', 'pr', 'v1', 'adr', 'voie',
    'index_x', 'Num_Acc', 'Num_Acc_num_veh', 'Num_Acc', 'num_veh', 'index_y',
    'jour', 'an',
    'dep', 'com', 'env1',
]

In [15]:
data_cleaned.drop(columns=too_incomplete_features+useless_features, inplace=True)

In [16]:
# The secu feature seams extremely important. Let's handle this one specifically.
# Drop lines without 'secu'
data_cleaned = data_cleaned[~np.isnan(data.secu)]
# Only keep rows with "secu" number consisting of two digits
data_cleaned = data_cleaned[data_cleaned.secu.map(lambda x: len(str(round(x)))) == 2]
# Split as per feature description
data_cleaned['safety_equipment'] = data_cleaned.secu.map(lambda x: str(round(x))[0])
data_cleaned['is_safety_equipment'] = data_cleaned.secu.map(lambda x: str(round(x))[1])
data_cleaned.drop(columns=['secu'], inplace=True)

In [17]:
# Replace hrmn by hh only (minue granularity is considered useless)
data_cleaned['hour_of_day'] = pd.Series(data_cleaned.hrmn.map(lambda x: str(x)[0:-2])).replace('', 0)
data_cleaned.drop(columns=['hrmn'], inplace=True)
data_cleaned.shape

(1125397, 33)

We now have a `data_cleaned` dataset! Let's now engineer our features as needed

## Prepare features and target

### Cyclical features

In [44]:
features_cyclical = ['hour_of_day', 'mois']

In [57]:
# YOUR CODE BELOW
def preprocess_cyclical_features(X):
    '''
    Input: DataFrame X
    Output: Returns new DataFrame, where all its features X_i have been replaced
    by both their sin(X_i) and cos(X_i), and delete initial feature X_i.
    '''
    pass

In [None]:
# Check your code below
preprocess_cyclical_features(data_cleaned[features_cyclical])

‚ùì Do you get a Warning "A value is trying to be set on a copy of a slice from a DataFrame"?
If so, it may be because you are trying to modify the input DataFrame `data_cleaned`!

Read this [important blog on copy vs. view](https://www.practicaldatascience.org/html/views_and_copies_in_pandas.html) of pandas DataFrame and try to solve your warning by yourself



<details>
    <summary>Hint</summary>

`pd.DataFrame.copy()`
</details>

### Numerical features

In [49]:
features_numerical = ['nbv', 'senc', 'an_nais', 'occutc', 'lartpc', 'larrout']

In [50]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler, StandardScaler

def preprocess_numerical_features(X):
    '''
    Returns a new DataFrame with
    - Missing values replaced by Column Mean
    - Features Standard Scaled
    - Original Features names kept in the DataFrame
    '''
    pass


### categorical features

‚ùì Create the last group of feature (categorical features) without hardcoding them manually. Then, create the associated preprocessing method

In [58]:
features_categorical = list(set(data_cleaned.columns) - set(features_numerical) - set(features_cyclical) - {'grav'})

In [60]:
def preprocess_categorical_features(X):   
    ''' Returns a new DataFrame with dummified columns'''
    df = X.copy()
    return pd.get_dummies(df.apply(lambda col: col.astype('category')))

‚ùì Create the new `data_preprocessed` dataset by concatenating all three preprocessing, and then drop all remaining NaN that could not have been handled previously despite our preprocessing. You should have a (1125397, 216) shape

In [55]:
# YOUR CODE BELOW

(1125397, 216)

## Split Dataset
‚ùì Create X and y, and don't forget to convert the classification into a binary task
For instance:

```python
data['grav_binary'] = data['grav'].replace({1: 0, 4: 0, 2: 1, 3: 1})
```

In [27]:
# Create X and y

In [28]:
# Create a smaller dataset (X_small, y_small) for investigation purpose only

In [29]:
# Train Test Split both datasets

In [62]:
# (optional) Create here - only when needed - an train/eval split within the train set itself.
# Some powerfull models (XGBOOST, Neural Network...) which are prone to overfitting on the traning set, needs "early stopping criteria", to avoid descending the gradient completely and avoid overfitti.

# Features exploration

You now have a dataset ready for training! 
**Skip directly to section 5 to get a baseline model working ASAP**, and only then come back to this section 4 if you want to better understand your X and get inspiration for the best model to use, or for some feature selection to reduce model complexity

## Visualization

‚ùìInvestigate your X. Are features strongly correlated? Are some feature more important than other?

In [63]:
# YOUR CODE HERE

## PCA

‚ùìFit a PCA and plot the cumulated sum of explained variance ratio of your Principal Component. Do you see any clear elbow?

In [32]:
# YOUR CODE HERE

## Forest-based most important features

‚ùì Fit a default RandomForestClassifier on a small smaple to estimate the top 20 feature importance. Do they make intuitive sense to your point of view?  Do you see any clear elbow for dimension-reduction?

In [64]:
# YOUR CODE HERE

‚ùì (Optional) There are better ways to estimate feature importance in a RandomForest. Feel free to try to two following options

**Option 1** : Recursive-method
1. Train a first model, note top1 feature (computed based on the gini-explicative power of the feature, in each tree)
2. Remove top1 from your X and retrain a RandomForest. Note top1 feature and it's relative importance
3. Loop

**Option 2** : Permutation-method ([sklearn.inspection.permutation_importance](https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-importance)), works with any model!
1. Train a first model, keep track of its accuracy
2. Take one feature and shuffle its columns. Compute new accuracy of the corrupted dataset, and note by how much it has been reduced.
3. Loop over all features and rank them by accuracy reduction

In [33]:
# YOUR CODE HERE

# Modeling

## Baseline performance metrics

‚ùì What is the class balance of your target?  

What would be the most dumb baseline to beat? Print the `classification_report` of this dumb model

In [None]:
# YOUR CODE HERE

‚ùì If you don't want to favor any class over the other, what would be a good performance metric for your problem? 
Take some time to think before reading the answer! It's not that obvious.

<details>
    <summary>Answer</summary>

In such an unbalanced problem, accuracy is meaningless: A very dumb model predicting always zeros would have great accuracy, to the detriment of the predictive power of class  1, which has precision and recall equal to zero!
    
The non-weighted mean between both f1 score of each class called `f1_macro` would be a good measure for this type of problem.
</details>

## Simple Model (A first iteration)

‚ùì Create a simple model, fast to train, to classify the severity of the accidents. Start simple. Don't forget to fit on your training set and evaluate the score on your test set. Can you beat the Baseline? What about its Accuracy? Measure the time it takes on the full dataset, with `%%time` 

In [66]:
# YOUR CODE BELOW
%%time

# üî•üî•üî• Advanced Models - LeWagon batch contest ! üî•üî•üî•

‚ùì Now it's your turn to shine! Play with different models and try to find the best one on your training set!
- Send your best score (as defined above) to your slack channel without saying which model you used!
- ‚ö†Ô∏è Only send score tested on the `y_test` of complete size (1M+ rows!)
- Feel free to use your X_small for investigation purpose
- If it takes too long to train, simplify your model, or use better feature preprocessing/selection

The winner will present its notebook to the class during the reboot üí™

(Don't forget, your Notebook should be made to be run from top to bottom in one go!)

In [68]:
# YOUR CODE HERE

### (Optional) - Pipeline most steps (prepross & fit) in one single Sklearn Pipeline