# [Paris Saclay Center for Data Science](http://www.datascience-paris-saclay.fr)


## [Kaggle Seguro RAMP](http://www.ramp.studio/problems/kaggle_seguro): Kaggle Porto-Seguro safe driver prediction

_Balázs Kégl (LAL/CNRS)_

## Introduction
This is a [Kaggle data challenge](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction) on predicting the probability that a driver will initiate an auto insurance claim in the next year.

### Requirements

* numpy>=1.10.0  
* matplotlib>=1.5.0 
* pandas>=0.19.0  
* scikit-learn>=0.19   

In [5]:
%matplotlib inline
import os
import glob
import numpy as np
from scipy import io
import matplotlib.pyplot as plt
import pandas as pd

## Exploratory data analysis

### Loading the data

The repo contains mock data in `/data`, simulating the format of the official Kaggle data, but smaller in size and containing random features. If you want to execute the notebook on the official Kaggle data, sign up to the [challenge](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction), download `train.7z` and `test.7z`, unzip them and place them in `kaggle_data/`. If you want to use the starting kit to generate output in the right Kaggle submission format, you will also need to download `sample_submission.7z`, unzip it, and place it in `kaggle_data/`.

In [6]:
train_filename = 'data/train.csv'

In [7]:
data = pd.read_csv(train_filename)

In [8]:
data.head()

Unnamed: 0.1,Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,0,20508,0,1,1,2,1,0,1,0,...,5,0,3,3,0,1,0,0,0,0
1,1,328282,0,1,1,5,1,0,1,0,...,11,2,5,6,0,1,1,0,0,0
2,2,694055,0,4,1,3,1,0,0,0,...,3,2,3,7,0,1,1,0,0,1
3,3,310315,0,0,1,5,1,0,1,0,...,3,1,1,6,0,1,1,0,0,1
4,4,254421,0,2,1,3,1,0,1,1,...,7,0,10,10,0,1,0,1,0,0


In [9]:
data.describe()

Unnamed: 0.1,Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,...,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,199.5,746489.3,0.035,1.79,1.3325,4.4525,0.395,0.4775,0.3825,0.2725,...,5.46,1.44,3.065,7.46,0.1175,0.635,0.58,0.2525,0.31,0.1875
std,115.614301,436703.1,0.18401,1.930081,0.610636,2.662794,0.489463,1.495019,0.486606,0.445803,...,2.217135,1.2,1.824582,2.593302,0.322418,0.482033,0.494177,0.434991,0.463072,0.390801
min,0.0,2604.0,0.0,0.0,-1.0,0.0,0.0,-1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,99.75,355203.5,0.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,...,4.0,1.0,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,199.5,760247.5,0.0,1.0,1.0,4.0,0.0,0.0,0.0,0.0,...,5.0,1.0,3.0,7.0,0.0,1.0,1.0,0.0,0.0,0.0
75%,299.25,1128489.0,0.0,3.0,2.0,6.0,1.0,0.0,1.0,1.0,...,7.0,2.0,4.0,9.0,0.0,1.0,1.0,1.0,1.0,0.0
max,399.0,1476663.0,1.0,7.0,4.0,11.0,1.0,6.0,1.0,1.0,...,12.0,6.0,10.0,17.0,1.0,1.0,1.0,1.0,1.0,1.0


In [10]:
data.dtypes

Unnamed: 0          int64
id                  int64
target              int64
ps_ind_01           int64
ps_ind_02_cat       int64
ps_ind_03           int64
ps_ind_04_cat       int64
ps_ind_05_cat       int64
ps_ind_06_bin       int64
ps_ind_07_bin       int64
ps_ind_08_bin       int64
ps_ind_09_bin       int64
ps_ind_10_bin       int64
ps_ind_11_bin       int64
ps_ind_12_bin       int64
ps_ind_13_bin       int64
ps_ind_14           int64
ps_ind_15           int64
ps_ind_16_bin       int64
ps_ind_17_bin       int64
ps_ind_18_bin       int64
ps_reg_01         float64
ps_reg_02         float64
ps_reg_03         float64
ps_car_01_cat       int64
ps_car_02_cat       int64
ps_car_03_cat       int64
ps_car_04_cat       int64
ps_car_05_cat       int64
ps_car_06_cat       int64
ps_car_07_cat       int64
ps_car_08_cat       int64
ps_car_09_cat       int64
ps_car_10_cat       int64
ps_car_11_cat       int64
ps_car_11           int64
ps_car_12         float64
ps_car_13         float64
ps_car_14   

In [11]:
data.count()

Unnamed: 0        400
id                400
target            400
ps_ind_01         400
ps_ind_02_cat     400
ps_ind_03         400
ps_ind_04_cat     400
ps_ind_05_cat     400
ps_ind_06_bin     400
ps_ind_07_bin     400
ps_ind_08_bin     400
ps_ind_09_bin     400
ps_ind_10_bin     400
ps_ind_11_bin     400
ps_ind_12_bin     400
ps_ind_13_bin     400
ps_ind_14         400
ps_ind_15         400
ps_ind_16_bin     400
ps_ind_17_bin     400
ps_ind_18_bin     400
ps_reg_01         400
ps_reg_02         400
ps_reg_03         400
ps_car_01_cat     400
ps_car_02_cat     400
ps_car_03_cat     400
ps_car_04_cat     400
ps_car_05_cat     400
ps_car_06_cat     400
ps_car_07_cat     400
ps_car_08_cat     400
ps_car_09_cat     400
ps_car_10_cat     400
ps_car_11_cat     400
ps_car_11         400
ps_car_12         400
ps_car_13         400
ps_car_14         400
ps_car_15         400
ps_calc_01        400
ps_calc_02        400
ps_calc_03        400
ps_calc_04        400
ps_calc_05        400
ps_calc_06

In [12]:
np.unique(data['target'])

array([0, 1])

In [13]:
data.groupby('target').count()[['id']]

Unnamed: 0_level_0,id
target,Unnamed: 1_level_1
0,386
1,14


## The pipeline

For submitting at the [RAMP site](http://ramp.studio), you will have to write two classes, saved in two different files,
* the class `FeatureExtractor`, which will be used to extract features for classification from the dataset and produce a numpy array of size (number of samples $\times$ number of features), and  
* the class `Classifier` to predict the target.

### Feature extractor

The feature extractor implements a `transform` member function. It is saved in the file [`submissions/starting_kit/feature_extractor.py`](/edit/submissions/starting_kit/feature_extractor.py). It receives the pandas dataframe `X_df` defined at the beginning of the notebook. It should produce a numpy array representing the extracted features, which will then be used for the classification.  

Note that the following code cells are *not* executed in the notebook. The notebook saves their contents in the file specified in the first line of the cell, so you can edit your submission before running the local test below and submitting it at the RAMP site.

In [14]:
%%file submissions/starting_kit/feature_extractor.py
class FeatureExtractor():
    def __init__(self):
        pass

    def fit(self, X_df, y):
        pass

    def transform(self, X_df):
        return X_df.values



Overwriting submissions/starting_kit/feature_extractor.py


### Classifier

The classifier follows a classical scikit-learn classifier template. It should be saved in the file [`submissions/starting_kit/classifier.py`](/edit/submissions/starting_kit/classifier.py). In its simplest form it takes a scikit-learn pipeline, assigns it to `self.clf` in `__init__`, then calls its `fit` and `predict_proba` functions in the corresponding member funtions.

In [15]:
%%file submissions/starting_kit/classifier.py
from sklearn.base import BaseEstimator
from sklearn.ensemble import RandomForestClassifier


class Classifier(BaseEstimator):
    def __init__(self):
        pass

    def fit(self, X, y):
        self.clf = RandomForestClassifier(
            n_estimators=2, max_leaf_nodes=2, random_state=61)
        self.clf.fit(X, y)

    def predict(self, X):
        return self.clf.predict(X)

    def predict_proba(self, X):
        return self.clf.predict_proba(X)



Overwriting submissions/starting_kit/classifier.py


## Local testing (before submission)

It is <b><span style="color:red">important that you test your submission files before submitting them</span></b>. For this we provide a unit test. Note that the test runs on your files in [`submissions/starting_kit`](/tree/submissions/starting_kit), not on the classes defined in the cells of this notebook.

First `pip install ramp-workflow` or install it from the [github repo](https://github.com/paris-saclay-cds/ramp-workflow). Make sure that the python files `classifier.py` and `feature_extractor.py` are in the  [`submissions/starting_kit`](/tree/submissions/starting_kit) folder, and the data `train.csv` and `test.csv` are in [`data`](/tree/data). Then run

```ramp_test_submission```

If it runs and print training and test errors on each fold, then you can submit the code.

Note that `kaggle_data/test.csv` is the actual Kaggle test file, so we have no test labels. To not to crash the test, we mock all 0 labels for the test points. This means that the **test scores are not meaningful** (only he valid scores are).

In [16]:
!ramp_test_submission

[38;5;178m[1mTesting Kaggle Porto-Seguro safe driver prediction[0m
[38;5;178m[1mReading train and test files from ./data ...[0m
[38;5;178m[1mReading cv ...[0m
[38;5;178m[1mTraining ./submissions/starting_kit ...[0m
[38;5;178m[1mCV fold 0[0m
[38;5;10m	train ngini = 0.118[0m
[38;5;12m	valid ngini = 0.108[0m
[38;5;9m	test ngini = 0.479[0m
[38;5;150m	train auc = 0.557[0m
[38;5;105m	valid auc = 0.557[0m
[38;5;218m	test auc = 0.37[0m
[38;5;150m	train acc = 0.964[0m
[38;5;105m	valid acc = 0.964[0m
[38;5;218m	test acc = 1.0[0m
[38;5;150m	train nll = 0.156[0m
[38;5;105m	valid nll = 0.156[0m
[38;5;218m	test nll = 0.037[0m
[38;5;178m[1m----------------------------[0m
[38;5;178m[1mMean CV scores[0m
[38;5;178m[1m----------------------------[0m
[38;5;10mtrain ngini = 0.118 ± 0.0[0m
[38;5;150mtrain auc = 0.557 ± 0.0[0m
[38;5;150mtrain acc = 0.964 ± 0.0[0m
[38;5;150mtrain nll = 0.156 ± 0.0[0m
[38;5;12mvalid ngini = 0.108 ± 0.0[0m
[38;5;105mval

You can use the `--quick-test` switch to test the notebook on the mock data sets in `data/`. Since the data is random, the scores will not be meaningful, but it can be useful to run this first on your submissions to make sure they run without errors.

In [17]:
!ramp_test_submission --quick-test

[38;5;178m[1mTesting Kaggle Porto-Seguro safe driver prediction[0m
[38;5;178m[1mReading train and test files from ./data ...[0m
[38;5;178m[1mReading cv ...[0m
[38;5;178m[1mTraining ./submissions/starting_kit ...[0m
[38;5;178m[1mCV fold 0[0m
[38;5;10m	train ngini = 0.119[0m
[38;5;12m	valid ngini = 0.112[0m
[38;5;9m	test ngini = 0.486[0m
[38;5;150m	train auc = 0.558[0m
[38;5;105m	valid auc = 0.557[0m
[38;5;218m	test auc = 0.372[0m
[38;5;150m	train acc = 0.697[0m
[38;5;105m	valid acc = 0.697[0m
[38;5;218m	test acc = 1.0[0m
[38;5;150m	train nll = 0.608[0m
[38;5;105m	valid nll = 0.608[0m
[38;5;218m	test nll = 0.358[0m
[38;5;178m[1m----------------------------[0m
[38;5;178m[1mMean CV scores[0m
[38;5;178m[1m----------------------------[0m
[38;5;10mtrain ngini = 0.119 ± 0.0[0m
[38;5;150mtrain auc = 0.558 ± 0.0[0m
[38;5;150mtrain acc = 0.697 ± 0.0[0m
[38;5;150mtrain nll = 0.608 ± 0.0[0m
[38;5;12mvalid ngini = 0.112 ± 0.0[0m
[38;5;105mva

## Other models in the starting kit

You can also keep several other submissions in your work directory [`submissions`](/tree/submissions) and test them using
```
ramp_test_submission --submission <submission_name>
```
where `<submission_name>` is the name of the folder in `submissions/`.

## Submitting to Kaggle

You can use this starting kit to train models and submit their predictions to Kaggle. `problem.save_y_pred` implements outputting the predictions. You can turn on this using the `--save-y-preds` switch:
```
ramp_test_submission --submission <submission_name> --save-y-preds
```
This will create the arborescence
```
submissions/<submission_name>/training_output
├── bagged_test_scores.csv
├── bagged_train_valid_scores.csv
├── fold_0
│   └── y_pred_test.csv
├── ...
├── fold_<k-1>
│   └── y_pred_test.csv
└── y_pred_bagged_test.csv
```
You can find test prediction vectors in each fold folder `submissions/<submission_name>/training_output/fold_<i>` and the bagged prediction vector **`submissions/<submission_name>/training_output/y_pred_bagged_test.csv`**. It is this latter that you should submit to Kaggle.

If your goal is to use this starting kit to optimize your Kaggle submission, besides optimizing your feature extractor and classifier, you can also tune the CV bagging scheme by changing the type of cross validation, the number of folds, and the test proportion in `problem.get_cv`. We found that `test_size=0.5` worked well with an extreme large number of folds, typically `n_splits=64`, but these parameters depend on the classifier you are testing, so may need fine tuning. 

## Submitting to [ramp.studio](http://ramp.studio)

If you are eligible, you can join the team at [ramp.studio](http://www.ramp.studio). First, if it is your first time using RAMP, [sign up](http://www.ramp.studio/sign_up), otherwise [log in](http://www.ramp.studio/login). Then ask for a sign-up to the event [kaggle_seguro](http://www.ramp.studio/events/kaggle_seguro). Both signups are controled by RAMP administrators, so there **can be a delay between asking for signup and being able to submit**.

Once your signup request is accepted, you can go to your [sandbox](http://www.ramp.studio/events/kaggle_seguro/sandbox) and copy-paste (or upload) [`feature_extractor.py`](/edit/submissions/starting_kit/feature_extractor.py) and [`classifier.py`](/edit/submissions/starting_kit/classifier.py) from `submissions/starting_kit`. Save it, rename it, then submit it. The submission is trained and tested on our backend in the same way as `ramp_test_submission` does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in [my submissions](http://www.ramp.studio/events/kaggle_seguro/my_submissions). Once it is trained, you get a mail, and your submission shows up on the [public leaderboard](http://www.ramp.studio/events/kaggle_seguro/leaderboard). 
If there is an error (despite having tested your submission locally with `ramp_test_submission`), it will show up in the "Failed submissions" table in [my submissions](http://www.ramp.studio/events/kaggle_seguro/my_submissions). You can click on the error to see part of the trace.

After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.

The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.

The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., _locally_, and checking them with `ramp_test_submission`. The script prints mean cross-validation and test scores 
```
----------------------------
train ngini = 0.119 ± 0.007
train auc = 0.559 ± 0.003
train acc = 0.964 ± 0.0
train nll = 0.156 ± 0.0
valid ngini = 0.114 ± 0.005
valid auc = 0.558 ± 0.002
valid acc = 0.964 ± 0.0
valid nll = 0.156 ± 0.0
test ngini = 0.229 ± 0.256
test auc = 0.307 ± 0.064
test acc = 1.0 ± 0.0
test nll = 0.037 ± 0.0
```
and bagged cross-validation and test scores
```
valid ngini = 0.167
test ngini = -0.324
```
This latter combines the cross-validation models pointwise on the validation and test sets, and usually leads to a better score than the mean CV score. The RAMP [leaderboard](http://www.ramp.studio/events/kaggle_seguro/leaderboard) displays this score.

The official score in this RAMP (the first score column after "historical contributivity" on the [leaderboard](http://www.ramp.studio/events/kaggle_seguro/leaderboard)) is normalized Gini ("ngini"), so the line that is relevant in the output of `ramp_test_submission` is `valid ngini = 0.167`. When the score is good enough, you can submit it at the RAMP.

## More information

You can find more information in the [README](https://github.com/paris-saclay-cds/ramp-workflow/blob/master/README.md) of the [ramp-workflow library](https://github.com/paris-saclay-cds/ramp-workflow).

## Contact

Don't hesitate to [contact us](mailto:admin@ramp.studio?subject=kaggle seguro notebook).