# Safe Driver Prediction

## Table of Contents

1. [Loading the Data](#loading-the-data)
    * [Setup](#setup)
    * [Data](#data)
2. [Understanding the Data](#understanding-the-data)
    * [Conclusions](#conclusions)
3. [Data Preparation](#data-preparation)
    * [Cleaning up bad values](#cleaning-up-bad-values)
    * [Separating values and labels](#separating-values-and-labels)
    * [Splitting up the dataset](#splitting-up-the-dataset)
4. [Machine Learning](#data-preparation-and-machine-learning)
    * [Setting up the model](#setting-up-the-model)
    * [Training the model](#training-the-model)
    * [Testing the model](#testing-the-model)
4. [Making a Benchmark Submission](#making-a-benchmark-submission)

## Loading the Data <a class="anchor" id="loading-the-data"></a>

Before we do anything, we need to make sure we have all our data ready.

Importing Numpy now so that it is ready for later.

In [None]:
import numpy as np

# Set the random seed for reproducability
np.random.seed(42)

We will use Pandas throughout the notebook to hold and manage our datasets.

In [None]:
import pandas as pd

Then you should uncomment the code and run the following two cells. **Warning:** This doesn't work in this Kaggle hosted notebook! See below

In [None]:
# Reads in the csv-files and creates a dataframe using pandas

# base_set = pd.read_csv('data/housing_data.csv')
# benchmark = pd.read_csv('data/housing_test_data.csv')
# sampleSubmission = pd.read_csv('data/sample_submission.csv')

In [None]:
base_set = pd.read_csv('../input/porto-seguro-safe-driver-prediction/train.csv')
benchmark = pd.read_csv('../input/porto-seguro-safe-driver-prediction/test.csv')
sample_submission = pd.read_csv('../input/porto-seguro-safe-driver-prediction/sample_submission.csv')

## Understanding the data

Now that we have our data, we need to investigate it so that we are able to leverage it to the fullest extent.

We will use Matplotlib to plot various things throughout the notebook.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
base_set.head()

In [None]:
benchmark.head()

In [None]:
base_set.info()

In [None]:
benchmark.info()

In [None]:
base_set.describe()

Looking at the correlations between the values, we can see that the median income has the strongest correlation to the target value.

In [None]:
correlations = base_set.corr()
correlations["target"]

In [None]:
base_set.hist(bins=50, figsize=(15,15))
plt.show()

### Conclusions

...

## Data preparation

In this section we will preprocess the data and construct a model, which we will then train so we are able to make predictions with it. Lastly we will test in on a test set we create.

### Cleaning up bad values

The `id` column in the sets is not needed, so we remove that.

In [None]:
base_set_id = base_set['id']
benchmark_id = benchmark['id']

base_set = base_set.drop(columns=['id'])
benchmark = benchmark.drop(columns=['id'])

Features ending with _calc appear to be randomly generated noise and should be dropped.

In [None]:
base_non_calc_cols = [c for c in base_set.columns if (not c.startswith('ps_calc_'))]
benchmark_non_calc_cols = [c for c in benchmark.columns if (not c.startswith('ps_calc_'))]

base_set = base_set[base_non_calc_cols]
benchmark = benchmark[benchmark_non_calc_cols]

Some of the columns are categorical and should be one-hot encoded.

In [None]:
from keras.utils import to_categorical

# Not sure how to do this yet

Values that have not been recorded are designated -1 in the dataset, we fill those with the median of the column.

In [None]:
base_set = base_set.replace(-1, np.NaN)
benchmark = benchmark.replace(-1, np.NaN)

base_set = base_set.fillna(base_set.median())
benchmark = benchmark.fillna(benchmark.median())

Finally, we check to see that neither of the datasets contain `NaN` values.

In [None]:
base_set.isnull().any()

In [None]:
benchmark.isnull().any()

### Separating values and labels

It is time to split the dataset into values and labels. To do that, we drop the label column and call that `X`, and take the label column alone and call that `Y`. Afterwards we are ready to start shaping our dataset.

In [None]:
labels_column = 'target'

X = base_set.drop(columns=[labels_column])
Y = pd.DataFrame(base_set[labels_column], columns=[labels_column])

In [None]:
X.head()

In [None]:
Y.head()

In [None]:
benchmark.head()

### Splitting up the dataset

We split our base set into separate datasets for training, testing and validation.

In [None]:
from sklearn.model_selection import train_test_split

train_to_valtest_ratio = .5
validate_to_test_ratio = .5

# First split our main set
(X_train,
 X_validation_and_test,
 Y_train,
 Y_validation_and_test) = train_test_split(X, Y, test_size=train_to_valtest_ratio)

# Then split our second set into validation and test
(X_validation,
 X_test,
 Y_validation,
 Y_test) = train_test_split(X_validation_and_test, Y_validation_and_test, test_size=validate_to_test_ratio)

### Machine Learning

### Gini scoring function

Gini will be used to score the model.

In [None]:
def gini(y_true, y_pred):
    # check and get number of samples
    assert y_true.shape == y_pred.shape
    n_samples = y_true.shape[0]
    
    # sort rows on prediction column 
    # (from largest to smallest)
    arr = np.array([y_true, y_pred]).transpose()
    true_order = arr[arr[:,0].argsort()][::-1,0]
    pred_order = arr[arr[:,1].argsort()][::-1,0]
    
    # get Lorenz curves
    L_true = np.cumsum(true_order) / np.sum(true_order)
    L_pred = np.cumsum(pred_order) / np.sum(pred_order)
    L_ones = np.linspace(1/n_samples, 1, n_samples)
    
    # get Gini coefficients (area between curves)
    G_true = np.sum(L_ones - L_true)
    G_pred = np.sum(L_ones - L_pred)
    
    # normalize to true Gini coefficient
    return G_pred/G_true

### Set up the model

Now, it is time to set up the architecture.

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout

model = Sequential([
    Dense(64, activation='relu', input_dim=X_train.shape[1]),
    Dropout(.30),
    Dense(64, activation='relu'),
    Dropout(.15),
    Dense(32, activation='relu'),
    Dropout(.15),
    Dense(16, activation='relu'),
    Dense(1),
])

model.summary()

In [None]:
import keras.backend as K

model.compile(optimizer='adam', # adam, sgd, adadelta
              loss='binary_crossentropy',
              metrics=['binary_crossentropy'])

### Training the model

Let's fit the model on the data.

In [None]:
from keras.callbacks import EarlyStopping

early_stopper = EarlyStopping(patience=3)

training_result = model.fit(X_train, Y_train,
                            batch_size=4096,
                            epochs=256,
                            validation_data=(X_validation, Y_validation),
                            callbacks=[early_stopper])

Now, let's look into how the fitting went.

In [None]:
print(training_result.history)

# Plot model accuracy over epoch
plt.plot(training_result.history['binary_crossentropy'])
plt.plot(training_result.history['val_binary_crossentropy'])
plt.title('Model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

# Plot model loss over epoch
plt.plot(training_result.history['loss'])
plt.plot(training_result.history['val_loss'])
plt.title('Model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [None]:
validate_result = model.test_on_batch(X_validation, Y_validation)
validate_result

### Testing the model

Finally, we churn the test set through the model we created.

In [None]:
test_result = model.test_on_batch(X_test, Y_test)
test_result

### Trying other models

Testing with RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr_model = RandomForestRegressor()
rfr_model.fit(X_train, Y_train)

rfr_predictions = rfr_model.predict(X_test)

In [None]:
rfr_error =  gini(Y_test['target'], rfr_predictions)
rfr_error

Testing with XGBoost

In [None]:
import re

regex = re.compile(r"[|]|<", re.IGNORECASE)

# XGBoost does not support some of the column names

X_train.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_train.columns.values]
X_test.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_test.columns.values]

from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

import scipy.stats as st

one_to_left = st.beta(10, 1)  
from_zero_positive = st.expon(0, 50)

xgb_reg = XGBRegressor(nthreads=-1)

xgb_gs_params = {  
    "n_estimators": st.randint(3, 40),
    "max_depth": st.randint(3, 40),
    "learning_rate": st.uniform(0.05, 0.4),
    "colsample_bytree": one_to_left,
    "subsample": one_to_left,
    "gamma": st.uniform(0, 10),
    'reg_alpha': from_zero_positive,
    "min_child_weight": from_zero_positive,
}

xgb_gs = RandomizedSearchCV(xgb_reg, xgb_gs_params, n_jobs=1)  
xgb_gs.fit(X_train.values, Y_train)  

xgb_model = xgb_gs.best_estimator_ 

xgb_predictions = xgb_model.predict(X_test.values)

In [None]:
xgb_error =  gini(Y_test['target'], xgb_predictions)
xgb_error

A comparison of all the models

In [None]:
print(f'NN:                                 {test_result[0]}')
print(f'RandomForestRegressor Gini:         {rfr_error}')
print(f'XGBRegressor Gini:                  {xgb_error}')

## Making a Benchmark Submission

For the benchmark data, it is important that we put it through the same preparation steps as the training set.

In [None]:
benchmark.head()

In [None]:
X.head()

Now it's time to make predictions.

In [None]:
target = xgb_model.predict(benchmark.values)

In [None]:
len(target)

In [None]:
target

In [None]:
submission = pd.DataFrame({
    'id': benchmark_id,
    'target': target.flatten()
})

In [None]:
submission.head()

In [None]:
# Stores a csv file to submit to the kaggle competition
submission.to_csv('submission.csv', index=False)