## Porto Seguro’s Safe Driver Prediction

<b>Competition Goal</b>: help car insurance company (Porto Seguro) to improve their model for predicting incident probability based on customer's profile. More accurate prediction will allow them to offer auto insurance policy at competitive customer-specific price point. <br>

### Action Plan
1. run an exploratory analysis on the dataset;
2. set up and train XGBoost model with stratified KFold;
3. stack KFold predictions, evaluate the submission;
4. tune XGBoost model parameters; repeat.

### Steps

- set up the environment
- read in the data
- split features and target variable
- check if the class distribution is balanced
- explore correlation matrix
- define the gini metric for model evaluation
- define training, prediction and submission datasets
- set stratified KFold training structure
- set XGBoost parameters
- train XGBoost model
- make ensemble prediction
- Put submission into csv format

In [None]:
# set up the environment
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import StratifiedKFold
import xgboost as xgb

%matplotlib inline

In [None]:
# read in the data
train=pd.read_csv('train.csv', na_values=-1)
test=pd.read_csv('test.csv', na_values=-1)

In [None]:
# split features and target variable
features = train.drop(['id','target'], axis=1).values
targets = train.target.values

In [None]:
# check if the class distribution is balanced
ax = sns.countplot(x = targets ,palette="Set2")
sns.set(font_scale=1.5)
ax.set_xlabel(' ')
ax.set_ylabel(' ')
fig = plt.gcf()
fig.set_size_inches(10,5)
ax.set_ylim(top=700000)
for p in ax.patches:
    ax.annotate('{:.2f}%'.format(100*p.get_height()/len(targets)), (p.get_x()+ 0.3, p.get_height()+10000))

plt.title('Distribution of 595212 Targets')
plt.xlabel('Initiation of Auto Insurance Claim Next Year')
plt.ylabel('Frequency [%]')
plt.show()

# much higher proportion of class 0 is observed

In [None]:
# explore the correlation matrix
corr = train.corr()
sns.set(style="white")
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()

# ps_calc_ features seem to have minimual correlation with target variable

In [None]:
# remove ps_calc_ features from the train / test data
unwanted = train.columns[train.columns.str.startswith('ps_calc_')]
train = train.drop(unwanted, axis=1)  
test = test.drop(unwanted, axis=1)

In [None]:
# Define the gini metric for model evaluation
# forked from https://www.kaggle.com/c/ClaimPredictionChallenge/discussion/703#5897
def gini(actual, pred, cmpcol = 0, sortcol = 1):
    assert( len(actual) == len(pred) )
    all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)
    all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]
    totalLosses = all[:,0].sum()
    giniSum = all[:,0].cumsum().sum() / totalLosses
    
    giniSum -= (len(actual) + 1) / 2.
    return giniSum / len(actual)
 
def gini_normalized(a, p):
    return gini(a, p) / gini(a, a)

def gini_xgb(preds, dtrain):
    labels = dtrain.get_label()
    gini_score = gini_normalized(labels, preds)
    return 'gini', gini_score

In [None]:
# define training, prediction and submission datasets
X = train.drop(['id', 'target'], axis=1).values
y = train.target.values

test_id = test.id.values
test = test.drop('id', axis=1)

sub = pd.DataFrame()
sub['id'] = test_id
sub['target'] = np.zeros_like(test_id)

# set stratified KFold training structure
kfold = 5
skf = StratifiedKFold(n_splits=kfold, random_state=0)

In [None]:
# set XGBoost parameters
params = {
    'min_child_weight': 10.0,
    'objective': 'binary:logistic',
    'max_depth': 7,
    'max_delta_step': 1.8,
    'colsample_bytree': 0.4,
    'subsample': 0.8,
    'eta': 0.025,
    'gamma': 0.65,
    'num_boost_round' : 700
    }

In [None]:
# train with XGBoost and make subsequient predictions
# forked from (https://www.kaggle.com/kueipo/stratifiedshufflesplit-xgboost-example-0-28) 
for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    print('[Fold %d/%d]' % (i + 1, kfold))
    X_train, X_valid = X[train_index], X[test_index]
    y_train, y_valid = y[train_index], y[test_index]
    
    d_train = xgb.DMatrix(X_train, y_train)
    d_valid = xgb.DMatrix(X_valid, y_valid)
    d_test = xgb.DMatrix(test.values)
    
    watchlist = [(d_train, 'train'), (d_valid, 'valid')]

    # train the model
    mdl = xgb.train(params, d_train, 1600, watchlist, early_stopping_rounds=70, feval=gini_xgb, maximize=True, verbose_eval=100)
    
    # make prediction
    print('[Fold %d/%d Prediciton:]' % (i + 1, kfold))
    p_test = mdl.predict(d_test, ntree_limit=mdl.best_ntree_limit)
    
    # add prediction to ensemble
    sub['target'] += p_test/kfold

In [None]:
# Put final predictions into csv format
sub.to_csv('submission.csv', index=False)