# Porto Seguro’s Safe Driver Prediction

### Predicting if a driver will file an insurance claim next year

![Porto Seguro Image](https://www.inbenta.com/wp-content/uploads/2016/11/7266.jpg)

## Introduction

Nothing ruins the thrill of buying a brand new car more quickly than seeing your new insurance bill. The sting’s even more painful when you know you’re a good driver. It doesn’t seem fair that you have to pay so much if you’ve been cautious on the road for years.

Porto Seguro, one of Brazil’s largest auto and homeowner insurance companies, completely agrees. Inaccuracies in car insurance company’s claim predictions raise the cost of insurance for good drivers and reduce the price for bad ones.

In this competition, we’re challenged to build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year. While Porto Seguro has used machine learning for the past 20 years, they’re looking to Kaggle’s machine learning community to explore new, more powerful methods. A more accurate prediction will allow them to further tailor their prices, and hopefully make auto insurance coverage more accessible to more drivers.

## Approach

Using data visualization techniques with the help of useful libraries such as [Matplotlib](https://matplotlib.org) and [Seaborn](http://seaborn.pydata.org), we are able to identify relationships between various features in the given dataset.

Following the identification of such relationships, we impute missing data values. For categorical features, we simply create a new category to account for the missing data (i.e. NA category). For numeric features, we can opt to use the median of the distribution to impute missing data.

After the data is cleaned and processed, we identify features which are informative of the target label. Following which, we conduct feature engineering on our existing pool of features to create new informative features. 

Lastly, we fit an [Extreme Gradient Boosting](http://xgboost.readthedocs.io/en/latest/model.html) (otherwise known as the XGB model) Model to our data. Using cross-validation via the Stratified KFolds method, we select the best model (best number of trees) to predict for the given testing set.

## Evaluation

We will use the Normalized Gini Coefficient as our evaluation metric, similar to the evaluation criteria set by Porto Seguro. For a more comprehensive understanding of what exactly the Normalized Gini Coefficient is, please visit this [kernel](https://www.kaggle.com/batzner/gini-coefficient-an-intuitive-explanation).

## Afternote

After submission, it turns out that our XGB model achieved a score of 0.279, which places us at the top 48 percentile of the competition. While not spectacular, I'm just glad that I learnt much more about the specifics behind the Extreme Gradient Boosting model (and its implementation in Python), and have a better idea of how the Normalized Gini Coefficient works now.

Also, in the event that you found this kernel useful, please take a look at some other kernels which I have referenced in my analysis (they were really useful in helping me understand how):

* [HyungsukKang's Stratified KFold+XGBoost+EDA Tutorial(0.281)](https://www.kaggle.com/sudosudoohio/stratified-kfold-xgboost-eda-tutorial-0-281)
* [Rudolph's Porto: xgb+lgb kfold LB 0.282](https://www.kaggle.com/rshally/porto-xgb-lgb-kfold-lb-0-282)
* [Olivier's XGB classifier, upsampling LB 0.283](https://www.kaggle.com/ogrellier/xgb-classifier-upsampling-lb-0-283)

## Table of Contents

1. [Importing key libraries and reading dataframes](#Importing-key-libraries-and-reading-dataframes)
2. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
3. [Feature Selection](#Feature-Selection)
    1. [Binary and Numeric Features](#Binary-and-Numeric-Features)
    2. [Categorical Features](#Categorical-Features)
    3. [Subsetting the dataframe](#Subsetting-the-dataframe)
4. [Missing Data Imputation](#Missing-Data-Imputation)
5. [Feature Importances](#Feature-Importances)
6. [Feature Engineering](#Feature-Engineering)
    1. [Polynomial Features](#Polynomial-Features)
7. [Model Fitting](#Model-Fitting)

### Importing key libraries and reading dataframes

In [None]:
%matplotlib inline
import pandas as pd # Dataframe manipulation
import numpy as np 
import matplotlib.pyplot as plt # Base plotting
import seaborn as sns # Sophisticated plotting (?)
import warnings
# Ignore all warnings - users beware
warnings.filterwarnings("ignore")

In [None]:
# Read dataframe into Python
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')

### Exploratory Data Analysis

Now that we have loaded the dataframe in Python, let's combine the training and testing dataset. We can split them later after we have conducted feature transformation, selection and scaling. 

Then, we  take a quick look at the first 5 rows of the data, along with its dimensions.

In [None]:
# Combine the training and test dataset
df = pd.concat([df_train, df_test])

In [None]:
df.set_index('id', inplace = True)
df.head(5)

In [None]:
# print dimensions of dataframes
print(df.shape)
print(df_train.shape)
print(df_test.shape)

Let's call on the `describe` function in Pandas to understand the dataframe better.

In [None]:
df.describe()

From the summary of the dataset, we note that there are some features which ends with the word 'bin', while other words might end with the letter 'cat'. Also, we note that there are negative values in the dataset.

A quick look at the Kaggle page seems to suggest that features that end with the word 'bin' are binary features, while features which end with the word 'cat' are categorical features. 

We do note that from the [Kaggle page](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data) there are missing values (as indicated by a '-1') in the dataset. However, nothing seems to be too alarming from this exercise. Let's check for missing values in the dataframe.

In [None]:
(pd.DataFrame(np.sum(df.apply(lambda x: x == -1))
              /len(df))[0][pd.DataFrame(np.sum(df.apply(lambda x: x == -1))/len(df))[0] != 0])

We note that 3 features, `ps_car_03_cat`, `ps_car_05_cat` and `ps_reg_03` contains a significant amount of missing values (>15% of values are missing from the dataframe).

While we can assume that Porto Seguro has mapped all missing values to take on the value -1, let's check whether there are any remaining missing values.

In [None]:
np.sum(pd.isnull(df))

It turns out that there is none.

Before we proceed to remove these features, let's take a look at the correlation between our features and the target label. It wouldn't be wise to remove features which are really informative of the target label.

To do this, we separate the features into categorical features, and binary + numeric features.

In [None]:
categorical_features = df.columns[df.columns.str.endswith('cat')].tolist()
binary_features = df.columns[df.columns.str.endswith('bin')].tolist()
numeric_features = [feature for feature in df.columns.tolist()
                    if feature not in categorical_features and feature not in binary_features]

In [None]:
binary_numeric = binary_features + numeric_features

Are there any categorical features which were supposed to be classified as binary features? We can use the `set` function to find the unique values that the feature can take on.

In [None]:
df[categorical_features].apply(set)

It appears that 6 features should (could) be classified as binary features.

In [None]:
for feature in ['ps_car_02_cat', 'ps_car_03_cat', 'ps_car_05_cat', 
                'ps_car_07_cat', 'ps_car_08_cat', 'ps_ind_04_cat']:
    binary_numeric.append(feature)
    binary_features.append(feature)
    categorical_features.remove(feature)

In [None]:
categorical_features

For now, let's take a look at the correlation matrix across different features, regardless of whether they are numeric, binary or categorical features.

In [None]:
df[df == -1] = np.nan

In [None]:
sns.set_style('white')
cmap = sns.diverging_palette(220, 10, as_cmap=True)

plt.figure(figsize=(20,15))

sns.heatmap(df[binary_numeric].corr(), vmin = -1, vmax = 1, cmap=cmap)

plt.show()

From this heatmap, we note that only a handful of features are informative of the target label. In particular, we note that the `ps_calc_` features are not correlated with any other features. Let's take a closer look at the correlation between our features and the target label.

In [None]:
plt.figure(figsize=(20, 15))
(df.corr()
     .target
     .drop('target')
     .sort_values(ascending=False)
     .plot
     .barh())

Before we begin to plot the numeric and binary features, let's see what unique values that the categorical features can take on.

From the horizontal bar plots, it appears that many of the features have correlation which are close to 0 with the target label. Let's take a look at the distributions of the features.

In [None]:
print('No. of numeric features: %d' % len(numeric_features))
print('No. of binary features: %d' % len(binary_features))

In [None]:
plt.figure(figsize=(20,20))
for idx, num_feat in enumerate(numeric_features):
    plt.subplot(5, 6, idx+1)
    sns.distplot(df[num_feat].dropna(), kde = False, norm_hist=True)

plt.show()

In [None]:
plt.figure(figsize=(20,20))
for idx, bin_feat in enumerate(binary_features):
    plt.subplot(6, 4, idx+1)
    sns.distplot(df[bin_feat].dropna(), kde = False, norm_hist=True)

plt.show()

Let's take a look at our categorical features now.

In [None]:
len(categorical_features)

Of the categorical features, what are their distributions?

In [None]:
plt.figure(figsize=(20,15))

for idx, cat_feat in enumerate(categorical_features):
    plt.subplot(4, 2, idx+1)
    sns.distplot(df[cat_feat].dropna(), kde=False, norm_hist=True)
    
plt.show()

In [None]:
plt.figure(figsize=(20,15))

for idx, cat_feat in enumerate(categorical_features):
    plt.subplot(4, 2, idx+1)
    sns.pointplot(x=cat_feat, y='target', data=df.iloc[:df_train.shape[0]])
    
plt.show()

From the categorical features, we note the categorical features might be indicative of the target label. 

Upon closer inspection, we find that for dense feature values, the probability of survival is low. Let's investigate this phenomenon further.

In [None]:
fig, axs = plt.subplots(8, 1, figsize=(20, 25))

for ax, cat_feat in zip(axs, categorical_features):
    ax2 = ax.twinx()
    sns.distplot(df[cat_feat].dropna(), kde=False, norm_hist=True, ax = ax)
    sns.pointplot(x=cat_feat, y='target', data=df.iloc[:df_train.shape[0]], ax=ax2)
    
plt.show()

### Feature Selection

From what we have previously seen, we can now proceed to extract features which are more informative of the target label. For example, we note that the feature, `ps_car_01_cat` and `ps_cat_06_cat` are pretty informative.

#### Binary and Numeric Features

Using the correlation matrix (in the form of a heatmap) done previously, we impose an artificial correlation threshold (with the target lavbel) of 0.005 to select key binary and numeric features from our dataset.

In [None]:
df[df == -1] = np.nan

# Binary and Numeric Features

no_of_features = sum(df[binary_numeric].corr()
                     .target
                     .abs()
                     .drop('target')
                     .sort_values(ascending=False) > 0.005)
no_of_features

In [None]:
bin_num_features = (df[binary_numeric].corr()
                    .target
                    .abs()
                    .drop('target')
                    .sort_values(ascending = False))[:no_of_features].index.tolist()

#### Categorical Features

Let's select the key categorical features later, when we plot our feature importances.

In [None]:
cat_features = [feature for feature in df.columns.tolist() 
                if (feature not in bin_num_features) and (feature.endswith('cat'))]

#### Subsetting the dataframe

Using the features selected through the correlation threshold, let's create our new dataframe.

Following which, we call on our heatmap again to understand the correlation across our numeric and binary features better.

In [None]:
df_fs1 = df[bin_num_features + cat_features]

df_fs1['target'] = df.target
bin_num_feat = [column for column in df_fs1.columns 
                if column not in cat_features]

In [None]:
sns.set_style('white')
cmap = sns.diverging_palette(220, 10, as_cmap=True)

plt.figure(figsize=(20, 20))
sns.heatmap(df_fs1[bin_num_feat].iloc[:df_train.shape[0]].corr(), vmin = -1, vmax = 1, 
            annot = True, cmap = cmap)
plt.plot()

After selecting our key features, we note that some of them are correlated with one another. Why might this be a problem?

[Multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity) occurs when one predictor is highly correlated with another predictor. Consequences of multicollinearity are imprecise predictors, standard errors of predictors tend to be higher.

To test whether multicollinearity is an issue in our case, we can turn to the [Variance Inflation Factor](https://en.wikipedia.org/wiki/Variance_inflation_factor). Simply put, it uses a rule of thumb that $R^2$ > 0.9, where $R$ is the correlation between 2 features. Using that formula, $R$ is approximated to be 0.95.

To be stricter, we impose a correlation threshold of 0.9 (slightly stricter than the 0.95 rule of thumb) in our selection of independent features. Using this threshold, we remove the feature `ps_ind_14`, as it has a correlation of 0.89 with the another feature, `ps_ind_12_bin`. Also, we note that the feature has a lower correlation with the target label compared to the other feature, `ps_ind_12_bin`.

In [None]:
del df_fs1['ps_ind_14']

Before we convert the features to dummies, we first impute missing data.

### Missing Data Imputation

Let's proceed to impute our missing data. 

First, we begin by finding whether there are any NA values which requires us to impute.

In [None]:
np.sum(df_fs1.isnull())

Let's remove features where missing values account for at least 20% of the data.

In [None]:
[feat for feat in df_fs1.columns.tolist() 
 if np.sum(pd.isnull(df_fs1[feat])) > (df_fs1.shape[0])*0.20]

Using this simple rule of thumb, we find that the features `ps_car_03_cat` and `ps_car_05_cat` fulfills this criteria.

In [None]:
del df_fs1['ps_car_03_cat']
del df_fs1['ps_car_05_cat']

For categorical features which have missing values, we can circumvent this issue by creating a new category for it.

In [None]:
[feat for feat in df_fs1.columns.tolist() 
 if (feat.endswith('cat'))  and ((np.sum(pd.isnull(df_fs1[feat]))) > 0)]

In [None]:
df_fs1.ps_car_02_cat.fillna('-1', inplace = True)
df_fs1.ps_car_07_cat.fillna('-1', inplace = True)
df_fs1.ps_ind_04_cat.fillna('-1', inplace = True)
df_fs1.ps_car_01_cat.fillna('-1', inplace = True)
df_fs1.ps_car_09_cat.fillna('-1', inplace = True)
df_fs1.ps_ind_02_cat.fillna('-1', inplace = True)
df_fs1.ps_ind_05_cat.fillna('-1', inplace = True)

What other columns require us to fill in missing values?

In [None]:
[feat for feat in df_fs1.columns.tolist() 
 if np.sum(pd.isnull(df_fs1[feat])) > 0]

For these features, let's use the median of these features to impute the missing values.

In [None]:
df_fs1['ps_car_12'].fillna(df_fs1['ps_car_12'].median(), inplace = True)
df_fs1['ps_reg_03'].fillna(df_fs1['ps_reg_03'].median(), inplace = True)
df_fs1['ps_car_14'].fillna(df_fs1['ps_car_14'].median(), inplace = True)

Let's check whether there are any more missing values.

In [None]:
np.sum(df_fs1.isnull())

There are no more missing values in our dataset!

### Feature Importances

Let's test out how significant our features are in predicting the target label, using the `feature_importances_` method from the RandomForestClassifier class. Following which, we can plot the relative importance of the features using a horizontal barplot.

The code to generate the `feature_importances_` plot was taken from the [Yhat Blog](http://blog.yhat.com/tutorials/5-Feature-Engineering.html).

Let's proceed to select our categorical features, using a RandomForestClassifier.

In [None]:
features = np.array([feature for feature in df_fs1.columns.tolist() 
                     if feature != 'target'])

In [None]:
random_state = 1212

In [None]:
idx = df_fs1[df_fs1.target.notnull()].index.tolist()

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(50, random_state=random_state)
clf.fit(df_fs1[features].loc[idx], df_fs1.target.loc[idx])

In [None]:
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

plt.figure(figsize=(15, 10))

padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")

plt.show()

From our existing feature set, it appears that most of the categorical and binary features are not really informative in predicting the target label. Nontheless, let's keep these features for now, and see if there is a need to remove them later.

In [None]:
combined = df_fs1[features]
combined['target'] = df_train.set_index('id').target

### Feature Engineering

Let's take a look at our **key** features more closely, and see whether we are able to create new features from our existing set.

In [None]:
plt.figure(figsize=(20,20))

plt.subplot(221)
sns.distplot(combined[combined.target == 0].ps_car_13.dropna(),
             bins = np.linspace(0, 4, 41), kde = False, norm_hist = True, color = 'red')
sns.distplot(combined[combined.target == 1].ps_car_13.dropna(),
             bins = np.linspace(0, 4, 41), kde = False, norm_hist = True, color = 'blue')
plt.title('ps_car_13 Distribution')

plt.subplot(222)
sns.distplot(combined[combined.target == 0].ps_reg_03,
             bins = np.linspace(0, 2, 11), kde = False, norm_hist = True, color = 'red')
sns.distplot(combined[combined.target == 1].ps_reg_03,
             bins = np.linspace(0, 2, 11), kde = False, norm_hist = True, color = 'blue')
plt.title('ps_reg_03 Distribution')

plt.subplot(223)
sns.distplot(combined[combined.target == 0].ps_car_14,
             bins = np.linspace(0.2, 0.6, 10), kde = False, norm_hist = True, color = 'red')
sns.distplot(combined[combined.target == 1].ps_car_14, 
             bins = np.linspace(0.2, 0.6, 10), kde = False, norm_hist = True, color = 'blue')
plt.title('ps_car_14 Distribution')

plt.subplot(224)
sns.distplot(combined[combined.target == 0].ps_ind_15.dropna(),
             bins = np.linspace(0, 15, 16), kde = False, norm_hist = True, color = 'red')
sns.distplot(combined[combined.target == 1].ps_ind_15.dropna(),
             bins = np.linspace(0, 15, 16), kde = False, norm_hist = True, color = 'blue')
plt.title('ps_ind_15 Distribution')

In [None]:
plt.figure(figsize=(20,15))

plt.subplot(221)
sns.distplot(combined[combined.target == 0].ps_ind_03.dropna(),
             bins = range(0, 8, 1), kde = False, norm_hist = True, color = 'red')
sns.distplot(combined[combined.target == 1].ps_ind_03.dropna(),
             bins = range(0, 8, 1), kde = False, norm_hist = True, color = 'blue')
plt.title('ps_ind_03 Distribution')

plt.subplot(222)
sns.distplot(combined[combined.target == 0].ps_reg_02.dropna(),
             bins = np.linspace(0, 2, 11), kde = False, norm_hist = True, color = 'red')
sns.distplot(combined[combined.target == 1].ps_reg_02.dropna(),
             bins = np.linspace(0, 2, 11), kde = False, norm_hist = True, color = 'blue')
plt.title('ps_reg_02 Distribution')

plt.subplot(223)
sns.distplot(combined[combined.target == 0].ps_car_11_cat.dropna(), 
             bins = range(0, 110, 5), kde = False, norm_hist = True, color = 'red')
sns.distplot(combined[combined.target == 1].ps_car_11_cat.dropna(), 
             bins = range(0, 110, 5), kde = False, norm_hist = True, color = 'blue')
plt.title('ps_car_11_cat Distribution')

plt.subplot(224)
sns.distplot(combined[combined.target == 0].ps_ind_01.dropna(),
             bins = range(0, 8, 1), kde = False, norm_hist = True, color = 'red')
sns.distplot(combined[combined.target == 1].ps_ind_01.dropna(),
             bins = range(0, 8, 1), kde = False, norm_hist = True, color = 'blue')
plt.title('ps_ind_01 Distribution')

In [None]:
plt.figure(figsize=(20,15))

plt.subplot(221)
sns.distplot(combined[combined.target == 0].ps_car_15.dropna(), 
             kde = False, norm_hist = True, color = 'red')
sns.distplot(combined[combined.target == 1].ps_car_15.dropna(), 
             kde = False, norm_hist = True, color = 'blue')
plt.title('ps_car_15 Distribution')

plt.subplot(222)
sns.distplot(combined[combined.target == 0].ps_reg_01.dropna().astype('float'),
             bins = range(0, 11, 1), kde = False, norm_hist = True, color = 'red')
sns.distplot(combined[combined.target == 1].ps_reg_01.dropna().astype('float'),
             bins = range(0, 11, 1), kde = False, norm_hist = True, color = 'blue')
plt.title('ps_reg_01 Distribution')

plt.subplot(223)
sns.distplot(combined[combined.target == 0].ps_car_01_cat.dropna().astype('float'), 
             bins = range(-1, 11, 1), kde = False, norm_hist = True, color = 'red')
sns.distplot(combined[combined.target == 1].ps_car_01_cat.dropna().astype('float'), 
             bins = range(-1, 11, 1), kde = False, norm_hist = True, color = 'blue')
plt.title('ps_car_01_cat Distribution')

plt.subplot(224)
sns.distplot(combined[combined.target == 0].ps_car_06_cat.dropna(), 
             bins = range(0, 17, 1), kde = False, norm_hist = True, color = 'red')
sns.distplot(combined[combined.target == 1].ps_car_06_cat.dropna(), 
             bins = range(0, 17, 1), kde = False, norm_hist = True, color = 'blue')
plt.title('ps_car_06_cat Distribution')

There doesn't appear to be any good features we can extract from our existing pool.

Let's take a look at the correlation across features.

In [None]:
combined['target'] = df_train.set_index('id').target

plt.figure(figsize=(20, 15))
sns.heatmap(combined.corr(), annot = True, cmap = cmap)
plt.show()

#### Polynomial Features

Can interaction terms help to improve the fit of our model? We will focus on the interaction terms of the top 10 features from our previous analysis, to minimise the computational complexity.

In [None]:
ind_var = [feature for feature in combined.columns[sorted_idx][-10:] 
           if feature != 'target']
ind_var.reverse()

In [None]:
from sklearn.preprocessing import PolynomialFeatures

train = combined[pd.notnull(combined.target)][ind_var].reset_index(drop=True)

poly = PolynomialFeatures(interaction_only = True, include_bias = False)

train_interaction = pd.DataFrame(poly.fit_transform(train))
train_interaction['target'] = df_train.target

What is the performance/variance explained of the interaction features? Let's take a quick look.

In [None]:
features = np.array([feature for feature in train_interaction.columns.tolist()
                     if feature != 'target'])

clf = RandomForestClassifier(50, random_state = random_state)
clf.fit(train_interaction.iloc[:df_train.shape[0]][features], 
        train_interaction.iloc[:df_train.shape[0]]['target'])

In [None]:
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

plt.figure(figsize=(15, 10))

plt.figure(figsize=(20, 20))
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")

plt.show()

Looking at the `feature_importances_plot`, we note that feature10 is a strong indicator for the target label. What is feature10?

In [None]:
[feat for feat in ind_var]

Using the list above, it turns out `feature10` is the interaction term between `ps_car_13` and `ps_reg_03`. Let's include feature10 in our dataframe!

In [None]:
combined['feature10'] = combined['ps_car_13'] * combined['ps_reg_03']

To make sure that these features are informative, let's take a look at the correlation of these features with our target label.

In [None]:
combined['target'] = df_train.set_index('id').target

plt.figure(figsize=(20, 20))
sns.heatmap(combined.corr(), annot = True)
plt.show()

From the heatmap, it appears that `feature10` isn't strongly correlated with all other features. 

Note: It has a correlation of 0.82 with the feature `ps_reg_03`, but that isn't **really** alarming. Let's keep it.

After the removal of these features, let's see the relative importance of each feature in our dataset!

In [None]:
features = np.array([feature for feature in combined.columns.tolist()
                     if feature != 'target'])

clf = RandomForestClassifier(50, random_state = random_state)
clf.fit(combined[pd.notnull(combined.target)][features], 
        combined[pd.notnull(combined.target)].target)

In [None]:
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

plt.figure(figsize=(15, 10))

plt.figure(figsize=(20, 20))
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")

plt.show()

From the `feature_importances_` plot, it appears that our new features are doing really well! Let's keep these features as they have high relative importance and low correlation with other features.

Let's remove the target label, and split our dataset into training and testing dataset.

In [None]:
del combined['target']

In [None]:
X_train = combined.reset_index(drop = True).iloc[:df_train.shape[0], ]
X_test = combined.reset_index(drop = True).iloc[df_train.shape[0]:, ]

### Model Fitting

Let's fit an [Extreme Gradient Boosting](http://xgboost.readthedocs.io/en/latest/) model to predict for the probability of insurance claim.

Let's define our evaluation metric and cost function first. This was taken off [Rudolph's iPython Notebook](https://www.kaggle.com/rshally/porto-xgb-lgb-kfold-lb-0-282/notebook).

In [None]:
def gini(actual, pred, cmpcol = 0, sortcol = 1):
    assert( len(actual) == len(pred) )
    all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)
    all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]
    totalLosses = all[:,0].sum()
    giniSum = all[:,0].cumsum().sum() / totalLosses
    
    giniSum -= (len(actual) + 1) / 2.
    return giniSum / len(actual)
 
def gini_normalized(a, p):
    return gini(a, p) / gini(a, a)

def gini_xgb(preds, dtrain):
    labels = dtrain.get_label()
    gini_score = gini_normalized(labels, preds)
    return 'gini', gini_score

Defining our features, X, and target labels, y.

In [None]:
features = X_train.columns.tolist

X = X_train.values; test = X_test.values

y = df_train.set_index('id').target.values

In [None]:
params = {
    'objective': 'binary:logistic',
    'min_child_weight': 12.0,
    'max_depth': 5,
    'colsample_bytree': 0.5,
    'subsample': 0.8,
    'eta': 0.025,
    'gamma': 0.8,
    'max_delta_step': 1.5
}

In [None]:
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold

submission = pd.DataFrame()
submission['id'] = df_test['id'].values
submission['target'] = 0

nrounds=1000
folds = 5
skf = StratifiedKFold(n_splits=folds, random_state=random_state)

for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    print('XGB KFold: %d: ' % int(i+1))
    
    X_subtrain, X_subtest = X[train_index], X[test_index]
    y_train, y_valid = y[train_index], y[test_index]
    
    d_subtrain = xgb.DMatrix(X_subtrain, y_train) 
    d_subtest = xgb.DMatrix(X_subtest, y_valid) 
    d_test = xgb.DMatrix(test)
    
    watchlist = [(d_subtrain, 'subtrain'), (d_subtest, 'subtest')]
    
    mdl = xgb.train(params, d_subtrain, nrounds, watchlist, early_stopping_rounds=80, 
                    feval=gini_xgb, maximize=True, verbose_eval=50)
    
    # Predict test set based on the best_ntree_limit
    p_test = mdl.predict(d_test, ntree_limit=mdl.best_ntree_limit)
    
    # Take the average of the prediction via 5 folds to predict for the test set
    submission['target'] += p_test/folds

Looking at the cross-validation scores, it appears that we are performing pretty well across all Stratified Folds (save for the 5th one).

After training our model, it's time to submit. Let's see how well we performed.

In [None]:
submission.to_csv('submission.csv', index=False)

### Conclusion

Our model achieved a Normalized Gini Coefficient score of 0.279, which places us at the top 48 percentile of the Kaggle competition!