# Kaggle Competition - Porto Seguro's Safe Driver Prediction

This notebook is more about learning from some top solutions rather than inventing my own solutions, although I will certainly give my own attempt.

A few things about this competition and why it appeals to me...

This is the biggest Kaggle competition to date. (other than Titanic, which I will publish my 'stupid-but-it-works' 100% web scraping solution to soon.) There are 5169 teams in total. Something amazing is the diversity of the solutions. In the first place solution by Michael Jahrer he admitted to a lack of creativity when it came to feature engineering and not much success with XGBoost. Yet in the second place solution by 三个臭皮匠, they relied quite heavily on feature engineering, and XGBoost played quite a vital role in their solution. The 3rd place solution by utility is noteworthy for jumping 1071 spots. (There is a public leaderboard and private leaderboard. Sometimes people overfit to the data in the public leaderboard and get punished when the private leaderboard comes out and their solution goes way down). Another fun thing - The 1st place solution Jahrer says he never drops features. In the 3rd place solution by utility, the first thing utility discusses in their solution is eliminating features.

This is from the insurance industry, an industry I have some knowledge and experience in. One of the reasons I moved from the actuarial career path to the data scientist path was because I felt the insurance industry had a bit of a reputation as some sort of big old immovable force that was in need of some new ways of thinking. An oversimplification of insurance pricing goes like...

i. You have a base rate and then we determine your risks based on information you provide us.

ii. For each risk, we look up a table, and this table gives some modifier to your base rate.

iii. That number is what we charge you for your premium

So for example, You are 40 and drive a red sports car, we may give you a times 0.95 modifier for your age, add 10 dollars to your premium because your car is red, and then throw a times 1.5 modifier because you drive a sportscar... So if your base rate was 100, you rate would be (100 x 0.95 + 10 ) x 1.5 = 157.5. Before data science and machine learning became super popular and was applied to every problem imaginable, this was a great way to do it. You're looking at huge datasets, there's a value for each risk category, and you can come up with a reasonable assumption based on these actuarial tables. It's like a pseudo decision tree.

There are also more 'one price fits all' kinds of insurance... You may have seen this for health insurance. There are different tiers of plans with clear details on what they cover or don't cover, then you buy the plan and you roll with it. The theory with insurance is that if you have a pool of people paying premiums and most don't get sick, that will be enough to cover those few people in the pool who get sick and need money.

These one price fits all sort of plans aim to simplify and standardize insurance so that insurers can cut some costs and offer a lower plan. The drawbacks of this is that in reality, people's health problems aren't so cut and dry. For example, we could have two people. Person A's healthcare costs $200 a year, and Person B's healthcare costs $150 a year. Person A has all their medical needs satisfied by the bronze plan. Person B has a slightly more unique medical condition that is only covered by the silver plan, but the actual cost of healthcare is less than Person A. Person B leaves your insurance company because they don't think they should be paying the silver tier. This example is just one person, but this could be a company with 5000 employees leaving your insurance plan because they want a specific health care coverage.

And that's it! This first notebook is just very simple. Subsequent notebooks will kind of go into top solutions by Kagglers.

# Brief EDA

In [99]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
dftrain = pd.read_csv('train.csv')
dftest = pd.read_csv('test.csv')

In [8]:
dftrain.shape

(595212, 59)

In [19]:
dftest.shape

(892816, 58)

We have 58 features. The train set consistes of 595,212 observations, and the test set had 892,816 features.

The target variable is conveniently named target, and is broken down by the counts below.

In [25]:
dftrain.target.value_counts()

0    573518
1     21694
Name: target, dtype: int64

There is a strong imbalance here, and this is something we have to manage because if we just throw our algorithms at it the models could be biased. Common ways to handle this is to either upsample the small class (the 1s), downsample the larger class the (0s), or applying weights in our models.

I think it would be a bit redundant to make my own EDA when some other Kagglers have made really good EDAs, so I will refer you to this very in-depth EDA [here.](https://www.kaggle.com/headsortails/steering-wheel-of-fortune-porto-seguro-eda)

# Evaluation Metric

This competition was evaluated using the normalized gini coefficient. The [Gini coefficient](https://en.wikipedia.org/wiki/Gini_coefficient) is often used in economics as an indicator of inequality and is used in the actuarial/industry for evaluating predictive models.

A few things about the normalized gini coefficient...

1. The normalization just put the gini coefficient from a scale of 0 to 1.

2. We wouldn't want to use a regular accuracy here because of the high degree of imbalance in this dataset. If we just predicted everything as 0, we would already have an accuracy in the high 90s.

3. The gini coefficient isn't too hard to understand but I like to walk through each step of the algorithm and see how it is calculated.

In [94]:
actual = np.array([1, 1, 0, 1, 0])
pred = np.array([0.1, 0.2, 0.9, 0, 0.1])

In [98]:
print(np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float))
print('')
print(all[ np.lexsort((all[:,2], -1*all[:,1])) ])

[[ 1.   0.1  0. ]
 [ 1.   0.2  1. ]
 [ 0.   0.9  2. ]
 [ 1.   0.   3. ]
 [ 0.   0.1  4. ]]

[[ 0.   0.9  2. ]
 [ 1.   0.2  1. ]
 [ 1.   0.1  0. ]
 [ 0.   0.1  4. ]
 [ 1.   0.   3. ]]


In the first array we just have our actual on the first column, predicted on the second column, and the rank on the third column. In our second array we sort the first column based on the probabilites in the second column.

In [100]:
all[:,0].sum()

3.0

Our denominator is simply the sum of our first column.

In [102]:
print(all[:,0].cumsum())
print(all[:,0].cumsum().sum())

[ 0.  1.  2.  2.  3.]
8.0


Our numerator is the sum of the cumulative sum at each point in our first column. So you can see if we had all our 1s higher up, we would have bigger cumulative sums and a larger numerator, resulting in a normalized Gini coefficient closer to 1.

All the other steps here are just simple arithmetic meant to make the number more interpretable and pretty, so we'll move on.

The function definitions are below.

In [30]:
def gini(actual, pred, cmpcol = 0, sortcol = 1):
    assert(len(actual) == len(pred))
    all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)
    all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]
    totalLosses = all[:,0].sum()
    giniSum = all[:,0].cumsum().sum() / totalLosses

    giniSum -= (len(actual) + 1) / 2.
    return giniSum / len(actual)
 
def gini_normalized(a, p):
    return gini(a, p) / gini(a, a)

Back to the data... The features are not very revealing at all. The only labels are ind, reg, car, calc, and then an additional label that is either empty, cat, bin, corresponding to continuous, categorical, and binomial.

This is both good and bad. I think at the end of the day, the best predictive model is one that relies on both machines and human knowledge. When the features are anonymized like this model, you take away the opportunity for human intuition for feature engineering. But at the same time, you're eliminating a lot of biases that can arise from human intuition.

So these are just a few of the early thoughts. We will look into Michael Jahrer's first place solution next!