<img src="kaggle_comp.png">

<img src="sub.png">

<img src="load.png">

<img src="scoring.png">

# Data Exploration

The [Porto Sequro](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction) Kaggle Competition is centered around predicting the likelihood that an insured driver/car pair is going to submit a claim. 

## Data Format
The features have all been anonymized and conform to the following naming convention: `ps_CAT_##[_TYPE]`.

`CAT` refers to one of the following categories:
 - `ind` - I assume to mean general "indicator"
 - `reg` - I assume to mean insurance "registrant"
 - `car` - I assume to mean "car"
 - `calc` - I assume to mean "calculated" field

`##` refers to the feature number within the category

`TYPE` is optional and refers to one of the following type:
 - ` ` - No type indicates a scalar
 - `cat` - Indicates a categorical feature
 - `bin` - Indicates a binary feature
    
All missing values are indicated by a `-1`

There is an `id` column used to mark the id of the entry.
The training data has a `target` column that is the target value (binary).

## Scoring
This particular competition uses the `normalized gini coefficient` for scoring. [Wikipedia](https://en.wikipedia.org/wiki/Gini_coefficient) has a mind-numbing explaination. What is important is that you are measuring the ORDERING of the results, not the predicted value. This score ranges from 0.0 to 1.0. For scoring The magnitude of the prediction does not matter, only the order. For example, lets say you have 4 observations (shown with their targets) in perfect order:
 - A - 1
 - B - 1
 - C - 0
 - D - 0
 
Ordering ABCD gets a perfect score of 1.0. However, let's say you have the following order (shown with predictions):
 - B - 0.9 (1)
 - C - 0.4 (0)
 - A - 0.3 (1)
 - D - 0.1 (0)

You calculate the gini coefficient by:
 - summing the cummulative actual values
  - `(1) + (1 + 0) + (1 + 0 + 1) + (1 + 0 + 1 + 0) = 6`
 - divide by the total number of positive targets
  - `6 / 2 = 3`
 - subtracting half of 1 plus the length
  - `3 - (4 + 1) / 2 = 3 - 2.5 = 0.5`
 - dividing by the length
  - `0.5 / 4 = 0.125`

To find the normalized score, divide this value by the perfect gini score for the set:

 - `(1) + (1 + 1) + (1 + 1 + 0) + (1 + 1 + 0 + 0) = 8`
 - `8 / 2 = 4`
 - `4 - (4 + 1) / 2 = 4 - 2.5 = 1.5`
 - `1.5 / 4 = 0.375`
 - `0.125 / 0.375 = 0.5`

If all of this was confusing, think of it as the "shuffle factor"

Thinking about scoring is important. At first glance this looks like a classification problem. However, predicting only 1 or 0 would leave ties to random ordering creating horrible results. In a way, this is regression. However, regression is not optimal because your RSME is usually large since the values are either 1 or 0. So, using classifiers, but pulling out the class probability was the best approach.

In [None]:
# Packages
import math
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
# Data Import
df_train = pd.read_csv("../input/train.csv")
df_test = pd.read_csv("../input/test.csv")

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
df_train.describe()

In [None]:
df_test.describe()

In [None]:
print("train.csv null value count = " + str(df_train.isnull().sum().sum()))
print("train.csv -1 value count = " + str((df_train == -1).sum().sum()))
print("test.csv null value count = " + str(df_test.isnull().sum().sum()))
print("test.csv -1 value count = " + str((df_test == -1).sum().sum()))

In [None]:
def data_density(df):
    print("Number of rows = " + str(df.shape[0]))
    print("Number of columns = " + str(df.shape[1]))
    print("null value count = " + str(df.isnull().sum().sum()))
    neg_1_count = (df == -1).sum().sum()
    print("-1 value count = " + str(neg_1_count) + "("+str(100. * neg_1_count / df.shape[0] / df.shape[1])+"%)")
    num_sparce_rows = ((df == -1).T.sum() > 0).sum()
    print("Number of sparce rows = " + str(num_sparce_rows) + "("+str(100. * num_sparce_rows /  df.shape[0])+"%)")
    num_sparce_cols = ((df == -1).sum() > 0).sum()
    print("Number of sparce columns = " + str(num_sparce_cols) + "("+str(100. * num_sparce_cols / df.shape[1])+"%)")

    print("")
    densities = (df != -1).sum() / df.shape[0]
    print("Columns with less than 100% density:")
    print(densities[densities < 1] * 100)


In [None]:
print("Train.csv density analysis:")
data_density(df_train)
print("")
print("Test.csv density analysis:")
data_density(df_test)

In [None]:
# Columns Analysis
res_cols = ["id",  "target"]
bin_cols = [col for col in df_train.columns if col[-3:] == "bin"]
cat_cols = [col for col in df_train.columns if col[-3:] == "cat"]
num_cols = [col for col in df_train.columns if col not in res_cols + bin_cols + cat_cols]

In [None]:
# Standard Heatmap
fig, ax = plt.subplots(figsize=(20,10))
sns.heatmap(df_train.corr(), linewidths=.5, ax=ax)

In [None]:
# heatmap of non-binary columns
fig, ax = plt.subplots(figsize=(20,10))
sns.heatmap(df_train[cat_cols + num_cols + ["target"]].corr(), linewidths=.5, ax=ax)

In [None]:
# Bin variable analysis
df_bin = df_train[["target"] + bin_cols]
df_temp = df_bin.groupby("target").mean()
df_temp = df_temp.append(df_bin[bin_cols].mean(), ignore_index=True).append(df_bin[bin_cols].std(), ignore_index=True).T
df_temp.columns = ["0_mean", "1_mean", "mean", "std"]
df_temp["0_std_dev"] = (df_temp["0_mean"] - df_temp["mean"]) / df_temp["std"]
df_temp["1_std_dev"] = (df_temp["1_mean"] - df_temp["mean"]) / df_temp["std"]
df_temp["std_dev_diff"] = df_temp["0_std_dev"] - df_temp["1_std_dev"]
df_temp["std_dev_diff"] = df_temp["std_dev_diff"].abs()
df_temp.sort_values("std_dev_diff", ascending=False)

# We are interested in seeing if the binary variables are skewed in any direction.
# If they were, then we would see that the std_dev for 0 or 1 would signifantly differ
# In fact, we do see this in some (ps_ind_17_bin, ps_ind_07_bin, ps_ind_06_bin)
# However, this difference is relatively small and unlikely to discriminate well.

In [None]:
df_bin.groupby("target").var().T

In [None]:
# Now let's build a binary tree!
# I made this up... not much interesting here

mean_target = df_bin["target"].mean()
bin_tree = df_bin.groupby(bin_cols).agg([lambda x: math.sqrt(((np.array(x) - mean_target)**2).mean()), "mean", "count"]).reset_index()
bin_tree.columns = bin_tree.columns.get_level_values(0)[0:-3].tolist()  + ["rsme", "mean", "count"]
min_count = df_bin.shape[0] / bin_tree.shape[0]
bin_tree[bin_tree["count"] >= min_count].sort_values("rsme", ascending=False)

In [None]:
# Ok, let's take a look at the numeric fields

In [None]:
# Define some helper functions to describe / plot columns

def column_details(df, col, is_cat=False):
    print("("+col+") Null Values: " + str(df[col].isnull().sum()))
    print(df[col].describe())
    if is_cat:
        sns.countplot(y=col, data=df)
    else:
        sns.distplot(df[col][np.logical_not(df[col].isnull())], kde=False)
        
    plt.show()
    
def data_details(data, num_cols, cat_cols):
    for col in num_cols:
        column_details(data, col)
    
    for col in cat_cols:
        column_details(data, col, is_cat=True)

    for col in bin_cols:
        column_details(data, col, is_cat=True)

In [None]:
data_details(df_train, num_cols, bin_cols + cat_cols + ["target"])

In [None]:
for col in df_train.columns:
    sns.jointplot(x=col, y="target", data=df_train)