In this kernel, I'll explore my curiosities regarding the Porto Seguro dataset. I'll answer a few questions:

1. **How many unique values are there per columns? How does this change between the training and test sets?**
2. **For categorical columns, what is the normalized histograms conditional on the target?**
3. **For regression columns, what is the density conditional on the target?**

Let's get started.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

Now read in our data:

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

## How many unique values are there per columns? How does this change between the training and test sets?

In [None]:
unique_counts = pd.DataFrame(train.nunique(),columns=['train_nunique'])
unique_counts = unique_counts.join(pd.DataFrame(test.nunique(),columns=['test_nunique']))

This will tell us how many unique counts are in each column, but I'm actually concerned with their difference. For a column, how often do values show up in the test set and not in the training set? What about visa versa? Let's answer this. 

In [None]:
unique_counts['count_in_train_not_test'] = np.nan
unique_counts['count_in_test_not_train'] = np.nan
cols = set(train.columns.values) - {'target'}
for c in cols:
    train_set_c = set(train[c])
    test_set_c = set(test[c])
    unique_counts.loc[c, 'count_in_train_not_test'] = len(train_set_c - test_set_c)
    unique_counts.loc[c, 'count_in_test_not_train'] = len(test_set_c - train_set_c)
unique_counts['interesting'] = unique_counts['count_in_train_not_test'] != 0
unique_counts['interesting'] = unique_counts['interesting'] | unique_counts['count_in_test_not_train'] != 0
unique_counts[unique_counts['interesting']]

The good news is there are no categorical values that show up in one and not the other.

But these regression variables are strange. If these continuous variables had infinite precision, we would expect zero overlap. That's not the case - there is a ton of overlap! Which is good news - these make our test and training set more similar.

But what about the differences? If the differences are due to precision, but they span the same range, then these differences are no cause for concern. But if they arise because the test and train set have very different ranges for these variables, then that's a big issue!

So let's explore.

In [None]:
interesting_cols = list(unique_counts[unique_counts['interesting']].index[2:].values)
train['which'] = 'training'
test['which'] = 'testing'
cols_pick = interesting_cols + ['which','id']
both = train[cols_pick].append(test[cols_pick])
both = pd.melt(both,id_vars=['id','which'], value_vars=interesting_cols)
plt.figure(figsize=(20,10))
ax = sns.boxplot(x="variable", y="value", hue="which", data=both.sample(frac=.05), palette="Set3")

OK - so they are virtually the same. Nothing to worry about here..

## For categorical columns, what is the normalized histograms conditional on the target?

In [None]:
col_groups = {'categorical': ['ps_ind_08_bin','ps_calc_20_bin','ps_ind_12_bin','ps_ind_13_bin',
                 'ps_ind_10_bin','ps_calc_18_bin','ps_ind_09_bin','ps_calc_17_bin',
                 'ps_calc_15_bin','ps_ind_16_bin','ps_calc_19_bin','ps_ind_17_bin',
                 'ps_ind_18_bin','ps_ind_07_bin','ps_ind_11_bin','ps_ind_06_bin',
                 'ps_calc_16_bin','ps_car_06_cat','ps_car_01_cat','ps_car_04_cat',
                 'ps_car_09_cat','ps_car_11_cat','ps_ind_05_cat','ps_ind_04_cat',
                 'ps_car_08_cat','ps_car_05_cat','ps_ind_02_cat','ps_car_03_cat',
                 'ps_car_07_cat','ps_car_02_cat','ps_car_10_cat'],
 'regression': ['ps_calc_11','ps_ind_14','ps_calc_05','ps_car_11','ps_calc_03',
                'ps_car_12','ps_reg_01','ps_ind_03','ps_calc_01','ps_ind_15',
                'ps_calc_14','ps_reg_03','ps_car_13','ps_calc_13','ps_car_14',
                'ps_calc_04','ps_calc_10','ps_reg_02','ps_calc_06','ps_calc_08',
                'ps_calc_02','ps_calc_12','ps_car_15','ps_calc_07','ps_ind_01',
                'ps_calc_09']}

def cat_hist_cond(cat_col):
    #This plots the conditional (on the target) histogram for a given categorical column name
    train_temp = train[[cat_col] + ['target']]
    targ_counts = train_temp.groupby('target').count()
    counts = train_temp.groupby([cat_col] + ['target']).size().reset_index(name='counts')
    for i in range(2):
        counts.loc[counts['target']==i,'counts'] = counts.loc[counts['target']==i,'counts']/targ_counts.loc[i].values[0]
    if cat_col == 'ps_car_11_cat':
        plt.figure(figsize=(24,6))
    else:
        plt.figure(figsize=(14,6))
    ax = sns.barplot(x=cat_col,y='counts',hue='target',data=counts)
    plt.title('Normalized histogram of ' + cat_col + ' for each target outcome')
    plt.show()

def reg_dens_cond(reg_col,width):
    #This plots the conditional (on the target) density for a given regression column name
    train_temp = train[[reg_col] + ['target']]
    #We standardize so we can use the same KDE width for both distributions.
    train_temp.loc[:,reg_col] -= np.mean(train_temp.loc[:,reg_col])
    train_temp.loc[:,reg_col] /= np.std(train_temp.loc[:,reg_col])
    plt.figure(figsize=(14,6))
    sns.kdeplot(train_temp.loc[train_temp['target']==0,reg_col],bw=width,label="targ = 0")
    sns.kdeplot(train_temp.loc[train_temp['target']==1,reg_col],bw=width,label="targ = 1")
    plt.title('Density of ' + reg_col + ' for each target outcome')
    plt.show()

for cc in col_groups['categorical']:
    cat_hist_cond(cc)
for cc in col_groups['regression']:
    reg_dens_cond(cc,.2)

From this, I'm walking away with the following observations:

1. The binary columns ps_ind_16, ps_ind_17, ps_ind_07 and ps_ind_06 have the largest difference in distribution depending on whether we condition on the target. Maybe we should cross them?
2. For categorical columns, the following values change considerably depending on the target:
    - [0, 1, 11, 15] for ps_car_06_cat
    - [7, 11] for ps_car_01_cat
    - [0] for ps_car_04_cat
    - [104] for ps_car_11_cat
3. For regression columns:
    - ps_reg_01, ps_car_13, ps_reg_02, ps_ind_01 and ps_reg_03 get skewed right for target = 1
    - ps_ind_15 get's skewed left for target = 1
    
I probably should have done statistical tests on the histograms.. to be continued..
    