# Porto Seguro's EDA
_F. Daniel (September 2017)_
___

The aim of this notebook is to provide a quick overview of the content of the dataset. In particular, this dataset contains a reasonnable number of features but none of them is directly related to a well defined quantity (as for example the age of the insurance older, the number of years since he obtained its driving licence, ...). Hence, the analysis will be performed in a _blind_ manner and without any kind of _a priori_ concerning the content of the variables. However, a first step is to understand the behaviour of the categorical (which are all tagged with an integer numbering) and numerical variables.

I first load the packages:

In [None]:
import numpy as np 
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.display.max_columns = 100
plt.rcParams["patch.force_edgecolor"] = True
plt.style.use('fivethirtyeight')
mpl.rc('patch', edgecolor = 'dimgray', linewidth=1)
%matplotlib inline

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

and the import the dataframes:

In [None]:
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')

Before continuing, I put aside the **id** and **target** variables:

In [None]:
df_targets = df_train[['id', 'target']]
df_train.drop(['id', 'target'], axis = 1, inplace = True)

## 1. Variables and null values

First, I have a look at the variables in the dataframe and to the number of missing values. The setting of the dataset is such that undefined values are set to `-1`. I switch to the standard nomenclature and convert the `-1` values to `np.nan` (skipping this step would not allow to use `pandas` facilities such like the  `isnull()`method that allows to quickly access undefined quantities). I define the `get_info()`function  that outputs the data types of each variable, the number of null values and their percentage with rexpect to the total number of entries:

In [None]:
def get_info(df):
    """
    Gives some infos on columns types and number of null values
    """
    print('dataframe dimensions:', df.shape)
    tab_info=pd.DataFrame(df.dtypes).T.rename(index={0:'column type'})
    df.replace({-1:np.nan}, inplace = True) # TAG NULL VALUES
    tab_info=tab_info.append(pd.DataFrame(df.isnull().sum()).T.rename(index={0:'null values (nb)'}))
    tab_info=tab_info.append(pd.DataFrame(df.isnull().sum()/df.shape[0]*100)
                             .T.rename(index={0:'null values (%)'}))
    return tab_info

I then look at the training set:

In [None]:
tab_info = get_info(df_train)
tab_info

and at the test set:

In [None]:
get_info(df_test)

Now, we can have a look at the percentage of missing values in each variable:

In [None]:
tab_info = tab_info.T.reset_index()
tab_info = tab_info.sort_values('null values (%)').reset_index(drop = True)
#_____________________________________
y_axis  = tab_info['null values (%)'] 
x_label = tab_info['index']
x_axis  = tab_info.index

fig = plt.figure(figsize=(11, 4))
plt.xticks(rotation=80, fontsize = 14)
plt.yticks(fontsize = 13)
plt.bar(x_axis, y_axis)
plt.xticks(x_axis, x_label, fontsize = 12)
plt.title('Missing values (%)', fontsize = 18);

The **target** variable, which is to be predicted, indicate wether a claim was filled or not. 

Apart from this variable, the dataframe contains 58 variables:
- **id**: the identifiant of the user
- 57 columns named **ps\_tag1\_NUM(\_tag2)** with:
    * tag1 $\in$ \{ind, reg, car, calc \}
    * NUM $\in$ [1:20]
    * tag2 $\in$ \{bin, cat \} indicate respectively binary and categorical features. This tag is optional.
  
The tables given above shows that there are a few undefined values, in particular for the following variables:
- **ps_reg_03**: 18% 
- **ps_car_03_cat**: 69%
- **ps_car_05_cat**: 45%
- **ps_car_14**: 7%

____
## 2. Categorical values

There a few categorical variables indexed with the **tag** postfix:

In [None]:
nb = sum(["cat" in s for s in df_train.columns])
print('categorical variables: {} '.format(nb))

These variables contain integers. Below, I represent the number of categories in each of these variables:

In [None]:
ind = 0
for col in df_train.columns:
    if "cat" not in col: continue
    ind += 1
    fig = plt.figure(1, figsize=(11,30))
    ax1 = fig.add_subplot(nb,1,ind)    
    x_axis = list(df_train[col].value_counts().index)
    y_axis = list(df_train[col].value_counts())
    x_label = list(map(int,x_axis))
    if len(x_label) > 50:
        x_label = [s if s%2 == 0 else '' for i,s in enumerate(x_label)]
    plt.xticks(x_axis, x_label)
    ax1.bar(x_axis, y_axis, align = 'center', label = col)
    plt.legend(prop={'size': 14})
    if ind == nb: break

## 3. Numerical variables

In [None]:
nb  = sum([("cat" not in s) and ('bin' not in s) for s in df_train.columns])
print('numerical variables: {} '.format(nb))

### 3.1 Distributions

#### 3.1.1 Variables with the **calc** tag

By looking at the numerical variables, we can distinguish a few different kinds of distributions. First, if we consider the variables indexed with the **calc** tag, we see that we have either uniform distributions:

In [None]:
# uniform distributions:
list_cols_uniform = ['ps_calc_01', 'ps_calc_02', 'ps_calc_03']
#____________________________
ind = 0
for col in list_cols_uniform:
    ind += 1
    fig = plt.figure(1, figsize=(11,3))
    ax1 = fig.add_subplot(1, 3, ind)    
    sns.distplot(df_train[col].dropna(), kde=False  )        
    if ind == nb: break
plt.suptitle('Uniform distributions');

or distributions that have a more shallow shape.

In [None]:
# shallow distributions: 
list_cols_shallow = ['ps_car_13', 'ps_reg_03', 'ps_calc_10', 'ps_calc_14', 'ps_calc_11',
                     'ps_ind_03', 'ps_calc_13', 'ps_calc_06', 'ps_calc_07', 'ps_calc_07',
                     'ps_calc_09', 'ps_calc_12', 'ps_calc_04', 'ps_calc_05', 'ps_car_11']
#____________________________
ind = 0
for col in list_cols_shallow:
    ind += 1
    fig = plt.figure(1, figsize=(10,15))
    ax1 = fig.add_subplot(5, 3, ind)    
    sns.distplot(df_train[col].dropna(), kde=False  )        
    if ind == nb: break
fig.suptitle('Shallow distributions')
fig.tight_layout(rect=[0, 0.03, 1, 0.95])

Given the regularity of the distributions found for these variables (and assuming that the **calc** tag carry some meaning), we may assume that the above variables come from some statistical model.

#### 3.1.2 Empirical distributions

The other numerical variables in the dataframe do not show such smooth shapes and are probably obtained through a census:

In [None]:
list_cols_other   = ['ps_ind_01', 'ps_ind_15', 'ps_reg_01', 'ps_reg_02', 'ps_car_12',
                     'ps_car_14', 'ps_car_15', 'ps_ind_14', 'ps_car_12']
#____________________________
ind = 0
for col in list_cols_other:
    ind += 1
    fig = plt.figure(1, figsize=(10,10))
    ax1 = fig.add_subplot(3, 3, ind)    
    sns.distplot(df_train[col].dropna(), kde=False  )        
    if ind == nb: break
fig.suptitle('Empirical distributions')
fig.tight_layout(rect=[0, 0.03, 1, 0.95])


### 3.2 Correlation coefficients

For some reason, the coeffs I calculate seem differents than the one given in other notebooks. I have to dig this.

In [None]:
list_cols = []
for col in df_train.columns:
    if 'bin' not in col and 'cat' not in col:
        list_cols.append(col)

        
df_corr = df_train.copy(deep=True)
df_corr['target'] = df_targets['target']
corrmat = df_corr[list_cols + ['target']].corr()

In [None]:
df_corr[list_cols + ['target']].dropna(how='any').corr()[:5]

In [None]:
corrmat[:5]

In [None]:
f, ax = plt.subplots(figsize=(12, 9))
k = 15 # number of variables for heatmap
cols = corrmat.nlargest(k, 'ps_reg_01')['ps_reg_01'].index
#cm = np.corrcoef(df_corr[cols].dropna(how='any').values.T)
cm = np.corrcoef(df_corr[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True,
                 fmt='.2f', annot_kws={'size': 10}, linewidth = 0.1, cmap = 'coolwarm',
                 yticklabels=cols.values, xticklabels=cols.values)
f.text(0.5, 0.93, "Correlation coefficients", ha='center', fontsize = 18)
plt.show()