In this notebook, Let us try and explore the data given for Portal Seguro competition. Before we dive deep into the data, Let us know a little more about the competition.  
**Portal Seguro**: The company offers car insurance, residential, health, life, business, consortium also offers auto and homeowners, pension, savings bonds and other financial services.  
**Objective**: The task of predicting the probability that a driver will initiate an insurance claim in the next year.


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA, KernelPCA
import xgboost as xgb
from sklearn.metrics import confusion_matrix
from sklearn.manifold import TSNE

color = sns.color_palette()
%matplotlib inline
pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999

In [None]:
train_df = pd.read_csv("../input/train.csv")

In [None]:
print("train dataset shape: ", train_df.shape)
train_df.head()

In [None]:
train_df['ps_car_15'].min()

## Data Quality Checks
First let us check whether there are any missing values in the train dataset

In [None]:
train_df.isnull().values.any()

The null values check return **false** but it doesn't really mean that ... as described *"Values of -1 indicate that the feature was missing from the observation"*

Here let us count how many -1 in each column

In [None]:
missing_df = np.sum(train_df==-1, axis=0)
missing_df.sort_values(ascending=False, inplace=True)

plt.figure(figsize=(10, 20))
sns.barplot(x=missing_df.values, y=missing_df.index)
plt.title("Number of missing values in each column")
plt.xlabel("Count of missing values")
plt.show()

We can observe that there are 7 columns out of 59 total columns that actually contained null values. 

Let us check target variable distribution

In [None]:
sns.countplot(x="target", data=train_df)
plt.show()

It is clear that target variable is imbalance so very small amount of policy hoder was filed

### Bin variable distribution

In [None]:
bin_vars = []
for col in train_df.columns:
    if col.endswith("bin"):
        bin_var = train_df.groupby(col).size()  
        bin_vars.append(bin_var)
        
bin_df = pd.concat(bin_vars, axis=0, keys=[s.index.name for s in bin_vars]).unstack()

_ = bin_df.plot(kind='bar', stacked=True, grid=False, figsize=(10, 8))


There are 4 features: **ps_ind_10_bin, ps_ind_11_bin, ps_ind_12_bin, ps_ind_13_bin** which are almost zero so we should consider remove it from training dataset

### Category variable distribution

In [None]:
bin_vars = []
for col in train_df.columns:
    if col.endswith("cat"):
        bin_var = train_df.groupby(col).size()  
        bin_vars.append(bin_var)
        
bin_df = pd.concat(bin_vars, axis=0, keys=[s.index.name for s in bin_vars]).unstack()

_ = bin_df.plot(kind='bar', stacked=True, grid=False, figsize=(10, 8), legend=False)

**ps_car_10_cat** is completely dominated by 1 while **ps_car_11_cat** have too much catetogies. Both variables should be removed 

Let us check how correlation between these variables

In [None]:
corr = train_df.corr()

plt.figure(figsize=(20,15))
sns.heatmap(corr)
plt.show()

We can observe that manny colums don't have  linear correlation with others, that mean each of these columns contains some independent information. So if we use PCA this columns will be remained  
There are **ps_calc_01, ps_calc_02,..., ps_calc_03, ps_calc_15_bin, ps_calc_16_bin, ..., ps_calc_20_bin** which are almost zero linear dependence with **target**

### Dimension reduction

Let us plot how data distrubute on 2d plane use **PCA**

In [None]:
ignore_columns = ['target', 'id', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin'] + ['ps_calc_{:02d}'.format(i) for i in range(1, 15)] + ['ps_calc_{:02d}_bin'.format(i) for i in range(15, 21)]
train_columns = [col for col in train_df.columns if col not in ignore_columns]

In [None]:

X = train_df[train_columns].values
target = train_df.target
print("Training data shape: ", X.shape)

pca = PCA(n_components=2)
reduced_dim = pca.fit_transform(X)
reduced_dim = reduced_dim[np.random.randint(0, len(reduced_dim), size=10000)]
                                            
reduced_df = pd.DataFrame(data=reduced_dim, columns=['x', 'y'])
reduced_df['target'] = target

plt.figure(figsize=(20, 8))
sns.jointplot(x='x', y='y', data=reduced_df, size=7, color="g")
plt.show()

We can observe that there are many points that have x in range **[-41.99, -31.573]** and heavily overlap

In [None]:
plt.figure(figsize=(20,15))
sns.lmplot(x='x', y='y', hue='target', data=reduced_df, size=7, fit_reg=False)
plt.show()

Define gini score

In [None]:
def gini(actual, pred, cmpcol = 0, sortcol = 1):
    assert( len(actual) == len(pred) )
    all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)
    all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]
    totalLosses = all[:,0].sum()
    giniSum = all[:,0].cumsum().sum() / totalLosses
    
    giniSum -= (len(actual) + 1) / 2.
    return giniSum / len(actual)
 
def gini_normalized(a, p):
    return gini(a, p) / gini(a, a)

def gini_xgb(preds, dtrain):
    labels = dtrain.get_label()
    gini_score = gini_normalized(labels, preds)
    return 'gini', gini_score

Build the first model using **XGBoost**

In [None]:


X = train_df[train_columns]
y = train_df.target
x_train = X[:-100000]
y_train = y[:-100000]
x_val = X[-100000:]
y_val = y[-100000:]

dtrain = xgb.DMatrix(x_train, y_train)
dval = xgb.DMatrix(x_val, y_val)
watchlist = [(dtrain, 'train'), (dval, 'valid')]

xgb_params = {
        'eta': 0.037,
        'max_depth': 5,
        'subsample': 0.80,
        'objective': 'binary:logistic',
        'eval_metric': 'mae',
        'lambda': 0.8,   
        'alpha': 0.4, 
        'base_score': 0.0364,
        'silent': 1
    }

num_boost_rounds = 250
model = xgb.train(dict(xgb_params, silent=1), dtrain, evals=watchlist, feval=gini_xgb, num_boost_round=num_boost_rounds, verbose_eval=20)


In [None]:
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()

Let us find where predictions is wrong

In [None]:
predict = model.predict(dval)
idx = np.abs(y_val - predict).nlargest(1400).index.values
y_val[idx].value_counts()
#predict1 = predict > 0.1
#confusion_matrix(y_val, predict1)

All samples with label 1 has very large residual error. It imply that we should focus more on it

try to balance training set

In [None]:
#cat_columns = [col for col in train_df.columns if col.endswith('cat') and (col!='ps_car_11_cat')]
#train_df = pd.get_dummies(train_df, columns=cat_columns, prefix=cat_columns)

ignore_columns = ['target', 'id', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin'] + ['ps_calc_{:02d}'.format(i) for i in range(1, 15)] + ['ps_calc_{:02d}_bin'.format(i) for i in range(15, 21)]
train_columns = [col for col in train_df.columns if col not in ignore_columns]
#X = train_df[train_columns]
#y = train_df.target

#x_train = X[:-100000]
#y_train = y[:-100000]
positive = train_df[train_df.target==1].head(20000)
negative = train_df[train_df.target==0].head(50000)
train_df = pd.concat([positive, negative], axis=0)
# Performing one hot encoding


train_df = train_df.sample(frac=1.0)
X = train_df[train_columns]
y = train_df.target

print(positive.shape)

x_train = X[:-5000]
y_train = y[:-5000]
x_val = X[-5000:]
y_val = y[-5000:]

dtrain = xgb.DMatrix(x_train, y_train)
dval = xgb.DMatrix(x_val, y_val)
watchlist = [(dtrain, 'train'), (dval, 'valid')]

xgb_params = {
        'eta': 0.037,
        'max_depth': 5,
        'subsample': 0.80,
        'objective': 'reg:logistic',
        'eval_metric': 'auc',
        'lambda': 0.8,   
        'alpha': 0.4, 
        'base_score': 0.01,
        'silent': 1
    }

num_boost_rounds = 250
model = xgb.train(dict(xgb_params, silent=1), dtrain, evals=watchlist, feval=gini_xgb, num_boost_round=num_boost_rounds, verbose_eval=20)


In [None]:
predict = model.predict(dval)
idx = np.abs(y_val - predict).nlargest(1400).index.values
y_val[idx].value_counts()
predict1 = predict > 0.
confusion_matrix(y_val, predict1)

we can observe even if we try to balnace our dataset but all wrong preditions still the same as before, it hint that features are not good enough to seperate between two labels. 

In [None]:
ignore_columns = ['target', 'id', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin'] + ['ps_calc_{:02d}'.format(i) for i in range(1, 15)] + ['ps_calc_{:02d}_bin'.format(i) for i in range(15, 21)] #+ [col for col in train_df.columns if col.endswith('cat')]
cat_columns = [col for col in train_df.columns if col.endswith('cat')]
train_df = pd.get_dummies(train_df, columns=cat_columns, prefix=cat_columns)

train_columns = [col for col in train_df.columns if col not in ignore_columns]

#for col in train_df.columns:
#    if col.endswith('cat'):
#        count = train_df[col].value_counts()
#        train_df[col] = train_df.replace({col:count})


#log_columns = ['ps_car_12','ps_car_13','ps_car_14','ps_car_15','ps_calc_01','ps_calc_02','ps_calc_03']
#log_columns = [col for col in train_df.columns if 'reg' in col]    
#for col in log_columns:
#    train_df[col] = np.square(train_df[col] +0.00001)
    
X = train_df[train_columns].values[:10000]
target = train_df.target[:10000]
print("Training data shape: ", X.shape)

#pca = KernelPCA(n_components=2, kernel='linear')
#reduced_dim = pca.fit_transform(X)
reduced_dim = TSNE(n_components=2).fit_transform(X)
reduced_dim = reduced_dim[np.random.randint(0, len(reduced_dim), size=10000)]
                                            
reduced_df = pd.DataFrame(data=reduced_dim, columns=['x', 'y'])
reduced_df['target'] = target

plt.figure(figsize=(20, 8))
sns.jointplot(x='x', y='y', data=reduced_df, size=7, color="g")
plt.show()

plt.figure(figsize=(20,15))
sns.lmplot(x='x', y='y', hue='target', data=reduced_df, size=7, fit_reg=False)
plt.show()