<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Preprocessing" data-toc-modified-id="Data-Preprocessing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Loading-data" data-toc-modified-id="Loading-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Loading data</a></span></li><li><span><a href="#Basic-feature-analysis" data-toc-modified-id="Basic-feature-analysis-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Basic feature analysis</a></span><ul class="toc-item"><li><span><a href="#Distribution-Analysis" data-toc-modified-id="Distribution-Analysis-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Distribution Analysis</a></span></li><li><span><a href="#Linear-Relationship" data-toc-modified-id="Linear-Relationship-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Linear Relationship</a></span></li></ul></li><li><span><a href="#Feature-Engineering" data-toc-modified-id="Feature-Engineering-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Feature Engineering</a></span><ul class="toc-item"><li><span><a href="#Pattern-of-the-Target" data-toc-modified-id="Pattern-of-the-Target-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Pattern of the Target</a></span></li><li><span><a href="#Encode-Categorical-Data" data-toc-modified-id="Encode-Categorical-Data-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Encode Categorical Data</a></span></li><li><span><a href="#Numeric-Features-Transformation" data-toc-modified-id="Numeric-Features-Transformation-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Numeric Features Transformation</a></span></li></ul></li></ul></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Modeling</a></span><ul class="toc-item"><li><span><a href="#Baseline-Model---Linear-Regression" data-toc-modified-id="Baseline-Model---Linear-Regression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Baseline Model - Linear Regression</a></span></li><li><span><a href="#Ridge-Regression" data-toc-modified-id="Ridge-Regression-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Ridge Regression</a></span></li><li><span><a href="#LASSO-Regression" data-toc-modified-id="LASSO-Regression-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>LASSO Regression</a></span></li><li><span><a href="#Xgboost" data-toc-modified-id="Xgboost-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Xgboost</a></span></li></ul></li></ul></div>

**Project Description:** 

In this project, I will build a model to predict the severity of Allstate claims using Ridge Regression, Lasso Regression and Xgboost.  

# Data Preprocessing

## Loading data

In [None]:
# import and some default settings
import warnings
import itertools
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from scipy import sparse
from sklearn import metrics
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import LinearSVC
from scipy.stats import skew, boxcox
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split

warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
%matplotlib inline

In [None]:
# load our dataset 
dataset = pd.read_csv("/kaggle/input/allstate-claims-severity/train.csv")
model_index = len(dataset)
dataset.head()

The dataset has already been anonymized due to privacy protection, we only know whether a feature is continuous or categorical. The submission dataset does not have "loss", it's the dataset that need final predictions. Here I still want to combine these two dataset, which can save me some duplicate operations. 

In [None]:
submission = pd.read_csv("/kaggle/input/allstate-claims-severity/test.csv")
full_dataset = pd.concat([dataset,submission]).reset_index(drop=True)

In [None]:
full_dataset.info()

Then, I want to check if there is any missing value or negative value (only for continuous variables). loss column absolutely have some missing value, and there is not any negative value in our dataset.

In [None]:
full_dataset.describe()

In [None]:
dataset.shape,submission.shape,full_dataset.shape

In [None]:
# split our dataset
Y = dataset["loss"]
X = dataset.drop(['id', 'loss'], axis= 1)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

In [None]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

In [None]:
# group features
cat_variables = []
con_variables = []
id_col = 'id'
target_col = 'loss'

for i in dataset.columns:
    if i[:2] == 'ca':
        cat_variables.append(i)
    if i[:2] == 'co':
        con_variables.append(i)

In [None]:
print("The continuous variables: ",con_variables)
print("The categorial variables: ",cat_variables)

## Basic feature analysis

### Distribution Analysis

Here, I want to make sure that if there is and skewed distribution, and if the distribution of train and test dataset are the same.

In [None]:
# check the distribution of continuous column
count = 1

for i in range(len(con_variables)):
    fig = plt.figure(figsize = (15,25))
    sns.set_style('darkgrid')
    plt.subplot(len(con_variables),2,count)
    sns.violinplot(x_train[con_variables[i]],palette="hls")
    plt.title("Train")
    
    plt.subplot(len(con_variables),2,count+1)
    sns.violinplot(x_test[con_variables[i]],palette="Paired")
    plt.title("Test")
    count += 2

### Linear Relationship 

In [None]:
# plot the heatmap of correlation matrix
plt.figure(figsize=(15,12))
sns.heatmap(x_train.corr(),cmap='coolwarm',linecolor='white',linewidths=0.5,annot=True)
plt.show()

In [None]:
# find out high correlated features
highcorr = []
corr = x_train.corr() 
threshold = 0.9

for i in range(corr.shape[0]):
    for j in range(corr.shape[1]):
        if i == j:
            continue
        elif (corr.iloc[i,j] > threshold) | (corr.iloc[i,j] < -threshold):
            highcorr.extend([corr.iloc[i].name,corr.iloc[:,j].name])
        else:
            continue

In [None]:
highcorr = list(set(highcorr))
highcorr

In [None]:
sns.set_style('darkgrid')
sns.pairplot(dataset[highcorr],plot_kws=dict(s=4, edgecolor="w", linewidth=.01),markers='o')

In [None]:
sns.jointplot(x="cont1",y="cont9",data=dataset,kind='hex')

In [None]:
sns.jointplot(x="cont12",y="cont11",data=dataset,kind='hex')

In [None]:
def dropColumn(dataset,drop_col,inp=False):
    dataset.drop(drop_col,axis=1, inplace=inp)
    if inp == False:
        return dataset

In [None]:
drop_col = ["cont11","cont1"]
dropColumn(full_dataset,drop_col,inp=True)

In [None]:
for i in drop_col:
    con_variables.remove(i)

## Feature Engineering

### Pattern of the Target

In [None]:
# viusalize the distribution of loss, which is our target
# thera are many outliers
sns.set_style('darkgrid')
plt.figure(figsize=(10,6))
sns.boxplot(y_train)

In [None]:
# it's a very skewed distribution
plt.figure(figsize=(10,6))
sns.distplot(y_train)

In [None]:
# apply log(1+loss) we can get a normal distribution
plt.figure(figsize=(10,6))
sns.distplot(np.log1p(y_train))

### Encode Categorical Data

In [None]:
def catEncode(dataset):
    # ensure the features being converted to numeric
    le = LabelEncoder()
    dataset[cat_variables] = dataset[cat_variables].apply(lambda col: le.fit_transform(col))
    # Then I will convert it to a sparse matrix which uses way less memory as compared to dense matrix
    OneHot = OneHotEncoder(sparse=True)
    return OneHot.fit_transform(dataset[cat_variables])

In [None]:
full_dataset_sparse = catEncode(full_dataset)
full_dataset_sparse.shape

###  Numeric Features Transformation

I will apply two preprocessings on numeric features:

1. Apply box-cox transformations for skewed numeric features.

2. Scale numeric features so they will fall in the range between 0 and 1.

Please be advised that these preprocessings are not necessary for tree-based models, e.g. XGBoost. However, linear or linear-based models may benefit from them.

In [None]:
# calculate skewness of each numeric features
skewed_cols = full_dataset.loc[:,con_variables].apply(lambda x: skew(x.dropna()))
print(skewed_cols.sort_values())

In [None]:
plt.figure(figsize=(10,6))
sns.distplot(full_dataset["cont9"])

In [None]:
plt.figure(figsize=(10,6))
sns.distplot(full_dataset["cont8"])

In [None]:
# apply box-cox transformations
skewed_cols = skewed_cols[abs(skewed_cols) > 0.25].index.values
for skewed_col in skewed_cols:
    full_dataset[skewed_col],lam = boxcox(full_dataset[skewed_col] + 1)

In [None]:
skewed_cols = full_dataset.loc[:,con_variables].apply(lambda x: skew(x.dropna()))
print(skewed_cols.sort_values())

In [None]:
# apply standard scaling
SSL = StandardScaler()

for con_col in con_variables:
     full_dataset[con_col] = SSL.fit_transform(full_dataset[con_col].values.reshape(-1,1))

# Modeling

In [None]:
# we use the following two methods to evaluate our model

def logregobj(labels, preds):
    con = 2
    x =preds-labels
    grad =con*x / (np.abs(x)+con)
    hess =con**2 / (np.abs(x)+con)**2
    return grad, hess 

def log_mae(y,yhat):
    return mean_absolute_error(np.exp(y), np.exp(yhat))

log_mae_scorer = metrics.make_scorer(log_mae, greater_is_better = False)

## Baseline Model - Linear Regression

In [None]:
Y = np.log(full_dataset[:model_index]["loss"]+200)
X = full_dataset[:model_index].drop(['id', 'loss'], axis= 1)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

In [None]:
linear_reg = linear_model.LinearRegression()
linear_reg.fit(x_train,y_train)

In [None]:
# figure out the coefficient of each feature

fig,ax = plt.subplots(figsize=(15,10))
plt.xticks(rotation=45) 
tick_spacing = 3
ax.plot(x_train.columns,linear_reg.coef_,label='LR')
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.title("Feature coefficient of Linear Regression Model")
plt.xlabel("Features")
plt.ylabel("Coefficient")
plt.legend()
plt.show()

In [None]:
y_pred = linear_reg.predict(x_test)
log_mae(y_test,y_pred)

## Ridge Regression

In [None]:
alpha = [1, 5, 10, 20, 30, 40, 50]

ridge = Ridge()
parameters = {'alpha': alpha}
ridge_regressor = GridSearchCV(ridge, parameters,scoring='neg_mean_squared_error', cv=5)
ridge_regressor.fit(x_train, y_train)

In [None]:
ridge_regressor.best_params_

In [None]:
y_pred = ridge_regressor.predict(x_test)
log_mae(y_test,y_pred)

In [None]:
rrg = linear_model.Ridge(alpha=40)
rrg.fit(x_train, y_train)

## LASSO Regression

In [None]:
larg = linear_model.Lasso(alpha=1e-7)
larg.fit(x_train, y_train)

In [None]:
y_pred = larg.predict(x_test)
log_mae(y_test,y_pred)

In [None]:
fig,ax = plt.subplots(figsize=(15,10))
plt.xticks(rotation=45) 
tick_spacing = 3
ax.plot(x_train.columns,linear_reg.coef_,c='r',label='LR')
ax.plot(x_train.columns,larg.coef_,c='g',label="Lasso")
ax.plot(x_train.columns,rrg.coef_,c='b',label="Ridge")
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.title("Feature coefficient of Three Regression Model")
plt.xlabel("Features")
plt.ylabel("Coefficient")
plt.legend()
plt.show()

Therefore, based on the performance, I will use ridge regression. Before go to the Xgboost model, I would like to submit my result of ridge regression. 

In [None]:
sub_x = full_dataset[model_index:]

In [None]:
sub_x.drop(["loss","id"],axis=1,inplace=True)

In [None]:
final_predict = np.exp(ridge_regressor.predict(sub_x)) - 200

In [None]:
results1 = pd.DataFrame()
results1['id'] = full_dataset[model_index:].id
results1['loss'] = final_predict
results1.to_csv("sub.csv", index=False)
print("Submission created.")

The score is 1263.56702.

## Xgboost

In [None]:
full_data_sparse = sparse.hstack((full_dataset_sparse,full_dataset[con_variables]), format='csr')
print(full_data_sparse.shape)

model_x = full_dataset_sparse[:model_index]
submission_x = full_dataset_sparse[model_index:]
model_y = np.log(full_dataset[:model_index].loss.values + 200)
ID = full_dataset.id[:model_index].values

In [None]:
def search_model(train_x, train_y, est, param_grid, n_jobs, cv, refit=False):
## grid search for the best model
    model = GridSearchCV(estimator=est,
                         param_grid=param_grid,
                         scoring=log_mae_scorer,
                         verbose=10,
                         n_jobs=n_jobs,
                         iid=True,
                         refit=refit,
                         cv=cv)
    # fit grid search model
    model.fit(train_x, train_y)
    print("Best score: %0.3f" % model.best_score_)
    print("Best parameters set:", model.best_params_)
    print("Scores:", model.grid_scores_)
    return model

In [None]:
param_grid = {'objective':[logregobj],
              'learning_rate':[0.02, 0.04, 0.06, 0.08],
              'n_estimators':[1500],
              'max_depth': [9],
              'min_child_weight':[50],
              'subsample': [0.78],
              'colsample_bytree':[0.67],
              'gamma':[0.9],
              'nthread': [-1],
              'seed' : [1234]}

while False:
    model = search_model(model_x,
                         model_y,
                         xgb.XGBRegressor(),
                         param_grid,
                         n_jobs=1,
                         cv=4,
                         refit=True)

In [None]:
rgr = xgb.XGBRegressor(seed = 1234, 
                       learning_rate = 0.01, # smaller, better results, more time
                       n_estimators = 1500, # Number of boosted trees to fit.
                       max_depth=9, # the maximum depth of a tree
                       min_child_weight=50,
                       colsample_bytree=0.67, # the fraction of columns to be randomly samples for each tree
                       subsample=0.78, # the fraction of observations to be randomly samples for each tree
                       gamma=0.9, # Minimum loss reduction required to make a further partition on a leaf node of the tree, 
                       # the larger, the more conservative 
                       nthread = -1, # Number of parallel threads used to run xgboost.
                       silent = False # Whether to print messages while running boosting.
                      )
rgr.fit(model_x, model_y)

In [None]:
pred_y = np.exp(rgr.predict(submission_x)) - 200

In [None]:
plt.figure(figsize=(12,8))
plt.bar(range(len(rgr.feature_importances_)), rgr.feature_importances_,c='royalblue')
plt.ylim(0,0.1)
plt.show()

In [None]:
xgb.plot_importance(rgr,max_num_features=5,importance_type='weight')

In [None]:
np.argsort(rgr.feature_importances_)

In [None]:
results2 = pd.DataFrame()
results2['id'] = full_dataset[model_index:].id
results2['loss'] = pred_y
results2.to_csv("sub2.csv", index=False)
print("Submission created.")