## Problem Statement - Task B

*Predict the loss amount for the insurance policies using the historical trends and features.*

I tried to replicate the problem statement and found the [data set](https://www.kaggle.com/c/allstate-claims-severity/data) on kaggle to present my analysis which is about measurement of severity of an insurance claim made by a client. As I am new to the industry, I thought it'd be a good idea to find a data set rather than create one my own. I look forward to work on data generation and using the same for solving problems in the near future.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score as AUC
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score 
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
from sklearn.model_selection import cross_val_score

from scipy import stats
import seaborn as sns
from copy import deepcopy

#model fitting
import xgboost as xgb
import pickle
import sys
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, make_scorer
from scipy.sparse import csr_matrix, hstack
from sklearn.model_selection import KFold, train_test_split
from xgboost import XGBRegressor

%matplotlib inline

## 1. Data Exploration

In this part, we do a short data exploration to see what dataset we have and whether we can find any patterns in it.

Also, we'll discover that some data transformations are needed to prepare the dataset for our modelling. A very important section of data discovery phase is to make sure that train and test sets are taken from the same distribution. **This is a required step if we want to use cross-validation on the training set and be sure that our models' generalizations will work the same way on test.**

### Load data

In [None]:
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

*This data set is loaded using kaggle kernels. Please change the file paths accordingly for reproducibility.*

### Aggregated and basic statistics

First, let's get accustomed with dataset and get some basic data about it:

In [None]:
train.shape

188K training examples, 132 columns — this does not look like a massive dataset. We can definitely train most of our models on a local machine.

In [None]:
print ('First 20 columns:'), list(train.columns[:20])

In [None]:
print ('Last 20 columns:'), list(train.columns[-20:])

We see that there are probably 116 categorical columns (as their names suggest) and 14 continuous (numerical) columns. Also, there's `id` and `loss` columns. This sums up to 132 columns total.

Next, let's see a quick statistic summary of our continuous features:

In [None]:
train.describe()

As we see, all continuous features have been scaled to `[0,1]` interval and have means around 0.5. This is the result of anonymization and some sort of data preprocessing that was done by Allstate.

On the other hand, `loss` values are not scaled (though they might have been preprocessed as well).

### Summary of target variable

In [None]:
train['loss'].describe()

### Testing on missing values

We should always dedicate a part of our research on dealing with missing values. Pandas provides an easy way to detect them:

In [None]:
pd.isnull(train).values.any()

There are **no missing values at all.** This is a great relief and this definitely allows us to focus on algorithms and not on data cleansing.

### Continuous vs Categorical features

Another way to see the division to categorical and continuous features is to run `pd.DataFrame.info` method:

In [None]:
train.info()

In here, `float64(15), int64(1)` are our continuous features (the one with `int64` is probably `id`) while `object(116)` are categorical features. We may confirm this:

In [None]:
cat_features = list(train.select_dtypes(include=['object']).columns)
print ("Categorical features:",len(cat_features))

In [None]:
cont_features = [cont for cont in list(train.select_dtypes(
                 include=['float64', 'int64']).columns) if cont not in ['loss', 'id']]
print ("Continuous features:", len(cont_features))
print(cont_features)

In [None]:
id_col = list(train.select_dtypes(include=['int64']).columns)
print ("A column of int64: ", id_col)

### Unique values in categorical features

In [None]:
cat_uniques = []
for cat in cat_features:
    cat_uniques.append(len(train[cat].unique()))
    
uniq_values_in_categories = pd.DataFrame.from_items([('cat_name', cat_features), ('unique_values', cat_uniques)])

In [None]:
uniq_values_in_categories.head()

Let's plot a histogram visualizing the unique categories under each variable. 

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(16,5)
ax1.hist(uniq_values_in_categories.unique_values, bins=50)
ax1.set_title('Amount of categorical features with X distinct values')
ax1.set_xlabel('Distinct values in a feature')
ax1.set_ylabel('Features')
ax1.annotate('A feature with 326 vals', xy=(322, 2), xytext=(200, 38), arrowprops=dict(facecolor='black'))

ax2.set_xlim(2,30)
ax2.set_title('Zooming in the [0,30] part of left histogram')
ax2.set_xlabel('Distinct values in a feature')
ax2.set_ylabel('Features')
ax2.grid(True)
ax2.hist(uniq_values_in_categories[uniq_values_in_categories.unique_values <= 30].unique_values, bins=30)
ax2.annotate('Binary features', xy=(3, 71), xytext=(7, 71), arrowprops=dict(facecolor='black'))

In [None]:
# Another option is to use Series.value_counts() method, but its
# output is not that nice

uniq_values = uniq_values_in_categories.groupby('unique_values').count()
uniq_values = uniq_values.rename(columns={'cat_name': 'categories'})
uniq_values.sort_values(by='categories', inplace=True, ascending=False)
uniq_values.reset_index(inplace=True)
print (uniq_values)

As we see, most of the categorical features (72 / 116) are binary, the vast majority of the features (88 / 116) have up to four values, there's one feature with 326 values.

### Target Feature ('loss') Plots

First, we just plot the target:

In [None]:
plt.figure(figsize=(16,8))
plt.plot(train['id'], train['loss'])
plt.title('Loss values per id')
plt.xlabel('id')
plt.ylabel('loss')
plt.legend()
plt.show()

There are several distinctive peaks in the loss values representing severe accidents. Such data distribution makes this feature very skewed and can result in suboptimal performance of the regressor.

Basically, skewness measures the asymmetry of the probability distribution of a real-valued random variable about its mean. Let's calculate the skewness of `loss`:

In [None]:
stats.mstats.skew(train['loss']).data

Yes, the data is skewed. Why? - [Because it is heavily greater than 1](https://help.gooddata.com/doc/en/reporting-and-dashboards/maql-analytical-query-language/maql-expression-reference/aggregation-functions/statistical-functions/predictive-statistical-use-cases/normality-testing-skewness-and-kurtosis). We can measure the skewness with `stats.mstats.skew`:

When we apply `np.log` to this vector, we get better results:

In [None]:
stats.mstats.skew(np.log(train['loss'])).data

To make sure our analysis is correct, we can plot a histogram of the loss. We also log-transform the target and plot its updated histogram. log-transform helps us interpret patterns easily and transformation doesn't mean changing the data. You also have the opportunity to back-transform the data without which it makes no sense while understanding the final results.

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(16,5)
ax1.hist(train['loss'], bins=50)
ax1.set_title('Train Loss target histogram')
ax1.grid(True)
ax2.hist(np.log(train['loss']), bins=50, color='g')
ax2.set_title('Train Log Loss target histogram')
ax2.grid(True)
plt.show()

### Continuous features

One thing we can do is to plot histogram of the numerical features and analyze their distributions:

In [None]:
train[cont_features].hist(bins=50, figsize=(16,12))

We see plots with many spikes which don't follow any reasonable PDF (probability distribution function). **Such plots point out that the data might have been converted from categorical to continuous to be represented as a real number.** No idea if that's really true as all the data was pre-processed by [Allstate Corporation](https://www.allstate.com/).

I'd still transform these features making their distribution more gaussian (a.k.a Normal), but I don't consider it can dramatically improve the model's performance.

### Feature Correlation (Pearson)

We can definitely visualize [correlations](https://stackoverflow.com/questions/42128462/in-python-how-to-do-correlation-between-multiple-columns-more-than-2-variables?rq=1) among all numerical features. We use an out-of-the-box solution (`pd.corr`) which relies on Pearson coefficient.

In [None]:
plt.subplots(figsize=(16,9))
correlation_mat = train[cont_features].corr()
sns.heatmap(correlation_mat, annot=True)

We see a high correlation among several features. This may be a result of a [data-based multicollinearity](https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/) as two or more predictors are highly correlated. There are many [problems](https://onlinecourses.science.psu.edu/stat501/node/346) it causes, so we should be very careful while [implementing](https://github.com/akshayreddykotha/regression-analysis-in-excel#evaluation-of-fit-model) linear regression models on current dataset.

### Comparing train and test set distributions

In order to get reliable predictions on test set, we need to make sure that the test data is distributed the same way as the training data does. If we do confirm that the data is equally distributed, this allows us to cross-validate on training set as well as test set.

The idea is to mix training and test sets and to see if a classifier (a logistic regression or a decision tree) can separate one from the other.

In [None]:
# Simple data preparation

train_d = train.drop(['id','loss'], axis=1)
test_d = test.drop(['id'], axis=1)

# To make sure we can distinguish between two classes
train_d['Target'] = 1
test_d['Target'] = 0

# We concatenate train and test in one big dataset
data = pd.concat((train_d, test_d))

# We use label encoding for categorical features:
data_le = deepcopy(data) # creates a same copy which can be used for other operations without 
#modifying the dataframe

#`data label encoding`
for c in range(len(cat_features)):
    data_le[cat_features[c]] = data_le[cat_features[c]].astype('category').cat.codes

# We use one-hot encoding for categorical features:
data = pd.get_dummies(data=data, columns=cat_features)

#### Recreating training and test sets:

In [None]:
# randomize before splitting them up into train and test sets
data = data.iloc[np.random.permutation(len(data))]
data.reset_index(drop = True, inplace = True)

x = data.drop(['Target'], axis = 1)
y = data.Target

train_examples = 100000

x_train = x[:train_examples]
x_test = x[train_examples:]
y_train = y[:train_examples]
y_test = y[train_examples:]

In [None]:
x.head()
y.head()

Now we train two classifiers: 1) logistic regression and 2) random forest — and use them to predict our test:

In [None]:
# Logistic Regression:
clf = LogisticRegression()
clf.fit(x_train, y_train)
pred = clf.predict_proba(x_test)[:,1]
auc = AUC(y_test, pred)
print("Logistic Regression AUC: ",auc)

# Random Forest, a simple model (100 trees) trained in parallel
clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
clf.fit(x_train, y_train)
pred = clf.predict_proba(x_test)[:,1]
auc = AUC(y_test, pred)
print ("Random Forest AUC: ",auc)

# Finally, CV our results (a very simple 2-fold CV):
scores = cross_val_score(LogisticRegression(), x, y, scoring='roc_auc', cv=2) 
print ("Mean AUC: {:.2%}, std: {:.2%} \n",scores.mean(),scores.std())

As we see from the results above, [ROC AUC score](https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5) is close to 0.5 (50%). We may conclude that neither a LR, nor a RF could find a difference between training and test set. This leads us to the thought that the training and test sets are drawn from the same distribution. A simple way to do so is just to use `train_test_split` in cross-validation module.

## Training a model for the ultimate goal of loss amount prediction

In [None]:
# A possibility to use pretrained models to limit the computational time.
USE_PRETRAINED = True

To start with, we train a baseline XGBoost model just to understand how well the whole training goes. To make sure  PC can handle this, we now limit ourselves to 3-fold CV. Higher fold is needed when the error doesn't decrease for each 10 rounds.

**k-fold CV**: *It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.*

#### Why XGBoost?

I am choosing because it is the most popular when it comes to applied machine learning and most importantly tabular data and suitable for limited training data (I feel it is limited as a 4GB RAM could handle it :P), little training time and little expertise for parameter tuning (this is basically my first implementation of XGBoost regressor) which is my scenario here.

#### What does the log-transformed data looks like (once again)?

In [None]:
train['log_loss'] = np.log(train['loss'])

In [None]:
train['log_loss'].sample(5)

In [None]:
features = [x for x in train.columns if x not in ['id','loss', 'log_loss']]

cat_features = [x for x in train.select_dtypes(
        include=['object']).columns if x not in ['id','loss', 'log_loss']]
num_features = [x for x in train.select_dtypes(
        exclude=['object']).columns if x not in ['id','loss','log_loss']]

print ("Categorical features:", len(cat_features))
print ("Numerical features:", len(num_features))

Label encoding for categorical variables among all the features:

In [None]:
ntrain = train.shape[0]

train_x = train[features]
train_y = train['log_loss'] #target variable
test_x = test[features]

for c in range(len(cat_features)):
    train_x[cat_features[c]] = train_x[cat_features[c]].astype('category').cat.codes
    test_x[cat_features[c]] = test_x[cat_features[c]].astype('category').cat.codes
print ("Xtrain:", train_x.shape)
print ("ytrain:", train_y.shape)
print("Xtest:", test_x.shape)

#### Important note:

It is important to keep in mind that `train_x` and `train_y` were created again but this time for the ultimate preditive model. The main difference between the earlier creation is the target variables are different in both cases. In the earlier case where RF and LG were fit it was a classification problem (to understand if the distibutions are alike) and the target was 0 or 1 i.e. train and test data points. But now the target
 variable is a log-transform of the `loss` variable.

In [None]:
train_x.head()

In [None]:
test_x.head()

In [None]:
pred_model = xgb.XGBRegressor() #As target variable is a continuous variable
pred_model.fit(train_x, train_y)
y_pred = pred_model.predict(test_x)
predictions = [value for value in y_pred]

#loss-prediction log-transformed
# print(predictions)

# evaluate predictions
# accuracy = accuracy_score(y_test, predictions)
# print("Accuracy: %.2f%%" % (accuracy * 100.0))

### Actual loss value prediction

In [None]:
loss_value = np.exp(predictions)
print(loss_value[:5])

In [None]:
train['loss'].sample(5)

### Model evaluation:

We use a custom evaluation function `xg_eval_mae` which calculates MAE (Mean absolute error), but works with our log-transformed data and uses XGBoost's DMatrix (native format to apply XGBoost):

In [None]:
def xg_eval_mae(yhat, dtrain):
    y = dtrain.get_label()
    return 'mae', mean_absolute_error(np.exp(y), np.exp(yhat))

In [None]:
dtrain = xgb.DMatrix(train_x, train['log_loss'])

In [None]:
#We use some average set of parameters to make XGBoost work:
xgb_params = {
    'seed': 0,
    'eta': 0.1,
    'colsample_bytree': 0.5,
    'silent': 1,
    'subsample': 0.5,
    'objective': 'reg:linear',
    'max_depth': 5,
    'min_child_weight': 3
}
# to be explored in detail for tuning and optimizing the model

Now we run XGBoost and cross validate the results via the built-in `xgb.cv` function. We use our `xg_eval_mae` function for calculating the loss (MAE). [MAE](https://en.wikipedia.org/wiki/Mean_absolute_error) is easy to understand and typically starting point as a score to estimate skill of a model.

In [None]:
bst_cv1 = xgb.cv(xgb_params, dtrain, num_boost_round=100, nfold=3, seed=0, 
                    feval=xg_eval_mae, maximize=False, early_stopping_rounds=10)

print ('CV score:', bst_cv1.iloc[-1,:]['test-mae-mean'])
#bst_cv1

We have got an MAE to start with, MAE = 1172.098958. With a scope for improvement by model tuning and optimizing, the score can further be improved but when it comes to real-world, there is always a trade-off between predictive power and computational capacity. It's highly recommended to strike a balance between the two in practice to bring your analysis closer to a real-world problem as it is really beneficial if every problem is thought in a practical manner rather to just fit the perfect model with a great score.

*Any inputs and feedback are accepted with great interest.*