<a href="https://colab.research.google.com/github/sayyed-uoft/sunlife/blob/main/SunLife_Vector_Institute_Workshop_(Dec_2021).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Institute + Sun Life Financial
## Fundamentals of Random Forests

Welcome to ‘Fundamentals of Random Forests’ by Vector Institute!
This is a Python tutorial in the ‘Fundamentals of Random Forests’ 2-day workshop. 

This program was developed for Sun Life Financial to give a mostly technical audience the opportunity to practice with Decision Tree and Random Forest models using the 'sklearn' Python package with a real and relevant dataset.

Instructor: Sayyed Nezhadi | Assignment Developer: Sayyed Nezhadi | Course Director: Shingai Manjengwa (@Tjido)
Never stop learning!

### Assignment: Regression using Decision Tree and Random Forest models
In this assignment you are going to learn how to process data, build and train Decition Tree and Random Forest models to predict the cost, and hence severity, of insurance claims. 

## Data: Loading and Analysis
In this part of the code we will load the data, analyze it, and visualize it.

We are going to use a public dataset in [Kaggle](https://www.kaggle.com/c/allstate-claims-severity) from Allstate insurance company in USA. They are  currently developing automated methods of predicting cost, and hence severity, of claims. Each row in this dataset represents an insurance claim. You must predict the value for the 'loss' column. Variables prefaced with 'cat' are categorical, while those prefaced with 'cont' are continuous. 

There are 116 categorical variables and 14 continuous (real) variables. All the column names and categorical values are annonomized for privacy reasons.  

Data is provided in two splits of "train" and "test". For this lesson, we will only load the "train" dataset.

### Initializing libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Loading data 

Loading the training dataset from a Zip file online using "pandas":

In [None]:
data = pd.read_csv('https://github.com/sayyed-uoft/sunlife/raw/main/Allstate_Claims_Severity.zip', compression='zip')

### Analyzing the training data

Let's take a quick look at the data. It is clearly annonomized. We won't be able to use any subject matter expertise to help with the feature engineering. 

In [None]:
data.head()

# Getting overall information:

In [None]:
data.info()

Checking for missing information - fortunately, the data is already clean.  

In [None]:
# number of missing data by column
data.isnull().sum(axis=0)

In [None]:
# Is ther any non-zero in that list?
data.isnull().sum(axis=0).any()

Let's look at the distribution of numerical/continous variables. Looks like they are already normalized to (0, 1) range. 

In [None]:
data.describe()

Let's check how they are distributed. We could plot the distributions, but we skip that for this assignment. We will rather just look how skewed they are. The result shows all the columns are fairly symmetric except "loss", that is the output. 

In [None]:
data.skew()

Now, we can look at the distribution of the "loss"" variable using a Violin plot. 

In [None]:
sns.violinplot(y='loss', data=data)
plt.show()

It is very skewed and the range of numbers is very high too.

### Pre-process data

Let's first convert "id" to an index as this is not a feature:

In [None]:
data.set_index('id', inplace=True)
data.head()

**We** saw that the "loss" values are very skewed and there is a very large range of numbers. Let's convert it to logarithmic scale. We also had some noisy/very small data (e.g. loss = 0.67). Therefore, it is better to use log(1+x).

In [None]:
data.loss = np.log1p(data.loss)

Let's look at it again. This looks more symmetric. It will be better to use this as the response variable.

In [None]:
sns.violinplot(y='loss', data=data)
plt.show()

The "skew" metric is dropped significantly:

In [None]:
data.loss.skew()


Now, we need to encode categorical columns into one-hot vectors. Let's first look at unique values for each column (only first 116 columns that are categorical)
You can learn more about one-hot encoding >> (https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/)

In [None]:
data.iloc[:, :116].nunique().value_counts()

Looks like most of the columns have 2 unique values. So, we won't create a huge amount of columns We can convert to one-hot / dummy variables now (only "0"s and "1"s). We will choose to drop the first one to eliminate redundant data. For eaxmple, the columns with only two unique values will be converted to only one column. 

In [None]:
data = pd.get_dummies(data, drop_first=True)
data.head()

Now we have 1038 columns:

In [None]:
data.info()

Now, we need to do the following to be ready to train a model using "sklearn":

- Separate the features from labels


In [None]:
features = data.drop(['loss'], axis=1)
labels = data['loss']


- We need to split it to training and test (validation) sets for model evaluation. We keep 80% for training and 20% for test.

In [None]:
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.20, random_state=0)

In [None]:
print("Number of training samples:", labels_train.shape[0])
print("Number of testing samples:", labels_test.shape[0])

## Training and Evaluation

We have prepared our data and are ready to train a model.

We will compare the following two models:

- Decistion Tree
- Random Forest

### Decision Tree:

Let's instantiate a Decision Tree model and train (fit) it with the training data. For now, we choose the default parameters without any restriction. 

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Instantiate the model 
model = DecisionTreeRegressor(random_state=0)
# Train the model
model.fit(features_train, labels_train)

Below are the parameters used for this decision tree (default parameters):

In [None]:
model.get_params()

Now, we can use the trained model to predict the response variable for test samples:

In [None]:
# Predict the test labels
preds = model.predict(features_test)
preds

To see how our model performs, we can use a Regression metric. One popular metric is MAE (Mean Absolute Error). Don't forget our model predicts the logarithm of "loss" (log(1+x)). We need to reverse it first using "expm1" function: 

In [None]:
from sklearn.metrics import mean_absolute_error

# Calculate MAE for test data
mean_absolute_error(np.expm1(labels_test), np.expm1(preds))

Let's see what the error is for training data: 

In [None]:
# Calculate MAE for train data 
mean_absolute_error(np.expm1(labels_train), np.expm1(model.predict(features_train)))

Wow! That's almost zero! That means a perfect fit. 


 


> **Question:**
> The error on the training data is very low but on the testing data is high. What is this sign of? Please explain.








The constructed tree is going to be very big and very deep. Let's limit the size of the tree by limiting it's depth to 3: 

In [None]:
# Create a new model with limited depth
model = DecisionTreeRegressor(max_depth=3, random_state=0)
# Train 
model.fit(features_train, labels_train)
# Predict test labels
preds = model.predict(features_test)
# MAE for test data
mean_absolute_error(np.expm1(labels_test), np.expm1(preds))

Interestingly, the error was reduced even when we limited the tree. Let's check the error on training data. 

In [None]:
# Calculate MAE for train data 
mean_absolute_error(np.expm1(labels_train), np.expm1(model.predict(features_train)))

> **Question:**
> The error on the test data is lower and the error on training data is comparable to that. Please explain why that happened? Is that good?


Let's plot this tree and look at the conditions. 

In [None]:
from sklearn import tree

plt.figure(figsize=(30, 20))
tree.plot_tree(model)
plt.show()

Looks nice! :)

> **Task:** Try different tree depths and see if you can get even better results. Report the depth and the corresponding MAE. You can use Grid Search function from "sklearn" or you can try it manually  

### Random Forest:

Now, we will use a Random Forest model and train (fit) it with the training data. For now, we choose the default parameters with 50 estimators. 

**Note:** This may take a few minutes.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Create the model
model = RandomForestRegressor(n_estimators=50, random_state=0)
# Train
model.fit(features_train, labels_train.values)
# Predict test labels
preds = model.predict(features_test)
# MAE 
mean_absolute_error(np.expm1(labels_test), np.expm1(preds))

We got a much better result! 

Let's look at MAE for training data:

In [None]:
# Calculate MAE for train data 
mean_absolute_error(np.expm1(labels_train), np.expm1(model.predict(features_train)))

It's not zero but much lower than the test error.

> **Task:** Play with different parameters and see if you can get a better result while avoiding overfitting.

Below are the paramaers we used for our model:

In [None]:
model.get_params()

One great feature of Random Forest is that it will give you the importance of the features. Thsi is great for vfeature engineering and to speed up the training process.

In [None]:
# Importance scores sorted from high to low
np.sort(model.feature_importances_)[::-1]

In [None]:
# Indices of top 10 important features
indices = np.argsort(model.feature_importances_)[-10:][::-1]
indices

In [None]:
# names of the top 10 important features (sorted)
cols = features.columns[indices]
cols

We can plot the top 10 importance scores:

In [None]:
plt.figure(figsize=(10, 5))
plt.bar(x=cols, height=np.sort(model.feature_importances_)[-10:][::-1])
plt.show()

> **Question & Task:** What are other ways to interpret the results of a Regression model? What other metrics or graphs would you suggest? 

**Congratulations, you have completed a tutorial in the ‘Fundamentals of Random Forests’ program!**

Vector Institute & Sun Life Financial | Course Director: Shingai Manjengwa (@Tjido) | Instructor: Sayyed Nezhadi  | Contact: learn@vectorinstitute.com
Never stop learning!