In this notebook, I tried to go in-depth with understanding ML models (one linear model **Regression** and one tree based model **Random Forest** to be specific). I have kept the motto to understand different parameters/ hyperparameters used by the models, so the focus is not on earning a high score here. Keeping this objective in mind, I avoided using XGBoost or LightGBM (which are otherwise popular models for getting high score).

Key takeaways:
* **Adversarial Validation** - A diagnostic way to check the similarity between Training and Test data sets
* **Feature Transformation** - Log transformation to reduce variable skewness and test normality with QQ-plot
* **Regularizations (L1 and L2 Penalties)** - Effect of regularization parameter on Lasso and Ridge coefficients
* **Residuals Plot and Prediction Error Plot** - Regression model's performance analysis visually
* **Visualize Random Forest** - Visual change in model plot w.r.t. change in different hyperparameters viz. n_estimators, max_depth, min_samples_leaf.

In [None]:
import pandas as pd
train_data = pd.read_csv("../input/allstate-claims-severity/train.csv") 
test_data = pd.read_csv("../input/allstate-claims-severity/test.csv")

In [None]:
train_data.head()

In [None]:
test_data.head()

**Description of Data :**
* "id": Unique identifier of an insurance claim lodged 
* "cat1" to "cat116": Categorical variables (masked)
* "cont1" to "cont14": Continuous variables (masked)
* "loss": The amount of claim the company has to pay out. This is the target variable present in training data.

Now, we will check datatypes of columns and if there is any missing value (null) present.

In [None]:
# Check for data types of the columns
train_data.info()

In [None]:
test_data.info()

In [None]:
# Check for null values
pd.isnull(train_data).values.any()

In [None]:
pd.isnull(test_data).values.any()

# Initial Data Exploration : 

We will do initial data exploration using Data Analysis Baseline Library.

In [None]:
!pip install dabl

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function() {
    return False;
}

In [None]:
import dabl
import warnings
import matplotlib.pyplot as plt
plt.style.use('ggplot')
warnings.filterwarnings('ignore')
dabl.plot(train_data, target_col = 'loss')

**Synopsis of DABL Plots :**

1. Claim amounts mostly hover below 20,000 USD
2. Among continuous features: cont2, cont7, cont3, cont11, cont12, cont6, cont4, cont8, cont10, cont14, cont9, cont5, cont1, cont13 are contributing dominantly towards the loss amount. Eg: when cont2 is having a value between 0.6 and 0.8, the loss amount is high; when cont14 is having a value between 0.2 and 0.4 or 0.6 and 0.8, the loss amount is high.
3. Among categorical features: cat1, cat2, cat3, cat4, cat5, cat6, cat8, cat9, cat10, cat11 are mainly contributing towards the loss amount. Eg: when cat1 is having value 'A', the loss amount is high. Also, when cat3 is having value 'A', the loss amount is high.

Next, we will visualize loss amount per claim id. This plot will also highlight the outlier claim amounts.

In [None]:
# View outliers for loss amount
plt.figure(figsize=(10,5))
plt.xlabel('Id')
plt.ylabel('Loss Value')
plt.title('Loss Value per Claim Id and Visualization of Outliers')
plt.xlim([-1000, 195000])
plt.ylim([-1000, 140000])
plt.plot(train_data.index, train_data["loss"], marker='o', markeredgecolor='k')
plt.show()

We can spot the furthest outlier with loss value around 1,20,000 USD.

# Adversarial Validation :

It is important to compare overall distribution of train and test set. For improving the quality of prediction on test set, it is required to ensure that both data sets follow almost similar distribution. To check this, we will remove target labels from train set and merge it with test set to make a big data set. Then we will label the train data points with "1" and test data points with "0". Afterwards, we will apply ML algorithms to check ROC_AUC metric. If ROC_AUC metric is much higher than 50% (preferably above 60%) then we can conclude that there is a significant different between train and test data distribution. This procedure is known as ***Adversarial Validation***.

**Acknowledgement : [YouTube video by Zak Jost](https://www.youtube.com/watch?v=7cUCDRaIZ7I)**

In [None]:
from copy import deepcopy

train_d = train_data.drop(['id','loss'], axis=1)
test_d = test_data.drop(['id'], axis=1)
train_d['Target'] = 1
test_d['Target'] = 0
# Merge
prep_data = pd.concat((train_d, test_d))

# Label encoding for categorical features
data_le = deepcopy(prep_data)

list_of_cat_cols = list(train_data.select_dtypes(include=['object']).columns)
for c in range(len(list_of_cat_cols)):
    data_le[list_of_cat_cols[c]] = data_le[list_of_cat_cols[c]].astype('category').cat.codes

# One-hot encoding for categorical features
prep_data = pd.get_dummies(data=prep_data, columns=list_of_cat_cols)

In [None]:
import numpy as np
data = prep_data.iloc[np.random.permutation(len(prep_data))] # Shuffle data
data.reset_index(drop = True, inplace = True)

x = data.drop(['Target'], axis = 1)
y = data.Target

few_examples = 50000 # Took only some data

x_train = x[:few_examples]
x_test = x[few_examples:]
y_train = y[:few_examples]
y_test = y[few_examples:]

We are using two classifiers: Logistic Regression and Random Forest to check if they are able enough to separate train and test data points. 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score as AUC
from sklearn.model_selection import cross_val_score
import warnings

warnings.filterwarnings('ignore')

clf_lr = LogisticRegression()
clf_lr.fit(x_train, y_train)
pred = clf_lr.predict_proba(x_test)[:,1]
auc_lr = AUC(y_test, pred)
print("Logistic Regression ROC_AUC: {:.2%}".format(auc_lr))

clf_rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
clf_rf.fit(x_train, y_train)
pred = clf_rf.predict_proba(x_test)[:,1]
auc_rf = AUC(y_test, pred)
print("Random Forest ROC_AUC: {:.2%}".format(auc_rf))

scores_lr = cross_val_score(LogisticRegression(), x, y, scoring='roc_auc', cv=2)
print("Mean ROC_AUC for Logistic Regression : {:.2%}, std: {:.2%}".format( scores_lr.mean(), scores_lr.std()))

scores_rf = cross_val_score(RandomForestClassifier(n_estimators=100, n_jobs=-1), x, y, scoring='roc_auc', cv=2)
print("Mean ROC_AUC for Random Forest : {:.2%}, std: {:.2%}".format( scores_rf.mean(), scores_rf.std()))

The results above shows ROC-AUC score is just a marginal 50%. So the classifiers are not able to clearly separate training and testing data points. So we can conclude that the training and testing data points are homogenous and there is low possibility of model-overfitting.

* **Projection into 2-Dimensional PCA :**

Next, we will also visualize both data points using PCA. First, we are decomposing the features to 2 principal components and view the 2D projection plot.

In [None]:
from sklearn.decomposition import PCA

# Shuffle
data_le = data_le.iloc[np.random.permutation(len(data_le))]
X = data_le.iloc[:, :130]
y = data_le.iloc[:, 130:]
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

import matplotlib.pyplot as plt
plt.style.use('ggplot')

# Train = 1, Test = 0
plt.figure(figsize=(16,12))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=np.array(y), 
            edgecolor='white', s=75,
            cmap=plt.cm.get_cmap('Accent',2))
plt.title('PCA transformed train and test sets')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar()
plt.show()

See, the trend of pattern followed by train and test data points are similar.

* **Projection into 3-Dimensional PCA :**

We can also try decomposing the features to 3 principal components and view the 3D projection plot.

In [None]:
import plotly.express as px
import plotly.graph_objects as go
pca_3 = PCA(n_components=3)
X_reduced_3 = pca_3.fit_transform(X)
fig = px.scatter_3d(X_reduced_3, x=X_reduced_3[:,0], y=X_reduced_3[:, 1], z=X_reduced_3[:, 2], color=np.array(y), width=750, height=450, opacity=0.7)
fig.update_layout(scene = dict(
                    xaxis_title='PC1',
                    yaxis_title='PC2',
                    zaxis_title='PC3'))
fig.show()

The yellow points (train) and blue points (test) are so closely mashed together. Evidently, it is difficult to separate them. Hence, we can conclude both train and test data points are coming from the same sample, not from different samples, and there is a minimal possibility of overfitting.

# Correlation Analysis :

We will now proceed to perform correlation analysis among the continuous variables to see if there is multicollinearity i.e. perfect correlation of 1 between any feature pair.

In [None]:
# Correlation for continuous variables
corr_mat = train_data.iloc[:,117:132].corr()

In [None]:
# Spot the continuous feature pairs with high correlation
threshold = 0.8
high_corrs = (corr_mat[abs(corr_mat) > threshold][corr_mat != 1.0]) .unstack().dropna().to_dict()
unique_high_corrs = pd.DataFrame(list(set([(tuple(sorted(key)), high_corrs[key]) for key in high_corrs])), columns=['cont_feature_pair', 'correlation_coefficient'])
unique_high_corrs = unique_high_corrs.loc[abs(unique_high_corrs['correlation_coefficient']).argsort()[::-1]]
unique_high_corrs

So, the highest correlation is among cont11 and cont12. There is no perfect correlation of 1 between any pair. We will check correlations plotting a cluster heatmap as well. The motto is to identify the cluster of continuous features having similar trend of correlation among each other.

In [None]:
# Clustermap of correlations of continuous variables
import seaborn as sns
cont_data = train_data.iloc[:,117:132]
cont_data = cont_data.corr().abs()
map = sns.clustermap(cont_data, annot = True, annot_kws = {'size': 11})
plt.setp(map.ax_heatmap.yaxis.get_majorticklabels(),rotation = 0)
plt.show()

Visibly, two major clusters are there : 
1. One with very high correlations among other {cont7, cont11, cont12, cont13, cont1, cont9, cont6, cont10}. 
2. The other is with weak correlations among each other {cont4, cont8, cont2, cont3, cont5, cont14}. The target variable "loss" is not very highly correlated with any continuous variable.

In [None]:
# Pair plot for judging the inter-relations among the continuous variables in 1st cluster
sns.pairplot(train_data, vars=["cont7", "cont11", "cont12", "cont13", "cont1", "cont9", "cont6", "cont10", "loss"])
plt.show()

* Steep linear relationship is observed between: (cont11, cont 12)
* Linear relationship is observed between: (cont1, cont9)

In [None]:
# Pair plot for judging the inter-relations among the continuous variables in 2nd cluster
sns.pairplot(train_data, vars=["cont4", "cont8", "cont2", "cont3", "cont5", "cont14", "loss"])
plt.show()

* No significant trend is observed in above pair plot.

Next, we will convert categorical variables to continuous ones.

In [None]:
# Convert categorical features to continuous features with Label Encoding in train data
from sklearn.preprocessing import LabelEncoder
lencoders = {}
for col in train_data.select_dtypes(include=['object']).columns:
    lencoders[col] = LabelEncoder()
    train_data[col] = lencoders[col].fit_transform(train_data[col])

In [None]:
# Convert categorical features to continuous features with Label Encoding in test data
from sklearn.preprocessing import LabelEncoder
lencoders_2 = {}
for col in test_data.select_dtypes(include=['object']).columns:
    lencoders_2[col] = LabelEncoder()
    test_data[col] = lencoders_2[col].fit_transform(test_data[col])

In [None]:
# Correlation for categorical variables
corr_mat_2 = train_data.iloc[:,1:116].corr()

In [None]:
# Spot the categorical feature pairs with high correlation
threshold = 0.8
high_corrs_2 = (corr_mat_2[abs(corr_mat_2) > threshold][corr_mat_2 != 1.0]) .unstack().dropna().to_dict()
unique_high_corrs_2 = pd.DataFrame(list(set([(tuple(sorted(key)), high_corrs_2[key]) for key in high_corrs_2])), columns=['cat_feature_pair', 'correlation_coefficient'])
unique_high_corrs_2 = unique_high_corrs_2.loc[abs(unique_high_corrs_2['correlation_coefficient']).argsort()[::-1]]
unique_high_corrs_2

Well, here also the highest correlation we observe is around 0.9557. No pair is having a value of 1. Let's visualize it.

In [None]:
# Correlation Heatmap
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid", {'axes.grid' : False})
mask = np.triu(np.ones_like(corr_mat_2, dtype=np.bool))
f, ax = plt.subplots(figsize=(20, 20))
cmap = sns.diverging_palette(250, 15, as_cmap=True)
sns.heatmap(corr_mat_2, mask=mask, cmap=cmap, vmax=None, center=0,square=True, annot=False, linewidths=.5, cbar_kws={"shrink": 0.9})

# Feature Importance :

Among 130 features, we will check which ones are having dominant contribution towards "loss". We have chosen top 30 features here. You can try with a bigger number.

In [None]:
def get_feature_importance_df(feature_importances,
                              column_names, 
                              top_n=30):
     
    imp_dict = dict(zip(column_names, feature_importances))
    
    # get name features sorted
    top_features = sorted(imp_dict, key=imp_dict.get, reverse=True)[0:top_n]
    
    # get values
    top_importances = [imp_dict[feature] for feature in top_features]
    
    # create dataframe with feature_importance
    df = pd.DataFrame(data={'feature': top_features, 'importance': top_importances})
    return df

In [None]:
import numpy as np

def get_col(df: 'dataframe', type_descr: 'numpy') -> list:
  
    try:
        col = (df.describe(include=type_descr).columns)  # pandas.core.indexes.base.Index  
    except ValueError:
        print(f'Dataframe not contains {type_descr} columns !', end='\n')    
    else:
        return col.tolist()
    
list_columns = get_col(df=train_data, type_descr=[np.object, np.number])

In [None]:
list_columns.remove('loss')

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = train_data.shape[1], # number of component trees
                            max_depth = 8,
                            min_samples_leaf = train_data.shape[1],
                            max_features = 0.2, # each tree's 20% utility in the features
                            n_jobs = -1)

In [None]:
rf.fit(train_data[list_columns], train_data['loss'])
features = train_data[list_columns].columns.values

In [None]:
feature_importance = get_feature_importance_df(rf.feature_importances_, features)
display(feature_importance)

So, we have got the list of top 30 important features. Let's visualize it.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig,ax = plt.subplots()
plt.xticks(rotation=65)

fig.set_size_inches(18,10)
#sns.set_color_codes('pastel')
sns.barplot(data=feature_importance[:30], 
            x="feature", 
            y="importance", 
            edgecolor='k',
            ax=ax)
ax.set(xlabel="Variable Names",
       ylabel='Importance',
       title="Feature Importances")

So we see that mainly categorical variables have high feature importance. Only a few continous variables like cont2, cont7, cont12, cont11 have high feature importance.  

# Feature Transformation 

Let's check skewness of the target variable. Highly skewed variable is not desired in the training model since it violates normality. For this purpose, we will experiment with simple application of log transformation on target variable "loss" and application of log on (loss+100). Later, we will check normality compliance with the QQ-plot. 

In [None]:
train_data['loss'].skew()

Yeah, it is very skewed indeed! ('3' being very high skewness as compared to desired skewness '0')

In [None]:
# Target Feature: Loss

sns.set_style("darkgrid", {'axes.grid' : False})
plt.figure(figsize = (8, 4))
plt.title('Loss Severity Distribution')
plt.xlabel('Loss Severity')
plt.ylabel('Frequency')
train_data['loss'].hist(bins=50)
plt.tight_layout()
xt = plt.xticks(rotation=45)
plt.xlim([-10000,140000])
plt.ylim([-10000,120000])
plt.annotate('Outliers\n present\n till\n this point', xy=(120000, 100), xytext=(120000, 35000), arrowprops=dict(facecolor='black'), color='black')
plt.show()

In [None]:
# Target Feature: Loss (Focus on Main Region)

plt.figure(figsize = (7, 4))
plt.title('Distribution of Loss Severity (Focus Mode: from 0 - 20,000 USD)')
plt.xlabel('Loss Severity')
plt.ylabel('Frequency')
train_data['loss'].hist(bins=500)
plt.tight_layout()
xt = plt.xticks(rotation=45)
plt.xlim([-1000,20000])
plt.ylim([-1500,20500])
plt.show()

In [None]:
# Feature Transformation Trial 1 : Apply Log on Loss

train_data['log_loss'] = np.log(train_data['loss'])

plt.figure(figsize = (7, 4))
plt.title('Loss Severity Distribution (Log Transformation)')
plt.xlabel('Log Loss Severity')
plt.ylabel('Frequency')
sns.distplot(train_data['log_loss'], kde = True, hist_kws={'alpha': 0.60})
plt.tight_layout()
xt = plt.xticks(rotation=0)
plt.xlim([-1,13])
plt.ylim([-0.01,0.5])
plt.show()

In [None]:
# Feature Transformation Trial 2 : Apply Log on (loss+100)

train_data['log_loss_+_100'] = np.log(100 + train_data['loss'])

plt.figure(figsize = (7, 4))
plt.title('Loss Severity Distribution (Log Transformation on Loss + 100)')
plt.xlabel('Complex Log Loss Severity')
plt.ylabel('Frequency')
sns.distplot(train_data['log_loss_+_100'], kde = True, hist_kws={'alpha': 0.60}) 
plt.tight_layout()
xt = plt.xticks(rotation=0)
plt.ylim([-0.01,0.55])
plt.show()

Well, both of these feature transformation trials have proved to take symmetrical bell-shaped curve now. Let's check the skewness of transformed target variables.

In [None]:
train_data['log_loss'].skew()

In [None]:
train_data['log_loss_+_100'].skew()

So, log(loss) has given lower skewness as compared to skewness given by log(loss+100). Let's draw **QQ-plots** now to check which one is complying with normality better. A **Q-Q plot** is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight like below. (Ref: [University of Virginia Library](https://data.library.virginia.edu/understanding-q-q-plots/#:~:text=A%20Q%2DQ%20plot%20is%20a,truly%20come%20from%20Normal%20distributions.))

In [None]:
import statsmodels.api as sm
sample = np.random.normal(0,1, 1000)
sm.qqplot(sample, line='45')
plt.title('Ideal QQ-Plot for Normal Distribution')
plt.show()

In [None]:
# Normality Test for the Target Feature
sm.qqplot(train_data['loss'], line='45')
plt.title('QQ-Plot for Loss Severity Distribution (Original)')
plt.show()

In [None]:
sm.qqplot(train_data['log_loss'], line='45')
plt.title('QQ-Plot for Loss Severity Distribution (Log Transformation)')
plt.show()

In [None]:
sm.qqplot(train_data['log_loss_+_100'], line='45')
plt.title('QQ-Plot for Loss Severity Distribution (Log of Loss + 100)')
plt.show()

Well, here also we observe, QQ-plot for **simple log transformation of loss** is parallel to the diagonal red line. So we will work with 'log_loss' as transformed target variable for training purpose. Next, we proceed for splitting data set into training and testing.

# Train-Test Split :

In [None]:
from sklearn.model_selection import train_test_split
seed = 12345

# considering only top 30 imp features
trainx = ['cat80', 'cat79', 'cat87', 'cat57', 'cat101', 'cat12', 'cont2', 'cat81', 'cat89', 'cont7', 'cat7', 'cat10', 'cont12', 'cont11', 'cat1', 'cat72', 'cat103', 'cat94', 
                    'cat2', 'cont3', 'cat11', 'cat106', 'cat111', 'cat114', 'cat53', 'cat13', 'cat9', 'cont6', 'cat100', 'cat44']
trainy = train_data.columns[-2] #considering log_loss
X = train_data[trainx]
Y = train_data[trainy]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=seed)

# Modelling with Linear Regression :

In [None]:
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, make_scorer, SCORERS

model1 = LinearRegression(n_jobs=-1)
mae_val= make_scorer(mean_absolute_error, greater_is_better=False)
results1 = cross_val_score(model1, X_train, y_train, cv=5, scoring=mae_val, n_jobs=-1)
print("Linear Regression (Manual Tuning): ({0:.3f}) +/- ({1:.3f})".format(-1*results1.mean(), results1.std()))

In [None]:
X2 = sm.add_constant(X)
model = sm.OLS(Y, X2)
model_ = model.fit()
print(model_.summary())

So, Ordinary Least Square Regression is giving \\( R^2 \\) value around 45% which is actually a weak value for the model. Also, high AIC and BIC values do not indicate a good fit.

# Ridge Regression (L2 Penalty) :

Ridge is a way to regularize regression to curb overfitting. To decrease model complexity, number of features get reduced by penalizing some of the redundant features' sum of squared coefficients (β-coefficients) almost nearing to zero (never exactly zero). Ideally, when we use λ (regularization penalty) tends to infinity, β tends to "0". Here, we are using scikit-learn library's Ridge regression with regularization parameter (α) = 1. 

In [None]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error, make_scorer, SCORERS

model2 = Ridge(alpha=1,random_state=seed)
mae_val= make_scorer(mean_absolute_error, greater_is_better=False)
results2 = cross_val_score(model2, X_train, y_train, cv=5, scoring= mae_val, n_jobs=1)
print("Linear Regression Ridge (Manual Tuning): ({0:.3f}) +/- ({1:.3f})".format(results2.mean(), results2.std()))

In [None]:
import time

start_time = time.perf_counter()
clf = Ridge()
coefs = []
# this alpha parameter of scikit-learn's Ridge actually corresponds to our lambda
alphas = np.logspace(-6, 9, 200)

for a in alphas:
    clf.set_params(alpha=a)
    clf.fit(X2, Y)
    coefs.append(clf.coef_)

plt.figure(figsize=(18, 6))
plt.subplot(121)
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('beta coefficients')
plt.title('Ridge coefficients (β) as a function of regularization parameter (α)')
plt.axis('tight')
plt.annotate('All coefficients are shrinked to zero at this point \nfor Lasso (see proof later)', 
xy=(1, -0.5), xytext=(1, -0.3), arrowprops=dict(facecolor='black'), color='black')
plt.grid(color='black', linestyle='dotted')
plt.show()

end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

In [None]:
start_time = time.perf_counter()

clf = Ridge()
error = []
alphas = np.logspace(-6, 9, 200)

for a in alphas:
    clf.set_params(alpha=a)
    mae_val = make_scorer(mean_absolute_error, greater_is_better=False)
    error.append(cross_val_score(clf, X2, Y, cv=5, scoring=mae_val, n_jobs=1).mean())

plt.figure(figsize=(18, 6))

plt.subplot(121)
ax = plt.gca()
ax.plot(alphas, error)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('mean absolute error')
plt.title('Mean absolute error as a function of regularization parameter (α)')
plt.axis('tight')

plt.annotate('Lasso is done at this point \nand all coefficients would be \nshrinked to zero (see proof later)', 
xy=(1, -0.65), xytext=(1, -0.62), arrowprops=dict(facecolor='black'), color='black')
plt.annotate('', 
xy=(1000, -0.48), xytext=(1000, -0.51), arrowprops=dict(facecolor='black'), color='black')
plt.grid(color='black', linestyle='dotted')
plt.ylim([-0.68,-0.45])
plt.show()

end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

# Lasso Regression (L1 Penalty) :
Lasso is another way to curb overfitting. It also adds a penalty for non-zero β-coefficients. Unlike Ridge regression which penalizes sum of squared coefficients, lasso penalizes the sum of the absolute values of β-coefficients. Hence, for very high values of λ, many β-coefficients are not only shrinked but exactly zeroed under Lasso. Here, we are trying Lasso Regression with regularization parameter (α) = 0.0001

In [None]:
from sklearn.linear_model import Lasso
model3 = Lasso(alpha=0.0001,random_state=seed)
mae_val = make_scorer(mean_absolute_error, greater_is_better=False)
results3 = cross_val_score(model3, X_train, y_train, cv=5, scoring=mae_val, n_jobs=1)
print("Linear Regression Lasso (Manual Tuning): ({0:.3f}) +/- ({1:.3f})".format(results3.mean(), results3.std()))

In [None]:
start_time = time.perf_counter()

#clf = Lasso(tol=0.00001, max_iter=10000)
clf = Lasso()
coefs = []
alphas = np.logspace(-6, 2, 200)

for a in alphas:
    clf.set_params(alpha=a)
    clf.fit(X2, Y)
    coefs.append(clf.coef_)

plt.figure(figsize=(18, 6))
plt.subplot(121)
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('coefficients')
plt.title('Lasso coefficients (β) as a function of regularization parameter (α)')
plt.axis('tight')

plt.annotate('see proof now', 
xy=(1, 0.1), xytext=(1, 0.3), arrowprops=dict(facecolor='black'), color='black')
plt.grid(color='black', linestyle='dotted')
plt.show()

end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

In previous figures, we mentioned about visualizing proof later. Now, we can see that at alpha = 1, the coefficients are skrinking down to exact zero for Lasso. 

In [None]:
start_time = time.perf_counter()

clf = Lasso()
error = []
alphas = np.logspace(-6, 9, 200)

for a in alphas:
    clf.set_params(alpha=a)
    mae_val = make_scorer(mean_absolute_error, greater_is_better=False)
    error.append(cross_val_score(clf, X2, Y, cv=5, scoring=mae_val, n_jobs=1).mean())

plt.figure(figsize=(18, 6))

plt.subplot(121)
ax = plt.gca()
ax.plot(alphas, error)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('mean absolute error')
plt.title('Mean absolute error as a function of the regularization parameter (α)')
plt.axis('tight')

plt.annotate('Lasso is done at this point \nand all weights are \nshrinked to zero', 
xy=(1, -0.60), xytext=(1, -0.58), arrowprops=dict(facecolor='black'), color='black')
plt.annotate('Ridge still perform good at this point', 
xy=(0.001, -0.49), xytext=(0.001, -0.51), arrowprops=dict(facecolor='black'), color='black')
plt.grid(color='black', linestyle='dotted')
plt.ylim([-0.68,-0.44])
plt.show()

end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

**Acknowledgement :** For plotting Ridge and Lasso coefficient as a function of alpha, I consulted [Jerome Blanchet's kernel](https://www.kaggle.com/jeromeblanchet/ridge-lasso-coefficients-as-a-function-of-alpha). All thanks to him for sharing his work.

# Elastic Net Regression:
Elastic Net is a mix of Ridge and Lasso. If we put l1_ratio = 0,  the penalty is 100% L2 penalty (Ridge Regression). Else if we take l1_ratio = 1, it is 100% L1 penalty (Lasso Regression). For any value l1_ratio = x where 0 < x < 1, the penalty is a combination of L1 (x%) and L2 (100-x)%.

In [None]:
from sklearn.linear_model import ElasticNet
model4 = ElasticNet(alpha=0.0001,l1_ratio=0.5,random_state=seed)
mae_val = make_scorer(mean_absolute_error, greater_is_better=False)
results4 = cross_val_score(model4, X_train, y_train, cv=5, scoring=mae_val, n_jobs=1)
print("Linear Regression Elastic Net (Manual Tuning): ({0:.3f}) +/- ({1:.3f})".format(results4.mean(), results4.std()))

In [None]:
lasso = Lasso(alpha=0.0001)
lasso.fit(X_train, y_train)
ridge = Ridge(alpha=1,random_state=seed)
ridge.fit(X_train, y_train)
linear = LinearRegression()
linear.fit(X_train, y_train)
enet = ElasticNet(alpha=0.0001, l1_ratio=0.5)
enet.fit(X_train, y_train)

plt.figure(figsize = (16, 6))
plt.plot(enet.coef_, color='red', linewidth=0.5, marker='^', label='Elastic net coefficients with α = 0.0001 & L1 Ratio = 0.5')
plt.plot(lasso.coef_, color='black', linewidth=0.5, marker='s', label='Lasso coefficients with α = 0.0001')
plt.plot(ridge.coef_, color='green', linewidth=0.5, marker='x', label='Ridge coefficients with α = 1')
plt.plot(linear.coef_, color='blue', linewidth=0.5, marker='o', label='Linear coefficients without regularization')
plt.grid(color='black', linestyle='dotted')
plt.ylim([-0.6,0.7])
plt.xlim([-1,30])
plt.legend(loc='best')
plt.title('Coefficient Distribution According to the Linear Regression type')
plt.xlabel('Variables in Ranked Order')
plt.ylabel('Estimated Coefficient Value')
plt.show()

We can see variation in ranking of variables (for estimated coefficient values) in range 10-15 only. For other range, the graphs are pretty similar coinciding the points. The fluctuation in coefficient value is the highest for blue curve i.e. for linear coefficients without regularization which is obvious. 

Let's try plotting residuals for Ridge, Lasso and Elastic Net models we used here.

In [None]:
from yellowbrick.regressor import ResidualsPlot
viz_r = ResidualsPlot(ridge)
viz_r.fit(X_train, y_train) 
viz_r.score(X_test, y_test)  
viz_r.show()  

In [None]:
viz_l = ResidualsPlot(lasso)
viz_l.fit(X_train, y_train)  
viz_l.score(X_test, y_test)  
viz_l.show()                 

For the Ridge (α=1) and Lasso (α=0.0001) models we used, the residual plots show similar strength for both of them. Both the models have similar \\( R^2 \\) value for train and test data sets i.e. around 45%. This is to recall that we also got a similar \\( R^2 \\)  value using OLS regression. We can experiment with trying a higher α-value for Lasso.

In [None]:
viz_l = ResidualsPlot(Lasso(alpha=0.1))
viz_l.fit(X_train, y_train)  
viz_l.score(X_test, y_test)  
viz_l.show() 

Increasing 'α' worsens the performance of Lasso. Hence, we are sticking to lower α-value which we used earlier. Let's check the plot for Elastic Net with α = 0.0001 and L1 ratio = 0.5

In [None]:
viz_e = ResidualsPlot(enet)
viz_e.fit(X_train, y_train)  
viz_e.score(X_test, y_test)  
viz_e.show()

We observed similar performance here as well. 

**Common Observation from all Residual Plots :** The main purpose of viewing residuals plot is to analyze the variance of the error of the regressor. Here, the points are not randomly dispersed around the horizontal axis. This gives an indication that a linear regression model is not appropriate for the data. A non-linear model (perhaps tree-based model) is more appropriate. 

Before trying a tree-based model, we will check prediction error plot for regression.

In [None]:
from yellowbrick.regressor import PredictionError
viz_pe = PredictionError(enet)
viz_pe.fit(X_train, y_train)  
viz_pe.score(X_test, y_test)  
viz_pe.show() 

The prediction error plot shows the actual targets from the dataset against the predicted values generated by the model (Elastic Net). This allows us to see how much variance is present in the model. The variance can be diagnosed by comparing the best-fit line against the 45 degree diagonal. The 45 degree diagonal line denotes the case when actual targets exactly match the predicted values generated by model. (Ref: [Scikit-YB site](https://www.scikit-yb.org/en/latest/api/regressor/peplot.html)). In this case, there is substantial amount of variance is present in the model.


Let us try a non-linear tree-based algorithm next. We will try only Random Forest here.

# Random Forest


In [None]:
from sklearn.ensemble import RandomForestRegressor
start_time = time.perf_counter()

model5 = RandomForestRegressor(n_jobs=-1,n_estimators=300, max_features=12, random_state=seed)
mae_val = make_scorer(mean_absolute_error, greater_is_better=False)
results5 = cross_val_score(model5, X_train, y_train, cv=5, scoring=mae_val, n_jobs=1)
print("Random Forest Regressor (Manual Tuning): ({0:.10f}) +/- ({1:.3f})".format(-1*results5.mean(), results5.std()))

end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

We started checking with a forest model made of 300 trees. It performed worse than linear regression. The cross val score is 0.45 which is lesser than 0.47 i.e. the cross val score we observed with different regression models. Hence, we should proceed with trying different no. of trees (optimal no. of trees) to check when the forest performs better.

In [None]:
start_time = time.perf_counter()

# let's check forest performance with different no. of trees (n_estimators)
h = [1, 2, 5, 10, 100, 500, 1000]
scores = []

for val in h:
    model = RandomForestRegressor(n_jobs=-1,n_estimators=val, max_features=12, random_state=seed)
    mae_val = make_scorer(mean_absolute_error, greater_is_better=False)
    results = cross_val_score(model, X_train, y_train, cv=2, scoring=mae_val, n_jobs=1)
    scores.append(-1*results.mean())
        
end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

Let's take a look at the results!

In [None]:
scores

In [None]:
import pandas as pd
df = pd.DataFrame(list(zip(h, scores)), columns=['n_estimators','cross_val_scores'])
ax = sns.lineplot(x="n_estimators", y="cross_val_scores", data=df)

By observation, it is evident that the cross val score is increasing as we are decreasing the no. of trees (n_estimators). There is almost no improvement when no. of estimators are crossing 100. So, ideally we should use n_estimators = 1 to get the best result. Now, there comes a quick question. **Is random forest with one tree same as a decision tree model?** 

The answer is **"No"**. I would request the curious kagglers to go through [desertnaut's answer on this topic in Stack Overflow](https://stackoverflow.com/questions/48239242/why-is-random-forest-with-a-single-tree-much-better-than-a-decision-tree-classif) for finding out the reasons behind the answer being a "no".

In [None]:
from sklearn.ensemble import RandomForestRegressor
model_1n = RandomForestRegressor(n_jobs=-1,n_estimators=1, max_features=12, random_state=seed)
model_1n.fit(X_train, y_train)

In [None]:
model_10n = RandomForestRegressor(n_jobs=-1,n_estimators=10, max_features=12, random_state=seed)
model_10n.fit(X_train, y_train)

Now we will be noticing the level of randomness in forests made of 1 tree and made of 10 trees respectively.

**Acknowledgement :** I have taken the functions of Random Forest visualization from [Aysen Tatarinov's work](https://github.com/aysent/random-forest-leaf-visualization). He has written codes in detail to visualize forests made of many trees on a simple graph.

In [None]:
from sklearn.tree import _tree
def leaf__depths(estimator, nodeid = 0):
     left__child = estimator.children_left[nodeid]
     right__child = estimator.children_right[nodeid]
     
     if left__child == _tree.TREE_LEAF:
         depths = np.array([0])
     else:
         left__depths = leaf__depths(estimator, left__child) + 1
         right__depths = leaf__depths(estimator, right__child) + 1
         depths = np.append(left__depths, right__depths)
 
     return depths

In [None]:
def leaf__samples(estimator, nodeid = 0):  
     left__child = estimator.children_left[nodeid]
     right__child = estimator.children_right[nodeid]

     if left__child == _tree.TREE_LEAF: 
         samples = np.array([estimator.n_node_samples[nodeid]])
     else:
         left__samples = leaf__samples(estimator, left__child)
         right__samples = leaf__samples(estimator, right__child)
         samples = np.append(left__samples, right__samples)

     return samples

In [None]:

def visualization__estimator(ensemble, tree_id=0):

     plt.figure(figsize=(8,8))
     plt.subplot(211)

     estimator = ensemble.estimators_[tree_id].tree_
     depths = leaf__depths(estimator)
     
     plt.hist(depths, histtype='step', bins=range(min(depths), max(depths)+1))
     plt.grid(color='black', linestyle='dotted')
     plt.xlabel("Depth of leaf nodes (tree %s)" % tree_id)
     plt.show()

In [None]:
def visualization__forest(ensemble):

     plt.figure(figsize=(8,8))
     plt.subplot(211)

     depths__all = np.array([], dtype=int)

     for x in ensemble.estimators_:
         estimator = x.tree_
         depths = leaf__depths(estimator)
         depths__all = np.append(depths__all, depths)
         plt.hist(depths, histtype='step', bins=range(min(depths), max(depths)+1))

     plt.hist(depths__all, histtype='step',
              bins=range(min(depths__all), max(depths__all)+1), 
              weights=np.ones(len(depths__all))/len(ensemble.estimators_), 
              linewidth=2)
     plt.grid(color='black', linestyle='dotted')
     plt.xlabel("Depth of leaf nodes")
    
     plt.show()

In [None]:
start_time = time.perf_counter()

visualization__estimator(model_1n)

end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

In [None]:
start_time = time.perf_counter()

visualization__estimator(model_10n)

end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

**Observation :** We see almost no difference in both the plots. This is perhaps the difference in no. of estimators is not huge. The gap between 1 and 10 is quite small. The times taken for execution are also very close. 

In [None]:
start_time = time.perf_counter()

visualization__forest(model_1n)

end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

In [None]:
start_time = time.perf_counter()

visualization__forest(model_10n)

end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

**Observation :** While visualizing forest, we can see 10 curves are overlapping for the model with n_estimators = 10. There is variation of depth of leaf nodes for each estimator in the middle-placed bins. 

Now, we will introduce max_depth parameter in the model and check the plots.

In [None]:
model_1n_md = RandomForestRegressor(n_jobs=-1,n_estimators=1, max_features=12, max_depth=20, random_state=seed)
model_1n_md.fit(X_train, y_train)

In [None]:
model_10n_md = RandomForestRegressor(n_jobs=-1,n_estimators=10, max_features=12, max_depth=20, random_state=seed)
model_10n_md.fit(X_train, y_train)

In [None]:
start_time = time.perf_counter()

visualization__forest(model_1n_md)

end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

In [None]:
start_time = time.perf_counter()

visualization__forest(model_10n_md)

end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

**Observation :** We see a stepwise increase (evenly incremented) in depth of leaf nodes. However, the increase is exceptionally steep while moving from depth of leaf nodes '19' to '20'.

Next, we will introduce min_samples_leaf parameter and observe the plots further.

In [None]:
model_1n_msl = RandomForestRegressor(n_jobs=-1,n_estimators=1, max_features=12, min_samples_leaf=6, max_depth=20, random_state=seed)
model_1n_msl.fit(X_train, y_train)

In [None]:
model_10n_msl = RandomForestRegressor(n_jobs=-1,n_estimators=10, max_features=12, min_samples_leaf=6, max_depth=20, random_state=seed)
model_10n_msl.fit(X_train, y_train)

In [None]:
start_time = time.perf_counter()

visualization__forest(model_1n_msl)

end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

In [None]:
start_time = time.perf_counter()

visualization__forest(model_10n_msl)

end_time = time.perf_counter()
print("Total Run Time:")
print(end_time - start_time, "Seconds")

**Observation :** We can see the increase in depth has become non-evenly incremented on introducing min_samples_leaf. Sometimes the curve is decreasing as well (e.g.: for the forest with one tree, there is a decrement observed while moving from depth '18' to '19'). This is absolutely logical, because min_samples_leaf denote the no. of leaves to be there in each leaf node. Keeping it as high a number as '6' will put a constraint on fixing the depth of tree, and the curve will fluctuate. A low no. of min_samples_leaf is desired. We will keep it default. 

**Final Model :** We will choose "model_1n_md" since it has produced the best plot.

In [None]:
pred=test_data.drop('id',axis=1)
test_data_ = test_data[['cat80', 'cat79', 'cat87', 'cat57', 'cat101', 'cat12', 'cont2', 'cat81', 'cat89', 'cont7', 'cat7', 'cat10', 'cont12', 'cont11', 'cat1', 'cat72', 'cat103', 'cat94', 
                       'cat2', 'cont3', 'cat11', 'cat106', 'cat111', 'cat114', 'cat53', 'cat13', 'cat9', 'cont6', 'cat100', 'cat44']]
yp=model_1n_md.predict(test_data_)

In [None]:
submission = pd.read_csv('../input/allstate-claims-severity/sample_submission.csv')
submission['loss'] = yp
submission.head()

In [None]:
submission.to_csv('AllstateClaimsPred.csv', index=False)
print("Submission successful")