# Exploratory Data Analysis (EDA) and Production of Various Regression Models

This notebook explores the categorical and numerical features of the dataset, followed by production of a range of classical and deep neural networks to perform regression. Rather than focussing on GBMs, I've tried to put more emphasis on deep learning methods within this notebook. Hope you enjoy!


**Table of Contents:**

1. [Load Data and Analyse Overall Dataset Features](#load)
2. [EDA](#EDA)
3. [Data Preparation and Preprocessing](#data-preprocessing)
4. [Creation of a Baseline Model](#baseline) 
5. [Classical Models Exploration](#classical-models)
    - 5.1. [Basic Analysis using Random Forest](#random-forest-analysis)
    - 5.2 [Basic Linear Regression and Visualisation of Residuals](#linear-regression)
    - 5.3 [Exploring PCA for linear regression](#pca)
    - 5.4 [Exploring Ridge and LASSO](#ridge-lasso)
    - 5.5 [Bootstrapped Linear Regression](#bootstrap-linear-regression)
    - 5.6 [More complex models - CatBoost Regressor](#catboost-model)
6. [Deep ANN Models](#ann-models)
    - 6.1. [Model 1 - Deep ANN Regressor with Monte Carlo Dropout and BatchNorm](#ann-model-1)
    - 6.2 [Model 2 - Deep Self-normalising SELU Network with Monte Carlo AlphaDropout](#ann-model-2)
7. [Test Set Predictions](#test-predictions)

In [None]:
import gc
import keras
import keras.backend as K
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import tensorflow as tf

from catboost import CatBoostRegressor, cv, Pool
from collections import defaultdict

from keras.layers import Dense, Embedding, Flatten, LSTM, GRU, \
        SpatialDropout1D, Bidirectional, Conv1D, MaxPooling1D, BatchNormalization
from keras.models import Sequential, load_model
from keras import models
from keras import layers

from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, BaggingRegressor
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.pipeline import Pipeline

from tqdm import tqdm

---

<a id="load"></a>
## 1. Load Data and Analyse Overall Dataset Features

In [None]:
data_dir = "/kaggle/input/tabular-playground-series-feb-2021/"

In [None]:
train_df = pd.read_csv(os.path.join(data_dir, "train.csv"))
test_df = pd.read_csv(os.path.join(data_dir, "test.csv"))
train_df.head()

In [None]:
test_df.head()

We seem to have a balanced mix of categorical and numerical features. From first inspection, it appears these features have already been standardised or normalised in some way.

In [None]:
train_df.describe()

In [None]:
# check for any null or missing values
train_df.isna().sum().sum()

Fortunately, we have absolutely no null or missing values, which is a pleasant surprise.

---

<a id="EDA"></a>
## 2. Basic EDA

### 2.1 Analysis of Numerical Features

In [None]:
train_df['target'].hist()
plt.show()

In [None]:
# boxplot comparison
fig = plt.figure(figsize=(4,7))
fig.suptitle('Distribution of target variable', fontsize=16)
ax = fig.add_subplot(111)
sns.boxplot(data=train_df["target"])
ax.set_xticklabels(['Target'])
ax.set_ylabel("Value")
plt.show()

It seems like we might have some outliers within the data, such as some zero values and those much less than the vast majority of data points.

Lets plot a pairplot, showing the relationship between all of our numerical columns:

In [None]:
plot_cols = train_df.loc[:, train_df.dtypes != object].drop(columns=['id']).columns.values[:7]
plot_cols = np.append(plot_cols, 'target')

sns.pairplot(train_df.loc[:, plot_cols], plot_kws={'alpha':0.1})
plt.show()

Its hard to make sense of any real relationships here, due to the huge number of data instances.

Instead, we can find the correlation between each variable and illustrate this instead:

In [None]:
# use all of our continuous variables this time
plot_cols = train_df.loc[:, train_df.dtypes != object].drop(columns=['id']).columns.values

corr = train_df.loc[:, plot_cols].corr()

We can use this correlation matrix to plot a heatmap of correlations:

In [None]:
plt.figure(figsize=(16,10))
sns.heatmap(corr, annot=True)
plt.show()

We have a few strong correlations between our variables, for instance, 'cont5' and 'cont8', 'cont9', 'cont10', 'cont11', and 'cont12'. Its possible that these share a lot of the same correlations, and therefore it could be worth trying to eliminate some of these, in order to make our model simpler and to avoid redundancy. 

All things considered, they do not resemble anywhere near perfect correlation (1.0), and so keeping them in should not do too much harm in this case.

In looking at the target variable, it appears that no columns on their own are particularly correlated. This suggests a more complex relationship between our input data and output variable (assuming of course that such a relationship does in fact exist).

### 2.2 Analysis of Categorical Features

In [None]:
train_df.nunique()

In [None]:
def custom_countplot(data_df, col_name, ax=None):
    """ Plot seaborn countplot for selected dataframe col """
    c_plot = sns.countplot(x=col_name, data=data_df, ax=ax)
    for g in c_plot.patches:
        c_plot.annotate(f"{g.get_height()}",
                        (g.get_x()+g.get_width()/3,
                         g.get_height()+60))

In [None]:
custom_countplot(data_df=train_df, col_name='cat0')

Lets do similar plots for all of our categorical columns:

In [None]:
cat_cols = train_df.loc[:, train_df.dtypes == object].columns
n = len(cat_cols)

fig, axs = plt.subplots(2, 5, figsize=(20,10))
axs = axs.flatten()

# iterate through each col and plot
for i, col_name in enumerate(cat_cols):
    custom_countplot(train_df, col_name, ax=axs[i])
    axs[i].set_xlabel(f"{col_name}", weight = 'bold')
    axs[i].set_ylabel('Count', weight='bold')
    
    if (i != 0 and i != 5):
        axs[i].set_ylabel('')
        
plt.tight_layout()

Some of our categorical columns have variables that occur a very low number of times, and therefore we have some imbalanced data in this case.

We can overcome this through many techniques, one of which is to combine our minority categories into composite ones. We'll look at some simple examples for this next.

First of all, we can automatically select the minority categories to combine within each of our features like so:

In [None]:
def find_minority_cats(cat_cols, data_df, composite_category='z', threshold=0.01):
    """ Find minority categories for each feature column, and create a 
        dictionary that maps those to selected composite category """
    minority_col_dict = {}
    minority_mapping_dict = {}
    
    # find all feature categories with less than 5% proportion
    for feature in cat_cols:
        minority_col_dict[feature] = []
        minority_mapping_dict[feature] = {}
        
        for category, proportion in data_df[feature].value_counts(normalize=True).iteritems():
            if proportion < threshold:
                minority_col_dict[feature].append(category)
                
                # map those minority cats to chosen composite feature
                minority_mapping_dict[feature] = { x : composite_category for x 
                                                  in minority_col_dict[feature]}
                
    return minority_mapping_dict, minority_col_dict

In [None]:
cat_min_mappings, minority_cols = find_minority_cats(cat_cols, train_df)

minority_cols

Our function has returned a mapping dict, so that we can automatically update our dataset with the desired composite value:

In [None]:
cat_min_mappings

With all of these minority categories, we have simply combined them all into composite categories for each feature. This should give us a greater balance throughout the dataset, at the expense of loosing some information about many data instances. 

Although we're loosing information about our data, its likely this will actually benefit our models in the long-run, since it makes them simpler and less prone to the issues of unbalanced data. The issues of having unbalanced data would likely be more impactful anyway.

Using our dictionary created above, we can simply transform each of our features like so:

In [None]:
train_df_tx = train_df.copy()

for feat in cat_cols:
    train_df_tx[feat] = train_df[feat].replace(cat_min_mappings[feat])

Lets have a look and make sure these are now updated:

In [None]:
fig, axs = plt.subplots(2, 5, figsize=(20,10))
axs = axs.flatten()

# iterate through each col and plot
for i, col_name in enumerate(cat_cols):
    custom_countplot(train_df_tx, col_name, ax=axs[i])
    axs[i].set_xlabel(f"{col_name}", weight = 'bold')
    axs[i].set_ylabel('Count', weight='bold')
    
    # only plot ylabels on lhs
    if (i != 0 and i != 5):
        axs[i].set_ylabel('')
        
plt.tight_layout()

In [None]:
del train_df_tx
gc.collect()

We still have many minority categories, but its certainly much better than it was originally!

---

<a id="data-preprocessing"></a>
## 3. Data Preprocessing: Creation of a data loader and preprocessor

Its important that we handle our numerical and categorical features appropriately prior to producing our models.

We'll put together some preprocessing functions to encode our categorical features and standardise our numerical features. Whilst doing this, we'll also add support for combining some of the minority categories within our data features (since some are very imbalanced), and add support for producing additional dimensionality-reduced features (using PCA) to our dataset.

These extra features will allow us to experiment and tune to find the best combinations of feature engineering to perform for this problem.

In [None]:
class DataProcessor(object):
    def __init__(self):
        self.encoder = None
        self.standard_scaler = None
        self.num_cols = None
        self.cat_cols = None
        
    def preprocess(self, data_df, train=True, one_hot_encode=False,
                   combine_min_cats=False, add_pca_feats=False):
        """ Preprocess train / test as required """
        
        # if training, fit our transformers
        if train:
            self.train_ids = data_df.loc[:, 'id']
            train_cats = data_df.loc[:, data_df.dtypes == object]
            self.cat_cols = train_cats.columns
            
            # if selected, combine minority categorical feats
            if combine_min_cats:
                self._find_minority_cats(train_cats)
                train_cats = self._combine_minority_feats(train_cats)
            
            # if selected, one hot encode our cat features
            if one_hot_encode:
                self.encoder = OneHotEncoder(handle_unknown='ignore')
                oh_enc = self.encoder.fit_transform(train_cats).toarray()
                train_cats_enc = pd.DataFrame(oh_enc, columns=self.encoder.get_feature_names())
                self.final_cat_cols = list(train_cats_enc.columns)
            
            # otherwise just encode our cat feats with ints
            else:
                # encode all of our categorical variables
                self.encoder = defaultdict(LabelEncoder)
                train_cats_enc = train_cats.apply(lambda x: 
                                                  self.encoder[x.name].fit_transform(x))
                self.final_cat_cols = list(self.cat_cols)
            
            
            # standardise all numerical columns
            train_num = data_df.loc[:, data_df.dtypes != object].drop(columns=['target', 'id'])
            self.num_cols = train_num.columns
            self.standard_scaler = StandardScaler()
            train_num_std = self.standard_scaler.fit_transform(train_num)
            
            # add pca reduced num feats if selected, else just combine num + cat feats
            if add_pca_feats:
                pca_feats = self._return_num_pca(train_num_std)
                self.final_num_feats = list(self.num_cols)+list(self.pca_cols)
                
                
                X = pd.DataFrame(np.hstack((train_cats_enc, train_num_std, pca_feats)), 
                        columns=list(self.final_cat_cols)+list(self.num_cols)+list(self.pca_cols))
            else:   
                self.final_num_feats = list(self.num_cols)
                X = pd.DataFrame(np.hstack((train_cats_enc, train_num_std)), 
                        columns=list(self.final_cat_cols)+list(self.num_cols))
        
        # otherwise, treat as test data
        else:
            # transform categorical and numerical data
            self.test_ids = data_df.loc[:, 'id']
            cat_data = data_df.loc[:, self.cat_cols]
            if combine_min_cats:
                cat_data = self._combine_minority_feats(cat_data)
        
            if one_hot_encode:
                oh_enc = self.encoder.transform(cat_data).toarray()
                cats_enc = pd.DataFrame(oh_enc, columns=self.encoder.get_feature_names())
            else:
                cats_enc = cat_data.apply(lambda x: self.encoder[x.name].transform(x))
                
            # transform test numerical data
            num_data = data_df.loc[:, self.num_cols]
            num_std = self.standard_scaler.transform(num_data)
            
            if add_pca_feats:
                pca_feats = self._return_num_pca(num_std, train=False)
                
                X = pd.DataFrame(np.hstack((cats_enc, num_std, pca_feats)), 
                        columns=list(self.final_cat_cols)+list(self.num_cols)+list(self.pca_cols))
            
            else:
                X = pd.DataFrame(np.hstack((cats_enc, num_std)), 
                        columns=list(self.final_cat_cols)+list(self.num_cols)) 
        return X
    
    
    def _find_minority_cats(self, data_df, composite_category='z', threshold=0.05):
        """ Find minority categories for each feature column, and create a 
            dictionary that maps those to selected composite category """
        self.min_col_dict = {}
        self.min_cat_mappings = {}
    
        # find all feature categories with less than 5% proportion
        for feature in self.cat_cols:
            self.min_col_dict[feature] = []
            self.min_cat_mappings[feature] = {}
        
            for category, proportion in data_df[feature].value_counts(normalize=True).iteritems():
                if proportion < threshold:
                    self.min_col_dict[feature].append(category)
                
                    # map those minority cats to chosen composite feature
                    self.min_cat_mappings[feature] = {x : composite_category for x 
                                                    in self.min_col_dict[feature]}
    
    
    def _combine_minority_feats(self, data_df, replace=False):
        """ Combine minority categories into composite for each cat feature """
        new_df = data_df.copy()
        for feat in self.cat_cols:
            col_label = f"{feat}" if replace else f"{feat}_new"
            new_df[feat] = new_df[feat].replace(self.min_cat_mappings[feat])
        return new_df
    
    
    def _return_num_pca(self, num_df, n_components=0.85, train=True):
        """ return dim reduced numerical features using PCA """
        if train:
            self.pca = PCA(n_components=n_components)
            num_rd = self.pca.fit_transform(num_df)
            
            # create new col names for our reduced features
            self.pca_cols = [f"pca_{x}" for x in range(num_rd.shape[1])]
            
        else:
            num_rd = self.pca.transform(num_df)
        
        return pd.DataFrame(num_rd, columns=self.pca_cols)

Lets transform our initial data into our total training and test sets. For this we'll one hot encode our categorical variables, and standardise our numerical data. We'll also add some additional numerical features using PCA.

In [None]:
data_proc = DataProcessor()
X = data_proc.preprocess(train_df, add_pca_feats=True, one_hot_encode=True)
y = train_df.loc[:, 'target']
X_test = data_proc.preprocess(test_df, train=False, add_pca_feats=True, one_hot_encode=True)

print(f"X: {X.shape} \ny: {y.shape} \nX_test: {X_test.shape}")

Lets convert our data into a more efficient set of dtypes, to avoid any memory issues:

In [None]:
# convert all of our categorical columns to ints before using GBMs
cat_feat_dtype_dict = { x : "int" for x in data_proc.final_cat_cols}
num_feat_dtype_dict = { x : "float32" for x in data_proc.final_num_feats }
df_map_dict = {**cat_feat_dtype_dict, **num_feat_dtype_dict}
X = X.astype(df_map_dict)
X_test = X_test.astype(df_map_dict)

Note that from above, we only fit our label encoder and standard scaler transformers to our training set, and then use this to transform (and not fit) to our test data. 

We next need to break this down into training and validation splits. Our training split allows us to train each of our models through the optimisation of our objective function, which is specific to the model used. Our validation split allows us to analyse the estimate performance of our trained models and lets us make refinements to improve and maximise their performance.

Throughout this entire process, we should not touch our test set until the very end, at which point we make predictions using our final model and submitting these to the competition. 

This is how it should work in practice, however what we often find is people simply make continued attempts at making predictions on the competition test set and maximise this over time. The problem is this often results in a model that has become over-fitted specifically to the test set, and thus might not generalise well beyond this. 

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
print(f"X_train: {X_train.shape} \ny_train: {y_train.shape} \nX_val: {X_val.shape}, \ny_val: {y_val.shape}")

These splits of data are still very large, and so in practice we might reduce these significantly into much smaller sub-sets. These can then be used to quickly train and evaluate a range of models, which can be iteratively improved and then tested on the full splits we produced above.

---

<a id="baseline"></a>
## 4. Creation of a baseline

In almost any data science project, we should make use of a simple baseline model from which we can gauge the utility and performance of any intelligent models we produce.

This is important, since if we jump straight in with a complex model we have no idea if its actually performing well. If it performs worse than a simple mathematical technique (such as the mean or median target value), then its not useful.

We'll make a baseline that simply predicts the average of the dependent output, so that can gauge how well this performs.

In [None]:
y_baseline = pd.DataFrame(y_train.values, columns=['y_train'])
y_baseline['y_train_avg'] = y_train.mean()

In [None]:
mean_squared_error(y_baseline['y_train'], y_baseline['y_train_avg'])

This is our score on the training data, and we could do this similarly for the validation data in order to get a gauge of how well the baseline performs:

In [None]:
y_baseline = pd.DataFrame(y_val.values, columns=['y_val'])
y_baseline['y_train_avg'] = y_train.mean()

mean_squared_error(y_baseline['y_val'], y_baseline['y_train_avg'])

It seems this MSE is roughly consistent with our validation set MSE. Let's now make a set of baseline predictions on the test set, using exactly the same technique. We'll then submit this as a **basic baseline submission**:

In [None]:
# use average value for all predictions on test set
#submission_df = pd.read_csv(os.path.join(data_dir, "sample_submission.csv"))
#submission_df['target'] = y.mean()
#submission_df.to_csv('submission.csv', index=False)

In [None]:
del y_baseline
gc.collect()

This extremely simple baseline scores 0.88498 on the test set, which isn't bad when considering the lack of effort put in.

---

<a id="classical-models"></a>
## 5. Classical Model Exploration

Lets take 10% of the total data as a sample for experimenting with models and training

In [None]:
_, X_sub, _, y_sub = train_test_split(X, y, test_size=0.05)
X_sub.shape, y_sub.shape

In [None]:
X_train_sub, X_val_sub, y_train_sub, y_val_sub = train_test_split(X_sub, y_sub, test_size=0.2)
X_train_sub.shape, X_val_sub.shape, y_train_sub.shape, y_val_sub.shape

In [None]:
# clear unnecessary variables
del X_sub
del y_sub
gc.collect()

<a id="random-forest-analysis"></a>
### 5.1 Basic Analysis using a Random Forest

In [None]:
rf_reg = RandomForestRegressor(n_jobs=-1)
rf_reg.fit(X_train_sub, y_train_sub)
rf_preds = rf_reg.predict(X_val_sub)

In [None]:
def mse(preds, true):
    """ Return mean squared error score """
    return np.square(preds - true).mean()

print(f"Validation MSE: {mse(rf_preds, y_val_sub)}\n")

In [None]:
rf_preds_cv = cross_val_predict(rf_reg, X_train_sub, y_train_sub, cv=3)
print(f"Cross-validation MSE: {mse(rf_preds_cv, y_train_sub)}")

In [None]:
r2_score(rf_preds, y_val_sub)

In [None]:
del rf_preds
del rf_preds_cv

In this case we are obtaining a negative $ R^{2} $ score, which is bad. This means we are doing worse than simply prediction the average, which is represented by an $ R^{2} $ of zero.

#### Feature importances using our random forest

Lets also look at our feature importances, as obtained from the Random Forest Regressor:

In [None]:
pd.Series(rf_reg.feature_importances_, index=X.columns).plot.bar(figsize=(12,6))
plt.show()

Based on our smaller sample and random forest model, it doesnt look like many of our categorical features are important relative to the majority of numerical features in our data.

We could try iteratively removing the lowest of these, and seeing how it impacts our performance. For now, we'll simply leave them all in.

<a id="linear-regression"></a>
### 5.2 Basic Linear Regression and Visualisation of Residuals

A basic linear regressor is convenient due to its extreme simplicity and high-speed. Although its not going to perform amazingly in general, it can still be useful for analysing our problem from a starting perspective.

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_train_preds = lr.predict(X_train)
y_val_preds = lr.predict(X_val)

In [None]:
mse(y_val_preds, y_val)

Fortunately, our basic linear regression model is better than our simple baseline, which scored 0.7871 on the validation split.

We can also visualise these predictions and how far they deviate through the use of a residual plot:

In [None]:
def residual_plot(train_labels, train_preds, test_labels=None, test_preds=None, 
                  title="Residual Plot", figsize=(9,6), xlim=[6.5, 9.5]):
    """ Residual plot to evaluate performance of our simple linear regressor """
    plt.figure(figsize=figsize)
    plt.scatter(train_preds, train_preds - train_labels, c='blue', alpha=0.1,
                marker='o', edgecolors='white', label='Training')
    
    if test_labels is not None:
        plt.scatter(test_preds, test_preds - test_labels, c='red', alpha=0.1,
                    marker='^', edgecolors='white', label='Test')
    plt.xlabel('Predicted values')
    plt.ylabel('Residuals')
    plt.hlines(y=0, xmin=xlim[0], xmax=xlim[1], color='black', lw=2)
    plt.xlim(xlim)
    if test_labels is not None:
        plt.legend(loc='best')
    plt.title(title)
    plt.show()
    return

In [None]:
residual_plot(y_train, y_train_preds, y_val, y_val_preds, title="Linear Regressor Residual Plot")

We want our residual plot points to be randomly dispersed close to the horizontal line. Due to the vast number of predictions, this is quite difficult to interpret effectively from just one residual plot. We'll plot training and test residuals seperately to make this more clear below:

In [None]:
residual_plot(y_train, y_train_preds, title="Lin Reg Residual Plot - Training Set", figsize=(6,5))

In [None]:
residual_plot(y_val, y_val_preds, title="Lin Reg Residual Plot - Validation Set", figsize=(6,5))

Our residuals look very similar in this case. These plots will be useful to compare to more complex models later on.

Lets apply dimensionality reduction and see whether it improves or impedes our basic linear model.

<a id="pca"></a>
### 5.3 Exploring PCA and linear regression

In [None]:
pca_99 = PCA(n_components=0.99)
X_train_rd = pca_99.fit_transform(X_train)
X_val_rd = pca_99.transform(X_val)
X_train_rd.shape, X_val_rd.shape

In [None]:
lr.fit(X_train_rd, y_train)
y_train_preds = lr.predict(X_train_rd)
y_val_preds = lr.predict(X_val_rd)

In [None]:
mse(y_val_preds, y_val)

In [None]:
n_range = np.arange(1, 25)
mse_scores = []

lin_reg = LinearRegression()

for n in n_range:
    pca = PCA(n_components=n)
    lr_model = Pipeline(steps=[('pca', pca), ('linear regression', lin_reg)])
    
    # evaluate using cross-validation
    lr_cv_preds = cross_val_predict(lr_model, X, y, cv=5)

    # in order to effective work out log loss, we need to flatten both arrays before computing log loss
    cv_mse = mean_squared_error(y, lr_cv_preds)
    
    mse_scores.append(cv_mse)

In [None]:
plt.figure(figsize=(12,5))
sns.lineplot(x=n_range, y=mse_scores)
plt.ylabel("Average Cross-Val MSE", weight='bold')
plt.xlabel("PCA n components", weight='bold')
plt.grid()
plt.show()

Clearly our model performs best when all of the features are included. Removing even a few of these results in severe impacts to the MSE score, as shown above. Thus, we'll not waste our time exploring PCA any further for this problem.

In [None]:
# delete existing vars to save memory later
del X_train_rd
del X_val_rd
del y_train_preds
del y_val_preds
gc.collect()

<a id="ridge-lasso"></a>
### 5.4 LASSO and Ridge Regression Exploration

#### Exploring Ridge Hyper-parameters

In [None]:
alpha_values = [0, 0.01, 0.1, 0.5, 1.0, 1.5, 3, 10, 20, 50, 100, 
                200, 250, 500, 750, 1000, 2000, 3000, 4000, 5000]

ridge_mse_scores = []
mean_ridge_mse_scores = []

for alpha in tqdm(alpha_values):
    ridge_reg = Ridge(alpha=alpha, fit_intercept=True)
    ridge_scores = cross_val_score(ridge_reg, X, y, 
                                   scoring='neg_mean_squared_error', cv=5)
    mse_scores = -ridge_scores
    ridge_mse_scores.append(mse_scores)
    mean_ridge_mse_scores.append(mse_scores.mean())

ridge_mse_scores = np.array(ridge_mse_scores)
mean_ridge_mse_scores = np.array(mean_ridge_mse_scores)

print(f"Ridge MSE: {mean_ridge_mse_scores.mean():.5f} +/- {mean_ridge_mse_scores.std():.5f}")

# calculate standard deviation to annotate on our plot
ridge_mse_scores_std = ridge_mse_scores.std(axis=1)

In [None]:
# plot mean and standard deviation of cross-val scores
plt.figure(figsize=(18,6))
sns.lineplot(x=alpha_values, y=mean_ridge_mse_scores)
plt.fill_between(alpha_values, mean_ridge_mse_scores - ridge_mse_scores_std, 
                 mean_ridge_mse_scores + ridge_mse_scores_std, 
                 color='tab:blue', alpha=0.2)
plt.semilogx(alpha_values, mean_ridge_mse_scores + ridge_mse_scores_std, 'b--')
plt.semilogx(alpha_values, mean_ridge_mse_scores - ridge_mse_scores_std, 'b--')
plt.ylabel("Cross-Validation MSE (Average)", weight='bold', size=14)
plt.xlabel("Ridge Alpha Hyper-Parameter", weight='bold', size=14)
plt.show()

#### Exploring LASSO Hyper-parameters

LASSO performs automatic feature reduction by setting the coefficient of low-importance features to zero throughout training. This is performed in relation to the hyper-parameter values we select at model creation. We'll explore a range of these similarly to how we did with Ridge regression above:

In [None]:
# range of alpha values between 0 and 1
alpha_values = np.logspace(-4, 1, 20)

lasso_mse_scores = []
mean_lasso_mse_scores = []

for alpha in tqdm(alpha_values):
    lasso_reg = Lasso(alpha=alpha, fit_intercept=True, max_iter=1000, tol=0.5)
    lasso_scores = cross_val_score(lasso_reg, X, y,
                                   scoring='neg_mean_squared_error', cv=5)
    mse_scores = -lasso_scores
    lasso_mse_scores.append(mse_scores)
    mean_lasso_mse_scores.append(mse_scores.mean())

lasso_mse_scores = np.array(lasso_mse_scores)
mean_lasso_mse_scores = np.array(mean_lasso_mse_scores)

print(f"LASSO MSE: {mean_lasso_mse_scores.mean():.5f} +/- {mean_lasso_mse_scores.std():.5f}")

In [None]:
# calculate standard deviation to annotate on our plot
lasso_mse_scores_std = lasso_mse_scores.std(axis=1)

# plot mean and standard deviation of cross-val scores
plt.figure(figsize=(16,6))
sns.lineplot(x=alpha_values, y=mean_lasso_mse_scores)
plt.fill_between(alpha_values, mean_lasso_mse_scores - lasso_mse_scores_std, 
                 mean_lasso_mse_scores + lasso_mse_scores_std, 
                 color='tab:blue', alpha=0.2)
plt.semilogx(alpha_values, mean_lasso_mse_scores + lasso_mse_scores_std, 'b--')
plt.semilogx(alpha_values, mean_lasso_mse_scores - lasso_mse_scores_std, 'b--')
plt.ylabel("Cross-Validation MSE (Average)", weight='bold', size=14)
plt.xlabel("LASSO Alpha Hyper-Parameter", weight='bold', size=14)
plt.show()

It appears for this problem neither LASSO nor Ridge perform that well in comparison to a basic linear regressor. They all perform uniformly relative to one-another. 

To perform well on this dataset, it appears we need more complex models that can better characterise the relations between our input data and dependent target variable.

<a id="bootstrap-linear-regression"></a>
### 5.5 Bootstrapping a basic linear regression model to obtain a simple ensemble:

Before moving on to more complex models, we'll just investigate how well a simple linear regression model can perform when using bootstrapping techniques:

In [None]:
lr_bag_reg = BaggingRegressor(LinearRegression(), n_estimators=5, bootstrap=True, 
                              max_samples=0.75, max_features=0.75)

lr_bag_mse_scores = []
mean_lr_bag_mse_scores = []

# times to repeat the cross validation
repeats = 2

for i in tqdm(range(repeats)):
    lr_bag_scores = cross_val_score(lr_bag_reg, X, y,
                                    scoring='neg_mean_squared_error', cv=5)
    mse_scores = -lr_bag_scores
    lr_bag_mse_scores.append(mse_scores)
    mean_lr_bag_mse_scores.append(mse_scores.mean())

lr_bag_mse_scores = np.array(lr_bag_mse_scores)
mean_lr_bag_mse_scores = np.array(mean_lr_bag_mse_scores)

print(f"Lin Reg Bagging MSE: {mean_lr_bag_mse_scores.mean():.5f} +/- {mean_lr_bag_mse_scores.std():.5f}")

<a id="catboost-model"></a>
### 5.6 More complex models - CatBoost Regressor

In [None]:
# if training on GPU:
#cb_reg = CatBoostRegressor(task_type='GPU', random_seed=13, verbose=400)

#cb_reg = CatBoostRegressor(random_seed=13, verbose=400)
#cb_reg.fit(X_train, y_train, eval_set=(X_val, y_val), use_best_model=True, plot=True)

In [None]:
#train_preds = cb_reg.predict(X_train)
#val_preds = cb_reg.predict(X_val)

# calculate mean squared error on val sub-set preds
#cat_rmse = np.sqrt(mean_squared_error(val_preds, y_val))

#print("CatBoost Regressor RMSE: {cat_rmse}")

Lets have a quick look at our feature importances for this model for insight:

In [None]:
#feat_importances = cb_reg.get_feature_importance(prettified=True)

#plt.figure(figsize=(12, 12))
#sns.barplot(x="Importances", y="Feature Id", data=feat_importances)
#plt.title('CatBoost features importance:')

Similarly, lets see how well our residuals look compared to the previous simple linear regression models:

In [None]:
#residual_plot(y_train[:10000], train_preds[:10000], 
#              y_val[:10000], val_preds[:10000], 
#              title="CatBoost Residual Plot")

In [None]:
# delete existing vars to save memory later
#del train_preds
#del val_preds
#gc.collect()

Lets make some predictions on the test set, and save them for later should we need them:

In [None]:
#cb_test_preds = cb_reg.predict(X_test)

---

<a id="ann-models"></a>
## 6. Production of various Deep ANN Models

We'll try some different approaches in this section through the production of various Deep Neural Networks. 

<a id="ann-model-1"></a>
### 6.1 Model 1 - Deep ANN Regressor with Monte Carlo Dropout and Batch Norm

We'll produce a model that uses monte carlo dropout to provide a form of ensembling for our final model predictions.

In [None]:
class MonteCarloDropout(layers.Dropout):
    """ Class that overrides default call function used by standard 
        dropout to keep dropout active during inference """
    def call(self, inputs):
        return super().call(inputs, training=True)

In [None]:
def pred_mc_dropout(model, test_inputs, n_samples=50):
    """ Make a large number of predictions (equal to n_samples) using the 
        passed model and input features """
    pred_probs = [model.predict(test_inputs) for samples in range(n_samples)]
    return np.mean(pred_probs, axis=0)

In [None]:
def ann_model(monte_carlo_dropout=False, dropout_val=0.45, lr=2e-3, 
              input_feat_dim=X_train.shape[1]):
    """ Create a Deep NN for regression that supports monte carlo dropout
        and uses Batch Norm """
    model = models.Sequential()
    
    model.add(layers.Dense(200, activation='elu', input_shape=(input_feat_dim,), 
                           kernel_initializer='he_normal'))
    model.add(BatchNormalization())
    if monte_carlo_dropout:
        model.add(MonteCarloDropout(dropout_val))
    else:
        model.add(layers.Dropout(dropout_val))
    model.add(layers.Dense(100, activation='elu', kernel_initializer='he_normal'))
    model.add(BatchNormalization())
    if monte_carlo_dropout:
        model.add(MonteCarloDropout(dropout_val))
    else:
        model.add(layers.Dropout(dropout_val))
    model.add(layers.Dense(50, activation='elu', kernel_initializer='he_normal'))
    model.add(BatchNormalization())
    if monte_carlo_dropout:
        model.add(MonteCarloDropout(dropout_val))
    else:
        model.add(layers.Dropout(dropout_val))
        
    # regression output layer - no activation
    model.add(layers.Dense(1))
        
    model.compile(optimizer=keras.optimizers.Nadam(lr=lr), 
                  loss='mse', metrics=['mse'])
    
    return model

In [None]:
model_1 = ann_model(monte_carlo_dropout=True, lr=1e-3)

In [None]:
def schedule_lr_rate(epoch, lr):
    """ Use initial learning rate for 20 epochs and then
        decrease it exponentially """
    if epoch < 20:
        return lr
    else:
        return lr * tf.math.exp(-0.1)

In [None]:
# create our lr scheduler - use reduceLRonPlat below - better performance
#lr_scheduler = tf.keras.callbacks.LearningRateScheduler(schedule_lr_rate)

# create learning rate scheduler
#lr_scheduler = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, 
#                                                   patience=5, verbose=0, 
#                                                   min_delta=0.0001, mode='min')

# create an early stopper callback
early_stopper = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

# list of callbacks to use
trg_callbacks = [early_stopper]

In [None]:
history = model_1.fit(X_train, y_train, epochs=50, 
                      batch_size=256, validation_data=(X_val, y_val), 
                      callbacks=trg_callbacks)

In [None]:
def plot_history_results(history, metric='mse', figsize=(12,5)):
    """ Helper function for plotting history from keras model """
    
    # gather desired features
    trg_loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs = range(1, len(trg_loss) + 1)

    # plot losses and accuracies for training and validation 
    fig = plt.figure(figsize=figsize)
    ax = fig.add_subplot(1, 1, 1)
    plt.plot(epochs, trg_loss, marker='o', label='Training Loss')
    plt.plot(epochs, val_loss, marker='x', label='Validation Loss')
    plt.title("Training / Validation Loss")
    ax.set_ylabel("Loss")
    ax.set_xlabel("Epochs")
    plt.legend(loc='best')
    plt.tight_layout()
    plt.ylim(0.0, 2.0)
    plt.show()

In [None]:
plot_history_results(history)

In [None]:
#val_preds = model_1.predict(X_val)
#mse = mean_squared_error(y_val, val_preds)
#print(f'Validation MSE: {mse:.4f}')

Notice that making one set of predictions with Monte Carlo Dropout enabled is not sufficient - we must make a large number of these due to the diversified nature of the predictions with dropout layers enabled during inference:

In [None]:
val_preds = pred_mc_dropout(model_1, X_val, n_samples=25)
mse = mean_squared_error(y_val, val_preds)
print(f'Validation MSE with Monte Carlo Preds: {mse:.4f}')

In [None]:
# delete existing vars to save memory later
del val_preds
gc.collect()

# clear keras session
K.clear_session()

The downside of Monte Carlo is the long time for inference, however the benefits in performance can be good as seen above. Therefore, it is a compromise between the two.

Let's use this model to make a set of predictions on the test set to use as a submission. First, we'll retrain on the entire training set, rather than just the sub-set of training we reserved initially. We'll only train for a maximum of 40 epochs though, since our model began to saturate training above.

In [None]:
model_1 = ann_model(monte_carlo_dropout=True, lr=1e-3)

# create an early stopper callback
early_stopper = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

# list of callbacks to use
trg_callbacks = [early_stopper]

# fit on full set of training data
history = model_1.fit(X, y, epochs=40, batch_size=256, callbacks=trg_callbacks)

In [None]:
test_preds_1 = pred_mc_dropout(model_1, X_test, n_samples=50)

Lets combine these predictions with the CatBoost ones we made earlier as a simple set of ensembled predictions.

In [None]:
final_preds = test_preds_1.copy()

In [None]:
del test_preds_1
#del cb_test_preds
gc.collect()
K.clear_session()

In [None]:
# save submission in csv format
#submission_df = pd.read_csv(os.path.join(data_dir, "sample_submission.csv"))
#submission_df['target'] = final_preds
#submission_df.to_csv('submission.csv', index=False)

<a id="ann-model-2"></a>
## Model 2 - Deep Self-Normalising SELU Network

With traditional neep neural networks we start having large problems when we train a network that is very deep, which is due to the occurrence of exploding and vanishing gradients. Certain network architectures overcome these issues, with one being a self-normalising SELU network.

We'll produce a deep network (with around 20 hidden layers) and see how well it performs on the given regression challenge.

An important consideration for SELU networks is that our data is correctly standardised before feeding it into our network. We also need to ensure we modify our the way our dropout works on the network, if we want to implement dropout as a regularisation technique. For this we can apply AlphaDropout. Like before, we'll also make use of monte carlo dropout for our model, modifying for alpha dropout accordingly.

In [None]:
class MCAlphaDropout(keras.layers.AlphaDropout):
    """ Custom Monte Carlo Alpha Dropout layer """
    def call(self, inputs):
        return super().call(inputs, training=True)
    
    
def pred_mc_dropout(model, test_inputs, n_samples=50):
    """ Make a large number of predictions (equal to n_samples) using the 
        passed model and input features """
    pred_probs = [model.predict(test_inputs) for samples in range(n_samples)]
    return np.mean(pred_probs, axis=0)

In [None]:
def dnn_selu(layers=20, neurons=100, dropout_rate=0.4):
    """ Deeply stacked SELU network with self-normalisation and Monte
        Carlo AlphaDropout 
    """
    model = keras.models.Sequential()
    
    # stack SELU layers with alpha dropout
    for layer in range(layers):
        model.add(keras.layers.Dense(neurons, activation='selu',
                               kernel_initializer='lecun_normal'))
        
        # add alpha dropout to our last layers
        if layer >= (layers - 2):
            model.add(MCAlphaDropout(rate=dropout_rate))
    
    # final regression layer
    model.add(keras.layers.Dense(1))
    
    model.compile(loss='mse', 
                  optimizer=keras.optimizers.Nadam(lr=1e-4), 
                  metrics=['mse'])
    return model

In [None]:
#model_2 = dnn_selu()

# create an early stopper callback and a checkpoint callback
#early_stopper = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

#checkpoint_cb = keras.callbacks.ModelCheckpoint("dnn_selu_dropout.h5", save_best_only=True)

# list of callbacks to use
#trg_callbacks = [early_stopper, checkpoint_cb]

# train our model and store the results
#history = model_2.fit(X_train, y_train, epochs=1, 
#                      batch_size=256, validation_data=(X_val, y_val), 
#                      callbacks=trg_callbacks)

In [None]:
#plot_history_results(history)

In [None]:
#val_preds = pred_mc_dropout(model_2, X_val, n_samples=1)
#mse = mean_squared_error(y_val, val_preds)
#print(f'Validation MSE with SELU Model & Monte Carlo Preds: {mse:.4f}')

In [None]:
#test_preds_2 = pred_mc_dropout(model_2, X_test, n_samples=1)

In [None]:
#submission_df = pd.read_csv(os.path.join(data_dir, "sample_submission.csv"))
#submission_df['target'] = test_preds_2
#submission_df.to_csv('submission.csv', index=False)

---

<a id="test-predictions"></a>
## 7. Final Test Predictions

Finally, we'll create a mix of predictions using some of our previous models on the test set, and submit these accordingly.

In [None]:
submission_df = pd.read_csv(os.path.join(data_dir, "sample_submission.csv"))
submission_df['target'] = final_preds
submission_df.to_csv('submission.csv', index=False)

Many thanks for following along - I hope you enjoyed this work!