## Introduction

This notebook makes use of the Delhi house price dataset and aims to apply and subsequently evaluate different Machine Learning (ML) methods based upon their ability to perform a regression capable of predicting house prices in Delhi by considering the values of other variables within the dataset. 

Specifically, a (i) Support Vector Regressor, (ii) Random Forest Regressor, (iii) K-Nearest Neighbour Regressor and (iv) Deep Learning Multilayer Perceptron (MLP) Regressor are applied. Of the methods, the Random Forest Regressor performed the best over 5 re-runs, followed very closely by the Deep Learning MLP Regressor, and then the Support Vector Regressor and K-Nearest Neighbour Regressor.

This notebook is split up into five different stages as follows:

**1) Stage 1: Data Description & Data Exploration**

**2) Stage 2: Data Preprocessing**

**3) Stage 3: Splitting Training/Testing subsets**

**4) Stage 4: Model Fitting and Evaluation**

**5) Stage 5: Model Comparison**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Stage 1: Data Description/Exploration

The first stage involves describing and exploring the data provided. Section 1.1 details the description of data, whilst 1.2 covers the data exploration

### 1.1 Data Description

The data provided is described below, with the use of the Pandas '.head()' method, which prints the first few rows of the dataframe, the '.describe()' method which provides a statistical description of a dataframe, the '.dtypes' method which outlines the formats within which respective columns within a dataframe are stored, and finally the '.shape' method which details the dimensions of the dataframe.

In [None]:
# load in the delhi house price dataframe
df = pd.read_csv('../input/delhi-house-price-prediction/MagicBricks.csv')

In [None]:
# inspect the first few rows of the dataframe
df.head()

In [None]:
# describe the dataset
df.describe()

In [None]:
# obtain datatypes
print(df.dtypes)

In [None]:
# outline the shape of the dataset
print(df.shape)

### 1.2 Data Exploration

This section provides a brief exploration of the data provided in the form of a Pearson's correlation matrix which examines the relationships between all variables where such a linear correlation can be applied appropriately, and an examination of the distributions of variables.

The computed correlation matrix below indicates that of the variables used, 'Area', 'BHK' (Bedroom, Hall, Kitchen), 'Bathroom' and 'Per_Sqft' exhibit positive correlations with the house price variable that is to be predicted. However, the 'Parking' variable exihibits a negative correlation with house prices which is perhaps unexpected, as more parking spaces would perhaps be expected to increase the value of a house.

In [None]:
# import seaborn and matplotlib.pyplot for data visualisation
import seaborn as sns
import matplotlib.pyplot as plt

# produce a correlation matrix between variables where applicable
corr = df.corr()

# generate a custom diverging colormap
cmap = sns.diverging_palette(150, 275, as_cmap=True)

# draw the heatmap with the mask and correct aspect ratio
sns.set(rc={'figure.figsize': (17.0, 8.0)}, font_scale=1.2)
sns.heatmap(corr, cmap=cmap, vmax=1.0,vmin=min(corr.min()), center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)

plt.title("A correlation matrix to show the Pearson's \ncorrelation between variables", fontsize=20)
plt.yticks(rotation=0)
plt.xticks(rotation=45)

The distributions of variables are also examined below with histograms, with 'Area' and 'Per_Sqft' and particularly the 'Parking' and 'Price' variables displaying exponential distributions, whilst variables 'BHK' and 'Bathroom' display comparatively more normal distributions.

In [None]:
# Set up the matplotlib figure
f, ax = plt.subplots(3, 2, constrained_layout=True)
ax[0,0].hist(df['Area'], color='red', bins=50, edgecolor='black', linewidth=1.2)
ax[0,1].hist(df['BHK'], color='blue', bins=10, edgecolor='black', linewidth=1.2)
ax[1,0].hist(df['Bathroom'], color='green', bins=10, edgecolor='black', linewidth=1.2)
ax[1,1].hist(df['Parking'], color='purple', bins=50, edgecolor='black', linewidth=1.2)
ax[2,0].hist(df['Price'], color='yellow', bins=50, edgecolor='black', linewidth=1.2)
ax[2,1].hist(df['Per_Sqft'], color='pink', bins=50, edgecolor='black', linewidth=1.2)

# set all y-labels
plt.setp(ax[:, :], ylabel='Frequency')

# individually set subplot titles and x axis labels
plt.setp(ax[0, 0], xlabel='Area')
ax[0, 0].set_title('Area value frequency histogram', fontsize=20)

plt.setp(ax[0, 1], xlabel='BHK')
ax[0, 1].set_title('BHK value frequency histogram', fontsize=20)


plt.setp(ax[1, 0], xlabel='Bathroom')
ax[1, 0].set_title('Bathroom value frequency histogram', fontsize=20)


plt.setp(ax[1, 1], xlabel='Parking')
ax[1, 1].set_title('Parking value frequency histogram', fontsize=20)


plt.setp(ax[2, 0], xlabel='Price')
ax[2, 0].set_title('Price value frequency histogram', fontsize=20)


plt.setp(ax[2, 1], xlabel='Per_Sqft')
ax[2, 1].set_title('Per_Sqft value frequency histogram', fontsize=20)

f.tight_layout(pad=1.2)

## Stage 2: Data Preprocessing

With the data introduced and briefly explored, Stage 2 involves preprocessing the data so that it is in a format that can be interpreted by a machine. This involves first cleaning the data (section 2.1), converting the datatypes (section 2.2) and finally scaling the data (section 2.3).

### 2.1 Data cleaning

The data is cleaned by removing the rows which contains NA values

In [None]:
# remove rows if they contain NA values
df = df.dropna()

### 2.1 Datatype Conversions

The 'BHK' and 'Price' variables are converted to floats, as such a format enables it to be understood by a machine.

In [None]:
# convert integers to floats
df['BHK'] = df['BHK'].astype(float)
df['Price'] = df['Price'].astype(float)

The columns with categorical variables are also One Hot Encoded, which takes a column with n-number of categorical values and 'splits' this into n-number of new columns which correspond to each unique categorical value, with each row within these new columns containing either a 0 or 1, depending on whether each row contains a specific column value. This enables these categorical values to be appropriately understood by a machine.

The One Hot Encoding is done using the Pandas '.get_dummies()' method.

In [None]:
# get dummies for one hot encoding
furn_dummies = pd.get_dummies(df['Furnishing'], dtype=float)
loca_dummies = pd.get_dummies(df['Locality'], dtype=float)
stat_dummies = pd.get_dummies(df['Status'], dtype=float)
tran_dummies = pd.get_dummies(df['Transaction'], dtype=float)
type_dummies = pd.get_dummies(df['Type'], dtype=float)

# remove old columns
df = df.drop(['Furnishing', 'Locality', 'Status', 'Transaction', 'Type'], axis=1)

# concat the one hot encoded dataframes onto the main dataframe
df = pd.concat([df, furn_dummies, loca_dummies, stat_dummies, tran_dummies, type_dummies], axis=1)
print(df.shape)

### 2.3 Data Scaling

In order to accelerate the calculations made using the various Regressors which are to be applied, the data is scaled using a Min-Max scaler which will scale all values to between 1 and 2. However, prior to scaling between values of 1 and 2 with the Min-Max scaler, the variables 'Area', 'Parking', 'Per_Sqft' and 'Price' are scaled logarithmically since the exploratory analysis indicated that these variables exhibited an exponential distribution. The 'BHK' variable did not have the logarithmic scaling applied, and all remaining variables (One Hot Encoded variables etc.) were not scaled as it was not deemed necessary.

In [None]:
# import the min-max scaler
from sklearn.preprocessing import MinMaxScaler

# define our scaler
scaler = MinMaxScaler(feature_range=(1, 2))

# no reverse transformation required with these columns, so we can use a fit_transform()
df['Area'] = scaler.fit_transform(np.expand_dims(np.log(df['Area']), axis=1))
df['BHK'] = scaler.fit_transform(np.expand_dims(df['BHK'], axis=1))
df['Parking'] = scaler.fit_transform(np.expand_dims(np.log(df['Parking']), axis=1))
df['Per_Sqft'] = scaler.fit_transform(np.expand_dims(np.log(df['Per_Sqft']), axis=1))

# we will ned to reverse transform the price values for evaluating the models,
# so we will separately define a scaler and store it as a variable for this variable
price_scaler = scaler.fit(np.expand_dims(np.log(df['Price']), axis=1))
df['Price'] = price_scaler.transform(np.expand_dims(np.log(df['Price']), axis=1))

## Stage 3: Splitting Training/Testing sets

The data was divided into testing and training sets with an 80/20 split.

In [None]:
# import the train_test_split function
from sklearn.model_selection import train_test_split

X_values = df.drop(['Price'], axis=1)
y_values = df['Price']

# now split our x and y values into train/test sets with an 80/20 percentage split
X_train, X_test, y_train, y_test = train_test_split(X_values, y_values, test_size=0.2)
print("X_train shape is", X_train.shape)
print("X_test shape is", X_test.shape)
print("y_train shape is", y_train.shape)
print("y_test shape is", y_test.shape)

## Stage 4: Model Fitting and Evaluation

Stage 4 involved applying four Machine Learning (ML) regressors, which specifically were the Support Vector Regressor (section 4.1), Random Forest Regressor (section 4.2), K-Nearest Neighbour Regressor (4.3) and a custom Deep Learning Multilayer Perceptron (MLP) Regressor (section 4.4). With each Regressor, the model was fit using the training data and its performance briefly evaluated using an r-squared and mean squared error evaluation metric.

### 4.1 Support Vector Regressor

The Support Vector Regressor makes use of Support Vector Machines.

The Support Vector Regressor is defined and fit as below:

In [None]:
# import sklearn's Support Vector Regression model
from sklearn.svm import SVR

# define the model
sv_regressor = SVR(kernel='linear')  # linear kernel achieved best results

# fit the model
sv_regressor.fit(X_train, y_train)

The Support Vector Regressor is applied on unseen testing X data and evaluated as below:

In [None]:
# import the r2 and mse evaluation metrics
from sklearn.metrics import r2_score, mean_squared_error

# make predictions with unseen testing data
sv_preds = sv_regressor.predict(X_test)

# calculate r-squared on inverse transformed data
sv_r2 = r2_score(np.exp(price_scaler.inverse_transform(np.expand_dims(y_test, axis=1))),
                 np.exp(price_scaler.inverse_transform(np.expand_dims(sv_preds, axis=1))))

# calculate mse on inverse transformed data
sv_mse = mean_squared_error(np.exp(price_scaler.inverse_transform(np.expand_dims(y_test, axis=1))),
                 np.exp(price_scaler.inverse_transform(np.expand_dims(sv_preds, axis=1))))

print('R-Squared: ', sv_r2)
print('Mean Squared Error: ', np.format_float_scientific(np.asarray(sv_mse), precision=3))

The R-squared and Mean Squared Error observed between the Support Vector Regressor testing data outputs and ground truth values were within ranges of 0.710-0.816 (R-squared) and 1.194x10$^{14}$-3.050x10$^{14}$ (Mean Squared Error) at means of 0.764 and 1.839x10$^{14}$ respectively (5 re-runs in total). NOTE: these were calculated from 5 re-runs, and later re-runs may record values outside of these ranges due to the random nature of the train/test splitting method.

### 4.2 Random Forest Regressor

The Random Forest Regressor fits a number of decision trees on various sub-samples of the dataset. Averaging is applied to improve the predictive accuracy and control over-fitting.

The Random Forest Regressor is defined and fit as below:

In [None]:
# load in the Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

# define the model
rf_regressor = RandomForestRegressor(n_estimators=120)  # 120 estimators optimised performance

# fit the model
rf_regressor.fit(X_train, y_train)

The Random Forest Regressor is applied on unseen testing X data and evaluated as below:

In [None]:
# make predictions with unseen testing data
rf_preds = rf_regressor.predict(X_test)

# calculate r-squared on inverse transformed data
rf_r2 = r2_score(np.exp(price_scaler.inverse_transform(np.expand_dims(y_test, axis=1))),
                 np.exp(price_scaler.inverse_transform(np.expand_dims(rf_preds, axis=1))))

# calculate mse on inverse transformed data
rf_mse = mean_squared_error(np.exp(price_scaler.inverse_transform(np.expand_dims(y_test, axis=1))),
                 np.exp(price_scaler.inverse_transform(np.expand_dims(rf_preds, axis=1))))

print('R-Squared: ', rf_r2)
print('Mean Squared Error: ', np.format_float_scientific(np.asarray(rf_mse), precision=3))

The R-squared and Mean Squared Error observed between the Random Forest Regressor testing data outputs and ground truth values were within ranges of 0.827-0.927 (R-squared) and 4.665x10$^{13}$-1.843x10$^{14}$ (Mean Squared Error) at means of 0.876 and 9.919x10$^{13}$ respectively (5 re-runs in total). NOTE: these were calculated from 5 re-runs, and later re-runs may record values outside of these ranges due to the random nature of the train/test splitting method.

The higher mean R-squared and lower mean MSE indicates that this Regressor is more accurate than the Support Vector counterpart.

### 4.3 K-Nearest Neighbour Regressor

The K-Nearest Neighbour Regressor performs a regression using K-Nearest Neighbours.

The K-Nearest Neighbour Regressor is defined and fit as below:

In [None]:
# load in the Random Forest Regressor
from sklearn.neighbors import KNeighborsRegressor

# define the model
kn_regressor = KNeighborsRegressor(n_neighbors=4, algorithm='auto')  # 4 neighbours and 'auto' algorithm optimum

# fit the model
kn_regressor.fit(X_train, y_train)

The K-Nearest Neighbour Regressor is applied on unseen testing X data and evaluated as below:

In [None]:
# make predictions with unseen testing data
kn_preds = kn_regressor.predict(X_test)

# calculate r-squared on inverse transformed data
kn_r2 = r2_score(np.exp(price_scaler.inverse_transform(np.expand_dims(y_test, axis=1))),
                 np.exp(price_scaler.inverse_transform(np.expand_dims(kn_preds, axis=1))))

# calculate mse on inverse transformed data
kn_mse = mean_squared_error(np.exp(price_scaler.inverse_transform(np.expand_dims(y_test, axis=1))),
                 np.exp(price_scaler.inverse_transform(np.expand_dims(kn_preds, axis=1))))

print('R-Squared: ', kn_r2)
print('Mean Squared Error: ', np.format_float_scientific(np.asarray(kn_mse), precision=3))

The R-squared and Mean Squared Error observed between the K-Nearest Neighbour Regressor testing data outputs and ground truth values were within ranges of 0.635-0.777 (R-squared) and 1.611x10$^{14}$-3.881x10$^{14}$ (Mean Squared Error) at means of 0.693 and 2.403x10$^{14}$ respectively (5 re-runs in total). NOTE: these were calculated from 5 re-runs, and later re-runs may record values outside of these ranges due to the random nature of the train/test splitting method.

The K-Nearest Neighbour Regressor exhibits a lower mean R-squared value and higher mean MSE value which suggests that it is less accurate than the Support Vector and Random Forest counterparts.

### 4.4 A Deep Learning MLP Regressor

A Deep Learning MLP Regressor was also applied. This was undertaken using Pytorch.

First, the data is converted to PyTorch tensors as below:

In [None]:
import torch
import torch.nn as nn

# tensorize our x/y train/test data to form pytorch tensors
X_train_tensor = torch.from_numpy(X_train.to_numpy()).float()
X_test_tensor = torch.from_numpy(X_test.to_numpy()).float()
y_train_tensor = torch.from_numpy(y_train.to_numpy()).float()
y_test_tensor  = torch.from_numpy(y_test.to_numpy()).float()
print("X_train_tensor shape is", X_train_tensor.shape)
print("X_test_tensor shape is", X_test_tensor.shape)
print("y_train_tensor shape is", y_train_tensor.shape)
print("y_test_tensor shape is", y_test_tensor.shape)

Specifically, the model (below) was built with two hidden layers, the first of which features a width of 4000 neurons, and the second a width of 1000 neurons which map the 318 input variables to a single output. ReLU layers were also applied to introduce non-linearity and a small dropout of 0.02 was used in order to reduce overfitting. The Deep Learning MLP Regressor is constructed and defined as below:

In [None]:
# construct the deep learning MLP regressor
class MLP_Regressor(nn.Module):
    def __init__(self, input_dim, layer_sizes, dropout):
        super(MLP_Regressor, self).__init__()
        self.mlp = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(input_dim, layer_sizes[0]),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(layer_sizes[0], layer_sizes[1]),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(layer_sizes[1], 1),
        )        
        
    # the forward pass through the network
    def forward(self, input_tensor):
        
        output_tensor = self.mlp(input_tensor)  # pass the input tensor through the mlp
        
        return output_tensor
    
# now lets define the model
mlp_regressor = MLP_Regressor(X_train_tensor.shape[1],
                               [4000, 1000],
                               0.02)
print(mlp_regressor)

In [None]:
loss_function = nn.MSELoss()  # mse loss function
optimizer = torch.optim.Adam(mlp_regressor.parameters(), lr=0.0001)  # adam's optimiser
epochs = 1000  # number of epochs
loss_vals_train = []  # hold the training loss values
loss_vals_valid = []  # hold the validation loss values

for i in range(epochs):
    y_pred_tensor = mlp_regressor(X_train_tensor)  # obtain y predictions
    single_loss = loss_function(y_pred_tensor[:-20], torch.unsqueeze(y_train_tensor[:-20], dim=1))  # calculate training loss
    loss_vals_train.append(single_loss.item())
    
    # now calculate the validation loss
    with torch.no_grad():  # disable the autograd engine
        val_loss = loss_function(y_pred_tensor[-20:], torch.unsqueeze(y_train_tensor[-20:], dim=1))  # calculate validation loss
        loss_vals_valid.append(val_loss.item())
    
    optimizer.zero_grad()  # zero the gradients
    single_loss.backward()  # backpropagate through the model
    optimizer.step()  # update parameters
    
    if i%25 == 0:
        print(f'epoch: {i:5} training loss: {single_loss.item():10.8f} validation loss: {val_loss.item():10.8f}')

The training and validation losses are plotted against the number of epochs:

In [None]:
sns.set(rc={'figure.figsize': (45.0, 20.0)})
sns.set(font_scale=8.0)
sns.set_context("notebook", font_scale=5.5, rc={"lines.linewidth": 1.0})
x_vals = np.arange(0, epochs, 1)
ax = sns.lineplot(x=x_vals, y=loss_vals_train)
ax = sns.lineplot(x=x_vals, y=loss_vals_valid)
ax.set_ylabel('Loss', labelpad=20, fontsize=75)
ax.set_xlabel('Epochs', labelpad=20, fontsize=75)
plt.legend(labels=['Training loss', 'Validation loss'], facecolor='white', framealpha=1)
plt.show()

The Deep Learning MLP Regressor is applied on unseen testing X data and evaluated as below:

In [None]:
# activate the evaluation mode for the model
mlp_regressor.eval()

# make predictions with the mlp model
mlp_preds = mlp_regressor(X_test_tensor).detach().numpy()

# calculate r-squared on inverse transformed data
mlp_r2 = r2_score(np.exp(price_scaler.inverse_transform(np.expand_dims(y_test, axis=1))),
                 np.exp(price_scaler.inverse_transform(mlp_preds)))

# calculate mse on inverse transformed data
mlp_mse = mean_squared_error(np.exp(price_scaler.inverse_transform(np.expand_dims(y_test, axis=1))),
                 np.exp(price_scaler.inverse_transform(mlp_preds)))

print('R-Squared: ', mlp_r2)
print('Mean Squared Error: ', np.format_float_scientific(np.asarray(mlp_mse), precision=3))

The R-squared and Mean Squared Error observed between the Deep Learning MLP Regressor testing data outputs and ground truth values were within ranges of 0.845-0.896 (R-squared) and 6.857x10$^{13}$-1.653x10$^{14}$ (Mean Squared Error) at means of 0.872 and 1.004x10$^{14}$ respectively. NOTE: these were calculated from 5 re-runs, and later re-runs may record values outside of these ranges due to the random nature of the train/test splitting method and the way that the MLP weights are initialised.

The mean R-squared and mean MSE values from this Regressor are higher and lower respectively than the Support Vector and K-Nearest Neighbour equivalents, indicating that this Regressor is more accurate. The mean R-squared and MSE values are very similar to that of the Random Forest Regressor, although slightly lower and higher respectively which may indicate slightly lower accuracy. However, it is important to note that the MLP Regressor may appear to be a bit more consistent with its performance, as the range of the R-squared recorded over 5 re-runs is slightly lower than the Random Forest Regressor.

## Stage 5: Model Comparison

The last stage aims to compare and further evaluate each model. Specifically, the R-squared and MSE of each model's predictions run with the current kernel are plotted as barplots. Additionally, the testing data predictions are plotted against the ground truth in the form of a scatterplot, and the residuals of each model are plotted as histograms.

In [None]:
# store revelant information as lists
r2_vals = [sv_r2, rf_r2, kn_r2, mlp_r2]
mse_vals = [sv_mse, rf_mse, kn_mse, mlp_mse]
models = ['Support Vector \nRegression', 'Random Forest \nRegression',
          'K-Nearest Neighbour \nRegression', 'Deep Learning MLP \nRegression']

# convert lists into a single dataframe
accuracy_df = pd.DataFrame({'Model': models, 'R-Squared': r2_vals, 'MSE': mse_vals})

A barplot to compare the R-Squared of each model's testing data predictions is produced as below:

In [None]:
# plot r2 score as a barplot
sns.set_context("notebook", font_scale=4.5, rc={"lines.linewidth": 0.5})
ax = sns.barplot(x="Model", y="R-Squared", data=accuracy_df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, horizontalalignment='right')
ax.set_ylabel('R-Squared', labelpad=50, fontsize=85)
ax.set_xlabel('Model', labelpad=50, fontsize=85)

plt.title("A Barplot comparing the testing data R-Squared \nof all four regressors", fontsize=100)
plt.show()

A barplot to compare the MSE of each model's testing data predictions is generated as below:

In [None]:
# plot MSE score as a barplot
sns.set_context("notebook", font_scale=4.5, rc={"lines.linewidth": 0.5})
ax = sns.barplot(x="Model", y="MSE", data=accuracy_df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, horizontalalignment='right')
ax.set_ylabel('MSE', labelpad=50, fontsize=85)
ax.set_xlabel('Model', labelpad=50, fontsize=85)

plt.title("A Barplot comparing the testing data MSE \nof all four regressors", fontsize=100)
plt.show()

In [None]:
# obtain the ground truth and predictions of each regressor as lists
ground_truth = np.exp(price_scaler.inverse_transform(np.expand_dims(y_test, axis=1)))
sv_preds = np.exp(price_scaler.inverse_transform(np.expand_dims(sv_preds, axis=1)))
rf_preds = np.exp(price_scaler.inverse_transform(np.expand_dims(rf_preds, axis=1)))
kn_preds = np.exp(price_scaler.inverse_transform(np.expand_dims(kn_preds, axis=1)))
mlp_preds = np.exp(price_scaler.inverse_transform(mlp_preds))

The predictions obtained with the testing data of each model are then plotted against the ground truth using the code outlined below:

In [None]:
# Set up the matplotlib figure
f, ax = plt.subplots(2, 2)
sns.scatterplot(x=ground_truth.flatten(), y=sv_preds.flatten(), ax=ax[0, 0], color='red', s=150)
sns.scatterplot(x=ground_truth.flatten(), y=rf_preds.flatten(), ax=ax[0, 1], color='blue', s=150)
sns.scatterplot(x=ground_truth.flatten(), y=kn_preds.flatten(), ax=ax[1, 0], color='green', s=150)
sns.scatterplot(x=ground_truth.flatten(), y=mlp_preds.flatten(), ax=ax[1, 1], color='purple', s=150)

# add subfigure titles
ax[0, 0].set_title('Support Vector Regressor', fontsize=60)
ax[0, 1].set_title('Random Forest Regressor', fontsize=60)
ax[1, 0].set_title('K-Nearest Neighbour Regressor', fontsize=60)
ax[1, 1].set_title('Deep Learning MLP Regressor', fontsize=60)

# generate annotations
annotations = []
for i, row in accuracy_df.iterrows():
    annotations.append('R2: ' + str(round(row[1], 3)) + '\nMSE: ' + str(np.format_float_scientific(np.asarray(row[2]), precision=3)))

# annotate the subfigures with the R-squared and MSE values
ax[0,0].annotate(annotations[0], xy=(0, max(sv_preds)*0.65), fontsize=40, color='dimgrey')
ax[0,1].annotate(annotations[1], xy=(0, max(rf_preds)*0.65), fontsize=40, color='dimgrey')
ax[1,0].annotate(annotations[2], xy=(0, max(kn_preds)*0.65), fontsize=40, color='dimgrey')
ax[1,1].annotate(annotations[3], xy=(0, max(mlp_preds)*0.65), fontsize=40, color='dimgrey')

# set all y-labels
plt.setp(ax[:, :], ylabel='Predicted \nvalues', xlabel='Observed values')

f.tight_layout(pad=2.0)

plt.show()

By plotting the observed values against the predicted values of each Regressor, the similarity between the two values can be visualised, whereby an a tighter spread of points within the two-dimensional space and greater adherence to an x=y relationship indicates greater similarity between the observed and predicted values (and thus higher accuracy). With this in mind, the K-Nearest Neighbour Regressor appears to have the least tight spread of points, followed by the Support Vector Regressor, whilst both the Deep Learning MLP Regressor and Random Forest Regressor exhibit the tighter spreads (although in this individual run the Random Forest Regressor did display a slightly higher R-squared). The R-squared values and MSE values also follow this ordering.

Residuals are calculated, and then plotted as histograms as shown below:

In [None]:
# a function to calculate residuals
def calculate_residuals(truth: list, preds: list):
    
    residuals = []
    for i in range(len(truth)):
        res = truth[i] - preds[i]
        residuals.append(res)
    
    return residuals

# calculate the residuals from the predictions of each regressor
sv_res = calculate_residuals(ground_truth, sv_preds)
rf_res = calculate_residuals(ground_truth, rf_preds)
kn_res = calculate_residuals(ground_truth, kn_preds)
mlp_res = calculate_residuals(ground_truth, mlp_preds)

In [None]:
# Set up the matplotlib figure
f, ax = plt.subplots(2, 2)
ax[0,0].hist(np.asarray(sv_res), color='red', edgecolor='black', linewidth=1.2)
ax[0,1].hist(np.asarray(rf_res), color='blue', edgecolor='black', linewidth=1.2)
ax[1,0].hist(np.asarray(kn_res), color='green', edgecolor='black', linewidth=1.2)
ax[1,1].hist(np.asarray(mlp_res), color='purple', edgecolor='black', linewidth=1.2)

# add subfigure titles
ax[0, 0].set_title('Support Vector Regressor', fontsize=60)
ax[0, 1].set_title('Random Forest Regressor', fontsize=60)
ax[1, 0].set_title('K-Nearest Neighbour Regressor', fontsize=60)
ax[1, 1].set_title('Deep Learning MLP Regressor', fontsize=60)

# annotate the subfigures with the median values
ax[0,0].annotate("Median: \n" + str(np.format_float_scientific(np.median(np.sort(np.asarray(sv_res))),
                                                             precision=3)), xy=(max(sv_res)*0.3, 40),
                 fontsize=45)

ax[0,1].annotate("Median: \n" + str(np.format_float_scientific(np.median(np.sort(np.asarray(rf_res))),
                                                             precision=3)), xy=(max(rf_res)*0.3, 40),
                 fontsize=45)

ax[1,0].annotate("Median: \n" + str(np.format_float_scientific(np.median(np.sort(np.asarray(kn_res))),
                                                             precision=3)), xy=(max(kn_res)*0.3, 40),
                 fontsize=45)

ax[1,1].annotate("Median: \n" + str(np.format_float_scientific(np.median(np.sort(np.asarray(mlp_res))),
                                                             precision=3)), xy=(max(mlp_res)*0.3, 40),
                 fontsize=45)


# set all x and y-labels
plt.setp(ax[:, :], ylabel='Frequency', xlabel='Residual values')

f.tight_layout(pad=1.4)

plt.show()

The residual histograms above show the distributions of residuals, with the median residual values providing insight into whether the models have a tendency to over or underpredict values.

In the kernel used in this notebook, the Support Vector Regressor, Random Forest Regressor and Deep Learning MLP Regressor appear to have a tendency to underpredict values (positive median residual value) whilst the K-Nearest Neighbour Regressor appears to have a tendency to overpredict values (negative median residual value).

It is also worth noting that both the Random Forest Regressor and Deep Learning MLP Regressor both exhibit a higher frequency of residual values closer to 0, whilst the Support Vector Regressor, and especially the K-Nearest Neighbour Regressor display a slightly wider range of residual values and fewer residuals closer to the 0 mark.

## Conclusion

Regarding the four Regressors applied, it is first of all most clear that the K-Nearest Neighbour Regressor performed the worst. This is evidenced by the low R-squared values and high MSE values, both in the context of this kernel and the means of the 5 re-runs as well as the non-tight spread of points within the observed vs predicted values feature space and the comparatively lwoer frequency of residuals near the 0 value. 

With the same indicators in mind, the Support Vector Regressor exhibited the second worse performance as it displayed lower R-squared values and higher MSE values than the Random Forest and Deep Learning MLP Regressors both in terms of this kernel and the means of the 5 re-runs as well as the point spread of the scatterplot and the distribution of residuals.

Separating the Random Forest Regressor and Deep Learning MLP Regressor and determining which takes the top spot is not quite as obvious as the two exhibited similar performances in terms of R-squared, MSE, spread of points on the scatterplot and distribution of residuals. However, the Random Forest Regressor did display a slightly higher (0.459%) mean R-squared value and slightly lower (1.205%) mean MSE value than the Deep Learning MLP Regressor as well as slightly higher R-squared and lower MSE values within this kernel. As such, the Random Forest Regressor is considered the best Regressor of the four, with the Deep Learning MLP Regressor closely following it. However, it is worth noting that over the 5 re-runs the Deep Learning MLP Regressor did display slightly higher consistency with regards to its lower range of R-squared values.