# Introduction: Particle Swarm Regression

In this notebook we will walk through how to use [Particle Swarm Optimisation](https://en.wikipedia.org/wiki/Particle_swarm_optimization)  to produce optimal Linear Regression models for a range of custom loss metrics. Using a methodology introduced by [Dietterich (1998)](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/dietterich1998.pdf) we will perform a siginificance test on the holdout predictions, to see whether training on custom loss metrics produces significantly better models.

## Model Error

In regression problems the target variable is continuous and our model can make errors in two ways
1. Underpredict the target
2. Overpredict the target

The Ordinary Least Squares and Gradient Descent implementations of Linear Regression both minmise Mean Squared Error as the loss function. Mean Square error places equal importance on an underprediction vs an overprediction. However in practice our preferences for both types of error are rarely the same instead they are driven by the business problem at hand. For example to improve efficiency a utility company might want to know how much electricity they can put through a [transformer](https://en.wikipedia.org/wiki/Transformer) without it overheating. For this use case an overprediction would mean a loss of efficiency, but an underprediction could lead to a failure of the transformer, blackouts for customers and heavy regulatory fines.

One solution to this problem is to add on "safety" margins to our predictions. However a more elegant approach is to encode our preferences directly in a function which we use directly to optimise our model coefficients.

## Particle Swarm Optimisation
Particle swarm optimization (PSO) is a computational method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. It solves a problem by having a population of candidate solutions, here dubbed particles, and moving these particles around in the search-space according to simple mathematical formulae over the particle's position and velocity. Each particle's movement is influenced by its local best known position, but is also guided toward the best known positions in the search-space, which are updated as better positions are found by other particles. This is expected to move the swarm toward the best solutions.

I have [coded up an implementation](https://github.com/tonyjward/machine-learning-oop/blob/master/twlearn/ParticleSwarm.py) of PSO based on the above Wiki page which i'll use to produce throughout this notebook.


In [None]:
# Numpy for data manipulation
import numpy as np
import pandas as pd

# Modelling
from twlearn import LinearRegression

# Evaluation of the model
import sklearn.model_selection
from twlearn.metrics import Rmse, Mae, five_by_two_cv

## Data
For this notebook we will work with the [Boston Housing dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) available as part of scikit learn. The objective is to predict the Median value of owner-occupied homes by training a model on past data. This is a supervised machine learning regression task: given past data we want to train a model to predict a continous outcome on testing data.

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()

X = np.array(boston.data)
Y = np.array(boston.target)

# split the data into training and testing
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size=0.33, random_state = 1)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")


## Ordinary Least Sqaures

Now lets train two linear regression models 
1. Using ordinary least squares
2. Using particle swarm optimisation with MSE loss metric

In [None]:
# OLS
OLS_model = LinearRegression()
OLS_model.fit(X_train, Y_train, optimiser = 'OLS')

# Particle Swarm Regression - Rmse
PSO_RMSE_model = LinearRegression()
PSO_RMSE_model.fit(X_train, Y_train, optimiser = 'PSO', loss = Rmse)

OLS_coef = OLS_model.coefficients()['coefficients']
PSO_RMSE_coef = PSO_RMSE_model.coefficients()['coefficients']

print(np.c_[OLS_coef, PSO_RMSE_coef])

Close but not exact. Increasing the number of iterations and particles from their defaults of 500 and 300 respectively gets us much close to the OLS coefficients

In [None]:
PSO_RMSE_model.fit(X_train, Y_train, optimiser = 'PSO', loss = Rmse, num_iterations = 5000, no_particles = 500)
PSO_RMSE_coef = PSO_RMSE_model.coefficients()['coefficients']
print(np.c_[OLS_coef, PSO_RMSE_coef])


Ok lets try optimising a different error metric - mean absolute error, using the num_iterations and no_particles from above.

In [None]:
PSO_MAE_model = LinearRegression()
PSO_MAE_model.fit(X_train, Y_train, optimiser = 'PSO', loss = Mae, num_iterations = 5000, no_particles = 500)

and now compare the predictions to the original OLS model using a helper function

In [None]:
def evaluate(model, X_test, Y_test, loss):
    """ Take a model and and some test data and produce test metrics
    Arguments:
        model         -- a fitted model
        X_test        -- test data - a numpy array
        loss_function -- a loss function to assess predictions

    Returns:
        loss - calculated loss
    """
    predictions = model.predict(X_test)
    return loss(predictions, Y_test)

print(f"OLS MAE: {evaluate(OLS_model, X_test, Y_test, Mae)}")
print(f"PSO MAE: {evaluate(PSO_MAE_model, X_test, Y_test, Mae)}")


Cool! The particle swarm regression model optimised for MAE performs better than the Ordinary Least Square model compared using MAE

## Is this significant?
519 data points is not very much. What could we do to assess the significance. Our first thought might be to do 10-fold cross validation, and compare the performance between both algorithms on each fold using a students paired t test. However this would be incorrect, since one of the core assumptions (independence) of the t test would be violated. A good overview of our options is provided here, and we elect for the 5 by 2 cross validation statistic.

The intuition behind this approach is that we use 50% of the data for training and 50% for testing ONCE to get our base estimate for the treatment effect (difference between models). Then we do 5 repeats of 2-fold cross validation to get an estimate for the variance in the treatment effect, which we use to calculate a t-statistic. 