# Regression

Regression is a supervised task where a model maps input to a continuous output. More formally, a regression problem can be defined as learning a function $f$ that will map input variables $X = x_0, x_1,\dots, x_{m-1}, x_{m}$ to a continuous target variable $y$ such that $f(x) = y$.

So for instance, let's say that we have the following data:

| Variable 1 | Variable 1 | Variable 3 | Variable 4 | Target variable |
|------------|------------|------------|------------|:---------------:|
| 1          | 2          | 3          | 4          | 10              |
| 2          | 3          | 4          | 5          | 14              |
| 3          | 4          | 5          | 6          | 18              |
| ...        | ...        | ...        | ...        | ...             |
| 2000       | 2001       | 2002       | 2003       | 8006            |

We would want to learn some function such that $f(1,2,3,4) = 10$ and $f(2,3,4,5) = 14$ and so on.

Regression is often compared to curve fitting, since it is trying to fit some function $f$ that will follow a similar curve as the data. For instance:

![Title](../../../source/visualization/images/regression.png)

## Evaluating regression

Machine learning is all about getting better and better at a task. Therefore, we need to define what it means to be _good_.

For instance, given the output of different models compared to the target variable, which model would you say is better, and why?

| Target | 0.55 | 0.72 | 0.6 | 0.54 | 0.42 | 0.65 | 0.44 | 0.89 | 0.96 | 0.38 |
|:-------:|:----------:|:----------:|:----------:|:-----:|:-:|:-:|:-:|:-:|
| Model A |  0.69 | 2.17 | 1.36 | 0.66 | 0.86 | 0.98 | 1.93 | 0.68 | 1.27 | -0.47 |
| Model B  | -1.36 | 1.21 | 1.25 | -0.02 | 2.12 | -0.44 | 0.47 | 0.75 | 2.11 | 1.48 |
| Model C |0.59 | 0.81 | 0.38 | 0.04 | 0.33 | 0.69 | 0.75 | 1.19 | 0.86 | 0.3 |
| Model D |0.03 | 0.01 | -0.25 | 1.52 | 0.17 | 0.43 | -0.19 | 1.28 | 0.15 | 0.27 |
| Model E | 0.1 | 0.91 | 0.34 | -0.05 | 0.41 | 0.86 | 0.47 | 1.04 | 0.64 | 0.2 |

This might be difficult to tell, especially if there are more models and predictions. Thankfully, there exists several commonly-used metrics to tackle this problem. Let's use the data from the table as example.

In [30]:
import numpy as np

target = np.array([0.55, 0.72, 0.6, 0.54, 0.42, 0.65, 0.44, 0.89, 0.96, 0.38])

predictions = {"A": np.array([0.69, 2.17, 1.36, 0.66, 0.86, 0.98, 1.93, 0.68, 1.27, -0.47]),
               "B": np.array([-1.36, 1.21, 1.25, -0.02, 2.12, -0.44, 0.47, 0.75, 2.11, 1.48]),
               "C": np.array([0.59, 0.81, 0.38, 0.04, 0.33, 0.69, 0.75, 1.19, 0.86, 0.3]),
               "D": np.array([0.03, 0.01, -0.25, 1.52, 0.17, 0.43, -0.19, 1.28, 0.15, 0.27]),
               "E": np.array([0.1, 0.91, 0.34, -0.05, 0.41, 0.86, 0.47, 1.04, 0.64, 0.2])}

### Mean-Squared Error

$ MSE = \frac{1}{n} \sum_{i = 0}^{n} (\hat{Y_i} - Y_i)^2$

The mean-squared error is probably the most commonly used metric for regression. It is often set as the default metric in many machine learning packages.

It is defined as the average of the square of the errors. It loosely means that large errors are proportionally _worse_ than small mistakes.

In [31]:
def MSE(predicted_target, target):
    errors = predicted_target - target
    
    return np.mean(errors**2)

for model_name, predicted_target in predictions.items():
    print(f"{model_name}: {MSE(predicted_target, target):.4f}")

A: 0.6099
B: 1.1255
C: 0.0520
D: 0.3785
E: 0.0857


### Root Mean-Squared Error

$ RMSE = \sqrt{\frac{1}{n} \sum_{i = 0}^{n} (\hat{Y_i} - Y_i)^2}$

The root mean-squared error is related to the mean squared error. It is simply the square root of the former metric. It has the advantage of being of the same units as the target variable. Therefore, it can be easily interpreted as the average distance of the output to the target.

In [32]:
def RMSE(predicted_target, target):
    return np.sqrt(MSE(predicted_target, target))

for model_name, predicted_target in predictions.items():
    print(f"{model_name}: {RMSE(predicted_target, target):.4f}")

A: 0.7810
B: 1.0609
C: 0.2281
D: 0.6153
E: 0.2927


### Mean Absolute Error

$ MAE = \frac{1}{n} \sum_{i = 0}^{n} |\hat{Y_i} - Y_i|$

As opposed to the mean-squared error, the mean absolute error views all errors as proportionally as bad and therefore, large errors are not penalized more.

In [33]:
def MAE(output, target):
    errors = output - target
    
    return np.mean(np.abs(errors))

for model_name, predicted_target in predictions.items():
    print(f"{model_name}: {MAE(predicted_target, target):.4f}")

A: 0.6100
B: 0.8820
C: 0.1770
D: 0.5470
E: 0.2390


### R Squared

$ R^{2} = 1 - \frac{\sum_{i=0}^{n} (Y_i - \hat{Y_i})^2}{\sum_{i=0}^{n} (Y_i - \sum_{i=0}^n Y_i)^2}$

R squared is also often referred to as the coefficient of determination, or the explained variance. It represents how much of the target's variance can be explained by the data. 1 is best, lower is worse

In [34]:
def RSquared(predicted_target, target):
    numerator = np.sum((target - predicted_target)**2)
    denominator = np.sum((target - np.mean(target))**2)
    
    return 1.0 - (numerator / denominator)

for model_name, predicted_target in predictions.items():
    print(f"{model_name}: {RSquared(predicted_target, target):.4f}")

A: -16.8947
B: -32.0216
C: -0.5265
D: -10.1061
E: -1.5134


### Custom metrics

Of course, it is completely possible to use custom metrics.

A simple example would be to use weighted versions of the aforementioned metrics. By doing this, you would loosely make it more important to perform well for certain data points than others. It could also be possible to have a fully custom metric based on a custom error function. Perhaps, your application entails that it is much worse to overshoot rather than undershoot for instance.

The metric should ultimately represent what it means for your regression to be good, whatever it may mean in your application.

## Common challenges

### Over- or underfitting, the bias-variance dilemma

Like in most machine learning problems, regression models might deal with noise in their training data. A big challenge is to figure out how biased we want to be towards our data. If we have a high bias, it means that we are quite sceptical about whether a data point isn't noise. If we have a low bias - and therefore a high variance - it means that we trust most data points to not be noise.


![Title](../../../source/visualization/images/regression_bias_variance.png)

## Practical examples

Below, we mention a few examples of regression problems.

+ **Predicting house prices**

_Input variables_: Number of bedrooms, whether it has a garage, living surface, age of the house

_Target variable_: Price of the house

_Example_:

| Bedrooms | Garage | Living surface | Age | Price ($) |
|:------------:|:------------:|:------------:|:------------:|:---------------:|
|3          | 0          | 3000          | 1         | 245000              |
| 2          | 1          | 2650          | 14          | 312040              |
| 4          | 0          | 4000          | 60          | 180000              |
| ...        | ...        | ...        | ...        | ...             |
| 5       | 1       | 5432       | 4       | 800670            |

This could be useful to make an estimate on a house, either if you're selling or buying one.

+ **Predicting student's grades**

_Input variables_: Grade on last test, GPA

_Target variable_: Grade on final exam

_Example_:

| Last test | GPA | Final Exam |
|:------------:|:------------:|:------------:|
|3          | 5.5          | 5          | 
| 10          | 7          | 6          |
| 7          | 8          | 7.5          | 
|...        | ...        | ...        |
| 8       | 9.2       | 10       |

A teacher could use this to identify which students might require additional attention.

+ **Predict how likely it is for a customer to default on a loan**

_Input variables_: Income, age, children, married

_Target variable_: Likelihood of defaulting

_Example_:

| Income | Age | Children | Married | Likelihood of defaulting |
|:------------:|:------------:|:------------:|:------------:|:---------------:|
|2500          | 33         | 1          | 1         | 0              |
| 1200         | 42          | 3          | 1         | 1              |
| 0          | 18          | 2          | 0          | 1              |
| ...        | ...        | ...        | ...        | ...             |
| 9000       | 28       | 0       | 0       | 0            |

A bank could use this to decide whether or not to grant a loan.