# Regression Analysis: Loss Functions for Regression Analysis

## Overview

Machines learn by means of **`loss function`**. It’s a method of evaluating how well specific algorithm models the given data. 

To evaluate the performance of the regression model, various loss functions are introduced and are preferred in different scenario.

There’s no one-size-fits-all loss function to all the machine learning problems. Each loss function performs well in respective use cases in detecting different model performances, and should be considered before implementing one.

<img src='pic/loss_all.png' width="400" height="150">

In the following section, we are going to dive deep into some of the most common loss functions as below.

* **`MSE`** : Mean Squared Error
* **`RMSE`**: Root Mean Squared Error
* **`MAE`** : Mean Absolute Error
* **`MBE`** : Mean Bias Error
* **`MAPE`**: Mean Absolute Percentage Error
* **`R^2`** : R-Squared + Adj. R-Squared
* **`RMSLE`** : Root Mean Squared Logarithmic Error 

---

## Notebook Setting

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

Here we will use the famous Boston Housing Dataset.

In [5]:
from sklearn.datasets import load_boston

boston_dataset = load_boston()

boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['Target'] = boston_dataset.target
boston = boston[boston.columns[-1:].append(boston.columns[:-1])]
boston.head()

Unnamed: 0,Target,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,24.0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,21.6,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,34.7,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,33.4,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,36.2,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [6]:
boston.isnull().values.any()

False

No null value is detected. It's good to go!

## Preparation

In [7]:
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import train_test_split

X = boston.drop(columns=['Target'])
y = boston['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Benchmark Regression Model

In [8]:
lr = ElasticNetCV(normalize=True, l1_ratio=0.3,cv=10)
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)

---

## Loss Functions for Evaluating Model Performances

### Mean Squared Error (MSE / Quadratic Loss / L2 Loss)

<img src="pic/mse.png" width="300" height="400">

**`Mean Squared Error (MSE)`** is one of the most simple and common loss function in regression analysis. For each data point, it calculates the squared difference between the prediction `ŷ` and original data point `y` and then take the mean value of those values.

Due to the squared term, predictions that are far away from the actual values, which leads to overestimation of how bad the model is, are penalized compared to less deviated predictions. 

`MSE` is preferred over other other metrics such as `MAE`, because it is **differentiable** and hence can be optimized better.

**Recommended Use Cases** :
* **`When Large Errors are undesirable`**:
Since the errors are squared before they are averaged, the MSE penalizes even a small error which leads to over-estimation of how bad the model is. Therefore, if outliers are undesirable and should be cared about, use `MSE` to detect those large errors!


* **`Further Calculation`**:
MSE is also widely used due to its differentiable nature.

**Disadvantage** : 
* **`Overestimate the problem of a bad model`** : `MSE` is easily affected by outliers. A huge `MSE` often means that there are outliers.


* **`Low Score may imply Overfitting`** : A low `MSE` does not imply good model, as it may be an overfitting model that fits all the data point.


* **`Noisiness`** : If we have noisy data (that is, data includes some randomness or for whatever reason is not entirely reliable) — even a “perfect” model may have a high `MSE` in that situation, so it becomes hard to judge how well the model is performing.


* **`Clueless about how good the result is`**: For example, in a demand forecasting problem if you have a MSE of 1000, you cannot tell how good your result is, since it is easily affected by the magnitude of the orignial dataset. In this type of problem, `MAPE` might be a better approach.

In [9]:
def mse(pred, y_test):
    sum_err = 0.0
    y = y_test.values
    for i in range(len(y_test)):
        err = pred[i] - y[i]
        sum_err += (err**2)
    return(sum_err / float(len(pred)))

In [10]:
mse(lr_pred, y_test)

36.674277681139735

### Root Mean Squared Error (RMSE)

<img src="pic/rmse.png" width="300" height="400">

Root Mean Squared Error is simply the root of the MSE. Then why bother to create another loss function?



Both `MSE` and `RMSE` decrease monotonically. Thus, a model that has a higher MSE will also have a higher RMSE compared to another model.

In [11]:
# Calculate root mean squared error
def rmse(pred, y_test):
    sum_err = 0.0
    y = y_test.values
    for i in range(len(y_test)):
        err = pred[i] - y[i]
        sum_err += (err**2)
    return(np.sqrt(sum_err / float(len(pred))))

In [12]:
rmse(lr_pred, y_test)

6.05592913442188

### Mean Absolute Error (MAE / L1 Loss)

<img src="pic/mae.png" width="300" height="400">

**`Mean Absolute Error (MAE)`** : `MAE` measures the average magnitude of absolute differences in a set of prediction `ŷ` and its partnered data point `y`, without considering the direction. Unlike `RMSE` and `MSE`, `MAE` is more robust to extreme values since it doesn't have the squared term that penalizes errors as extremely as `MSE` does. That is, all the individual differences are weighted equally.

**`Mean Absolute Error`** is widely used in cases like finance, where `$10`  error is usually exactly two times worse than `$5` error. 

On the other hand, `MSE` metric thinks that `$10` error is four times worse than `$5` error. 
Therefore, `MAE` is easier to justify than `MSE`.

**Recommended Use Cases** :
* **`Pay equal attention to both all data points`**:
Unlike `MSE` that pays higher emphasis on outliers, since `MAE` doesn't include the squared term, it is **suitable** for applications where you want to **pay equal attention to all the data points**. Therefore, if you **have outliers and want to treat it equally**, use `MAE`!


* **`Easy comparison between differences`**:
The `MAE` is a linear score which means that all the individual differences are weighted equally in the average. For example, the difference between `10` and `0` will be twice the difference between `5` and `0`.

**Disadvantage** : 
* **`Hard to compute further`** : `MAE` is hard to compute further due to its absolute function. 

Meanwhile, its gradient either `-1` or `1`. In worst scenarios when `ŷ == y`, it is even indifferentiable!!

In [13]:
def mae(pred, y_test):
    sum_err = 0.0
    y = y_test.values
    for i in range(len(y_test)):
        sum_err += abs(pred[i] - y[i])
    return(sum_err / float(len(pred)))

In [14]:
mae(lr_pred, y_test)

3.5973912813393607

註：還沒完全讀懂 https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d

### Mean Biased Error (MBE)

<img src="pic/mbe.png" width="300" height="400">

**`Mean Biased Error`** is exceptionally useful in detecting the average model bias. In general, the **`MBE`** is the average forecast error representing the **systematic error of a forecast model to under or overforecast**.

Note that since positive and negative errors will cancel out (since the absolute sign is not taken), the error accounts for no variation and only bias. 

In [15]:
def mbe(pred, y_test):
    sum_err = 0.0
    y = y_test.values
    for i in range(len(y_test)):
        sum_err += pred[i] - y[i]
    return(sum_err / float(len(pred)))

In [16]:
mbe(lr_pred, y_test)

-0.4203096246222728

In practice, **`Mean Biased Error`** can be used to calculate the part of error, i.e. in MSE, that does not result from bias, which is the error resulting from variance. This term is called `Systematic Error (SD)`, and can be calculated as below:

$$SD^2 = MSE^2 - MBE^2$$

### Mean Absolute Percentage Error (MAPE)

<img src="pic/mape.png" width="400" height="400">

The **`mean absolute percentage error (MAPE)`**, also known as `mean absolute percentage deviation (MAPD)`, usually expresses accuracy as a percentage. Basically, it tells you by how many percentage points your forecasts are off, on average. In practice, it is probably the single most commonly used forecasting metric in demand forecasting.

In [17]:
def mape(pred, y_test):
    sum_err = 0.0
    y = y_test.values
    
    for i in range(len(y_test)):
        sum_err += np.abs(pred[i] - y[i]) / y[i]
        
    mape = (100.0/len(y)) * sum_err
    return(mape)

In [18]:
mape(lr_pred, y_test)

16.543514913679566

**Recommended Use Cases** :
* **`Scale-independency and Interpretability`**: The MAPE metric is exceptionally useful to evaluate model performance in terms of percentages. That is, it is not affected by scale and can be compared across different problem size. For example, in demand forecasting MAPE is really useful.

**Disadvantages**:

* **`Undefined Values`**: MAPE has the significant disadvantage that it produces infinite or undefined values for zero or close-to-zero actual values. If the denominator is very close to zero, or is zero, it can result in huge problems. If just a single actual is zero, $𝐴{𝑡}=0$, then you divide by zero in calculating the MAPE, which is undefined. To rectify this problem, one can implement the `Mean Arctangent Absolute Percentage eEror (MAAPE)`. For more information please refer to the journal article: "[A new metric of absolute percentage error for intermittent demand forecasts](https://www.sciencedirect.com/science/article/pii/S0169207016000121)"


* **`Imbalanced Penalty`**: MAPE puts a heavier penalty on negative errors, $A_{t}<F_{t}$ than on positive errors. As a consequence, when MAPE is used to compare the accuracy of prediction methods it is biased in that it will systematically select a method whose forecasts are too low. This issue can be overcome by using an the logarithm of the accuracy ratio.

$$log({predicted/actual})$$


For more information about the disadvantages of MAPE, check out this wonderful post: [What are the shortcomings of the Mean Absolute Percentage Error (MAPE)?](https://stats.stackexchange.com/questions/299712/what-are-the-shortcomings-of-the-mean-absolute-percentage-error-mape) answered by Stephen Kolassa.

### Root Mean Squared Logarithmic Error  ( RMSLE )

<img src="pic/rmsle2.png" width="500" height="400">

With simple logarithmic calculation, we can easily detect that the main difference between `RMSLE` and `RMSE` lies in the fact that 

* `RMSLE` only considers the **relative error between and the Predicted and the actual value**. **The absolute difference (the scale of the error) is not significant**. On the other hand, `RMSE` increases in magnitude if the scale of error increases.


* `RMSLE` penalizes predictions that are less than the actual values more than it penalized predictions more than actual values. (這邊可以想個例子來證明一下)

<img src="pic/rmsleviz.png" width="500" height="400">

**Recommended Use Cases**:

* **`When biased penalty is acceptable`** : `RMSLE` incurs a larger penalty for the `under-estimation` between predictions and actual values more than `over-estimation`.


* **`When underestimated is not acceptable but overestimation can be allowed`** : `RMSLE` is especially useful for business cases where the underestimation of the target variable is not acceptable but overestimation can be tolerated.

<img src="pic/rmsle2.png" width="500" height="400">

In [20]:
def rmsle(pred, y_test):
    sum_err = 0.0
    y = y_test.values
    
    for i in range(len(y_test)):
        sum_err += (np.log(pred[i]+1) - np.log(y[i]+1))**2
        
    return(sum_err / float(len(pred)))

In [21]:
rmsle(lr_pred, y_test)

0.06751416411847284

### R² ( R-Squared / Coefficient of Determination )

Imagine that you get a MSE score of `5.21`. What should you do? Is this a decent value?

To tackle this ambiguity, **$R^2$** is a wonderful metric to evaluate how well the model fits the data point. It not only solve the problem above -- **$R^2$ is unitness and universally interpretable** -- but also captures the value added from the new model derived.

<img src="pic/r2.png" width="250" height="400">

<img src="pic/r2-1.png" width="300" height="400">

**`MSE(model)`** : Sum squared Regression Error. This MSE is derived from whatever regression model we implemented.

**`MSE(baseline)`** : Sum squared  Total Error. This constant baseline model can be interpreted as the **`simplest model`** we can derive -- which is to **always predict the average of all samples**.

**`y̅`** : the mean of the observed yᵢ.

**Simple Explanations**:

1. **$R^2$** is a scale-free metric and is widely used in terms of evaluating rooms for improvement for the model.
2. **$R^2$** is the ratio between **how good our model is** vs **how good is the naive mean model**.

This metric compares the fit of the chosen model with that of a horizontal straight line (the null hypothesis).

A value close to `1` indicates that the modle perfectly has zero-to-none bias and variation (with close to zero error), and a value close to `0` indicates a model very close to the baseline.

Note that  **$R^2$ can actually be negative**. **$R^2$** is negative when the chosen model does not follow the trend of the data, so fits worse than a horizontal line. Therefore, **if the chosen model fits worse than a horizontal line, then $R^2$ is negative**. 

Note that $R^2$ is not in fact the square of anything, so it can have a negative value without violating any rules of math. 

<img src="pic/r2neg.png" width="350" height="400">

The simplest way to implement R² is through

```python 
from sklearn.metrics import r2_score
```

In [22]:
def R_squared(pred, y_test):
    y = y_test.values
    SSR = 0.0
    SST = 0.0
    y_mean = np.mean(pred)
    for i in range(len(y_test)):
        SSR += (pred[i] - y[i]) ** 2
        SST += (pred[i] - y_mean) ** 2
    r2 = 1 - (float(SSR)) / SST
    return(r2)

### Adjusted R² ( Adj. R-Squared )

What is the problem of common `R²`?

`R²` suffers from the problem that the model could overfit the data. Imagine that you have five data points, and you derive the `R²` score of `0.70`. If you want to improve your `R²`, one simple way is to add one more variable into the model. 

Howver, the scores improve on increasing terms even though the model is exactly not improving. This potentially overfitting problem may misguide the researcher. In extreme cases, you can even get an `R²` of `1` when you have the same amounts of variables and data points.

<img src="pic/adjr2.png" width="400" height="400">

**`p`** : Number of independent variables


**`N`** : Number of observations (Sample Size)

**`Adjusted R²`** is thus introduced to solve this problem.

`Adjusted R²` is always lower than `R²` as it adjusts for the increasing predictors and only shows improvement if there is a real improvement.

In [24]:
def R_squared(pred, y_test):
    y = y_test.values
    SSR = 0.0
    SST = 0.0
    y_mean = np.mean(pred)
    for i in range(len(y_test)):
        SSR += (pred[i] - y[i]) ** 2
        SST += (pred[i] - y_mean) ** 2
    r2 = 1 - (float(SSR)) / SST
    return(r2)

In [30]:
def adj_R_squared(pred, y_test, num_cols):
    numerator = (1-R_squared(pred,y_test))*(len(y)-1)
    denominator = len(y) - num_cols -1
    adj_r2 = 1 - numerator / denominator
    return(adj_r2)

In [25]:
R_squared(lr_pred, y_test)

0.2007820898753635

In [31]:
adj_R_squared(lr_pred, y_test, X.shape[1])

0.17966454346963123

### End Notes

Keep in mind that there’s no one-size-fits-all loss function to all the machine learning problems. Each loss function performs well in respective use cases in detecting different model performances, and should be considered before implementing one. 

---

## Reference: 


[[Medium] Regression: An Explanation of Regression Metrics And What Can Go Wrong](https://towardsdatascience.com/regression-an-explanation-of-regression-metrics-and-what-can-go-wrong-a39a9793d914)

[[Medium] How to select the Right Evaluation Metric for Machine Learning Models: Part 1 Regression Metrics](https://medium.com/@george.drakos62/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-1-regrression-metrics-3606e25beae0)


[[ScienceDirect] Mean Bias Error](https://www.sciencedirect.com/topics/engineering/mean-bias-error)

[[ScienceDirect] A new metric of absolute percentage error for intermittent demand forecasts](https://www.sciencedirect.com/science/article/pii/S0169207016000121)

[[Relexsolutions] Measuring Forecast Accuracy: The Complete Guide](https://www.relexsolutions.com/resources/measuring-forecast-accuracy/)

[[DeepLearningAcdemy] Loss Functions in Deep Learning](https://www.deeplearning-academy.com/p/ai-wiki-loss-functions-in-deep-learning)

[[StackExchange] **Nick's and Harvey's answers** on "When is R squared negative?"](https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative)

[[StackExchange] **Stephan's answer** on "What are the shortcomings of the Mean Absolute Percentage Error (MAPE)?
"](https://stats.stackexchange.com/questions/299712/what-are-the-shortcomings-of-the-mean-absolute-percentage-error-mape)

[[StackOverflow] **Sandipen's answer** on "python sklearn multiple linear regression display r-squared"
](https://stackoverflow.com/questions/42033720/python-sklearn-multiple-linear-regression-display-r-squared)

[[Kaggle] **Nasashi's and Dmitriy's posts** on "All about the metric: RMSLE"](https://www.kaggle.com/c/ashrae-energy-prediction/discussion/113064)