# Building Blocks of Neural Networks

## Loss Functions for Regression

In [1]:
import tensorflow as tf
import keras

import numpy as np

### Mean Squared Error

$$
MSE = \frac{1}{n}\sum^n_{i=1}{y_{i} - y'_{i}}^2
$$

where 
`n`: number of examples or samples.
`y`: true values.
`y'`: predicted values.

A smaller value indicates that the ground truth and predicted values are closer to each other.

However, there a few disadvantages to using MSE: It is senstive to outliers.

In [2]:
def mse_loss(y_true: tf.Tensor, y_pred: tf.Tensor):
    return tf.reduce_mean(tf.square(y_true - y_pred))

In [10]:
y_true = tf.Variable([0.0, 1.0, 0.0, 0.0])
y_pred = tf.Variable([1.0, 1.0, 1.0, 0.0])
print("Calc MSE Loss: ", mse_loss(y_true, y_pred))

tf_mse = keras.losses.MeanSquaredError()
print("Keras MSE Loss: ", tf_mse(y_true, y_pred))
print("Keras MSE Loss: ", keras.losses.mean_squared_error(y_true, y_pred))

Calc MSE Loss:  tf.Tensor(0.5, shape=(), dtype=float32)
Keras MSE Loss:  tf.Tensor(0.5, shape=(), dtype=float32)
Keras MSE Loss:  tf.Tensor(0.5, shape=(), dtype=float32)


### Mean Absolute Error

$$
MSE = \frac{1}{n}\sum^n_{i=1}{|y_{i} - y'_{i}|}
$$

where 
`n`: number of examples or samples.
`y`: true values.
`y'`: predicted values.

A smaller value indicates that the ground truth and predicted values are closer to each other.

It overcomes a disadvantages of MSE: Less sensitive to outliers compared to MSE.


In [8]:
def mae_loss(y_true: tf.Tensor, y_pred: tf.Tensor):
    return tf.reduce_mean(tf.abs(y_true - y_pred))

In [11]:
y_true = tf.Variable([0.0, 2.0, 0.0, 3.0])
y_pred = tf.Variable([1.0, 1.0, 2.0, 0.0])
print("Calc MAE Loss: ", mae_loss(y_true, y_pred))
print("Keras MAE Loss: ", keras.losses.mean_absolute_error(y_true, y_pred))

Calc MAE Loss:  tf.Tensor(1.75, shape=(), dtype=float32)
Keras MAE Loss:  tf.Tensor(1.75, shape=(), dtype=float32)


#### Outliers Example

In [12]:
y_true = tf.Variable([0.0, 2.0, 0.0, 7.0])
y_pred = tf.Variable([1.0, 1.0, 2.0, 1.0])

print("Keras MSE Loss: ", keras.losses.mean_squared_error(y_true, y_pred))
print("Keras MAE Loss: ", keras.losses.mean_absolute_error(y_true, y_pred))

Keras MSE Loss:  tf.Tensor(10.5, shape=(), dtype=float32)
Keras MAE Loss:  tf.Tensor(2.5, shape=(), dtype=float32)


### MSE or MAE? Which loss function should be used?

As shown, MSE applies square on each error term, and because of which the error value will get amplified because of any outliers. This is usually a good property, as it helps the model achieve better performance on the overall dataset by reducing significant deviations. In case of MAE, it doesn't highlight the outliers in the error. If the model needs to consider outlier predictions, then MAE won’t be as effective, as large errors coming from the outliers end up being weighted the same as lower errors. This might result in the model being great most of the time, but making a few very poor predictions as well!

When a model is trained, if the MAE and MSE both stop at 30.5, then it's better to use MSE, as MSE penalizes larger errors more heavily than MAE because it squares the errors. This makes MSE models more sensitive to outliers. If the MSE-trained model converges to the same 30.5 error level as the MAE, it suggests that it has managed to handle outliers effectively, balancing both small and large errors well.

Most optimizers use differentiation to find the optimum value for parameters in the evaluation metric. While, MSE is differentiable, as it's a continious function (resembles a parabola), MAE is not differentialble as zero. Since Modulo function is discreet, which make optimization with gradient descent more challenging and less efficient.

MSE also helps to converge on the optimization problem faster compared to MAE, due to the quadratic nature of the loss function.

so, in most cases, MSE is preffered over MAE. 

### Efficiency of MSE over MAE

MSE loss function is faster to compute than the MAE loss function because

- **Squaring is computationally faster**: The reason for this is that the MSE involves squaring the differences between predicted and target values, whereas the MAE involves taking the absolute differences. Computing the square of a number is generally faster than computing its absolute value, especially on modern processors that have optimized hardware for multiplication. 

- **Gradient**: When calculating the gradient for backpropagation during training, the derivative of the squared term in MSE is simpler and can be computed more efficiently.

When working with increasingly large batches of data during training, the computational advantage of MSE over MAE becomes more significant. This is because the computation time of MSE depends on the size of the batch, while the computation time of MAE remains relatively constant regardless of the batch size.

For example, if you have a batch of size N, the MSE loss requires N multiplications (to compute the squared differences) and then a sum operation, which can be efficiently parallelized. On the other hand, the MAE loss requires N absolute value computations and a sum operation. As the batch size increases, the number of absolute value computations in MAE remains constant, while the number of squared differences computations in MSE increases linearly.