# Loss Functions

In [None]:
import numpy as np

## Loss functions used in the regression problem

### 1. Squared Loss (a.k.a. Mean Squared Error; MSE)

$$ MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 $$

It calculates the squared difference between the actual value and predicted value (order doesn't matter).

1. If the difference is bigger, it gives exponentially bigger lossses: it values the outlier more.
2. Most used in regression loss for it is good to be used in least squared method / gradient descent problem.

In [1]:
def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

### 2. Absolute Loss (a.k.a. Mean Absolute Error; MAE)

$$ MAE = \frac{1}{N}\sum_{i=1}^{N}|y_i - \hat{y}_i|

It calculates the absolute difference between the actual value and predicted value (order doesn't matter).

1. It doesn't count the outlier compared to the MSE, so if the sample is splitted into two, it tends to fit into more major sample group.
2. When it comes to the least squared method / gradient descent, absolute sign gives it hard to compute / converge.

In [2]:
def mae_loss(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

### 3. Huber Loss

$$ \text{Huber}(y_i, \hat{y}_i) = \begin{cases} 
    \frac{1}{2}(y_i - \hat{y}_i)^2 & |y_i - \hat{y}_i| \leq \delta \\
    \delta |y_i - \hat{y}_i| - \frac{1}{2}\delta ^2 & \text{otherwise}
\end{cases}  $$



It merges the MAE and MSE.
1. It uses squared error for the small error, but it uses absolute error for the large error, so it works similar to the MAE.
2. It is useful to decrease the sensitivity to the outlier.

In [3]:
def huber_loss(y_true, y_pred, delta=1.0):
    diff = y_true - y_pred
    loss = np.where(diff <= delta, 0.5 * diff**2, delta * (diff - 0.5 * delta))
    return np.mean(loss)

## Loss functions used in the classification problem

### 1. Binary Cross-Entropy Loss

$$
\text{Cross-Entropy Loss} = -\sum_{i=1}^{N}(y_i log(\hat{y}_i) + (1-y_i)log(1-\hat{y}_i))

$y_i$: real y label (0, 1)

$\hat{y}_i$: predicted percentages (0 - 1)

log(x) from 0 to 1 returns negative value, so the total loss is multiplied by -1.

It calculates how close the prediction from the actual value. by multiplying $y_{true}$ and $1-y_{true}$ in front of each log term, it can only focus on the targeted y class.

In [None]:
def crossentropy_loss(y_true, y_pred):
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1. - epsilon) # 예측값을 [epsilon, 1-epsilon 범위로 제한] 함으로 exponential한 값 막기
    return -np.mean(y_true * np.log(y_pred) + (1-y_true) * np.log(1-y_pred))

### 1-1. Multicategorical Cross-Entropy Loss

$$
\text{Multi Cross-Entropy Loss} = -\sum _{i=1}^{C}y_i log(\hat{y}_i)
$$

It calculates the cross entropy, but with many classes all together. we use one-hot encoded y, which gives 1 for answer class and others for 0, so it can still stick into the targeted y class and get the summation of the log loss.

$C$: number of classes
$y_i$: one-hot encoded: (answer class:1, others:0)
$\hat{y}_i$: predicted percentages for each classes

ex) [0.1, 0.8, 0.1] and 2nd class is right => lss = -log(0.8)

In [5]:
def categorical_crossentropy_loss(y_true, y_pred):
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1. - epsilon)
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

### 2. Killback-Leibler Divergence (KL Divergence)

$$
D_{KL}(P||Q) = \sum_{i=1}^{N}P(x_i)log\frac{P(x_i)}{Q(x_i)}
$$

$P(x_i)$: actual (true) distribution's probability

$Q(x_i)$: predicted distribution's probability

$D_{KL}(P||Q)$: KL Divergence between distribution P and distribution Q


It is used to Variational Auto Encoder (VAE), of Baysian Network.

it is an asymmetric loss function to measure the difference between two distributions. So we need to clarify which distribution is used for the standard (true) one.

1. It is asymmetirc. It always have 0 or positive number, and if P and Q are the same, it is to 0. If they are different, it has the positive value.
2. Information loss: KL divergence shows the loss of information from true distribution to another distribution. To illustrate, it shows how many information will be lost if we use alternative distribution rather than true distribution.

In [6]:
def kl_divergence(P, Q):
    epsilon = 1e-10
    P = np.clip(P, epsilon, 1)
    Q = np.clip(Q, epsilon, 1)
    
    return np.sum(P * np.log(P / Q))