<a href="https://colab.research.google.com/github/vtecftwy/fastbook/blob/master/resources/06_loss_functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clarify loss functions terminology

Loss functions used in classifier models are mostly related to the same concept, **Cross Entropy**, but have a wide variety of names depending on the type of model and the library used to code these models.

This notebook tries to provide an overview of the loss functions used for different types of classification problems and then what name is used en PyTorch and fastai.

This [article](https://gombru.github.io/2018/05/23/cross_entropy_loss/) gives a more in depth review of a similar topic.

# 1. The Concepts

## Multiclass Classifier

This is the case where each input $X$ (image or other input) has a single class out of a set of $\textbf C$ classes. 
- The training set consist of the inputs $X$ and the ground truth or labels $Y$. 
- Each $Y$ is a one-hot-encoded vector, that is $Y = [y_1, y_2, y_3, ...y_C]$ where $y_i \ \epsilon \ \{0,1\}$ and $\sum_{i=1}^C y_i = 1$)

The model predicts $\hat Y$ = $f(X)$ in such a way that $\hat Y$ is as close as possible to the ground trutch $Y$. 

In deep learning, $f$ is built by a sequence of layers and $\hat Y = f(X)$ is the output of a forward pass in the network. The specific structure of the model is based on the chosen architecture, but it always ends by $\textbf C$ outputs, each representing one class. 

We want that the selected models favors $\hat Y$ close to one-hot-encoded vectors, that is one (1) $y_i$ is close to 1, (2) all the others $y_i$ are close to 0 and (3) the sum of all $y_i$ is equal to 1. And we call this the probabilities. Each $y_i$ represents the "probability" that class $i$ is the correct class.

The loss function is calculated during training (forward pass) and must represent how far $\hat Y$ is from the ground trutch $Y$. 

The diagram below summarizes this:
- Input $X$ going through the model and predicting $\hat Y$
- In order secure that $\hat Y$ is a probability, we pass the network output through a $Softmax$ layer
    - $0 \le \hat y_i \le 1$ and $\sum_{i=1}^C \ \hat y_i = 1$
- Loss function $Cross \ Entropy$ as a function of $\hat Y$ and $Y$
    - $Cross \ Entropy(Y, \hat Y) = - \sum_{i=1}^C y_i * log(\hat y_i)       $

**Loss Function in the case of a Deep Learning Multi Class Classifier model**
<img src="https://raw.githubusercontent.com/vtecftwy/fastbook/master/images/ce_loss_function_01.png">

For each sample $X$ or each mini batch of $X$ passed through the network, we get a single loss value, used to evaluate the gradients and correct weights/biaises through the backward pass.

## Binary Classifier

Here, we only have two classes and the ground truth values are encoded as $True$/$False$ of $1$/$0$. This means that $Y$ is a single value $0$ or $1$.

The model logic remains the same, but the fact that there is only one class encoded as a 0 or a 1 makes structure a little simpler.

We want that the selected models favors $\hat Y$ close to either $1$ or $0$

The diagram below summarizes this:
- Input $X$ going through the model and predicting $\hat Y$
- Instead of a $Softmax$ layer (which takes more then one input) we can simply use a $Sigmoid$ to squish the output value of the network between 0 and 1.
- The loss function is still cross entropy, but it is expressed slightly differently because we only have one value $Y$ and not 2. This version of cross entropy is called $Binary \ Cross \ Entropy$.
    - $Binary \ Cross \ Entropy(Y,\hat Y) = - Y \ log (\hat y) - (1-Y) \ log (1 - \hat y) $.

**Loss Function in the case of a Deep Learning Two Class Classifier model, a.k.a Binary Classifier**
<img src="https://raw.githubusercontent.com/vtecftwy/fastbook/master/images/ce_loss_function_02.png">

## Multi Label Classifer

In this case, we have again $\textbf C$ classes, but each input $X$ (image or other input) can have one or several classes out of these $\textbf C$ classes. 
- The training set consist of the inputs $X$ and the ground truth or labels $Y$. 
- Each $Y$ is a encoded as a multi-hot-encoded, that is $Y = [y_1, y_2, y_3, ...y_C]$ where $y_i \ \epsilon \ \{0,1\}$ and $\sum_{i=1}^C y_i \ge 1$)

Now we can have $Y$ like $[y_1=0, y_2=0, y_3=0, y_4=1, y_5=0, y_6=1, ... y_C=0 ]$. Therefore, $\hat Y$ will also be a vector of size $\textbf C$ and the last laywer of the network will have $\textbf C$ outputs. 

We also want the model to give us $\hat Y$ with $\hat y_i$ as close to $0$ or $1$ as possible to match $\hat Y$. But we know that $\sum_{i=1}^C y_i$ can be greater than $1$, so $Softmax$ is not the layer we want here. In fact, we can consider that the model is trying to perform $\textbf C$ binary classifications at the same time: each $\hat y_i$ is a probability that class $i$ is in the image, and evaluated independently from the other classes $j \ne i$. We can consider this problem as one network evaluating $C$ binary classification problem in paralel. 

The diagram below summarizes this:
- Input $X$ going through the model and predicting $\hat Y = [\hat y_1, y_2 ... y_C]$
- To ensure that each $\hat y_i$ is  between 0 and 1, each output of the network is passed through a $Sigmoid$ layer.
- The loss function is evaluating in two steps:
    1. For each of the  $\textbf C$ classes, $BCE_i = Binary \ Cross \ Entropy(y_i, \hat y_i)$ is evaluated.
    2. Then all the $ BCE_i$ losses are aggregated into a sum $\sum_{i=1}^C BCE_i$ or a mean $\frac{\sum_{i=1}^C BCE_i}{C}$ to have a single loss.

**Loss Function in the case of a Deep Learning Multi Label Classifier**
<img src="https://raw.githubusercontent.com/vtecftwy/fastbook/master/images/ce_loss_function_03.png">

# 2. Loss Functions in PyTorch

While the concept of loss function may seem simple when seen as in the diagrams above, libraries use different names for their implementation, and this makes the whole landscape confusing at first glance.

One of the reasons for this apparent confusion in that these loss function and concepts borrow from very different fields:
- statistics and econometry for names such as *Likelihood*
- information theory for names such as *cross entropy* 
- ...

The diagrams below what the main PyTorch Loss Function Layers do in the network:
- nn.CrossEntropyLoss
- nn.NLLLoss
- nn.BCEWithLogitsLoss
- nn.BCELoss
- nn.MSE
- nn.L1

Note that PyTorch supports both *Layers/Modules* and *Functions*
- *Layers/Modules* are PyTorch classes under `nn.LayerName` ([doc.](https://pytorch.org/docs/stable/nn.html#loss-functions)). They are used when you build a model architecture. They also have a `forward` and a `__call__` methods that allow the class to be called like a function.
- *Functions* are PyTorch simple functions under `nn.functional.function_name` or the alias `nn.F.function_name` ([doc.])(https://pytorch.org/docs/stable/nn.functional.html#loss-functions). They can be used to build other functions manually, or within custom Layer/Modules.

The equivalence is sumarized in the table below:

|Layer - Module       | Functions                          |
|:-------------------:|:----------------------------------:|
|nn.CrossEntropyLoss  | cross_entropy                      |
|nn.NLLLoss           | nll_loss                           |
|nn.BCEWithLogitsLoss | binary_cross_entropy_with_logits   |
|nn.BCELoss           | binary_cross_entropy               |
|nn.MSE               | mse_loss                           |
|nn.L1                | l1_loss                            |



<img src="https://raw.githubusercontent.com/vtecftwy/fastbook/master/images/ce_loss_function_04.png">


<img src="https://raw.githubusercontent.com/vtecftwy/fastbook/master/images/ce_loss_function_05.png">


<img src="https://raw.githubusercontent.com/vtecftwy/fastbook/master/images/ce_loss_function_06.png">


<img src="https://raw.githubusercontent.com/vtecftwy/fastbook/master/images/ce_loss_function_07.png">

# 3. Loss Functions in fastai
fastai uses specific loss functions layers, based on these PyTorch loss functions layers/modules, but with additional fastai friendly features ([see doc](https://docs.fast.ai/losses.html)). 

These classes are:
- `CrossEntropyLossFlat`
- `BCEWithLogitsLossFlat`
- `BCELossFlat`
- `MSELossFlat`
- `L1LossFlat`

The correspondance with the PyTorch loss function layers is obvious through the name. They essentially act as the PyTorch loss function.