# Introduction

The notebooks in this folder analyze different properties of several loss functions commonly used when training machine learning models. For each loss function, we cover

1. the predicted statistic when minimzing the loss function
1. visualization of the loss function for a single example.
1. the maximum-likelihood interpretation of the loss function (TODO)

# Symbols

For a given example, I'll use

* $\mathbf{x}$ to represent the feature vector
* $t$ to represent the label (aka. target)
* $y$ or $y(\mathbf{x})$ to represent the prediction

We assume the function $y(\mathbf{x})$ is completely flexible (
[PRML](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf),
Page 46), so that the derivative of the loss function wrt. $y(\mathbf{x})$
exists.

# Summary

| Loss function name             | $L(y(\mathbf{x}), t)$                                                     | $\arg \min_{y(\mathbf{x})} \mathbb{E}[L]$ | Likelihood distribution | $p(t|\mathbf{x})$                                                                                  |
|--------------------------------|-----------------------------------------------------------------------------|---------------------------------------------------|---------------------------------------|-----------------------------------------------------------------------------------------------------|
| Minkowski loss (q=2, aka. MSE) | $ \left|y(\mathbf{x}) - t \right|^2 $   | Conditional mean: $\mathbb{E}[t|\mathbf{x}]$     | Gaussian                              | $\frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left \{- \frac{1}{2} \frac{|t - y(\mathbf{x})|^2}{\sigma^2} \right \}$ |
| Minkowski loss (q=1, aka. MAE) | $ \left|y(\mathbf{x}) - t \right|^1 $   | Conditional median: $F_{t|\mathbf{x}}^{-1}(0.5)$ | Laplace                             | $\frac{1}{2 \sigma} \exp \left \{- \frac{|t - y(\mathbf{x})|}{\sigma} \right \}$                                      |
| Minkowski loss (q=0)           | $ \left|y(\mathbf{x}) - t \right|^{q \rightarrow 0} $ | Conditional mode                                 | $-$                                     | $-$                                                                                                   |
|Log-loss (aka. cross-entropy loss) | $-\left[ t \ln y(\mathbf{x}) + (1 - t) \ln(1 - y(\mathbf{x})) \right]$ | Conditional mean: $\mathbb{E}[t|\mathbf{x}]$ | Bernoulli | $\exp \Big \{ t \ln y(\mathbf{x}) + (1 - t) \ln (1 - y(\mathbf{x})) \Big \}$ |
|Poisson loss | $   y(\mathbf{x}) - t \ln y(\mathbf{x}) $ | Conditional mean: $\mathbb{E}[t|\mathbf{x}]$ | Poisson | $\exp \Big \{ t \ln y(\mathbf{x}) - y(\mathbf{x}) - \ln t! \Big\}$ |
|Pinball loss | $(t - y(\mathbf{x})) (\tau - \mathbb{I}(t < y(\mathbf{x})))$ | Conditional quantile: $F_{t|\mathbf{x}}^{-1}(\tau)$ | Asymmetric Laplace | $\frac{\tau(1 - \tau)}{ \sigma } \exp \left\{ -\frac{(t - y(\mathbf{x})) (\tau - \mathbb{I}(t < y(\mathbf{x})))}{\sigma} \right \}$ |

Note, the derivation of $\arg \min_{y(\mathbf{x})} \mathbb{E}[L]$ does NOT make any assumption of the distribution form of $p(t|\mathbf{x})$.

# Grouping of loss functions by motivation

* by likelihood function, e.g. MSE, log-loss, poisson loss
* by interested metrics, e.g. MAE, MAPE
* by its argmin with desired properties: e.g. quantile loss
* by robustness, e.g. MAE, huber loss
* etc.

Although these loss function can all correspond to some form of likelihood distribution, 

\begin{align*}
p(t|\mathbf{x}) = \frac{1}{Z} \exp \left \{- L(t, y(\mathbf{x})) \right\} \\
\end{align*}

but it seems unreasonable to assume a likelihood would follow distributions like Laplace distribution or asymmetric Laplace distribution in the first place.