# Maximum Likelihood

Maximum likelihood (ML):
- is an algorithm;
- for fitting, i.e. estimating the parameters $\underline\theta$ of;
- a probability distribution/measure, i.e. a parametric function $f_{\underline\theta}$ from a predetermined family;
- given $N$ <u>presumed</u> samples from that distribution.

Steps:
1. Formulate the likelihood function $\mathcal{L}(\underline\theta) = \prod_{n=1}^N p(y^{(n)} | \underline x^{(n)}; f_{\underline\theta})$. Treat this as a "loss" or "cost" function.
2. Find parameters $\underline \theta$ that maximise $\mathcal{L}(\underline \theta)$.

We usually solve the equivalent problem of minimising $- \log \mathcal{L}(\underline \theta)$, easier for easier computation.

Based on the distribution family chosen, we get different losses. Here are some common ones.

## Square loss
Assume
$$
p\left(y^{(n)} | x^{(x)}; f_{\underline \theta}\right) = \mathcal{N} \left( y^{(n)}; \mu = f\left( x^{(n)} \right), \sigma_y^2 \right),
$$
for some fixed $\sigma_y^2$, which (assume) can be estimated with e.g. a neural network $f_{\underline\theta}$.

Then,
$$
\mathcal{L}(\underline \theta) = c_1 \sum_n \left( y^{(n)} - f_{\underline \theta}\left( \underline x^{(n)} \right) \right)^2 + c_2,
$$
for some (irrelevant for optimisation) constants $c_1$ and $c_2$.

This is the regular sum of squares loss.

## Binary cross entropy loss

Think about binary classification tasks. That is, assume
$$
p\left( y^{(n)} | x^{(n)}; f_{\underline \theta} \right) =
\begin{cases}
    p^{(n)}, & \text{if } y^{(n)} = 1\\
    1 - p^{(n)}, & \text{if } y^{(n)} = 0.
\end{cases}
$$
where $p^{(n)} \stackrel{\text{e.g.}}{=} \sigma\left( f_{\underline\theta}\left( x^{(n)} \right) \right)$, which (assume) can be estimated with a e.g. neural network $f_{\underline\theta}$.

Then,
$$
\mathcal{L}(\underline\theta) = \sum_{n:y^{(n)} = 1} \log p^{(n)} - \sum_{n: y^{(n)} = 0} \log \left( 1 - p^{(n)} \right) = \sum_n \left( p^{(n)} \right)^{y^{(n)}} \left( 1 - p^{(n)} \right)^{\left( 1 - y^{(n)} \right)}.
$$

Now say we have more than two classes. Let $p( y^{(n)} | x^{(n)} ) = p^{(n)}$. Given training set $T=\{(x^{(n)}, y^{(n)})\}$, $\mathcal{L}(\underline\theta) = \sum_n p^{(n)}$.

In [28]:
import torch

# E.g. for a NN with 3 units in the final layer, to which we applied softmax.
probs = torch.tensor([
    [0.1, 0.4, 0.5],
    [0.7, 0.1, 0.2]
])
labels = torch.tensor([1, 0])

nll_loss = -probs.gather(dim=1, index=labels.unsqueeze(1)).mean()
print(nll_loss)

# Alternatively, without using "gather"
nll_loss = torch.zeros_like(labels, dtype=torch.float)
for row in range(probs.size(0)):
    nll_loss[row] = probs[row][labels[row]]
nll_loss = -nll_loss.mean()

print(nll_loss)

tensor(-0.5500)
tensor(-0.5500)
