# Loss functions

In chapter 1 we discussed how in supervised learning we derive a function which aims to predict an output based on an input, <br> We measure the ability of this function/model to produce good results by a loss function. <br>
We discussed that this Loss measures how far off the predicted result was from the true result.

We went through the classic linear regression example where the loss function was the $\text{ mean square error}$.<br> At the end of the chapter we asked why we used this formula? what makes a good loss function? how do we choose our loss function?

In this chapter we'll:
1. Justify the use of this function in Linear regression 
2. Present a method for choosing a loss function based on a provided learning problem. 
3. Go through a few common learning problems to apply the method to and see the results

### Definition of the Loss 

$$\boxed{\text{Loss function := } L \ : \ \mathbb{R}^{N \times K} \times \mathbb{R}^{N \times K} \rightarrow \mathbb{R}}$$

The above is a formal description though it's usually seen as follows: 

$$\boxed{L[\phi] = L[f(x_i, \phi), y_i] \in \mathbb{R}}$$

It's a description of the **missmatch** between the model predictions, $f[x_i, \phi]$ and the ground-truth outputs $y_i$<br>
NOTE: That we provide the shorthand "$L[\phi]$" since this function is with respect to the parameters meaning $\phi$ is the only thing we can change and thus truely what we're trying to measure.


---
---
---

# Probability Recap 


## Basic Definitions



**NOTE: This assumes background knowledge**

Before moving on, we'll go over a few main notions that'll help us along the way for deriving loss functions. 

$\text{Probability Space:= is a 3-tuple model } (\Omega, F, P)$

$ \Omega : \text{Sample Space := A set of all possible outcomes }$ 

$ F : \text{Field of events := A collection of events }$ 

$\text{ Probability function } P : F \rightarrow \mathbb{R} $

$\text{Random variable X := denotes a qunatity (discrete or contiuous) of some event that we don't know yet.}$

**Personal Note:** <br> **I aim to write the random variables in CAPITAL to always remind myself that they can be any value until we assign a value in lower case <br> (in vector cases the difference will be clear as well).**



| Component | **Discrete Example: Fair Coin Flip** | **Continuous Example: Person's Height** |
|-----------|--------------------------------------|----------------------------------------|
| **$\Omega$** (Sample Space) | $\{\text{Heads}, \text{Tails}\}$ | $[0, \infty)$ (all possible heights in meters) |
| **$F$** (Field of Events) | $\{\emptyset, \{\text{Heads}\}, \{\text{Tails}\}, \{\text{Heads}, \text{Tails}\}\}$ | Borel $\sigma$-algebra on $[0, \infty)$ (all measurable intervals) |
| **$P$** (Probability Function) | $P(\{\text{Heads}\}) = 0.5$ <br> $P(\{\text{Tails}\}) = 0.5$ <br> $P(\Omega) = 1$ | $P \sim \mathcal{N}(1.7, 0.1)$ <br> $P(X \in [1.6, 1.8]) \approx 0.68$ <br> $P(X = 1.75) = 0$ |
| **$X$** (Random Variable) | $X = \begin{cases} 1 & \text{if Heads} \\ 0 & \text{if Tails} \end{cases}$ | $X : \Omega \rightarrow \mathbb{R}$ <br> Maps outcomes to measured height |
| **Key Property** | $P(X = 0) = 0.5$ (point probabilities exist) | $P(X = c) = 0$ for any specific $c$ (use intervals) |


Consider the coin-flip example, if we don't know any information about the probabilities of landing head or tails we have but to experiment. <br>
This is known as observing instances of a random variable. <br> The list of all the observed results is a $probability \ distribution \ Pr(X)$.


- For a discrerte random variable we have: $Pr(X = k) \in [0,1]$ where k is a possible outcome of event. <br>

- For a continuous variable, we have $Pr(X = a) \ge 0$ where each $a$ is mapable in the domain $X$ and the integral of this probability of this density function (PDF) over the domain $X$ is one.

From now on we'll assume that we're dealing with **continous random variables**



## Joint Probability and Conditional



### Joint Probability

Suppose we have two random variables $X$ and $Y$. The $joint \ distribution \ Pr(X, Y)$ tells us about the natural tendency that $X$ and $Y$ will take on a specific combination of values. 

$$ \boxed{\int \int Pr(X, Y)\cdot dxdy = 1 }$$

In some cases we can store multiple random variables in a vector $\vec{x}$ so the joint distribution of the vector is $Pr(\vec{x})$
and similarly we can also have $Pr(\vec{x}, \vec{y})$

### Marginalization

If we're provided with $Pr(X, Y)$ over two random variables, we're able to recover the $marginal$ distributions $Pr(X)$ and $Pr(Y)$ by intergrating over the other variable. 

$$ \boxed{ \int Pr(X, Y) \cdot dx = Pr(Y)}$$

$$ \boxed { \int Pr(X, Y) \cdot dy = Pr(X)}$$

We're computing the disribution of one variable regardless of the value of the other variable. <br> This can extend to higher dimensions and the same process is applied.




### Conditional Probability and Likelihood

The $Conditional \ probability \ Pr(X | Y) $ is the probability of variable $X$ taking some value given a known value on $Y$. <br>
NOTE: This definition does not assume causality or direction of influence between $X$ and $Y$.<br>
- We could have $X = \text{ \# of heads landed}$ and $Y = \text{ \# of tails landed}$
- We could have $X = \text{ \# of heads landed}$ and $Y = \text{ \# of passangers entering a bus in an hour}$
$$ \boxed{Pr(X | Y) = \frac{Pr(X, Y)}{Pr(Y)} }$$

$$ \boxed{Pr(Y | X) = \frac{Pr(X, Y)}{Pr(X)} }$$

Read it as the "The probability of X occuring (occuring means having some value) given Y occured is the probability of both X and Y occuring divided by the probabilty of Y occuring irrespective to X"

**Note: That we can obtain the conditional probability using the above joint proability formulation**


### Chain Rule

join probability of multiple events can be broken into conditional probabilities

$$Pr(X_1, X_2) = Pr(X_2 | X_1) \cdot Pr(X_1)$$

$$Pr(X_1, X_2, X_3) = Pr(X_3 | X_2, X_1) \cdot Pr(X_1, X_2) = Pr(X_3 | X_2, X_1) \cdot Pr(X_2 | X_1) \cdot Pr(X_1)$$

$$Pr(X_1, X_2, X_3, X_4) = Pr(X_4 | X_3, X_2, X_1) \cdot Pr(X_3, X_2, X_1) = Pr(X_4 | X_3, X_2, X_1) \cdot Pr(X_3 | X_2, X_1) \cdot P(X_2, X_1) = Pr(X_4 | X_3, X_2, X_1) \cdot Pr(X_3 | X_2, X_1) \cdot Pr(X_2 | X_1) \cdot Pr(X_1)$$

**General Explicit Form**
$$\boxed{Pr(X_1, X_2, \ldots, X_n) = Pr(X_n | X_{n-1}, \ldots, X_1) \cdot Pr(X_{n-1} | X_{n-2}, \ldots, X_1) \cdots Pr(X_2 | X_1) \cdot Pr(X_1)}$$

**Compact Form**
$$\boxed{Pr(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} Pr(X_i | X_{i-1}, \ldots, X_1)}$$

**Important Clarification:**

The subscript order ($X_1, X_2, \ldots, X_n$) is **purely notational** and does **not** imply:
- ❌ Temporal ordering (events happening in sequence)
- ❌ Causal relationships between variables
- ❌ That $X_1$ must occur "before" $X_2$

It's simply a **labeling convention** to systematically decompose the joint probability. We could equally write:
$$Pr(X_3, X_1, X_2) = Pr(X_2 | X_3, X_1) \cdot Pr(X_1 | X_3) \cdot Pr(X_3)$$

The ordering just needs to be consistent within the decomposition.



#### Distinction between Likelihood and Conditional

Suppose X: is our **data (observation/outcomes)**  <br>
Suppose Y: is our **Prameters or model specification**

##### Conditional Says $Pr(X| Y)$
-  **We fix $Y$ and let $X$ vary:** "Given these fixed parameters what are the chances of seeing different outcomes?" 
-  This is a **Probability Density Function** or **Probability Mass Function**
-  Since X represents all possible outcomes in the sample space IT MUST SUM (intergrate) TO 1.
-  Example: You have a fair die the probability of rolling $\{1, 2, 3, 4, 5, 6\}$ is 1.

##### Likelihood uses $Pr(X | Y)$ but interpreted as a function of $Y$
- **We fix $X$ and let $Y$ vary:** “Given the observed outcome, how plausible are different parameter values?”
- This is called the **Likelihood Function:** $L(Y \mid X=x) \;\equiv\; \Pr(X=x \mid Y)$
- We're comparing different models/parameters against the **same observed data**
- The likelihood is not a probability distribution over Y and therefore does not need to sum to 1
- Example: After observing a roll of “6”, we can compare <br>
$L(\text{fair} \mid 6) = \Pr(6 \mid \text{fair}), \quad
L(\text{weighted} \mid 6) = \Pr(6 \mid \text{weighted})$


##### So why do conditional probability and likelihood use the same expression? 
##### A. Because they use the same mathematical quantity but with different variables fixed
In the **Likelihood** when we say $Pr(X | Y)$  we're saying $Pr(X = x | Y = ?)$ view it as a function of $Y$ <br>
In **Conditional** when we say $Pr(X | Y)$ we're saying $Pr(X = ? | Y = y)$ and view it as a function of $X$.



## Bayes' rule




Given the above we can manipulate a few of the formulas: 

$$Pr(X, Y) = Pr(X | Y)Pr(Y) = Pr(Y | X)Pr(X)$$

$$ \downarrow $$

$$\boxed{Pr(X | Y) = \frac{Pr(Y | X)Pr(X)}{Pr(Y)}}$$

$Pr(X | Y) = \textcolor{lightblue}{Posterior Probability}$ <br>
$Pr(Y | X) = \textcolor{lightblue}{Likelihood}$ <br>
$Pr(X) = \textcolor{lightblue}{Prior \ Probability}$ <br>
$Pr(Y) = \textcolor{lightblue}{Evidence Probability}$ <br>

$\text{This equation maps what we know about } X \text{ before observing } Y \ Pr(X) \text{ to the posterior } Pr(X | Y) \text{ What we know about } X \text{ after observing } Y$

$\text{This is important since it's an indication of how Y affected X}$


### Independence

If the value of the random variable $Y$ tells us nothing about $X$ **AND** vice-versa, we say that $X$ and $Y$ are $\text{independent}$ thereby: 

$$Pr(Y|Y) = Pr(X)$$
$$Pr(Y | X) = Pr(X)$$

It means that the probability distributions $Pr(Y | X = •)$ will have the same value.
<br>
Indeed the same is applied over the distributions $Pr(X | Y = •)$ will have the same value.

$$Pr(X, Y) = Pr(X | Y)Pr(Y) = Pr(X)Pr(Y)$$

$$\downarrow$$

$$Pr(X, Y)  = Pr(X)Pr(Y)$$




---
---
---

## Maximum Likelihood

Until we've been holding this concept where the model (our function) produces a **direct** output $f[x, \phi]$ (models prediction) based on the input $x$ and the parameters $\phi$.<br>
Let's change this perspective into a probabilitic one, where we actually consider the model computing a $\text{Conditional Probability } Pr(Y | X)$ over the possible outputs $Y$ given the inputs $X$.<br>

With this in mind the loss encourages each training output $Y_i$ to have a high probability under the distribution $Pr(Y_i | X_i)$ computed from the correcsponding input $X_i$.

This is indeed the the likelihood as discussed above. 

|Example 1 | Example 2 | Example 3 | Example 4|
|----------|-----------|-----------|----------|
| <div align="center"> <img  src="../images/chap4/regesLike.png" alt="Linear Regression Distribution" width="700" /></div> | <div align="center"> <img  src="../images/chap4/classifierLike.png" alt="Discrete Distribution" width="700" /></div>| <div align="center"> <img  src="../images/chap4/classfier2Like.png" width="420" /></div>|  <div align="center"> <img  src="../images/chap4/contLike.png" alt="ReLU Function" width="700" /></div>|
|This is a regression problem where we want to predict $y \in \mathbb{R}$. <br> We look at the data and see what's the distribution and how "likely" is $Y = y$.| Here the task is to put in class $\in \{1, 2, 3, 4\}$. Our data is discrete, <br>thus our distribution is in histogram form. <br> We ask the same question "Given we have data $x$, looking at the distribution how 'likely' is $Y = y$." | This task predicts counts where our data is continuous $X \in [0, 10]$. So we produce a histogram and see how likely given the data is it to predict $Y=y$. | This problem is a directional problem so $Y \in (-\pi, \pi]$ and our data is also continuous $X \in [0, 10]$.|




### How do we convert a model to compute probability distributions

We need to choose a **parametric** distribution defined over the output domain $Y$. 

$$\boxed{Pr(Y | \theta)}$$

Then we use the network to compute the parameters $\theta$ of this distribution.

For example if our output $Y \in \mathbb{R}$ then it may be suitable to choose the normal distribution. <br>
The parameters that define it is the mean and variance. So we have $\theta = \{\mu, \sigma^2 \}$.

**Optimization point** <br>

The model in this case only needs to learn the mean $\mu$ since $\sigma$ is derivable so technically $\sigma^2$ could be treated as a constant.

Below presents differrent Distributions for loss functions for different prediction goals.

<div align="center">
<img src="../images/chap4/distrLoss.png" width="700" />
</div>



### Maximum Likelihood Criterion



Our model for each training input now looks like this: $$\theta_i  = f[x_i, \phi]$$ 

Each observed training output $y_i$ should have a high probability under the corresponding disribution $Pr(y_i | \theta_i)$.<br>

So we'd want to choose the paramters which produces the maxmimum distribution of prediction over all predictions.

$$ 
\begin{align} 
\hat{\phi} &= \mathbf{argmax}_{\phi}\big[ Pr(y_1, y_2, \dots, y_N | x_1, x_2, \dots, x_N ) \big] \\ 
& \Downarrow \ i.i.d \ (Independent \ and \ Identically \ Distributed) \\
 &= \mathbf{argmax}_{\phi}\big[\ \prod_{i=1}^N Pr(y_i | x_i) \big] \\
&= \mathbf{argmax}_{\phi}\big[\prod_{i=1}^N Pr(y_i | \theta_i) \big ] \\
&= \mathbf{argmax}_{\phi}\big[\prod_{i=1}^N Pr(y_i | f[x_i, \theta]) \big ]
\end{align}
$$

This is known as the $\textcolor{lightblue}{Maximum \ Likelihood \ Criterion}$




### Maximizing log-Likelihood

Note that we're dealing with a product of probabilities this can lead to two main issues.

1. Theoretical – If the number of samples is large enough this can converge to 0. $^*$ 
2. Implentation – We only have so much degree of accuracy we can keep track of on a computer.

Fortunately $\log$ will help us here. This function ensures that no information is lost $^*$. We now see why this function helps our problem.

$$
\begin{align} 
\hat{\phi} 
&= \mathbf{argmax}_{\phi}\left[\prod_{i=1}^N Pr(y_i | f[x_i, \phi]) \right] \\
&= \mathbf{argmax}_{\phi}\left[ \log \left( \prod_{i=1}^N Pr(y_i | f[x_i, \phi]) \right) \right] \\
&= \mathbf{argmax}_{\phi}\left[ \sum_{i=1}^N \log \left(Pr(y_i | f[x_i, \phi]) \right) \right]
\end{align}
$$

This resolves the finite prescision problem we had with the product.

$^{*\text{For the curious, the proof is located in the proof directory}}$ 


### Minimizing negative log-likelihoood

Note that in ML aim to **Minimize** the loss we therefore flip:

$$
\begin{align} 
\hat{\phi} 
&= \mathbf{argmin}_{\phi}\left[ -\sum_{i=1}^N \log \left(Pr(y_i | f[x_i, \phi]) \right) \right] \\
&= \mathbf{argmin}_{\phi}\big[ L[\phi] \big]
\end{align}
$$

This forms the Final Loss function $L[\phi]$

### Inference

Since our model now predicts a probability distribution over possible $y$ to perform an inference we need to extract ther maximum of the distribution. 

$$\boxed{\hat{y} = \argmax_{y}\big[Pr(y | f[x, \hat{\phi}]\big]}$$

## Method: Constructing loss Functions

$$\boxed{\begin{aligned}
&1. \text{ Given the output choose a suitable probability distribution } Pr(y | \theta) \text{ defined over the domain of predictions} \\
\\
&2. \text{ Set the ML model } f[x, \phi] \text{ to predict all independant parameters } \\
&\quad \text{(and compute the rest of the parameters based on what's learnt) so } \theta = f[x, \phi] \text{ and } Pr(y | \theta) = Pr(y | f[x, \phi]) \\
\\
&3. \text{ We train the model to find the network parameters } \hat{\phi} \text{ that minimizes the negative log-likelihood} \\
&\quad \text{over the training dataset } \{x_i, y_i\}_{i=1}^N \\
\\
&4. \text{ When needed to perform the inference we'll apply the argmax of the distribution } Pr(y | f[x, \hat{\phi}])
\end{aligned}}$$