## Loss Function

### Bernoulli variable - Log Loss Function

As discussed before, Bernoulli Random Variables $X \sim \textbf{Bern}(p)$ has a probability $p$ of turning up $1$ and $1-p$ of turning up $0$.

For succintness, we can rewrite the probability mass function as:
$$f(x;p) = p^x (1-p)^{1-x}$$

Deciphering this, we see that if $x=1$, the probability is $p$; if $x=0$, the probability is $1-p$.

<ins>Example 1: Biased coin toss</ins>

Assume we have 100 tosses of a biased coin, and we are trying to estimate the probability $p$ of the coin turning up $H$. We have $68 H$. How do we estimate the probability $p$? We can use an MLE approach:
$$L(x;p) = \prod_{i=1}^{100} p^{x_i} (1-p)^{1-x_i}$$
$$\ln{L(x;p)} = \sum_{i=1}^{100} x_i\ln{p} + (1-x_i)\ln{(1-p)}$$

We have recovered the Log-Likehood function, by definition, the Log-Loss function:
$$\ln{\text{Loss}} = -\frac{1}{100} \sum_{i=1}^{100} x_i\ln{p} + (1-x_i)\ln{(1-p)}$$

It is quite easy now to rewrite the Log-Likelihood function as:
$$\ln{L(x;p)} = 68 \ln{p} + 32 \ln{(1-p)}$$
$$\implies \frac{\partial \ln{L(x;p)}}{\partial p} = \frac{68}{p} - \frac{32}{1-p} = 0 \implies p = \frac{17}{25}$$

The significance of the Log-Loss function is for classification tasks. Assume that we have two types of images: dog and not-dog. The classifier is trying to identify whether an image is a dog or not. Hence, this problem is identical to estimating the probability an image is a dog $(H)$ or  not a dog $(T)$-- a Bernoulli random variable.

### Least squares - Mean Squared Error Loss Function

As we saw in regression type problems, the mean squared error:
$$L = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2=\frac{1}{n} \sum_{i=1}^n (y_i - \beta x_i)^2$$

However, to think of the mean squared error loss function in terms of a MLE-type problem, we can write out the original regression equation:
$$\hat{y} = \beta x + \epsilon$$

Assume that $\epsilon \sim N(0, \sigma^2)$, hence $y - \beta x \sim N(0, \sigma^2)$. If we have $n$ data points and we were to fit a gaussian likelihood function of $\mu=0$ and $\sigma$, then the likelihood function can be written as:
$$L(x; \beta, \sigma) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}}\exp{-\frac{(y_i - \beta x_i)^2}{2\sigma^2}}$$
$$\implies \ln{L(x; \beta, \sigma)} = \sum_{i=1}^n -\ln{\sigma} - \frac{(y_i - \beta x_i)^2}{2\sigma^2}$$
After $\arg \max_{\beta, \sigma}$:
$$\beta = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i x_i}$$
$$\sigma^2 = \frac{\sum_{i=1}^n x_i^2}{n}$$

Note that when we are differentiating $\frac{\partial \ln L}{\partial \beta}$, the optimization only depends on the 2nd term $\sum_{i=1}^n \frac{1}{2} (y_i - \beta x_i)^2$ (since $\sigma^2$ is always >0, we can drop that). Therefore, we have shown that maximizing the log-likelihood, with the assumption that the error term $\epsilon \sim N(0, \sigma^2)$, is the same as minimizing the mean-squared error.