## 6. Joker card: Linear Regression with Laplacian Noise

In the course, we obtain closed-form solution for linear regression (the OLS) by using the maximum likelihood estimation (MLE) under the assumption that our data follows a linear relationship and it has added noise that is normally distributed (Gaussian noise). In this exercice, we will study the usage of different noise model.

Let $z \sim \text{Laplace}(\mu, \beta)$ denote a random valuable that is drawn from an univariate Laplace distribution with mean $\mu$ and scale parameter $\beta$. Its PDF is

$$
p(z,\mu,\beta) = \frac{1}{2\beta} exp \Big( \frac{-|z - \mu|}{\beta} \Big)
$$

Given $\mathbf{X}$ the matrix of size $N \times D$ collecting the inpute data and $y$ the vector of labels of length $N$, our goal is to do linear regression, $y = h(\mathrm{w}^T \mathrm{x})$ under the assumption that each label $y_i$ comes from the linear relationship affected by Laplace noise.

$$
y_i \sim \text{Laplace}(\mathrm{w}^T \mathrm{x}_i,\beta)
$$

You will now use the MLE steps, as seen in the course, to obtain the model parameters $\mathrm{w}$. For simplicity, we will assume $\mathrm{w}_0 = 0 $ (no intercept)

`(a)` [1 point (bonus)] Write the likelihood function $\mathcal{L}$ for the parameter $\mathbf{w}$

To write the likelihood function $\mathcal{L}$ for the parameter $\mathbf{w}$, we need to consider the probability of observing the data given the parameters. Given the assumption that each label $y_i$ comes from a Laplace distribution with mean $\mathbf{w}^T \mathbf{x}_i$ and scale parameter $\beta$, the probability density function for each $y_i$ is:

$
p(y_i \mid \mathbf{w}, \beta) = \frac{1}{2\beta} \exp \left( \frac{-|y_i - \mathbf{w}^T \mathbf{x}_i|}{\beta} \right)
$

Assuming the data points are independent, the likelihood function $\mathcal{L}$ for the parameter $\mathbf{w}$ is the product of the individual probabilities for all data points:

$
\mathcal{L}(\mathbf{w}) = \prod_{i=1}^N p(y_i \mid \mathbf{w}, \beta)
$

Substituting the PDF of the Laplace distribution into the likelihood function:

$   \framebox[1][10]{ Solution: } $

$
\mathcal{L}(\mathbf{w}) = \prod_{i=1}^N \frac{1}{2\beta} \exp \left( \frac{-|y_i - \mathbf{w}^T \mathbf{x}_i|}{\beta} \right)
$

This can be further simplified by taking the product inside the exponential term:

$
\mathcal{L}(\mathbf{w}) = \left(\frac{1}{2\beta}\right)^N \exp \left( \frac{-1}{\beta} \sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i| \right)
$

Thus, the likelihood function for the parameter $\mathbf{w}$ is:

$
\mathcal{L}(\mathbf{w}) = \left(\frac{1}{2\beta}\right)^N \exp \left( \frac{-1}{\beta} \sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i| \right)
$

`(b)` [$1/2$ point (bonus)] Write the log likelihood function $\log \mathcal{L}$ for the parameter $\mathrm{w}$ in the simplest form possible.

To write the log likelihood function for the parameter $\mathbf{w}$ in the simplest form possible, we take the natural logarithm of the likelihood function $\mathcal{L}(\mathbf{w})$ obtained previously.

The likelihood function is:
$
\mathcal{L}(\mathbf{w}) = \left(\frac{1}{2\beta}\right)^N \exp \left( \frac{-1}{\beta} \sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i| \right)
$

Taking the natural logarithm of both sides:

$
\log \mathcal{L}(\mathbf{w}) = \log \left[ \left(\frac{1}{2\beta}\right)^N \exp \left( \frac{-1}{\beta} \sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i| \right) \right]
$

Using the properties of logarithms:

$
\log \mathcal{L}(\mathbf{w}) = \log \left(\frac{1}{2\beta}\right)^N + \log \left[ \exp \left( \frac{-1}{\beta} \sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i| \right) \right]
$

Simplifying further:

$
\log \mathcal{L}(\mathbf{w}) = N \log \left( \frac{1}{2\beta} \right) + \frac{-1}{\beta} \sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i|
$

$
\log \mathcal{L}(\mathbf{w}) = N \left( \log 1 - \log 2 - \log \beta \right) + \frac{-1}{\beta} \sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i|
$

Since $\log 1 = 0$:

$
\log \mathcal{L}(\mathbf{w}) = -N \log 2 - N \log \beta - \frac{1}{\beta} \sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i|
$

Thus, the log likelihood function for the parameter $\mathbf{w}$ is:

$
\log \mathcal{L}(\mathbf{w}) = -N \log 2 - N \log \beta - \frac{1}{\beta} \sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i|
$

The log likelihood function $\log \mathcal{L}$ for the parameter $\mathbf{w}$ in the simplest form is:

$   \framebox[1][10]{ Solution: } $


$ \log \mathcal{L}(\mathbf{w}) = -N \log(2\beta) - \frac{1}{\beta} \sum_{i=1}^{N} |y_i - \mathbf{w}^T \mathbf{x}_i|. $


`(c)` [1 point (bonus)] What is the simplest loss function that can be used that will lead to the same result as maximizing the likelihood? Write its expression. In which way it is different from that one of the standard least squares regression? ___Hint:___ Question 4.

Maximizing the likelihood function is equivalent to minimizing the negative log likelihood. From the log likelihood function obtained:

$
\log \mathcal{L}(\mathbf{w}) = -N \log 2 - N \log \beta - \frac{1}{\beta} \sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i|
$

We focus on minimizing the term that depends on $\mathbf{w}$:

$
-\frac{1}{\beta} \sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i|
$

Since $\beta$ is a constant, minimizing the negative log likelihood is equivalent to minimizing:

$
\sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i|
$

Therefore, the simplest loss function that can be used is the sum of the absolute errors which is precisely the Least Absolute Deviations (LAD) loss, (also known as the **L1 loss**):

$   \framebox[1][10]{ Solution: } $

$
L(\mathbf{w}) = \sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i|
$

This is different from the standard least squares regression (OLS), which uses the sum of squared errors (L2 norm) instead of the sum of absolute errors (L1 norm).


`(d)` [2 points (bonus)] Write the batch gradient descent update rule to minimize the cost function from the previous point. 
___Hint:___ $\frac{d}{dx}|x| = sign(x)$. You may ignore points where there are undefined gradients

To derive the batch gradient descent update rule to minimize the Least Absolute Deviations (LAD) loss function, we start with the cost function:

$
L(\mathbf{w}) = \sum_{i=1}^N |y_i - \mathbf{w}^T \mathbf{x}_i|
$

We need to compute the gradient of this cost function with respect to $\mathbf{w}$. Using the hint that $\frac{d}{dx}|x| = \text{sign}(x)$, the gradient with respect to $\mathbf{w}$ is:

$
\frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} = \sum_{i=1}^N \frac{\partial}{\partial \mathbf{w}} |y_i - \mathbf{w}^T \mathbf{x}_i|
$

Applying the chain rule and the hint:

$
\frac{\partial}{\partial \mathbf{w}} |y_i - \mathbf{w}^T \mathbf{x}_i| = \text{sign}(y_i - \mathbf{w}^T \mathbf{x}_i) \cdot \frac{\partial}{\partial \mathbf{w}} (y_i - \mathbf{w}^T \mathbf{x}_i)
$

Since $\frac{\partial}{\partial \mathbf{w}} (y_i - \mathbf{w}^T \mathbf{x}_i) = -\mathbf{x}_i$:

$
\frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} = \sum_{i=1}^N \text{sign}(y_i - \mathbf{w}^T \mathbf{x}_i) \cdot (-\mathbf{x}_i)
$

Simplifying this, we get:

$
\frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} = -\sum_{i=1}^N \text{sign}(y_i - \mathbf{w}^T \mathbf{x}_i) \cdot \mathbf{x}_i
$

The gradient descent update rule is given by:

$
\mathbf{w} \leftarrow \mathbf{w} - \alpha \frac{\partial L(\mathbf{w})}{\partial \mathbf{w}}
$

Substituting the gradient we computed:

$
\mathbf{w} \leftarrow \mathbf{w} + \alpha \sum_{i=1}^N \text{sign}(y_i - \mathbf{w}^T \mathbf{x}_i) \cdot \mathbf{x}_i
$

Thus, the batch gradient descent update rule to minimize the LAD loss function is:

$   \framebox[1][10]{ Solution: } $


$
\mathbf{w} \leftarrow \mathbf{w} + \alpha \sum_{i=1}^N \text{sign}(y_i - \mathbf{w}^T \mathbf{x}_i) \cdot \mathbf{x}_i
$

where $\alpha$ is the learning rate.