# Maximizing Likelihood ðŸ“ˆðŸ“ˆ

## Definition

The likelihood of a family of random variables is a function that gives for each possible realisation of each $Î²_1$ and $X_{1}$ random variable in the family the probability that this combination of realisations will occur. In the case of statistical modelling, data are already available, so we already know the realizations of each random variable (i.e. the value of the explanatory variables for each observation). It is therefore a matter of finding the parameters that will make the observations available to us as likely as possible. In other words, we are trying to find the underlying general rule that we suspect exists behind the oberved distribution of data. 

Statistical likelihood is a function of conditional probabilities (it is a probability whose law depends on parameters). Let $X = (X_{1}, X_{2}, ..., X_{n})$, a vector of random variables and $\theta = (\beta_{1}, \beta_{2}, ..., \beta_{k})$ of the set of parameters on which $X$ depends. In what follows, capital letters represent the random variables i.e. $X_i$ and the lowercase represent secific realisations of thoses random variables $X_i = x_i$. With that in mind, the likelihood of $Y$ is written:

$$
L(Y)=L(x_{1},x_{2},...x_{p}) = \prod^{n}_{i=1}f(x_{i};\beta)
$$

With

if x is a continuous variable:

$$
f(x_{i};\beta) = f_{\beta}(x_{i})
$$

Where $f$ is the density probability function for $X_i$, and $\beta$ represents the parameters $\beta_1, \beta_2, ..., \beta_p$.

if x is a discret variable:

$$
f(x_{i};\theta) = P_{\beta}(x_{i})
$$

Where $P$ is the probability of getting $X_i = x_i$.



### Maximum likelihood estimate for simple linear regression

One way to estimate the parameters of a model is to maximize the corresponding likelihood function. In the case of simple linear regression, this function is:

$$
L(Y = y|\beta) = \prod_{i=1}^{n}f(Y_{i} = y_{i}|\beta)
$$

$$
L(Y = y|\beta) = \prod_{i=1}^{n}f(\beta_{0} + \beta_1 x_1i + \epsilon = y_{i}|\beta_{0}, \beta_{1})
$$

$$
L(Y = y|\beta) = \prod_{i=1}^{n}f(\epsilon = y_{i}-\beta_{0}-\beta_{1}x_{i}|\beta_{0}, \beta_{1}
$$

$$
L(Y = y|\beta) = \prod_{i=1}^{n}\frac{1}{\sigma\sqrt{2\pi}}\exp(\frac{-(y_{i}-\beta_{0}-\beta_{1}x_{i})^2}{2\sigma^2})
$$

This likelihood equation can be obtained by assuming that the error $\epsilon$ follows a centred normal law ($E(\epsilon) = \mu = 0$) of standard deviation $\sigma$. We must therefore find the maximum (if it exists) of this likelihood function. To do this we apply a logarithm to each side of the equation to obtain a sum rather than a product.

$$
log(L(Y = y|\beta)) = log(\prod^{n}_{i=1}\frac{1}{\sigma \sqrt{2\pi}}\exp(\frac{-(y_{i}-\beta_{0}-\beta_{1}x_{1})^2}{2\sigma^2}))
$$

$$
l(Y = y|\beta) = -n(log(\sigma)+log(2\pi))-\frac{1}{2\sigma^2}\sum^{n}_{i=1}(y_{i}-\beta_{0}-\beta_{1}x_{i})^2
$$



We can see that the equation depends only on the parameters of the model: $Î²_0$ and $Î²_1$ and we still have to find the values for which $-\sum^{n}_{i=1}(y_{i}-\beta_{0}-\beta_{1}x_{i})^2$ is maximal.


### Maximum likelihood estimate for multiple linear regression

The calculation of the maximum likelihood estimator is a classic exercise in statistics, so we present the calculation here for those who are familiar with or would like to become familiar with matrix calculation.


$$
L(Y = y|\beta) = f(Y=y|\beta)
$$

$$
L(Y = y|\beta) = f(X\beta+\epsilon = y|\beta)
$$

$$
L(Y = y|\beta) = f(\epsilon = y-X\beta|\beta)
$$

According to the assumptions necessary to be able to use a multiple linear model, $\epsilon$ follows a centered Normal distribution $(E(\epsilon) = 0*n)$ and diagonal covariance matrix $\sum = Diag(\sigma_{1}^2, \sigma_{2}^{2}, ... , \sigma_{p}^{2})$. Which brings us to the following equation:

$$
L(Y = y|\beta) = det(2\pi\sum)^{-\frac{1}{2}}\exp(-\frac{1}{2}((y=X\beta)^{t}\sum^{-1}(y-X\beta)))
$$

$\sum$ is a diagonal matrix, hence:

$$
L(Y = y|\beta) = det(2\pi\sum)^{-\frac{1}{2}}\exp(-\frac{1}{2}\sum^{-1}((y=X\beta)^{t}(y-X\beta)))
$$

We apply the logarithm which is increasing and therefore does not change the optimization problem under consideration:

$$
log(L(Y = y|\beta)) = -\frac{1}{2}log(det(2\pi\sum))-\frac{1}{2}\sum^{-1}(y-X\beta)^{t}(y-X\beta)
$$


We try to find the value of $\beta$ that maximizes the above equation, which is like finding the minimum of the following value:

$$
\underset{\beta}{min}(y-X\beta)^{t}(y-X\beta) = \beta_{MLE}
$$

$$
\underset{\beta}{min}y^{t}y-\beta^{t}X^{t}y-y^{t}X\beta+\beta^{t}X^{t}X\beta = \beta_{MLE}
$$

$y^{t}X\beta$ is a scalar (i.e. a real number of dimension 1), so it is equal to its transpose $y^{t}X\beta = \beta^{t}X^{t}y$!, hence:

$$
\underset{\beta}{min}y^{t}y-2\beta^{t}X^{t}y + \beta^{t}X^{t}X\beta = \beta_{MLE}
$$

We derive from $\beta$ and we get:

$$
-X^{t}y + X^{t}X\beta = 0 \Rightarrow \beta_{MLE} = (X^{t}X)^{-1}X^{t}y
$$

This solution is well defined only if $X^{t}X$ is an invertible matrix, which is true if the explanatory variables are not collinear and if $p < n$ (the number of explanatory variables is less than the number of observations).