# Maximum Likelihood Estimation

##### Keywords: maximum likelihood,  parametric model, linear regression, logistic regression, inference,  exponential distribution

## Contents
{:.no_toc}
* 
{: toc}

## Maximum Likelihood Estimation
Maximum Likelihood Estimation is, by far, the best-stuided and most widely applied model fitting technique. As usual, the aim is to work out which set of parameter values we ought to use in our model.

Maximum Likelihood Estimation says that we should pick the parameter values that make the data most likely to arise. We'll see later on that this is equivalent to picking the parameter values that are least surprised by the data.

To apply the MLE procedure:  
1) Write down the (log) likelihood function.  
2) Either analytically or numerically optimize the likelihood by varying the parameters and keeping the observed data fixed.  
3) Use the values of the parameters obtained in step 2 as the offical MLE best-guess.

### Via table

Recall our 'spreadsheet' of all possible datasets and all possible parameter settings:

| |prameter setting 1| parameter setting 2|...|
|-|-|-|-|
|**dataset 1**|$P($dataset1$\vert$parameter setting1$)$|$P($dataset1$\vert$parameter setting2$)$|...|Row Sum=?
|**dataset 2**|$P($dataset2$\vert$parameter setting1$)$|$P($dataset2$\vert$parameter setting2$)$|...|Row Sum=?
|**dataset 3**|$P($dataset3$\vert$parameter setting1$)$|$P($dataset3$\vert$parameter setting2$)$|...|Row Sum=?
|...|...|...|...|...
| |Column Sum=1|Column Sum=1|...

Each entry in the table is the likelihood of the particular dataset. To find the MLE we look up our dataset in the table (let's say it's on row 2) and then pick the column with the largest value in that row.

Of course, the number of possible parameter settings tends to be [highly] infinite, so we tend to resort to calculus to find the best set of parameters, holding the data constant.

### via Graph

The diagram below illustrates the idea behind the MLE.

![](images/gaussmle.png)

Consider a likelihood funtion using the parameter value 1.8 (blue) and the same likelihood but with a parameter value of 5.8. (green). Let's say we have 3 data points, at $x=1,2,3$.

Maximum likelihood says we should favor whichever value of the parameter, if true, makes the data more likely to occur.

In our case the blue parameter value is more likely since the product of the height of the 3 vertical blue bars is greater than the product of the 3 green bars.

Indeed the question that MLE asks is: how can we move and scale the distribution by changing the parameters, so that the product of the 3 bars is maximised.

### Via math
If we assume that the overal likelihood of the dataset is the product of the likelihood of each row (i.e. the rows are drawn independently), we can write the likelihood as

$$
P(dataset\,|\,parameters)=L(\theta) = \prod_{i=1}^n P(x_i \mid \theta)
$$

This gives us a measure of how likely it is to observe values $x_1,...,x_n$ given the parameters $\theta$. Our goal is to maximize the expression above by picking $\theta$. Remember that we're working within a single row: $x_i$ [the observed data] are considered fixed.

Often it is easier and numerically more stable to maximise the log likelyhood:

$$
\ell(\lambda) = \sum_{i=1}^n ln(P(x_i \mid \theta))
$$

Because log is a monotonic transformation if the highest likelihood occurs at $\theta=.07$ before we take the log it will still occur at $\theta=0.7$ after we take the log, and the sum is much easier to work with than the product.

From here, we take the derivative of the log-likelihood and hunt for places where the derivative [or gradient] is zero of undefined as candidate maxima, or we resort to numerical techniques to find a local maximum.


### Example: MLE for an Exponential Distribution
The exponential distribution occurs naturally when describing the lengths of the inter-arrival times in a homogeneous Poisson process.

It takes the form:
$$
f(x;\lambda) = \begin{cases}
\lambda e^{-\lambda x} & x \ge 0, \\
0 & x < 0.
\end{cases}
$$

#### Finding the MLE
In this example the observed data are $n$ scalar values $x_i$, and the model is that each point is an iid draw from an exponential distribution with unknown paramter lambda.

We have,
$$
log(P(data\,|\,parameters)=\ell(\lambda)) = \sum_{i=1}^n ln(\lambda e^{-\lambda x_i}) = \sum_{i=1}^n \left( ln(\lambda) - \lambda x_i \right).
$$

Maximizing this:

$$
\frac{d \ell}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^n x_i = 0
$$

and thus:

$$
\est{\lambda_{MLE}} = \frac{n}{\sum_{i=1}^n x_i}
$$

From here we just need to sum the $x_i$ and plug in. We thus have a way of finding the MLE best-guess for $\lambda$ for any dataset.

### Nice Properties

#### Asymptotic Normality
In the above examples, we considered the observed data to be fixed. But we know that each of the $x_i$ in the dataset is really a random variable, i.e. we could have easily ended up with a different dataset / on another row of the table.

Thus the particular value of the MLE parameters we calculate depends on our luck during data collection. What does the spread of possible MLE results look like? It turns out that *with enough data, the possible MLE results look more and more like a normal distribution centered on the true parameter value (if there is one) and with variance that decreases as 1/n*. So with a good amount of data, odds are that our observed MLE isn't too far from the target MLE, and we could even work out probabilities of being off by a given amount. If the there is no true parameter value, the MLE is still normally distributed, and centered on the parameters that are closest to the true model (in a particular information-theory measurement)

### Effeciency, etc
The MLE has a host of other nice features like "doesn't leave any information in the data lying on the table". If you want to know more about when, how, and why MLE works well (or doesn't) take a class on statistical inference.