
## Maximum Likelihood Principle (Copy of Book)
In the study of estimators, rather than guessing that some function might have a good estimator and then analysing its bias and variance, we would like to have some principle from which we can derive specific functions that are good estimators for different models. And the most common such principle is the Maximum Likelihood Principle. 

Consider a set of m example $ X=\lbrace x^{(1)},...,x^{(m)} \rbrace$ drawn independently from the ture but unknown data-generating distribution $ p_{data}(x)$.

Now let $p_{model}(x;\theta)$ be a parametric family of probability distributions over the same space indexed by $\theta$. In other words, $p_{model}(x;\theta)$ maps any configuration $x$ to a real number estimating the true probability $ p_{data}(x)$.

The maximum likelihood estimator for $\theta$ is then defined as

(5.56)
$$
\theta_{ML}=arg max\quad p_{model}(X;\theta)  
$$
(5.57)
$$
=arg max\prod_{i=1}^m p_{model}(x^{(i)};\theta) 
$$


However, this product over many probability can be inconvenient for various reason, for example, it is prone to numercial underflow. To obtain a more convenient but equivalent optimization problem, we observe that taking the logarithm (log operation) of the likelihood does not change its arg max but does conveniently transform a product into a sum:

(5.58)
$$
\theta_{ML}=arg max\sum_{i=1}^m log p_{model}(x^{(i)};\theta)
$$

Moreover, because the arg max does not change when we rescale the cost function, we can divide by $m$ to obtain a version of the criterion that is expressed as an exceptation which respect to the empirical distribution $\hat p_{data}$ defined by the training data:

(5.59)
$$
\theta_{ML}=arg max E_{x\sim\hat p_{data}} log p_{model}(x^{(i)};\theta)
$$

## KL Divergence(Copy of Book)

One way to interpret maximum likelihood estimation is to view it as minimizing the dissimilarity between the empirical distribution $\hat p_{data}$, defined by the training set and the model distribution, with the degree of dissimilarity between the two measured by the **KL Divergence**. The KL divergence is given by 

(5.60)
$$
D_{KL}(\hat p_{data}||p_{model})=E_{x\sim\hat p_{data}} [log \hat p_{data}(x)-log p_{model}(x)]
$$

The term on the left is a function only of the data-generating process, not the model. This means when we train the model to minimize the KL divergence, we need only minimize  $-E_{x\sim\hat p_{data}}[log p_{model}(x)]$ (5.61)which is of course the same as the maximization in equation of (5.59):

We can thus see maximum likelihood as an attempt to make the model distribution match the empirical distribution $\hat p_{data}$, but we have no direct access to this distribution. 

While the optimal $\theta$is the same regardless of whether we are maximizing the likelihood or minimizing the KL dicergence, **the value of the objective functions are different.** In software, we often phrase both as minimizing a cost function. Maximum likelihood thus becomes minimization of the negative log-likelihood(NLL), or equivalently, minization of the cross-entropy. The perspective of maximum likelihood as minimum KL divergejce becomes helpful in this case bacuase the KL divergence has a known minimium value of zero. The negative log-likelihood can actually become negative when $x$ is real-valued. 



## Conditional Log-likelihood and Mean Squared Error(Copy of Book)

The maximum likelihood estimator can readily be generalized to estimate a conditional probability $P(y|x;\theta)$ in order to predict $y$ by given $x$. And this is actually the most common situation because it forms the basis for most supervised learning. 

If $X$ represents all our inputs and $Y$ all our observed targets, then the conditional maximum likelihood estimators is:

(5.62)
$$
\theta_{ML}=arg maxP(Y|X;\theta)
$$
If the examples are assumed to be i.i.d., then this can be decomposed into:

(5.63)
$$
\theta_{ML}=arg max\sum_{i=1}^m log P(y^{(i)}|x^{(i)};\theta)
$$

## Example: Linear Regression as Maximum Likelihood(Copy of Book)

Linear regression, introduced in section 5.1.4 in the book of Deep Learning, may be justified as a maximum likelihood procedure. Previously, we motivated linear regression as an algorithm that learns to take an input $x$ and produce an output value $y$. The mapping from $x$ to $\hat y$ is chosen to minimize mean squared error, a criterion that we introduced more or less arbitrarily. We now revisit linear regression from the point view of maximum likelihood estimation. Instead of producing a single prediction $\hat y$, we now think of the model as producing a conditional distribution $p(y|x)$. We can imagine that with the same input value $x$ but different value of $y$. The goal of the learning algorithm is now to fit the distribution $p(y|x)$ to all those different $y$ values that are all compatible with $x$. To drive the same linear regression algorithm we obtained before, we define $p(y|x)=N(y;\hat y(x;w),\delta^2)$. The function $\hat y(x;w)$ give the predition of the mean of the Gaussian. In this example, we assume that the variance is fixed to some constant $delta^2$ chosen by the user. We will see that this choice of the functrional form of $p(y|x)$ causes the maximum likelihood estimation procedure to yield the same learning algorithm as we developed before. Since the examples are assumed to be i.i.d., the conditional log-likelihood( equation 5.63) is give by:

(5.64)
$$ 
\sum_{i=1}^m log P(y^{(i)}|x^{(i)};\theta)
$$
(5.65)
$$
=-mlog\delta- \frac {m} {2}log(2\pi)-\sum_{i=1}^m\frac {||\hat y^{(i)}-y^{(i)}||^2} {2\delta^2}
$$

where $\hat y^{(i)}$ is the output of the linear regression on the i-th imput $x^{(i)}$ and $m$ is the number of the training examples. Comparing the log-likelihood with the mean square error,

(5.66)
$$
MSE_{train}=\frac{1}{m}\sum_{i=1}^m {||\hat y^{(i)}-y^{(i)}||^2} 
$$

We immediately see that maximizing the log-likehood with respect to $w$ yelids the same estimate of the parameters $w$ as does minimizing the mean squared error. The two criteria have different values but the same location of the optimum. This justifies the use of the MSE as a mazimum likehood estimation procedure. As we all see, the maximum likelihood estimator has serveral desirable properties. 

## Properties of Maximum Likelihood(Copy of Book)

The main appeal of the maximum likelihood estimator is that it can be shown to be the best estimator asymptotically, as the number of example $m \to \infty$, in terms of it rate of convergence as $m$ increase. 

The Maximum Likelihood Estimator has the two properties, **the property of consistency**(see section 5.4.5） and **good statistical efficiency**. 

### The property of consistency(Copy of Book)

Under appropriate conditions, the maximum likelihood estimator has the the property of consistency, meaning that as the number of trainin examples approaches infinity, the maximum likelihood estimate of a parameter converges to the true value of the parameter. These conditions are as follow:
1. The true distribution $p_{data}$ must lire within the model family $p_{model}(·;\theta)$. Otherwise, no estimator can recover $p_{data}$.
2. The true distribution $p_{data}$ must correspond to exactly one value of $\theta$. Otherwise, maximum likelihood can rocover the correct $p_{data}$ but will not be able to determine which value of $\theta$ was used by the data-generating process.

### Statistical efficiency(Copy of Book)
There are other inductive principles besides the maximum likelihood estimator, many of which share the property of being consistent estimators. Consistent estimators can differ, however, in their statistical efficiency, meaning that one consistent estimator may obtain lower generalization error for a fixed number of sample $m$, or equivalenty, may require fewer examples to obtain a fixed level of generalization error. 

Statistical efficiency is typicaly studied in the parametrics case (as in linear regression), where our goal is to estimate the value of a parameter(assuming it is possible to identify the true parameter), not the value of a function. A way to measure how close we are to the true parameter is by the expected mean squared error, computing the squared difference between the estimated and ture parameter values, where the expection is ober $m$ training samples from the data-generating distribution. That parameteric mean squared error decreases as $m$ increase, and from $m$ large, the Cramer-Rao lower bound( Rao, 1945; Cramer, 1946) shows that no consisnent estimator has a lower MSE than the maximum likelihod extimator. 

For those reason( consistency and efficiency), maximum likelihood is often considered the preferred estimator to use for machine learning. When the number of example is small enough to yelid overfitting behavior, regularization strategies such as weight decay may be used to obtain a biased version of maximum likelihood that has less variance when training data is limited. 


# A sample example of Maximum Likelihood (My own example by LI Jiawei)

Assuming we have a data set which contain 10 data point, each of them represent the time comsuming of a student to complete a specific examination (for example, a math exam). And the figure of the data set is below:

[Photo 07_01]

First, we have to choose a model or distribution to describe these data points. Assuming we are trying to use Gaussian distribution to describe them. And then for sure, next step is choosing appropriate valu of $\mu$ and $\sigma$ which are two key parameters to define the shape of Gaussian Distribution. 

Let's look at the below photo.

[Photo 07_02]

Assuming we have four options of Gaussian distribution, f1, f2, f3 and f4, each of them has different value of $\mu$ and $\sigma$. And from the photo is easy to find that, the curve of f1 has the most probability to fit the distribution of all 10 data points. And that is how we using Maximum Likilihood to solve the problme, **intuitively**. 

**Now, let's do the Maximum Likelihood Estimation in mathematic way.**

To simple the calculation, let's only choose 3 data points here, which are 9.0, 9.5 and 11.0 for example. So how can we use Maxmium Likelihood to approach the value of $\mu$ and $\sigma$?

Here, we need to assume that each data point is generated independently, so that the probability of the whole data set is equal to the product of probability of each single data point. 

The probability of a singe point generated by Gaussian distribution is: 

(Equation A)
$$
P(x;\mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)
$$

Therefore, in the example, the total probability of the three data points is:

(Equation B)
$$
P(9,9.5,11;\mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}}exp\left(-\frac{(9-\mu)^2}{2\sigma^2}\right)*\frac{1}{\sigma\sqrt{2\pi}}exp\left(-\frac{(9.5-\mu)^2}{2\sigma^2}\right)*\frac{1}{\sigma\sqrt{2\pi}}exp\left(-\frac{(11-\mu)^2}{2\sigma^2}\right)
$$

And then, the maximum of the equation could be calculation by setting the derivative funtion into 0 in order the find the inflection point and then obtain the calue of $\mu$ and $\sigma$.

### log-maximum likelihood

However, you would find the above equation is very complex to do derivative operation and find the answer. 

By doing log opration for (Equation B), we can get:

(Equation C)
$$
ln(P(9,9.5,11;\mu,\sigma))=ln(\frac{1}{\sigma\sqrt{2\pi}})-\frac{(9-\mu)^2}{2\sigma^2})+ln(\frac{1}{\sigma\sqrt{2\pi}})-\frac{(9.5-\mu)^2}{2\sigma^2})+ln(\frac{1}{\sigma\sqrt{2\pi}})-\frac{(11-\mu)^2}{2\sigma^2})
$$

And by applying the algorithm of log, the avove equation can be simplified as:

(Equation D)
$$
n(P(9,9.5,11;\mu,\sigma))=-3ln(\sigma)-\frac{3}{2}ln(2\pi)-\frac{1}{2\sigma^2}[(9-\mu)^2+(9.5-\mu)^2+(11-\mu)^2]
$$

And maximum value of $\mu$ and $\sigma$ can be obtained by doing derivative operation. And the solving derivative for an adding equation is much eaiser than solving a derivative equation for an producting equation. And that is why we are using log-maximum likelihood rather than maximum likelihood here. 

By doing the partial derivative to $\mu$, we can get:

(Equation E)
$$
\frac{\partial ln(P(9,9.5,11;\mu,\sigma))}{\partial \mu}=\frac{1}{\sigma^2}[9+9.5+11-3\mu]
$$
Then the value of \mu could be obtained:

(Equation F)
$$
\mu=\frac{9+9.5+11}{3}=9.833
$$

By using the similar way, doing the partial derivative to \sigma, we can get the maximum value of it.

And that is the example about how to do maximum likelihood and log-maximum likelihood. 