# Lecture 8 - Maximum Likelihood and Maximum A Posteriori

In [None]:
import numpy as np
import scipy.stats as stats

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-colorblind')

We can look at the **Regularized Least Squares** in the "Objective Function world", where we simply add a term to our objective in order to prevent overfitting and, consequently, allow the model to generalize to unseen and unkown data.

# Bayesian Interpretation

Another way to look at Regularized Least Squares is from a Bayesian point-of-view. To see this, let's look at our objective function:

\begin{align}
& \arg_{\mathbf{w}}\min \left(J(\mathbf{w})\right) \\
= & \arg_{\mathbf{w}}\max \left(- J(\mathbf{w})\right) \\
= & \arg_{\mathbf{w}}\max \left(\exp\left(- J(\mathbf{w})\right)\right) \text{, }\exp(\bullet)\text{ is a monotonic function}  
\end{align}

where

$$J(\mathbf{w})= \frac{1}{2}\sum_{n=1}^N \left(t_n - y_n\right)^2 - \frac{\lambda}{2} \sum_{i=0}^M w_i^2$$
and, consider e.g. the polynomial model (this could be *any* model)
$$y_n = \sum_{j=0}^M w_jx_n^j$$

Then,

\begin{align}
& \arg_{\mathbf{w}}\max \left(\exp\left(-\frac{1}{2}\sum_{n=1}^N \left(t_n - y_n\right)^2 - \frac{\lambda}{2} \sum_{i=0}^M w_i^2)\right)\right) \\
= & \arg_{\mathbf{w}}\max \left(\exp\left(-\frac{1}{2}\sum_{n=1}^N \left(t_n - y_n\right)^2\right) \exp\left(- \frac{\lambda}{2} \sum_{i=0}^M w_i^2)\right)\right) \\
=& \arg_{\mathbf{w}}\max \left(\prod_{n=1}^N \exp\left(-\frac{1}{2}\left(t_n - y_n\right)^2\right) \prod_{i=0}^M \exp \left(-\frac{\lambda}{2} w_i^2\right) \right)\text{, assuming the data }\{(x_n,t_n)\}_{n=1}^N\text{ is i.i.d.}  \\
\approx & \arg_{\mathbf{w}}\max \mathcal{N}\left(\mathbf{t}| \mathbf{y}, 1\right) \mathcal{N}\left(0, 1/\lambda\right) \\
=& \arg_{\mathbf{w}}\max p(\mathbf{t}|\mathbf{w}) p(\mathbf{w}), \mathbf{y}\text{ is a function of }\mathbf{w}\\
=& \arg_{\mathbf{w}}\max p(\mathbf{w}|\mathbf{t}) p(\mathbf{t}), \text{ using Bayes' Rule} \\
\propto & \arg_{\mathbf{w}}\max p(\mathbf{w}|\mathbf{t}), p(\mathbf{t})\text{ is constant for some fixed training set}  
\end{align}

where $p(\mathbf{t}|\mathbf{w})$ is known as the **data likelihood**, $p(\mathbf{w})$ is known as the **prior** on the parameters, and $p(\mathbf{w}|\mathbf{t})$ is the **posterior probability**.

In Machine Learning, this result is known as the **evidence approximation**.

* In practice, this means that we now can rewrite the Regularized Least Squares problem as the product between the *data likelihood* and a *prior distribution* on the parameters. 

    * In particular, for Least Squares cost function and an L2- regularization term, both distributions (likelihood's and prior's) follow a Gaussian distribution.
    
* Now, we can select **any** distribution function to our data and control the regularization also using a probabilistic model!

* **What is the shape of the prior distribution if we had considered the L1-norm or the Lasso regularizer?**

* **Using the same manipulation, what our optimization function look like *without* a regularization term?**

In [None]:
x = np.linspace(-4,4,1000)
Gaussian = np.exp(-x**2/2)/np.sqrt(2*np.pi) #Gaussian with zero-mean and unit-variance
Laplacian = np.exp(-np.abs(x))/(2) #Laplacian with zero-mean and lambda=1

plt.figure(figsize=(10,7))
plt.plot(x, Gaussian, label='Gaussian Distribution')
plt.plot(x, Laplacian, label='Laplacian Distribution')
plt.legend(loc='best')
plt.xlabel("x")
plt.ylabel("p(x)")
plt.show()

### Another note on Feature Selection

And, again, the L1-norm penalty (or *Lasso regularizer*) term *prefers* to have the weight parameters to be zero whereas the squared L2-norm penalty (or *ridge regularizer*) term *prefers* to have non-zero elements in $\mathbf{w}$.

The Lasso regularizer promotes sparsity, which can be used to perform feature selection.

## Maximum Likelihood Estimation (MLE) & Maximum A Posteriori (MAP)

Recall that our goal is to find the set of (hyper-)parameters that best fit our data. 

For the **Regularized Least Squares** objective function, we just showed that our optimization problem can be reduced to:

* Maximizing the **posterior** probability, that takes the shape of a Gaussian distribution, of unknown (hyper-)parameters, also known as **hypothesis** in the statistical inferencing.

For the **Least Squares without regularization** objective function, we just showed that our optimization problem can be reduced to:

* Maximizing the **data likelihood**, that takes the shape of a Gaussian distribution, with unknown (hyper-)parameters, also known as **hypothesis** in the statistical inferencing.

Recall the decision rules for statistical inferencing:

<div class="alert alert-info" role="alert">
  <strong>Maximum Likelihood (ML) Decision Rule</strong>

Given some observational data $\{x_i,t_i\}_{i=1}^N$, we can perform *classical* (or frequentist) statistical inferencing by computing the probability of 2 hypothesis, $H_0$ and $H_1$. The decision rule is given by:
    
$$P(\text{data}|H_0) \underset{H_1}{\overset{H_0}{\gtrless}} P(\text{data}|H_1)$$
    
</div>

<div class="alert alert-info" role="alert">
  <strong>Maximum A Posteriori (MAP) Decision Rule</strong>

Given some observational data $\{x_i,t_i\}_{i=1}^N$, we can perform Bayesian statistical inferencing by testing different hypothesis $\{H_i\}, i=1,2,3,4, \dots$, each with an induced **prior** probability $P(H_i)\neq 0, \forall i$. The decision rule is given by:

\begin{align}
P(H_i|\text{data}) &\underset{H_j}{\overset{H_i}{\gtrless}} P(H_j|\text{data}), i\neq j \\
\iff \frac{P(\text{data}|H_i)P(H_i)}{P(\text{data})} &\underset{H_j}{\overset{H_i}{\gtrless}} \frac{P(\text{data}|H_j)P(H_j)}{P(\text{data})}\\
\iff P(\text{data}|H_i)P(H_i) &\underset{H_j}{\overset{H_i}{\gtrless}} P(\text{data}|H_j)P(H_j), P(\text{data})\neq 0
\end{align}
    
</div>

In our problem, the hypothesis are the *unknown* **(hyper-)parameters** $\mathbf{w}$.

* In Bayesian statistical inferencing, we are then trying to find the $\mathbf{w}$'s that maximizing the posterior probability.
* In classical statistical inferencing, on the other hand, we are only computing the probability of some hypothesis (the *null hypothesis*).

<h2 align="center"><span style="color:blue">Maximum Likelihood Estimation (MLE)</span></h2>
<center>(Frequentist approach)</center>

In **Maximum Likelihood Estimation** (also referred to as **MLE** or **ML**) we want to *find the set of parameters* that **maximize** the data likelihood $P(\mathbf{x}|\mathbf{w})$. We want to find the *optimal* set of parameters under some assumed distribution such that the data is most likely.

<h2 align="center"><span style="color:orange">Maximum A Posteriori (MAP)</span></h2>
<center>(Bayesian approach)</center>

In **Maximum A Posteriori** (also referred as **MAP**) we want to *find the set of parameters* that **maximize** the posteriori probability $P(\mathbf{w}|\mathbf{x})$. We want to find the *optimal* set of parameters under some assumed distribution such that the parameters are most likely to have been drawn off of given some prior beliefs.

## Example

**Problem: Suppose I flip a coin 3 times and observe the event H-H-H. What is the probability of flipping Heads (H) on the next coin flip?**

Let $H_i$ be the event that it comes up heads on flip $i$. The sample space for this experiment is $S=\{H,T\}$. Consider the event $E=H_1\cap H_2\cap H_3$.

1. From Classical probability, what is the probability of heads in the next flip?

    * $P(H) = \frac{|H|}{|S|} = \frac{3}{3} = 1$

2. Bayesian Inference: What is the **hidden state** in this problem?

    * Hidden state: what type of coin was use in the experiment (fair, 2-headed)
    * So, by Law of Total Probability:
    $P(H) = P(H|\text{fair})P(\text{fair}) + P(\overline{H}|\text{2-headed})P(\text{2-headed})$
    * Furthermore, we can test different hypothesis by checking which hypothesis has the largest posterior probability value, e.g. if $P(\text{fair}|E) > P(\text{2-headed}|E)$, then hypothesis "fair" is more likely and that is what we will use to make predictions.
    
    
3. Note that the outcomes $H_i$ are **conditionally independent**, that is: $P(H_1\cap H_2|\text{fair}) = P(H_1|\text{fair})P(H_2|\text{fair})$. 
    * This is often an assumption that we make about data samples, we say that the samples are **independent and identically distributed (i.i.d.)**.
    
4. Recall that an experiment is **fair** if and only if (iff) the probability of each possible outcome (H,T) is equally likely to happen.

    * E.g., if $P(H)=P(T)=\frac{1}{2}$ then the experiment is fair.
    * But in this problem we do not know the probability of the outcomes are. In fact, that is exactly what we seek. Just like the polynomial regression problem, where we are finding the best hypothesis model but do not know which parameter values $\mathbf{w}$ to use.

Now, let's consider heads=1 and tails=0, so our sample space is $S=\{1,0\}$. The probability of heads is equal to some *unknown* value $\mu$, then:

\begin{align}
& P(x=1 | \mu) = \mu \\
& P(x=0|\mu) = 1-\mu
\end{align}

We can compute the data likelihood as:

$$P(x|\mu) = \mu^x(1-\mu)^{1-x} = \begin{cases}\mu & \text{if }x=1 \\ 1-\mu & \text{if } x=0 \end{cases}$$

* This is the **Bernoulli distribution**. The mean and variance of the Bernoulli distribution are: $E[x] = \mu$ and $E[\left(x- E[x]\right)^2] = \mu(1-\mu)$.

* So, for every outcome of the event $E$, we will model it using a Bernoulli distribution, and each outcome is pairwise **conditionally independent**. Therefore, we have the event $E$ contains i.i.d. outcomes.

### Method 1: Maximum Likelihood Estimator

For simplicity of calculation, assume that the event contains outcomes: $E=x_1\cap x_2\cap \dots\cap x_N$, where $x_i=\{0,1\}$ (0 for Tails and 1 for Heads). Then, for an experiment with $N$ samples, we can write the **data likelihood** as:


\begin{align}
P(E|\mu) &= P(x_1\cap x_2\cap \dots\cap x_N|\mu) \\
&= P(x_1|\mu)P(x_2|\mu)\dots P(x_N|\mu) \\
&= \prod_{n=1}^N P(x_n|\mu) \\
&= \prod_{n=1}^N \mu^{x_n} (1-\mu)^{1-x_n}
\end{align}


* Now, we are interested in finding the value of $\mu$ given some data set $E$. 

We now optimize the data likelihood. What trick can we use?

$$arg_\mathbf{\mu} \max P(E|\mu) = \arg_\mathbf{\mu} \max \ln \left( P(E|\mu) \right)$$

because the $\ln(\bullet)$ is a monotonic function.

Where 
$$\ln(P(E|\mu) = \sum_{n=1}^N \left(x_n \ln(\mu) + (1-x_n)\ln(1-\mu)\right)$$

So now we can take the derivative of this function wrt to $\mu$ and equal it to zero:

$$\frac{\partial \ln(P(E|\mu))}{\partial \mu} = 0$$

\begin{align}
(1-\mu)\sum_{n=1}^N x_n - \mu \left(N - \sum_{n=1}^N x_n\right) &= 0 \\
\sum_{n=1}^N x_n - \mu\sum_{n=1}^N x_n - \mu N + \mu\sum_{n=1}^N x_n &= 0 \\
\sum_{n=1}^N x_n - \mu N &= 0 \\
\mu &= \frac{1}{N} \sum_{n=1}^N x_n
\end{align}

So the MLE estimation of the probability of seeing heads in the next coin flip is equal to **relative frequency** of outcome heads.

* Suppose you flipped the coin only once, and saw Tails. The probability of flipping Heads according to MLE would be 0.

* MLE is **purely data driven**! This is sufficient *when* we have lots and lots of data. 

### Method 2: Maximum A Posteriori

In the MAP estimation of $\mu$, we are instead optimizing the posterior probability:

\begin{align}
&\arg_{\mu} \max P(\mu|E) \\
=& \arg_{\mu} \max \frac{P(E|\mu) P(\mu)}{P(E)} \\
\propto & \text{  } \arg_{\mu} \max P(E|\mu) P(\mu), P(E)\text{ is some constant value} 
\end{align}

We have defined the data likelihood $P(E|\mu)$, we now need to choose a **prior distribution** $P(\mu)$.

* This prior distribution will *encode* any prior knowledge we have about the hidden sate of the problem, in this case, the type of coin that was used.

Let's say our **prior distribution** is a Beta Distribution. A **Beta Distribution** takes the form:

$$\text{Beta}(x|\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1} (1-x)^{\beta-1}$$

where $\Gamma(x) = (x-1)!$ and $\alpha,\beta>0$.

The mean and variance of the Beta distribution are: $E[x] = \frac{\alpha}{\alpha+\beta}$ and $E[(x-E[x])^2] = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$.

* Let's see what that looks like:

In [None]:
import math

a = 2
b = 2
x = np.arange(0,1,0.0001)
Beta = (math.gamma(a+b)/(math.gamma(a)*math.gamma(b)))*x**(a-1)*(1-x)**(b-1)

plt.plot(x, Beta, label='Beta Distribution')
plt.legend(loc='best')
plt.xlabel('Probability of Heads, $\mu$',fontsize=15)
plt.ylabel('Prior Probability, p($\mu$)',fontsize=15)
plt.show()

Using the Beat Distribution as out prior, we have:

\begin{align}
P(\mu|\alpha,\beta) &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \mu^{\alpha-1} (1-\mu)^{\beta-1} \\
&\propto \mu^{\alpha-1} (1-\mu)^{\beta-1}
\end{align}

Let:
* $m$ the number of heads
* $l$ the number of tails
* $N=m+l$ the total number of coin flips 

We can write our **posterior probability** as:

\begin{align}
P(\mu|E) &= \frac{P(E|\mu)P(\mu)}{P(E)}\\
&\propto P(E|\mu)P(\mu)\\
&= \left(\prod_{n=1}^N \mu^{x_n} (1-\mu)^{1-x_n}\right) \mu^{\alpha-1} (1-\mu)^{\beta-1} \\
&= \mu^m (1-\mu)^l \mu^{\alpha-1} (1-\mu)^{\beta-1} \\
&= \mu^{m+\alpha-1} (1-\mu)^{l+\beta-1}
\end{align}

* The posterior probability has the same shape as the data likelihood. 

* This is a special case called **Conjugate Prior Relationship**, which happens when the posterior has the same form as the prior.

We can now optimize our posterior probability, and we will apply the same trick:

$$arg_\mathbf{\mu} \max P(\mu|E) = \arg_\mathbf{\mu} \max \ln \left( P(\mu|E) \right)$$

where

$$ \ln \left( P(\mu|E) \right) =  (m+\alpha-1)\ln(\mu) + (l+\beta-1)\ln(1-\mu)$$

We can now *optimize* our posterior probability:

\begin{align}
\frac{\partial  \ln \left( P(\mu|E) \right)}{\partial \mu} &= 0\\
\frac{m+\alpha-1}{\mu} + \frac{l+\beta-1}{1-\mu} &= 0\\
\mu &= \frac{m+\alpha-1}{m + l + \alpha + \beta -2}
\end{align}

This is our estimation of the probability of heads using MAP!

* Our estimation for the probability of heads, $\mu$, is going to depend on $\alpha$ and $\beta$ introduced by the prior distribution. We saw that they control the level of certainty as well as the center value.

* With only a few samples, the prior will play a bigger role in the decision, but eventually the data takes over the prior.

Let's run a simulation to compare MAP and MLE estimators.

In [None]:
trueMU = 0.5 # 0.5 for a fair coin
Nflips = 10
a = 2
b = 10

Outcomes = []
for i in range(Nflips):
    Outcomes += [stats.bernoulli(trueMU).rvs(1)[0]]
    print(Outcomes)
    print('MLE aka Frequentist Probability of Heads = ', np.sum(Outcomes)/len(Outcomes))
    print('MAP aka Bayesian Probability of Heads = ', (np.sum(Outcomes)+a-1)/(len(Outcomes)+a+b-2))
    input('Press enter to flip the coin again...\n')

<h2 align="center"><span style="color:blue">Maximum Likelihood Estimation (MLE)</span></h2>
<center>(Frequentist approach)</center>

$$\arg_{\mathbf{w}} \max P(\mathbf{x}|\mathbf{w})$$

In **Maximum Likelihood Estimation** we *find the set of parameters* that **maximize** the data likelihood $P(\mathbf{x}|\mathbf{w})$. We find the *optimal* set of parameters under some assumed distribution such that the data is most likely.

* MLE focuses on maximizing the data likelihood, which *usually* provides a pretty good estimate

* A common trick to maximize the data likelihood is to maximize the log likelihood

* MLE is purely data driven 

* MLE works best when we have lots and lots of data

* MLE will likely overfit when we have small amounts of data or, at least, becomes unreliable

* It estimates relative frequency for our model parameters. Therefore it needs incredibly large amounts of data (infinite!) to estimate the true likelihood parameters
    * This is a problem when we want to make inferences and/or predictions outside the range of what the training data has learned

<h2 align="center"><span style="color:orange">Maximum A Posteriori (MAP)</span></h2>
<center>(Bayesian approach)</center>

\begin{align}
& \arg_{\mathbf{w}} \max P(\mathbf{x}|\mathbf{w})P(\mathbf{w}) \\ 
& \propto \arg_{\mathbf{w}} \max P(\mathbf{w}|\mathbf{x})
\end{align}

In **Maximum A Posteriori** we *find the set of parameters* that **maximize** the the posterior probability $P(\mathbf{w}|\mathbf{x})$. We find the *optimal* set of parameters under some assumed distribution such that the parameters are most likely to have been drawn off of.

* MAP focuses on maximizing the posterior probability - data  likelihood with a prior

* A common trick to maximize the posterior probability is to maximize the log likelihood

* MAP is data driven 

* MAP is mostly driven by the prior beliefs

* MAP works great with small amounts of data *if* our prior was chosen well

* We need to assume and select a distribution for our prior beliefs
    * A wrong choice of prior distribution can impact negatively our model estimation
    
* When we have lots and lots of data, the data likelihood will take over and the posterior will depend less and less on the prior