# Introduction to Learning Probabilistic Models

Graphical models are a powerful mechanism for capturing structure in distributions that enables efficient inference. We're now ready to talk about how to choose such models. This is the problem of *model selection*. Invariably, domain knowledge guides aspects of the model we choose– for example, the underlying physics, biology, economics, et cetera. However, in practice, some aspects of the model must often be learned directly from data.

So now let's turn our attention to how to learn models. We'll begin with a discussion of a rather general framework of learning based on the time-honored principle of *maximum likelihood*. We will then put it to work to help us learn a wide range of tree-structured graphical models, including hidden Markov models and what are referred to as naive Bayes models– which, despite their name, are surprisingly powerful.

In particular, we'll show how to efficiently learn the conditional distribution tables that characterize a tree model. This is called *parameter estimation*. And we will see that the concept of empirical distributions or histograms plays an important role.

Finally, we will show how to solve the problem of finding a tree that best fits the data, which is referred to as *structure learning*. Amazingly, even though there are a super exponential number of possible trees, we will see how to solve the problem with only quadratic search complexity using the celebrated Chow-Liu algorithm, which is based on measuring empirical mutual information. Exciting applications for the learning methodology we develop are everywhere, from natural language processing and automatic face recognition to phylogenetics and well beyond. So let's begin. 

## Notation and Outline

Thus far in the course, we've been given a probabilistic model of the uncertain world, from which we produced predictions given observations. But where do these probabilistic models come from? We now turn to the problem of learning such models (also referred to as model selection since we are selecting which model to use).

As a concrete example that we'll build on later in this section: We don't actually know the underlying probability distribution for what makes an email considered spam vs. “ham" (i.e., not spam). However, we do have access to plenty of training examples of what emails are spam or ham, where a user has manually flagged her or his incoming emails as spam or not. From all these training examples, we could learn a model for what spam emails look like and what ham emails look like.

There are two levels of learning we consider:

- **Parameter learning**: Suppose we know what the edges are in an undirected graphical model but we don't know what the table entries should be for the potentials – how do we estimate these entries?

- **Structure learning**: What if we know neither the parameters nor which edges are present in an undirected graphical model? In this case, we could first figure out what edges are present. After we decide on which edges are present, then the problem reduces to the first problem of parameter learning. 

In both cases, the high-level setup is the same: there is some underlying probability distribution $p$ that we don't know details for but want to learn. The distribution $p$ has some parameter (or a set of parameters) $\theta$. We will assume that we can collect $n$ independent samples $X^{(1)}, \dots , X^{(n)}$ from the distribution (these samples are often referred to as “training data"). Given these samples, we aim to estimate $\theta$ using what's called “maximum likelihood", which tries to learn a model that in some sense best fits the training data we have available. Along the way, we will see how information theory plays a crucial role in helping us not only develop computer programs to learn probability distributions but it also provides an interpretation of what the programs are doing.

We begin with *parameter learning*, starting with a single node graphical model and working our way to richer graphical models whose potential table entries we aim to learn:

1. A single node undirected graphical model corresponding to a single finite random variable

2. A graphical model called the naive Bayes model which can be used for tasks like email spam detection and handwritten digit recognition

3. General tree-structured undirected graphical models 

We then turn toward *structure learning*, proceeding directly to the most general class of graphical models we consider: tree-structured undirected graphical models, where we learn which edges to include in the tree and what potential tables to assign to the edges. 

# I— Introduction to Maximum Likelihood

## I.1— The Maximum Likelihood Estimator

We consider a single binary random variable $X$, that for simplicity can be thought of as a (possibly biased) coin flip, taking on values in the set $\mathcal{X}=\{ \text {heads},\text {tails}\}$. While this setup is extremely simple, understanding how parameter learning works here will already give us most of the intuition for our parameter learning coverage! We'll generalize to the finite (can be non-binary) random variable case later.

The issue is that the probability of heads, which denote $\theta$, is unknown, and we'd like to estimate (or “learn") this probability.

The probability table is:
![images_sec-maximum-likelihood-intro](./images/images_sec-maximum-likelihood-intro.png)

**Notation**: Recall that previously we would write the probability table of random variable $X$ as $p_{X}$ or $p_{X}(\cdot )$. However, now to make it explicit that we don't know $\theta$, which we aim to estimate, we will denote the probability table as $p_{X}(\cdot ;\theta )$. The semi-colon is used to say that everything after the semi-colon in the parentheses refers to parameter(s) of the model. The probability that $X=x$ is denoted as $p_{X}(x;\theta )$.

To estimate parameter $\theta$, we assume we have flipped the coin $n$ times to get outcomes $X^{(1)},X^{(2)},\dots ,X^{(n)}$ which are i.i.d. samples from the same distribution as X. In other words,
$$
p_{X^{(1)},X^{(2)},\dots ,X^{(n)}}(x^{(1)},x^{(2)},\dots ,x^{(n)};\theta )=\prod _{i=1}^{n}p_{X}(x^{(i)};\theta ).
$$

The above probability is called the "likelihood" of the data — it is the probability of seeing the observed data as a function of the unknown parameter $\theta$. Note that the observed data are treated as fixed constants! We denote the likelihood as $L(\theta )$.

As an example, if we observed the sequence heads, tails, heads, then $n = 3$, $X^{(1)} = \text {heads}, X^{(2)} = \text {tails}, and x^{(3)} = \text {heads}$, and the likelihood is
$$
L(\theta ) = \underbrace{\theta }_{\text {heads}} \cdot \underbrace{(1 - \theta )}_{\text {tails}} \cdot \underbrace{\theta }_{\text {heads}} = \theta ^2 (1 - \theta ).
$$

Next, to estimate (or learn) $\theta$, we will use “maximum likelihood" which *maximizes the likelihood function $L$* over possible values of the parameter $\theta$.

Put another way, we find whichever probability of heads $\theta$ makes the probability of seeing the samples $X^{(1)}=x^{(1)},X^{(2)}=x^{(2)},\dots ,X^{(n)}=x^{(n)}$ as high as possible.

Formally, the maximum likelihood estimate $\widehat{\theta }$ for parameter $\theta$ is the solution to the following optimization problem:
$$
\widehat{\theta }=\arg \max _{\theta \in [0,1]}\underbrace{\prod _{i=1}^{n}p_{X}(x^{(i)};\theta )}_{\text {likelihood}}.
$$

It turns out the answer in this case is quite simple.

**Claim**: The maximum likelihood estimate $\widehat{\theta }$ for the true probability of heads $\theta$ is simply the fraction of times we see heads in the observed sequence $x^{(1)},\dots ,x^{(n)}$, i.e.,
$$
\widehat{\theta }=\frac{\text {number of times heads appears}}{n}=\frac{\sum _{i=1}^{n}\mathbf{1}\{ x^{(i)}=\text {heads}\} }{n}.
$$

For example, if we observed the sequence heads, tails, heads, then the maximum likelihood estimate for the probability of heads is 2/3 since 2/3 of the tosses were heads.

Let's prove the claim.

First, by how the problem is set up:
$$
p_{X}(x^{(i)};\theta)=\begin{cases}
\theta & \text{if }x^{(i)}=\text{heads},\\
1-\theta & \text{if }x^{(i)}=\text{tails}.
\end{cases}
$$

Next, letting $n_{\text {heads}}\triangleq \sum _{i=1}^{n}\mathbf{1}\{ x^{(i)}=\text {heads}\}$ be the number of times heads occurred, and $n_{\text {tails}}\triangleq \sum _{i=1}^{n}\mathbf{1}\{ x^{(i)}=\text {tails}\}$ be the number of times tails occurred, we have, due to independence of the coin flips and because the coin flips all have the same distribution:
$$
\begin{eqnarray}
% \text{likelihood}
L(\theta)
&=& \prod_{i=1}^{n}p_{X}(x^{(i)};\theta) \\
&=& \big(\underbrace{\theta\times\cdots\times\theta}_{n_{\text{heads}}\text{ times}}\big)\times\big(\underbrace{(1-\theta)\times\cdots\times(1-\theta)}_{n_{\text{tails}}\text{ times}}\big) \\
&=& \theta^{n_{\text{heads}}}(1-\theta)^{n_{\text{tails}}}.
\end{eqnarray}
$$

Our next goal is to maximize the likelihood.

Mathematically it'll be easier to work with the log of the likelihood. Note that the value of \theta that maximizes the likelihood is the same as the value of \theta that maximizes the log of the likelihood, because the log function is strictly increasing, so:
$$
\begin{eqnarray}
\widehat{\theta} &=&\arg\max_{\theta\in[0,1]} L(\theta) \\
&=&\arg\max_{\theta\in[0,1]}\log( L(\theta) )\\
&=&\arg\max_{\theta\in[0,1]}\log\big(\theta^{n_{\text{heads}}}(1-\theta)^{n_{\text{tails}}}\big)\\
&=&\arg\max_{\theta\in[0,1]}\big\{\underbrace{n_{\text{heads}}\log\theta+n_{\text{tails}}\log(1-\theta)}_{\triangleq\ell(\theta)}\big\}.
\end{eqnarray}
$$

Let's examine the log likelihood function $\ell$ that we've just defined and that we're aiming to maximize:
$$
\ell (\theta )=n_{\text {heads}}\log \theta +n_{\text {tails}}\log (1-\theta ).
$$

From calculus you may remember that we can optimize this function by looking at what value for $\theta$ makes $\frac{d\ell (\theta )}{d\theta }=0$. It turns out that this will be sufficient and solving this optimization (which we'll spell out shortly) will yield that the optimal $\widehat{\theta }$ is the fraction of heads observed in the n flips.

However, for just this binary random variable case, we'll be a little bit more rigorous and give all the details, which should help you connect what you've seen in calculus to what is happening here in maximum likelihood estimation.

Without further ado, when we maximize $\ell (\theta )$, we break the problem up into three cases:

- If $n_{\text {tails}}=0$, then $\ell (\theta )=n_{\text {heads}}\log \theta$ is maximized when $\theta =1$ since $\log \theta$ increases as $\theta$ ranges from 0 to 1.

- If $n_{\text {heads}}=0$, then $\ell (\theta )=n_{\text {tails}}\log (1-\theta )$ is maximized when $\theta =0$ since $\log (1-\theta )$ decreases as we increase $\theta$ from 0 to 1.

- If neither $n_{\text {heads}}$ nor $n_{\text {tails}}$ is 0, which means that both are positive (they can't be negative, and they can't both be 0 since that would mean we didn't have any observations), then note that $\ell$ is a differentiable real-valued function defined on the domain $0<\theta <1$. To see why this is the case:

    - Note that we can't have $\theta <0$ or $\theta >1$ because in both of these cases, we'd have a term in $\ell (\theta )$ that is the log of a negative number, which isn't defined for real-valued functions.
  
    - When $\theta =0$ or $\theta =1$, we encounter a term $\log 0=-\infty$, which is not a real number, and so $\ell (\theta )$ in this case is not a real number.
  
    - When $\theta \in (0,1)$, then $\log \theta$ and $\log (1-\theta )$ are both real numbers, and so is $\ell (\theta )$. 

    Thus, $\ell$ is only a real-valued function on the interval $\theta \in (0,1)$. In fact, within this interval, we can compute its first and second derivatives (which we'll do shortly). At this point we recall from calculus that to optimize $\ell$, it's sufficient to do two steps:

    1. Find when $\frac{d\ell (\theta )}{d\theta }=0$. Supposing that this happens for only one value of $\theta$, we'll call this best value $\widehat{\theta }$.

    2. Check that the second derivative $\frac{d^{2}\ell }{d\theta ^{2}}$ evaluated at $\widehat{\theta }$ is negative, which implies that the point we found in the first step corresponds to the unique maximum. 

    Let's do step 1:
    $$
    \begin{eqnarray}
    0&=&\Big[\frac{d\ell}{d\theta}\Big]_{\theta=\widehat{\theta}}\\
     &=&\bigg[\frac{d}{d\theta}\big\{ n_{\text{heads}}\log\theta+n_{\text{tails}}\log(1-\theta)\big\}\bigg]_{\theta=\widehat{\theta}}\\
    &=&\bigg[n_{\text{heads}}\frac{1}{\theta}-n_{\text{tails}}\frac{1}{1-\theta}\bigg]_{\theta=\widehat{\theta}}\\
    &=&n_{\text{heads}}\frac{1}{\widehat{\theta}}-n_{\text{tails}}\frac{1}{1-\widehat{\theta}}.
    \end{eqnarray}
    $$

    Multiplying through by $\widehat{\theta }(1-\widehat{\theta })$, we get:
    $$
    \begin{eqnarray}
    0
    &=& n_{\text{heads}}(1-\widehat{\theta})-n_{\text{tails}}\widehat{\theta} \\
    &=& n_{\text{heads}}-n_{\text{heads}}\widehat{\theta}-n_{\text{tails}}\widehat{\theta} \\
    &=& n_{\text{heads}}-(n_{\text{heads}}+n_{\text{tails}})\widehat{\theta}.
    \end{eqnarray}
    $$

    In other words,
    $$
    \widehat{\theta }=\frac{n_{\text {heads}}}{n_{\text {heads}}+n_{\text {tails}}}=\frac{n_{\text {heads}}}{n}.
    $$

    Step 2: The second derivative of $\ell$ with respective to $\theta$ is
    $$
    \frac{d^{2}\ell }{d\theta ^{2}}=-n_{\text {heads}}\frac{1}{\theta ^{2}}-n_{\text {tails}}\frac{1}{(1-\theta )^{2}},
    $$

    which is always negative since $\theta ^{2}$ and $(1-\theta )^{2}$ are always positive when $\theta \in (0,1)$.

    Lastly, we also check the boundary, i.e., we make sure that indeed the log likelihood $\ell$ evaluated at $\widehat{\theta }$ is larger than each of $\ell (0)$ and $\ell (1)$. We have
    $$
    \begin{eqnarray}
    \ell (0) 	& = 	& \underbrace{n_{\text {heads}} \log 0}_{-\infty } + \underbrace{n_{\text {tails}} \log 1}_0 = -\infty , \\	 	 
    \ell (1) 	& = 	& \underbrace{n_{\text {heads}} \log 1}_0 + \underbrace{n_{\text {tails}} \log 0}_{-\infty } = -\infty , 	\\ 	 
    \ell (\widehat{\theta }) 	& = 	& n_{\text {heads}} \underbrace{\log \frac{n_{\text {heads}}}n}_{>-\infty } + n_{\text {tails}} \underbrace{\log \frac{n_{\text {tails}}}n}_{>-\infty } > -\infty ,
    \end{eqnarray}
    $$

    so indeed $\ell (\widehat{\theta }) > \ell (0)$ and $\ell (\widehat{\theta }) > \ell (1)$. 

To summarize:

- If $n_{\text {tails}}=0$, then we set $\widehat{\theta }=1$
- If $n_{\text {heads}}=0$, then we set $\widehat{\theta }=0$
- If $n_{\text {heads}}>0$ and $n_{\text {tails}}>0$, then we set $\widehat{\theta }=\frac{n_{\text {heads}}}{n}$,
    from which we see that this same formula holds true even for the cases of $n_{\text {tails}}=0$ or $n_{\text {heads}}=0$. The only reason why we were careful to look at the first two cases separately is because the log likelihood function $\ell$ is neither real-valued nor differentiable at $\theta =0$ or $\theta =1$, corresponding to the $n_{\text {heads}}=0$ and $n_{\text {tails}}=0$ cases. 

Putting together the pieces, the maximum likelihood estimate for the probability of heads is equal to the fraction of heads we see amongst our n coin flips:
$$
\widehat{\theta }=\frac{n_{\text {heads}}}{n}.
$$

**Remark**: If we didn't take the log, we could still proceed with the same calculus tools but immediately the derivatives get messy due to the product rule of derivatives! Note though that it's not always the case that taking the log makes sense, as you'll see in one of the problems. 

## I.2— Bias and Variance of an estimator

We continue off the example of estimating the probability of heads $\theta$ for a coin.

Note that there are different ways in which one can compute an estimate $\widehat{\theta }$. In 6.008.1x, we mainly use maximum likelihood estimation, and as we'll see later, we also use MAP estimation.

### I.2.1— Bias of an estimator & bias of the MLE

For the coin case, the maximum likelihood (ML) estimate $\widehat{\theta } = \frac{n_{\text {heads}}}n$. Note that $\widehat{\theta }$ is a *function of the training data* $X^{(1)},\dots ,X^{(n)}$. Before conditioning on a specific observed value of the training data, the ML estimate $\widehat{\theta }$ is a random variable since $n_{\text {heads}} \sim \text {Binomial}(n, \theta )$; some times to make this explicit, we write $\widehat{\theta }(X^{(1)},\dots ,X^{(n)})$, making the dependence on the training data clear.

In particular,
$$
\widehat{\theta }(X^{(1)},\dots ,X^{(n)}) = \frac{n_{\text {heads}}}n = \frac1n \sum _{i=1}^ n \mathbf{1}\{  X^{(i)} = \text {heads} \} .
$$

We can estimate the expectation of the MLE: $\mathbb {E}[\widehat{\theta }(X^{(1)},\dots ,X^{(n)})]$ as a function of $\theta$, the true unknown parameter, as follow.

We have
$$
\widehat{\theta }(X^{(1)},\dots ,X^{(n)}) = \frac1n \sum _{i=1}^ n \mathbf{1}\{  X^{(i)} = \text {heads} \} ,
$$
so by linearity of expectation,
$$
\mathbb {E}[\widehat{\theta }(X^{(1)},\dots ,X^{(n)})] = \frac1n \sum _{i=1}^ n \mathbb {E}[\mathbf{1}\{  X^{(i)} = \text {heads} \} ] = \frac1n \cdot n \cdot \theta = \boxed {\theta }.
$$

How do we tell how good an estimate $\widehat{\theta }(X^{(1)},\dots ,X^{(n)})$ for $\theta$ is?

For example, if we had an estimate $\widehat{\theta }(X^{(1)},\dots ,X^{(n)})$ that was just always 0 regardless of the training data we collect, then intuitively such an estimate for \theta would probably be quite awful.

The *bias* of an estimate $\widehat{\theta }$ for a parameter $\theta$ is
$$
\mathbb {E}[\widehat{\theta }(X^{(1)},\dots ,X^{(n)})] - \theta ,
$$
where we note that $\widehat{\theta }(X^{(1)},\dots ,X^{(n)})$ is a random variable.

We can easily see that the ML estimate for $\theta$ is unbiased, i.e., it has bias equal to 0:
$$
\mathbb {E}[\widehat{k}_{\text {ML}}] - \theta = \theta - \theta = 0.
$$

We can compare with the terrible estimator $\widehat{\theta }(X^{(1)},\dots ,X^{(n)}) = 0$ regardless of what the training data are.

Clearly the expectation of $\widehat{\theta }$ in this case is 0, so the bias is going to be $0 - \theta = \boxed {\theta }$.

### I.2.2— Variance of an estimator & variance of the maximum likelihood (ML) estimate

The variance of an estimator $\theta$ is
$$
\text {var}(\widehat{\theta }) = \mathbb {E}[(\widehat{\theta } - \mathbb {E}[\widehat{\theta }])^2].
$$

Recall that for any random variable $Z$ and any constant a $\in \mathbb {R},$
$$
\text {var}(a Z) = a^2 \text {var}(Z),
$$

and for random variables $Z_1, \dots , Z_ n$ that are i.i.d. each with the same distribution as $Z$,
$$
\text {var}\Big(\sum _{i=1}^ n Z_ i\Big) = n \text {var}(Z).
$$

We can thus compute the variance of the ML estimate $\widehat{\theta }(X^{(1)},\dots ,X^{(n)}) = \frac{n_{\text {heads}}}n$ for the probability of heads of the coin as follow.

We have
$$
\begin{eqnarray}
\text {var}(\widehat{\theta }_{\text {ML}}) 	& = 	& \text {var}\Big( \frac1n \sum _{i=1}^ n \mathbf{1}\{ X^{(i)} = \text {heads} \} \Big) 	  	 \\
      	  	& = 	& \frac1{n^2} \text {var}\Big( \sum _{i=1}^ n \mathbf{1}\{ X^{(i)} = \text {heads} \} \Big) 	\\  	 
      	  	& = 	& \frac1{n^2} \cdot n \cdot \text {var}(\mathbf{1}\{ X^{(1)} = \text {heads} \} ) \\	  	 
      	  	& = 	& \frac1{n^2} \cdot n \cdot \text {var}( \text {Ber}(\theta ) \} ) 	  	 \\
      	  	& = 	& \frac1{n^2} \cdot n \cdot \theta (1-\theta ) 	  \\	 
      	  	& = 	\displaystyle \boxed {\frac{\theta (1-\theta )}{n}}. \end{eqnarray}
$$

We can compare with the variance of the terrible estimator $\widehat{\theta }(X^{(1)},\dots ,X^{(n)}) = 0$ regardless of the training data:
$$
     \text {var}(\widehat{\theta }(X^{(1)},\dots ,X^{(n)})) = \text {var}(0) = \boxed {0}.
$$

## I.3— Practice Problem: The German Tank Problem

Suppose the Germans have $k$ tanks, numbered $1, \, 2, \, \dots , \, k$. The Allies observe $n <k$ tanks with numbers $x_1, \, x_2, \, \dots , \, x_ n$. Assume that these numbers are drawn uniformly without replacement from $\{ 1, \, 2, \, \dots , \, k \}$. The objective is to get an estimate $\hat{k}$ of the total number of German tanks.

### I.3.1— What is the ML estimate for $k$?

Keep in mind that each $x^{(i)}$ is a value in $\{ 1,\dots ,k\}$. Also, this is one of those maximum likelihood problems where taking the log isn't helpful.

There are ${k \choose n}$ possible choices of which tanks we observe. Each of these is equally likely, so the likelihood is
$$
    p_ X(x; k ) = \frac{\mathbf{1}\{ 1\le x^{(1)} \le k, 1 \le x^{(2)} \le k, \dots , 1 \le x^{(n)} \le k\} }{{k \choose n}}.
$$

The numerator says that each $x^{(i)}$ (i.e., the number the tank is labeled with) must be between $1$ and $k$, the maximum number.

Now notice that with $n$ treated as fixed, then to maximize the likelihood, we want the denominator to be as small as possible, which means that we want $k$ to be as small as possible. However, the smallest $k$ can be so that the numerator is nonzero is when $k = \max (x^{(1)}, \dots , x^{(n)})$. Thus, the ML estimator for $k$ is given by
$$
    \widehat{k}_{\text {ML}} = \max (x^{(1)}, \dots , x^{(n)}).
$$

### I.3.2— What is the bias of the ML estimate $\widehat{k}_{\text {ML}}(X^{(1)},\dots ,X^{(n)})$ for $k$?

We want to evaluate the expectation of $\widehat{k}_\text {ML}(X^{(1)},\dots ,X^{(n)}) = \max (X^{(1)},X^{(2)},\dots ,X^{(n)})$.
    	 

The probability that the maximum tank number/label observed is $m$ is $\frac{\binom {m-1}{n-1}}{\binom {k}{n}}$ since there are $\binom {k}{n}$ possible ways to observe $n$ tanks, and $\binom {m-1}{n-1}$ ways in which we definitely picked the tank with number $m$ and among the tanks with smaller numbers (i.e., among $m-1$ tanks) we choose $n-1$ of them.

Then
$$
\begin{eqnarray}
    \mathbb {E}[\widehat{k}_\text {ML}(X^{(1)},\dots ,X^{(n)})] & = & \sum _{m=n}^{k} m \mathbb {P}(\text {max number observed is }m) \\
    &= & \sum _{m=n}^{k} m \frac{ \binom {m-1}{n-1}}{ \binom {k}{n}} = \frac{n (k+1)}{n+1},
\end{eqnarray}
$$

The last step can be evaluated by calling [Wofram Alpha](https://www.wolframalpha.com/input/?i=sum++m+choose+n+from+m%3Dn+to+k) or as follows:

$$
\begin{eqnarray}
\sum _{m=n}^{k} m \frac{ \binom {m-1}{n-1}}{ \binom {k}{n}}
& = \frac{n}{\binom{k}{n}}  \sum _{m=n}^{k} \binom {m}{n} \\
& = \frac{n}{\binom{k}{n}}  \left[ \binom {n}{n} + \binom {n+1}{n} + \cdots + \binom {k}{n}\right] \\
& = \frac{n}{\binom{k}{n}}  \binom {k+1}{n+1} \\
& = \frac{n (k+1)}{n+1}
\end{eqnarray}
$$

Thus, the bias of the ML estimator for $k$ in this case is
$$
    \mathbb {E}[\widehat{k}_\text {ML}(X^{(1)},\dots ,X^{(n)})] - k = \frac{n (k+1)}{n+1} - k = \frac{n (k+1)}{n+1} - \frac{k (n+1)}{n+1} = \frac{n-k}{n+1},
$$
where as a reminder, note that $n \le k$. This means that the ML estimator on average guesses a value of $k$ that is lower than the true value.

This makes sense since if we just take the maximum of the numbers on the tanks we've seen, then unless we see the tank with number $k$, the maximum we observe is going to be smaller than the true maximum. Of course, if we see all the tanks, i.e., if $n=k$, then we definitely see the tank with max number $k$ so the bias in this case would be 0.

### I.3.3—  An unbiased estimator for $k$ based on the maximum likelihood estimate $\widehat{k}_{\text {ML}}$ for $k$.

The bias is given by
$$
    \mathbb {E}[\widehat{k}_{\text {ML}}]-k=\frac{n-k}{n+1}=\frac{n}{n+1}-\frac{k}{n+1}.
$$

Thus,
$$
\mathbb {E}[\widehat{k}_{\text {ML}}]-\frac{n}{n+1} = \underbrace{\Big(1-\frac{1}{n+1}\Big)}_{\frac{n}{n+1}}k.
$$ 

So
$$
    k=\underbrace{\frac{n+1}{n}}_{1+\frac{1}{n}}\mathbb {E}[\widehat{k}_{\text {ML}}]-1.
$$ 

Thus, an estimator
$$
    \widehat{k}=\Big(1+\frac{1}{n}\Big)\widehat{k}_{\text {ML}}-1
$$

will be unbiased since, by linearity of expectation,
$$
    \mathbb {E}[\widehat{k}]=\Big(1+\frac{1}{n}\Big)\mathbb {E}[\widehat{k}_{\text {ML}}]-1.
$$   	

## I.4— The Bayesian Approach to Learning Parameters

### I.4.1— Accounts for prior information

The Bayesian approach is an alternative approach to learning parameters that accounts for prior information we may have on what a parameter should be.

We saw previously that the maximum likelihood estimate for the probability of heads was just the fraction of heads that we see in the training data. For example if our training data consisted of the sequence heads, tails, heads then the maximum likelihood estimate for the probability of heads would be 2/3. But we might say, "There are only three tosses! That's not enough data to claim that the probability of heads is actually 2/3." The coin could still be fair for example. If as another extreme example, if we only see one toss, for example if we only saw heads, then the maximum likelihood estimate would be 1 because the fraction of heads is one...

An alternative approach to learning parameters is the Bayesian approach.

Here, instead of treating the parameter as some fix unknown constant,
instead treats parameter theta as a *random variable*. So rather than writing $p_X(\cdot;\theta)$, we now write $p_{X\mid\Theta}(\cdot\mid\theta)$ where $\Theta$ is a randow variable.Now we can impose some prior beliefs for what the parameters should be by choosing a function $p_\Theta(\theta)$ that can be thought of as a non-negative weight for
how much we prefer this specific choice $\theta$ before observing any training data.

Technical note: $p_\Theta(\theta)$ here is not like the
probability tables we have seen so far in the course because $\Theta$ has an alphabet that is the closed interval $\[0, 1\]$. This alphabet is neither finite nor countably infinite. $\Theta$ is actually what is called a *continuous random variable.* For the purposes of this class though, we don't need to know the specifics of continuous random variables. It's enough to think of $p_\Theta(\theta)$ as just a non-negative weight given to a specific parameter choice little $\theta$.

Previously, we maximized
$$
\widehat{\theta }_{\text {ML}} = \arg\max_{\theta\in[0,1]} L(\theta)
$$

In the bayesian approach, we will maximize
$$
\arg\max_{\theta\in[0,1]} p_{X^{(1)},...,X^{(N)}\mid\Theta}(x^{(1)},...,x^{(N)}\mid\theta)p_\Theta(\theta)
$$
which is actually equal to
$$
\arg\max_{\theta\in[0,1]} \underbrace{
\frac{ p_{X^{(1)},...,X^{(N)}\mid\Theta}(x^{(1)},...,x^{(N)}\mid\theta) 
p_\Theta(\theta)}
{p_{X^{(1)},...,X^{(N)}}(x^{(1)},...,x^{(N)})}}
_{p_{\Theta\mid X^{(1)},...,X^{(N)}}(\theta\mid x^{(1)},...,x^{(N)})}
$$

This $\arg\max$ is thus the Maximum-a-posteriori of $\Theta$: $\widehat{\theta }_{\text {MAP}}$

We can see the effect is two examples:

- When there are some training data in comparison to the prior:
![MAP_estimate1](./images/MAP_estimate1.png)
- When there are less training data:
![MAP_estimate2](./images/MAP_estimate2.png)

This Bayesian approach to learning parameters thus allows to account for how much training data we have. If we don't have that much training data
we trust our prior more. If we have tons of training data, then we side with the data more and basically our answer will look more like the ML estimate.

### I.4.2— Conjugate prior, pseudo-observations

The present example describes a prior distribution $p(\theta)$ which is *a conjugate prior* to the likelihood distribution $p(x\mid\theta)$.

A prior distribution is a conjugate prior for a likelihood distribution if the posterior distribution $p(\theta\mid x)$ is in the same family as the prior distribution $p(\theta)$.

As a reminder, the posterior distribution is given by:
$$
p(\theta\mid x) = \frac{p(x\mid\theta)p(\theta)}{\int p(x\mid\theta')p(\theta')d\theta'}
$$

Let the likelihood function be considered fixed; the likelihood function is usually well-determined from a statement of the data-generating process. It is clear that different choices of the prior distribution $p(\theta)$ may make the integral more or less difficult to calculate, and the product $p(x\mid\theta)\times p(\theta)$ may take one algebraic form or another. For certain choices of the prior, the posterior has the same algebraic form as the prior (generally with different parameter values). Such a choice is a conjugate prior.

A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior; otherwise numerical integration may be necessary. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a prior distribution.

#### I.4.2.1— Example: Binomial distribution

The binomial distribution has a probability mass function (PMF):
$$
p_ S(s) = {n \choose s} q^ s (1-q)^{n-s}.
$$

We can deduce a shape for a conjugate prior by expressing the PMF as a function of $q$: $f(q)\propto q^a(1-q)^b$ for some constant $a$b and $b$. Generally we would multiply this function by a normalizing constant ensuring that the function is a probability distribution, i.e. the integral over the entire range is 1. This factor will often be a function of $a$ and $b$, but never of $ q$ because the sum of $a$ and $b$ are the number of trials.

##### Beta distribution

In fact, the usual conjugate prior is the beta distribution (not to be confused with the beta *function*) with parameters ( $\alpha , \beta$ ):
$$
p(q)={q^{\alpha -1}(1-q)^{\beta -1} \over \mathrm {B} (\alpha ,\beta )}
$$
where $\alpha$  and $\beta$  are chosen to reflect any existing belief or information ( $\alpha  = 1$ and $ \beta  = 1$ would give a uniform distribution).

$B( \alpha ,  \beta )$ is the Beta function acting as a normalising constant. 

In this context, $\alpha$  and $\beta$  are called *hyperparameters* (parameters of the prior), to distinguish them from parameters of the underlying model (here $q$).

It is a typical characteristic of conjugate priors that the dimensionality of the hyperparameters is one greater than that of the parameters of the original distribution. If all parameters are scalar values, then this means that there will be one more hyperparameter than parameter; but this also applies to vector-valued and matrix-valued parameters.

##### Pseudo-observations
It is often useful to think of the hyperparameters of a conjugate prior distribution as corresponding to having observed a certain number of *pseudo-observations* with properties specified by the parameters.

For example, the values $\alpha$  and $\beta$  of a beta distribution can be thought of as corresponding to $\alpha -1$ successes and $\beta -1$ failures if the posterior mode is used to choose an optimal parameter setting, or $\alpha$ successes and $\beta$ failures if the posterior mean is used to choose an optimal parameter setting.

In general, for nearly all conjugate prior distributions, the hyperparameters can be interpreted in terms of pseudo-observations. This can help both in providing an intuition behind the often messy update equations, as well as to help choose reasonable hyperparameters for a prior.

##### Beta function
The *Beta function* is defined by
$$
\mathrm {B} (x,y)=\int _{0}^{1}t^{x-1}(1-t)^{y-1}\,\mathrm {d} t\! $$
The beta function is symmetric, meaning that
$$ \mathrm {B} (x,y)=\mathrm {B} (y,x).
$$
A key property of the Beta function is its relationship to the Gamma function:
$$
\mathrm {B} (x,y)={\dfrac {\Gamma (x)\,\Gamma (y)}{\Gamma (x+y)}}\! 
$$
When x and y are positive integers, it follows from the definition of the gamma function $\Gamma$  that:
$$
\mathrm {B} (x,y)={\dfrac {(x-1)!\,(y-1)!}{(x+y-1)!}}\!
$$

Just as the gamma function for integers describes factorials, the beta function can define a binomial coefficient after adjusting indices:

$$
{n \choose k}={\frac {1}{(n+1)\mathrm {B} (n-k+1,k+1)}}.
$$
##### Gamma function
The gamma function $\Gamma$ is an extension of the *factorial function*, with its argument shifted down by 1, to real and complex numbers.

That is, if $n$ is a positive integer:
$$
\Gamma (n)=(n-1)!.
$$

#### I.4.2.2— Example: Categorical distribution
The categorical distribution can have its PMF expressed using the Iverson bracket as:
$$
f(x|{\boldsymbol {p}})=\prod _{i=1}^{k}p_{i}^{[x=i]},
$$
where $[x=i]$ evaluates to 1 if $x=i$, 0 otherwise.

Another formulation makes explicit the connection between the categorical and multinomial distributions by treating the categorical distribution as a special case of the multinomial distribution in which the parameter $n$ of the multinomial distribution (the number of sampled items) is fixed at 1.

In this formulation, the sample space can be considered to be the set of 1-of-K encoded random vectors $x$ of dimension $k$ having the property that exactly one element has the value 1 and the others have the value 0.

The particular element having the value 1 indicates which category has been chosen. The probability mass function f in this formulation is:

$$
f(\mathbf {x} |{\boldsymbol {p}})=\prod _{i=1}^{k}p_{i}^{x_{i}},
$$
where $ p_{i}$ represents the probability of seeing element $i$ and $\textstyle {\sum _{i}p_{i}=1}$. This is the formulation adopted by Bishop.

The conjugate priori of the categorical distribution is then the *Dirichlet distribution*.

##### Dirichlet distribution
The PMF of the Dirichlet distribution is given by:
$$
f\left(x_{1},\cdots ,x_{K};\alpha _{1},\cdots ,\alpha _{K}\right)={\frac {1}{\mathrm {B} ({\boldsymbol {\alpha }})}}\prod _{i=1}^{K}x_{i}^{\alpha _{i}-1},
$$
on the open $(K − 1)$-dimensional simplex defined by:
$$
\begin{aligned}
&x_{1},\cdots ,x_{K-1}>0 \\
&x_{1}+\cdots +x_{K-1}<1 \\
&x_{K}=1-x_{1}-\cdots -x_{K-1}
\end{aligned}
$$
and zero elsewhere.

The normalizing constant is the *multivariate Beta function*, which can be expressed in terms of the gamma function:
$$
\mathrm {B} ({\boldsymbol {\alpha }}) = {\frac {\prod _{i=1}^{K}\Gamma (\alpha _{i})}{\Gamma \left(\sum _{i=1}^{K}\alpha _{i}\right)}},\qquad {\boldsymbol {\alpha }}=(\alpha _{1},\cdots ,\alpha _{K}).
$$

## I.5— Example: Maximum Likelihood Estimation of a Poisson Parameter

We say that random variable $X$ follows a Poisson distribution with parameter $\lambda$ if its PMF is given by
$
p_ X(x) = \frac{\lambda ^ x e^{-\lambda }}{x!}\; ,
$$
with support over alphabet $\{0,1,2,\ldots \}$.

Suppose we observe $X^(1),\ldots,X^(n)$, which are all independent Poisson random variables with the same parameter $\lambda$. 

How can we compute the maximum likelihood estimate $\hat{\lambda }$ for $\lambda$ given that we observe the sequence $X^(1)=x^(1),\ldots,X^(n)=x^(1)$. 

We begin by computing the likelihood:
$$
p_{X^{(1)}, \ldots , X^{(n)}}(x^{(1)}, \ldots , x^{(n)}) = \prod _{i=1}^ n p_ X(x^{(i)}) = \prod _{i=1}^ n \frac{\lambda ^{x^{(i)}} e^{-\lambda }}{x^{(i)}!}\; .
$$

The log-likelihood is then
$$
\log p_{X^{(1)}, \ldots , X^{(n)}}(x^{(1)}, \ldots , x^{(n)}) = \left(\sum _{i=1}^ n x^{(i)}\right)\log \lambda -n\lambda -n\log (x^{(i)}!).
$$

Differentiating with respect to \lambda and setting the result equal to 0, we find
$$
\hat{\lambda } = \frac{1}{n}\sum _{i=1}^ n x^{(i)}\; ,
$$

i.e., the mean of the training data's values.

For example, with observed values 5, 10, 14, 42, the mean of these is $\hat{\lambda } = \frac{1}{4}(5 + 10 + 14 + 42) = 17.75$.
 
If we have several candidates for a value of $\hat{\lambda }$, for instance $\lambda_1$ and $\lambda_2$, we can also choose one of the candidate, simply by taking the one with the maximum likelihood. 

From the previous part, we know that the log-likelihood for the data is either
$$
\log p_{X^{(1)}, \ldots , X^{(n)}}(x^{(1)}, \ldots , x^{(n)}) = \left(\sum _{i=1}^ n x^{(i)}\right)\log \lambda _1-n\lambda _1 -n\log (x^{(i)}!),
$$
or
$$
\log p_{X^{(1)}, \ldots , X^{(n)}}(x^{(1)}, \ldots , x^{(n)}) = \left(\sum _{i=1}^ n x^{(i)}\right)\log \lambda _2-n\lambda _2 -n\log (x^{(i)}!).
$$
The maximum-likelihood rule says to choose whichever is more likely. Therefore, when the first is greater than the second, we choose the distribution parametrized by $\lambda _1$, and otherwise we choose the distribution parametrized by $\lambda _2$.


# II— Parameter Learning

## II.1— THE NAIVE BAYES CLASSIFIER: INTRODUCTION

Previously, we saw how maximum likelihood estimation works for some simple cases. Now, we look at how it works for a more elaborate setup, specifically in the problem of email spam detection. Note that in this section, for simplicity, in showing the maximum likelihood estimates, we will just be setting derivatives equal to 0 without checking second derivatives and boundaries.

We want to build a classifier that, given an email, classifies it as either spam or ham. How do we go about doing the classification? There are many ways to classify data.

Today, we're going to talk about one such way called the *naive Bayes classifier*, which uses a simple classification model and has two algorithms to go with it:
- the first algorithm learns the naive Bayes model parameters from training data, and
- the second algorithm, given the parameters learned, predicts whether a new email is spam or ham.

Suppose we have $n$ training emails. Email $i$ has a known label $c^{(i)}\in \{ \text {spam},\text {ham}\}$. Also, from email $i$, we extract $J$ features $y^{(i)}=(y_{1}^{(i)},y_{2}^{(i)},\dots ,y_{J}^{(i)})$. In particular, for simplicity, we shall assume that each $y_{j}^{(i)}\in \{ 0,1\}$ indicates the presence of the $j$-th word in some dictionary of $J$ words.

Of course, you could use much fancier features such as vector-valued features or even features with non-numerical values, but we'll stick to 0's and 1's that indicate presence of certain words.

For example, maybe the first word in the dictionary is “viagra" in which case $y_{1}^{(i)}$ is 1 if email $i$ contains the word “viagra" and 0 otherwise. In summary, our training data consists of $y^{(1)},y^{(2)},\dots ,y^{(n)}\in \{ 0,1\} ^{J}$ with respective labels $c^{(1)},c^{(2)},\dots ,c^{(n)}\in \{ \text {spam},\text {ham}\}$.

Next, we specify a probabilistic model that explains how an email is hypothetically generated:

1. Sample a random label $C$ that is equal to spam with probability $s$ and equal to ham with probability 1-$s$. (For example, you could flip a coin with probability of heads $s$, and if the coin comes up heads, you assign $C=\text {spam}$, and otherwise you assign $C=\text {ham}$.)

2. For $j=1,2,\dots ,J$:
   
   If $C=\text {spam}$: sample $Y_{j}\sim \text {Ber}(q_{j})$    
   
   If $C=\text {ham}$: sample $Y_{j}\sim \text {Ber}(p_{j})$.

Note that the chance of a word from the dictionary occurring depends on whether the email is spam or ham, much like how you'd imagine that if you see the word “viagra" in an email, the email's probably spam rather than ham.

The above recipe for generating features for a hypothetical email is called a *generative process*, and its corresponding graphical model is as follows:
![images_naive-bayes](./images/images_naive-bayes.png)

**Important observation**: The $Y_{i}$'s are independent given $C$. This assumption of the model is not actually true for emails since certain words may be more likely to co-occur. Also, the model does not account for the ordering of the words or whether a word occurs multiple times. While the naive Bayes classifier does make these assumptions, in practice, it is often applied to data that violates these assumptions, yet the performance of the classifier is still quite good! To quote statistician George Box, “All models are wrong but some are useful."

We need to estimate the parameters
$$
\theta =\{ s,p_{1},p_{2},\dots ,p_{J},q_{1},q_{2},\dots ,q_{J}\}.
$$

We shall assume that our training data $(c^{(i)},y^{(i)})$ for each $i$ are generated i.i.d. according to the above generative process.

Then, to learn the parameters, we find $\theta$ that maximizes the likelihood, i.e.:
$$
\widehat{\theta } = \arg \max _\theta \prod _{i=1}^ n p_{C, Y_1, \dots , Y_ J}(c^{(i)}, y_1^{(i)}, \dots , y_ J^{(i)}; \theta ).
$$

## II.2— THE NAIVE BAYES CLASSIFIER: TRAINING 

Let's find the the log likelihood for our email spam detection setup.

From the generation process and the graphical model assumed, we can write:
$$
\begin{eqnarray}
p_C(c; \theta)
&=& s^{\mathbf{1}\{c = \text{spam}\}} (1-s)^{1 - \mathbf{1}\{c = \text{spam}\}}
\qquad\text{for }c\in\{\text{spam}, \text{ham}\} \\
p_{Y_j \mid C}(y_j \mid \text{ham}; \theta)
&=& p_j^{y_j} (1-p_j)^{1 - y_j}
\qquad\text{for }y_j\in\{0, 1\} \\
p_{Y_j \mid C}(y_j \mid \text{spam}; \theta)
&=& q_j^{y_j} (1-q_j)^{1 - y_j}
\qquad\text{for }y_j\in\{0, 1\}
\end{eqnarray}
$$

The log likelihood is then given by
$$
\begin{eqnarray}
&&\log\left(\prod_{i=1}^{n}p_{C,Y_{1},\dots,Y_{J}}(c^{(i)},y_{1}^{(i)},\dots,y_{J}^{(i)};\theta)\right)\nonumber \\
&&=\log\left(\prod_{i=1}^{n}\left[p_{C}(c^{(i)};\theta)\prod_{j=1}^{J}p_{Y_{j}|C}(y_{j}^{(i)}|c^{(i)};\theta)\right]\right)\nonumber \\
&&=\sum_{i=1}^{n}\left[\log p_{C}(c^{(i)};\theta)+\sum_{j=1}^{J}\log p_{Y_{j}|C}(y_{j}^{(i)}|c^{(i)};\theta)\right]\nonumber \\
&&=\underbrace{\sum_{i=1}^{n}\log p_{C}(c^{(i)};\theta)}_{\text{(*)}}+
\underbrace{\sum_{i=1}^{n}\sum_{j=1}^{J}\log p_{Y_{j}|C}(y_{j}^{(i)}|c^{(i)};\theta)}_{\text{(**)}}.
\end{eqnarray}
$$
We next simplify the expressions (\*) and (\**).

First let's simplify term (\*):
$$
\begin{eqnarray}
\text{(*)} &=&\sum_{i=1}^{n}\log p_{C}(c^{(i)};\theta) \\
&=&\sum_{i=1}^{n}\left[\mathbf{1}\{c^{(i)}=``\text{spam}"\}\log s+\mathbf{1}\{c^{(i)}=``\text{ham}"\}\log(1-s)\right]\\
&=&\left[\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{spam}"\}\right]\log s+\left[\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}\right]\log(1-s)\\
&\triangleq& f(s).
\end{eqnarray}
$$
Next, we simplify (\**), splitting it up as to decouple $p_{j}$ and $q_{j}$. To do this, we can split the summation over $i$ into two sums, one accounting for all the ham emails and one accounting for all the spam emails:
$$
\begin{eqnarray}
&&\text{(**)} \\
&&=\sum_{i=1}^{n}\sum_{j=1}^{J}\log p_{Y_{j}|C}(y_{j}^{(i)}|c^{(i)};\theta)\\
&&=\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}\sum_{j=1}^{J}\log p_{Y_{j}|C}(y_{j}^{(i)}|c^{(i)};\theta)\\
&&\quad+\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{spam}"\}\sum_{j=1}^{J}\log p_{Y_{j}|C}(y_{j}^{(i)}|c^{(i)};\theta)\\
&&=\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}\sum_{j=1}^{J}\left[y_{j}^{(i)}\log p_{j}+(1-y_{j}^{(i)})\log(1-p_{j})\right]\\
&&\quad+\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{spam}"\}\sum_{j=1}^{J}\left[y_{j}^{(i)}\log q_{j}+(1-y_{j}^{(i)})\log(1-q_{j})\right] \\
&&=\sum_{j=1}^{J}\underbrace{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}\left[y_{j}^{(i)}\log p_{j}+(1-y_{j}^{(i)})\log(1-p_{j})\right]}_{\triangleq g_{j}(p_{j})}\\
&&\quad+\sum_{j=1}^{J}\underbrace{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{spam}"\}\left[y_{j}^{(i)}\log q_{j}+(1-y_{j}^{(i)})\log(1-q_{j})\right]}_{\triangleq h_{j}(q_{j})}.
\end{eqnarray}
$$
In summary:
$$
\begin{eqnarray}
f(s) &=&\left[\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{spam}"\}\right]\log s+\left[\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}\right]\log(1-s),\\
g_{j}(p_{j}) &=&\left[\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}y_{j}^{(i)}\right]\log p_{j}+\left[\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}(1-y_{j}^{(i)})\right]\log(1-p_{j}),\\
h_{j}(q_{j}) &=&\left[\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{spam}"\}y_{j}^{(i)}\right]\log q_{j}+\left[\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{spam}"\}(1-y_{j}^{(i)})\right]\log(1-q_{j}).
\end{eqnarray}
$$
Setting derivatives to 0:

The ML estimate for $s$ is
$\widehat{s}=\arg \max _{s\in [0,1]}f(s)$, which occurs when $\frac{df}{ds}=0$.

We know that for nonzero constants A and B,
$$
\frac{d}{dt}\left\{  A\log t+B\log (1-t)\right\}  =0\qquad \text {when}\qquad t=\frac{A}{A+B}
$$

We can thus write
$$
f(s)=\underbrace{\left[\sum _{i=1}^{n}\mathbf{1}\{ c^{(i)}=``\text {spam}"\} \right]}_{A}\log s+\underbrace{\left[\sum _{i=1}^{n}\mathbf{1}\{ c^{(i)}=``\text {ham}"\} \right]}_{B}\log (1-s)
$$
has derivative equal to 0 when
$$
\widehat{s}=\frac{A}{A+B}=\frac{\sum _{i=1}^{n}\mathbf{1}\{ c^{(i)}=``\text {spam}"\} }{\sum _{i=1}^{n}\mathbf{1}\{ c^{(i)}=``\text {spam}"\} +\sum _{i=1}^{n}\mathbf{1}\{ c^{(i)}=``\text {ham}"\} }=\frac{\sum _{i=1}^{n}\mathbf{1}\{ c^{(i)}=``\text {spam}"\} }{n}.
$$
This result is intuitive — it's the number of emails labeled “spam" divided by the total number of emails.

The ML estimate for $p_{j}$ is $\widehat{p}_{j}=\arg \max _{p_{j}\in [0,1]}g_{j}(p_{j})$, which occurs when $\frac{dg_{j}}{dp_{j}}=0$.
Again we know that
$$
g_{j}(p_{j})=\underbrace{\left[\sum _{i=1}^{n}\mathbf{1}\{ c^{(i)}=``\text {ham}"\} y_{j}^{(i)}\right]}_{A}\log p_{j}+\underbrace{\left[\sum _{i=1}^{n}\mathbf{1}\{ c^{(i)}=``\text {ham}"\} (1-y_{j}^{(i)})\right]}_{B}\log (1-p_{j})
$$
has derivative equal to 0 when
$$
\begin{eqnarray}
\widehat{p}_{j} & =&\frac{A}{A+B}\\
 & =&\frac{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}y_{j}^{(i)}}{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}y_{j}^{(i)}+\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}(1-y_{j}^{(i)})}\\
 & =&\frac{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}y_{j}^{(i)}}{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}}.
\end{eqnarray}
$$
This result is also intuitive — it's the number of times word $j$ occurred in an email labeled “ham" divided by the total number of emails labeled “ham".

Finally, by pattern-matching, the ML estimate for $q_{j}$ is
$$
\widehat{q}_{j}=\frac{\sum _{i=1}^{n}\mathbf{1}\{ c^{(i)}=``\text {spam}"\} y_{j}^{(i)}}{\sum _{i=1}^{n}\mathbf{1}\{ c^{(i)}=``\text {spam}"\} }.
 $$
Wonderful, now we can write up an algorithm that computes all those ML estimates above. Once we learn the parameters $\theta$, we can treat them as fixed and start doing prediction.


## II.3— THE NAIVE BAYES CLASSIFIER: LAPLACE SMOOTHING

Assume that a particular word, say word 1, did not appear in any of the training data.
Now for the email we're trying to predict the label for, suppose that we observe that $y_{1}=1$. We would have learned that $p_{1}=q_{1}=0$ (we're leaving hats off to keep notation from getting cluttered).

Using the result of the prediction phase, we see that
$$
p_{Y_{1},\dots ,Y_{J}}(y_{1},\dots ,y_{J})=s\prod _{j=1}^{J}q_{j}^{y_{j}}(1-q_{j})^{1-y_{j}}+(1-s)\prod _{j=1}^{J}p_{j}^{y_{j}}(1-p_{j})^{1-y_{j}}=0
$$
since $y_{1}=1$ while $p_{1}=q_{1}=0$.

In other words, the observation is impossible given the model learned!

Hence, computing $\mathbb {P}(C=\text {spam}|Y_{1}=1,Y_{2}=y_{2},\dots ,Y_{J}=y_{J})$ doesn't make sense since we're conditioning on an event that has 0 probability.

Thus, ML estimates aren't robust in this case to words that we don't encounter in training data.

One way to resolve this is to take a Bayesian approach to parameter estimation (we saw this earlier when we put a prior on the bias of a coin) and, in particular, introduce *pseudocounts*. For example, we could set:
$$
\begin{eqnarray*}
\widehat{s} & =&\frac{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{spam}"\}+1}{n+2},\\
\widehat{p}_{j} & =&\frac{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}y_{j}^{(i)}+1}{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}+2},\\
\widehat{q}_{j} & =&\frac{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{spam}"\}y_{j}^{(i)}+1}{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{spam}"\}+2}.
\end{eqnarray*}
$$
This ensures that none of the parameters are estimated as 0 or 1.

Note that introducing a pseudocount for each possible outcome has a special name: *Laplace smoothing*, also called *additive smoothing*. For $\widehat{s}$, this means introducing 1 pseudocount for spam and 1 pseudocount for ham. For $\widehat{p}_{j}$, this means introducing 1 pseudocount for word j appearing in ham and 1 pseudocount for word j not appearing in ham. You could, of course, also introduce $\ell$ pseudocounts for each possible outcome instead of just one, i.e.,
$$
\begin{eqnarray*}
\widehat{s} & =&\frac{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{spam}"\}+\ell}{n+2\ell},\\
\widehat{p}_{j} & =&\frac{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}y_{j}^{(i)}+\ell}{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{ham}"\}+2\ell},\\
\widehat{q}_{j} & =&\frac{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{spam}"\}y_{j}^{(i)}+\ell}{\sum_{i=1}^{n}\mathbf{1}\{c^{(i)}=``\text{spam}"\}+2\ell}.
\end{eqnarray*}
$$

## II.4— THE NAIVE BAYES CLASSIFIER: PREDICTION

Once we learn the parameters $\hat{\theta}$, we can treat them as fixed and start doing prediction.

Let's now look at classifying whether a new email that's not in our training data is spam or ham. This new email has random, unobserved label $C$, which we would like to infer, but we only get to see its features $Y_{1} = y_1,Y_{2} = y_2,\dots ,Y_{J} = y_ J$.

Assuming that $\theta$ is known and fixed, we can figure out what the MAP estimate for label $C$ is given $Y_{1} = y_1,Y_{2} = y_2,\dots ,Y_{J} = y_ J$.

Unleashing Bayes' rule,
$$
\begin{eqnarray*}
 &&p_{C|Y_{1},\dots,Y_{J}}(``\text{spam}"|y_{1},\dots,y_{J})\\
 &&=\frac{p_{C}(``\text{spam}")p_{Y_{1},\dots,Y_{J}|C}(y_{1},\dots,y_{J}|``\text{spam}")}{p_{Y_{1},\dots,Y_{J}}(y_{1},\dots,y_{J})}\\
 &&=\frac{p_{C}(``\text{spam}")p_{Y_{1},\dots,Y_{J}|C}(y_{1},\dots,y_{J}|``\text{spam}")}{p_{C}(``\text{spam}")p_{Y_{1},\dots,Y_{J}|C}(y_{1},\dots,y_{J}|``\text{spam}")+p_{C}(``\text{ham}")p_{Y_{1},\dots,Y_{J}|C}(y_{1},\dots,y_{J}|``\text{ham}")} \\
 &&=\frac{p_{C}(``\text{spam}")\prod_{j=1}^{J}p_{Y_{j}|X}(y_{j}|``\text{spam}")}{p_{C}(``\text{spam}")\prod_{j=1}^{J}p_{Y_{j}|C}(y_{j}|``\text{spam}")+p_{C}(``\text{ham}")\prod_{j=1}^{J}p_{Y_{j}|X}(y_{j}|``\text{ham}")}\\
 &&=\frac{s\prod_{j=1}^{J}q_{j}^{y_{j}}(1-q_{j})^{1-y_{j}}}{s\prod_{j=1}^{J}q_{j}^{y_{j}}(1-q_{j})^{1-y_{j}}+(1-s)\prod_{j=1}^{J}p_{j}^{y_{j}}(1-p_{j})^{1-y_{j}}},
\end{eqnarray*}
$$

where for simplicity we've dropped the hats on the parameters even though the parameter values we use are estimated from training data.

Of course,
$$
\begin{eqnarray}
&&p_{C|Y_{1},\dots,Y_{J}}(``\text{ham}"|y_{1},\dots,y_{J}) \\
&&= 1 - p_{C|Y_{1},\dots,Y_{J}}(``\text{spam}"|y_{1},\dots,y_{J}) \\
&&=
\frac{(1-s)\prod_{j=1}^{J}p_{j}^{y_{j}}(1-p_{j})^{1-y_{j}}}{s\prod_{j=1}^{J}q_{j}^{y_{j}}(1-q_{j})^{1-y_{j}}+(1-s)\prod_{j=1}^{J}p_{j}^{y_{j}}(1-p_{j})^{1-y_{j}}}.
\end{eqnarray}
$$
The MAP estimate for $C$ is
$$
\begin{eqnarray}
\widehat{C}_{\text{MAP}}
&=&
\begin{cases}
``\text{spam}"
& \text{if }
p_{C|Y_{1},\dots,Y_{J}}(``\text{spam}"|y_{1},\dots,y_{J})
\ge p_{C|Y_{1},\dots,Y_{J}}(``\text{ham}"|y_{1},\dots,y_{J}) \\
``\text{ham}"
& \text{otherwise}.
\end{cases}
\end{eqnarray}
$$

Note that here we're breaking ties in favor of spam. The above is equivalent to looking at whether the odds ratio
$$
\frac{p_{C|Y_{1},\dots ,Y_{J}}(``\text {spam}"|y_{1},\dots ,y_{J})}{p_{C|Y_{1},\dots ,Y_{J}}(``\text {ham}"|y_{1},\dots ,y_{J})}
$$
is at least 1, or whether the log odds ratio
$$
\log \frac{p_{C|Y_{1},\dots ,Y_{J}}(``\text {spam}"|y_{1},\dots ,y_{J})}{p_{C|Y_{1},\dots ,Y_{J}}(``\text {ham}"|y_{1},\dots ,y_{J})}
$$
is at least 0.

In practice the log odds ratio can be much more numerically stable to compute since, pushing in the log, we end up taking sums and differences of log probabilities rather than multiplying a large number of probabilities.

## II.5— Generalizing to Trees


This section generalizes the ideas we've seen with parameter learning for a biased coin and for naive Bayes, specifically for maximum likelihood (so, no Laplace smoothing now). Specifically we look at learning parameters for a general finite random variable (for which estimating the bias of a coin is a special case) and learning all the potential tables in a tree-structured graphical model (for which naive Bayes is a special case).

For the finite random variable case, our derivation will revisit information measures from earlier in the course: entropy and information divergence. We'll see that maximum likelihood estimation relates to what we've seen earlier in the course on histograms – frequencies in which we see outcomes occur.

Learning all the potential tables in a tree-structured graphical model can be phrased in terms of learning parameters for a collection of finite random variables.

Unlike the bulk of our previous coverage of parameter learning, we won't be taking derivatives in this section!

### II.5.1— Parameter Learning for a Finite Random Variable

Let's build off of our coin tossing example, where we had an underlying distribution $p_{X}$ with parameter $\theta$, the probability of heads.

We now explicitly make there be two parameters $\theta _{\text {heads}}$ and $\theta _{\text {tails}}$ so we can extend to larger alphabet of the random variable.

In the more general case when there are many outcomes, it will be easier to write out a parameter for every value a random variable can take on rather than a parameter for all but one of them. Thus, using the same representation trick we saw earlier:
$$
\begin{eqnarray*}
p_{X}(x;\theta)&=&\theta_{\text{heads}}^{\mathbf{1}\{x=\text{heads}\}}\theta_{\text{tails}}^{\mathbf{1}\{x=\text{tails}\}}=\begin{cases}
\theta_{\text{heads}} & \text{if }x=\text{heads},\\
\theta_{\text{tails}} & \text{if }x=\text{tails}.
\end{cases}
\end{eqnarray*}
$$

In particular, in the general case when random variable $X$ has alphabet $\mathcal{X}$, then
$$
p_{X}(x;\theta )=\prod _{a\in \mathcal{X}}\theta _{a}^{\mathbf{1}\{ x=a\} }.
$$
	 

(In the coin tossing case, $\mathcal{X}=\{ \text {heads},\text {tails}\}$.)

The training data $X^{(1)},\dots ,X^{(n)}$ are again assumed to be drawn i.i.d. from the distribution $p_{X}(\cdot ;\theta )$, so that the likelihood for observed data $X^{(1)}=x^{(1)},\dots ,X^{(n)}=x^{(n)}$ is
$$
\prod _{i=1}^{n}p_{X}(x^{(i)};\theta )=\prod _{i=1}^{n}\bigg\{ \prod _{a\in \mathcal{X}}\theta _{a}^{\mathbf{1}\{ x^{(i)}=a\} }\bigg\} .
$$	 

The log likelihood is thus
$$
\log \prod _{i=1}^{n}p_{X}(x^{(i)};\theta )=\log \prod _{i=1}^{n}\bigg\{ \prod _{a\in \mathcal{X}}\theta _{a}^{\mathbf{1}\{ x^{(i)}=a\} }\bigg\} =\sum _{i=1}^{n}\sum _{a\in \mathcal{X}}\mathbf{1}\{ x^{(i)}=a\} \log \theta _{a}.
$$ 

Now we exchange the ordering of the summations on the right-hand side to get
$$
\text {log likelihood}=\sum _{a\in \mathcal{X}}\bigg[\sum _{i=1}^{n}\mathbf{1}\{ x^{(i)}=a\} \bigg]\log \theta _{a}.
$$ 

Note that the maximum likelihood estimate $\widehat{\theta }$ of all the parameters $\theta$ (now $\theta$ has a parameter $\theta _{a}$ for each possible value $a\in \mathcal{X}$) is precisely the solution to the following constrained optimization problem:
$$
\begin{eqnarray*}
 &  & \widehat{\theta}=\arg\max_{\theta = \{ \theta_a \text{ for }a\in\mathcal{X} \}}\bigg\{\sum_{a\in\mathcal{X}}\bigg[\sum_{i=1}^{n}\mathbf{1}\{x^{(i)}=a\}\bigg]\log\theta_{a}\bigg\}\\
 &  & \qquad\text{subject to }\sum_{a\in\mathcal{X}}\theta_{a}=1,\text{ and }\theta_{a}\ge0\text{ for all }a\in\mathcal{X}.
\end{eqnarray*}
$$

(Technical detail: If we didn't have the inequality constraints, then it's possible to solve this using a single Lagrange multiplier to enforce the equality constraint that $\sum _{a\in \mathcal{X}}\theta _{a}=1$. If you do this, you will in fact get the correct answer, but it's not straightforward showing why it is definitely the unique global maximum.)

Let's see how information-theoretic measures can help us. Let $\widehat{p}_{X}$ refer to the empirical distribution (this is the histogram of frequencies) of $X$ that we get from our training data $x^{(1)},\dots ,x^{(n)}$, so that $\widehat{p}_{X}(x)$ is precisely the fraction of times we saw x in the training data:
$$
\widehat{p}_{X}(x)=\frac{1}{n}\sum _{i=1}^{n}\mathbf{1}\{ x^{(i)}=x\} .
$$

Then the log likelihood is
$$
\begin{eqnarray*}
 &  & \log\text{likelihood}\\
 &  & =\sum_{a\in\mathcal{X}}\underbrace{\bigg[\sum_{i=1}^{n}\mathbf{1}\{x^{(i)}=a\}\bigg]}_{n\cdot\widehat{p}_{X}(a)}\log\underbrace{\theta_{a}}_{p_{X}(a;\theta)}\\
 &  & =\sum_{a\in\mathcal{X}}n\cdot\widehat{p}_{X}(a)\log p_{X}(a;\theta)\\
 &  & =\sum_{a\in\mathcal{X}}n\cdot\widehat{p}_{X}(a)\log\Big(p_{X}(a;\theta)\frac{\widehat{p}_{X}(a)}{\widehat{p}_{X}(a)}\Big)\\
 &  & =\sum_{a\in\mathcal{X}}n\cdot\widehat{p}_{X}(a)\Big(\log\widehat{p}_{X}(a)+\log\frac{p_{X}(a;\theta)}{\widehat{p}_{X}(a)}\Big)\\
 &  & =n\bigg[\sum_{a\in\mathcal{X}}\widehat{p}_{X}(a)\log\widehat{p}_{X}(a)+\sum_{a\in\mathcal{X}}\widehat{p}_{X}(a)\log\frac{p_{X}(a;\theta)}{\widehat{p}_{X}(a)}\bigg]\\
 &  & =n\big[-H(\widehat{p}_{X})-D\big(\widehat{p}_{X}\parallel p_{X}(\cdot;\theta)\big)\big]\\
 &  & =-n\big[H(\widehat{p}_{X})+D\big(\widehat{p}_{X}\parallel p_{X}(\cdot;\theta)\big)\big].
\end{eqnarray*}
$$

Note that here we are using natural log with entropy and information divergence instead of log base 2 — this is actually commonly done and just results in the overall quantity being scaled by a constant (since $\log _2 x = \frac{\log x}{\log 2}$). Recall that when we used log base 2, we measured information content in terms of *bits*. When using natural log, information content is measured in terms of what are called *nats*.

Importantly, note that to maximize the likelihood, the only expression that depends on $\theta$ here is the divergence, and in particular, maximizing the likelihood is the same as (because of the minus sign in the last expression) minimizing the divergence $D\big (\widehat{p}_{X}\parallel p_{X}(\cdot ;\theta )\big )$. Here, since we treat our training data as fixed, then $\widehat{p}_{X}$ is a fixed distribution, and so minimizing the divergence means choosing a distribution $p_{X}(\cdot ;\theta )$. But we know from Gibbs' inequality what the best choice of $p_{X}(\cdot ;\theta )$ is! In particular, the divergence is the smallest possible and, in particular, 0 when $p_{X}(\cdot ;\theta )$ is set to be equal to the empirical distribution $\widehat{p}_{X}$, so we set
$$
\underbrace{p_{X}(a;\theta )}_{\theta _{a}}=\widehat{p}_{X}(a)\qquad \text {for all }a\in \mathcal{X}.
$$ 

So the maximum likelihood estimate $\widehat{\theta }_{a}$ for parameter $\theta _{a}$ is the fraction of times we see a in the training data $x^{(1)},\dots ,x^{(n)}$.

Note that Gibbs' inequality tells us that we cannot do better.

The unique global maximum is to choose $\widehat{\theta }_{a}=\widehat{p}_{X}(a)$ for all $a\in \mathcal{X}$. There is no need to check second derivatives, boundaries, etc. (Technical note: When we have more than a single variable, justifying whether a point is a local maximum in general requires more work than in the single variable calculus case since there could be saddle points. However, in this case, Gibbs' inequality tells us that there's a unique maximum.) 

### II.5.2— Parameter Learning for an Undirected Tree-Structured Graphical Model

We now look at how to learn potential tables for a tree-structured graphical model. Recall that for the exact same distribution, the way in which we specify potential tables is not unique (i.e., there could be multiple way to specify potential tables to represent the same distribution). This is fine. We can still learn potential tables systematically with maximum likelihood. In fact, we did this already when we trained a naive Bayes classifier!

#### II.5.2.1— Naïve Bayes classifier

Let's build some intuition from the naive Bayes classifier case (again, we stick to maximum likelihood here — no Laplace smoothing). The graphical model was as follows:
![images_naive-bayes](./images/images_naive-bayes.png)

How we learned the parameters was that we treated node $C$ as the root node. Then the factorization was
$$
\underbrace{p_{C}(c;\theta )}_{\text {1 potential table}}\underbrace{\prod _{j=1}^{J}p_{Y_{j}\mid C}(y_{j}\mid c;\theta )}_{J\text { potential tables}}.
$$

Each of the potential tables is actually specified in terms of different parameters. In particular, $p_{C}$ only depends on parameter $s$, and $p_{Y_{j}\mid C}$ only depends on $p_{j}$ and $q_{j}$.

What we did was we estimated $p_{C}(\cdot ;\theta )$ with the empirical distribution
$$
\widehat{p}_{C}(c)=\text {fraction of times in training data we see label }c=\frac{1}{n}\sum _{i=1}^{n}\mathbf{1}\{ c^{(i)}=c\} .
$$

When $c=\text {spam}$, we called this parameter $s$. Of course when $c=\text {ham}$, then this parameter is just $1-s$, which we didn't have to separately store.

We estimated $p_{Y_{j}\mid C}(\cdot \mid c;\theta )$ with the empirical conditional distribution
$$
\begin{eqnarray}
\widehat{p}_{Y_{j}\mid C}(y_{j}\mid c) 	& = 	& \text {fraction of times in training data we see }y_{j}\text { amongst emails with label }c 	 	 \\
& = 	& \frac{\sum _{i=1}^{n}\mathbf{1}\{ y_{j}^{(i)}=y_{j},c^{(i)}=c\} }{\sum _{i=1}^{n}\mathbf{1}\{ c^{(i)}=c\} }. 	 	 
\end{eqnarray}
$$

When $c=\text {ham}$, we called this parameter $p_{j}$. When $c=\text {spam}$, we called this parameter $q_{j}$.

#### II.5.2.2— General case of learning tree parameters

For the general case of learning maximum likelihood parameters for a given tree-structured graphical model, we can choose an arbitrary root node (which has a node potential table corresponding to the probability of the root node) and then treat the edge potential tables as conditional probability distributions. For the root node, we estimate its node potential table with the empirical distribution for just that node. For the other potential tables, we estimate them using empirical conditional distributions. Let me walk through a specific example. Consider the following graphical model:
![images_sec-graphical-models-five-node-example](./images/images_sec-graphical-models-five-node-example.png)

We assume we have access to training data
$$
\underbrace{X_{1}^{(1)},X_{2}^{(1)},X_{3}^{(1)},X_{4}^{(1)},X_{5}^{(1)}}_{\text {first training data point}},\quad \underbrace{X_{1}^{(2)},X_{2}^{(2)},X_{3}^{(2)},X_{4}^{(2)},X_{5}^{(2)}}_{\text {second training data point}},\quad \dots ,\quad \underbrace{X_{1}^{(n)},X_{2}^{(n)},X_{3}^{(n)},X_{4}^{(n)},X_{5}^{(n)}}_{n\text {-th training data point}}.
$$

The likelihood is
$$
\prod _{i=1}^{n}p_{X_{1},X_{2},X_{3},X_{4},X_{5}}(x_{1}^{(i)},x_{2}^{(i)},x_{3}^{(i)},x_{4}^{(i)},x_{5}^{(i)};\theta ).
$$ 

By choosing node 1 arbitrarily as the root, then we write $p_{X_{1},X_{2},X_{3},X_{4},X_{5}}(\cdot ;\theta )$ as
$$
\underbrace{p_{X_{1},X_{2},X_{3},X_{4},X_{5}}}_{\text {parameters }\theta }=\underbrace{p_{X_{1}}}_{\text {parameters }\theta _{1}}\underbrace{p_{X_{2}\mid X_{1}}}_{\theta _{2\mid 1}}\underbrace{p_{X_{3}\mid X_{1}}}_{\theta _{3\mid 1}}\underbrace{p_{X_{4}\mid X_{2}}}_{\theta _{4\mid 2}}\underbrace{p_{X_{5}\mid X_{2}}}_{\theta _{5\mid 2}},
$$

where we use $\theta _{1}$,$\theta _{2\mid 1}$,$\theta _{3\mid 1}$,$\theta _{4\mid 2}$,$\theta _{5\mid 2}$ to refer to parameters for each of the tables that we aim to learn. The notation here is going to be a bit messy. We'll use the notation
$$
\theta _{1;a}\triangleq p_{X_{1}}(a;\theta )
$$
and
$$
\theta _{2\mid 1;a\mid b}\triangleq p_{X_{2}\mid X_{1}}(a\mid b;\theta _{2\mid 1})
$$
and so forth.

We assume these parameters to be separate from each other (so that similar to the naive Bayes case, we can learn each of these tables separately). Then the log likelihood is
$$
\begin{eqnarray*}
 &  & \log\prod_{i=1}^{n}p_{X_{1},X_{2},X_{3},X_{4},X_{5}}(x_{1}^{(i)},x_{2}^{(i)},x_{3}^{(i)},x_{4}^{(i)},x_{5}^{(i)};\theta)\\
 &  & =\log\prod_{i=1}^{n}\big\{ p_{X_{1}}(x_{1}^{(i)};\theta_{1})p_{X_{2}\mid X_{1}}(x_{2}^{(i)}\mid x_{1}^{(i)};\theta_{2\mid1})p_{X_{3}\mid X_{1}}(x_{3}^{(i)}\mid x_{1}^{(i)};\theta_{3\mid1}) \\
 &  & \qquad\qquad\;\; p_{X_{4}\mid X_{2}}(x_{4}^{(i)}\mid x_{2}^{(i)};\theta_{4\mid2})p_{X_{5}\mid X_{2}}(x_{5}^{(i)}\mid x_{2}^{(i)};\theta_{5\mid2})\big\}\\
 &  & =\text{log likelihood for }X_{1}\text{ with parameter }\theta_{1}\\
 &  & \quad+\sum_{a}\mathbf{1}\{x_{1}^{(i)}=a\}\cdot(\text{log likelihood for }X_{2}\mid X_{1}=a\text{ with parameter }\theta_{2\mid1})\\
 &  & \quad+\sum_{a}\mathbf{1}\{x_{1}^{(i)}=a\}\cdot(\text{log likelihood for }X_{3}\mid X_{1}=a\text{ with parameter }\theta_{3\mid1})\\
 &  & \quad+\sum_{a}\mathbf{1}\{x_{2}^{(i)}=a\}\cdot(\text{log likelihood for }X_{4}\mid X_{2}=a\text{ with parameter }\theta_{4\mid2})\\
 &  & \quad+\sum_{a}\mathbf{1}\{x_{2}^{(i)}=a\}\cdot(\text{log likelihood for }X_{5}\mid X_{2}=a\text{ with parameter }\theta_{5\mid2})
\end{eqnarray*}
$$
where the loglikelihood of each random variable is obtained as explained in section II.5.1 (double sum on the examples and the alphabet of said random variable).

To maximize the overall log likelihood, we maximize the log likelihood for each of the five tables separately, where the root node corresponds to learning the parameter for a single finite random variable, whereas for each of the edges, we learn a finite random variable for every possible value that we are conditioning on. In both of these cases, the result from parameter learning for a finite random variable (for maximum likelihood) says that the ML estimate corresponds to fitting an empirical distribution.

Thus, for the root node, we have
$$
\widehat{\theta }_{1;a}=\widehat{p}_{X_{1}}(a)=\frac{1}{n}\sum _{i=1}^{n}\mathbf{1}\{ x_{1}^{(i)}=a\} ,
$$

and for the conditional probability tables, we have
$$
\begin{eqnarray*}
\widehat{\theta}_{i\mid j;a\mid b} & = & \widehat{p}_{X_{i}\mid X_{j}}(a\mid b)=\frac{\sum_{\ell=1}^{n}\mathbf{1}\{x_{i}^{(\ell)}=a,x_{j}^{(\ell)}=b\}}{\sum_{\ell=1}^{n}\mathbf{1}\{x_{j}^{(\ell)}=b\}}.
\end{eqnarray*}
$$

Note that if we chose the root node to be some other node, of course, the factorization we get would be different, which means that the tables we are learning will be different but will represent the same distribution. 

# III— Structure learning for an Undirected Tree-Structured Graphical Model: The Chow-Liu Algorithm

At this point we've covered parameter learning for undirected trees, where we assume we know the tree structure. But what if we don't know what tree to even use? We now look at an algorithm called the Chow-Liu algorithm that learns which tree to use from training data, again using maximum likelihood. Once more, information measures appear, where mutual information plays a pivotal role. Recall that mutual information tells us how far two random variables are from being independent. A key idea here is that to determine whether an edge between two random variables should be present in a graphical model, we can look at mutual information.

Note (November 14, 2016): While the Chow-Liu algorithm is quite simple to describe, the running time and correctness of Chow-Liu is a bit more involved than other algorithms we have encountered so far. I will post a little bit about this later this week. For those who want the gory details now, look up Kruskal's algorithm. Chow-Liu basically does a preprocessing step (computing empirical mutual information quantities) before just running Kruskal's algorithm; what gets a bit messy is talking about how to efficiently check for whether adding an edge creates a cycle, and also the proof of correctness for the algorithm uses a graph induction. These details are beyond the scope of the course but, again, I'll post something about it later this week for those who are interested. 



We have random variables $X_1,\dots,X_k$ that can be  seen as node in a graphical model but we don't know which edges should link these graphical model. In the case of trees: which tree should we use ?

There are a huge number of possible trees: $k^{k-2}$ by Cayley's formula. This is super-exponential !

But we have training data $x_1^{(i)},\dots,x_k^{(i)}$ for $i=1\dots n$.

We can thus select the tree by using maximum-likelihood to find the best tree which lead to the best parameters:
$$
\hat T = \arg\max_T \left\lbrace\max_{\theta_T}\log\prod_{i=1}^n p_{X_1\dots X_k}(x_1^{(i)},\dots,x_k^{(i)};T;\theta_T)\right\rbrace
$$

Let's detail the maximization for a specific tree:
$$
\begin{eqnarray}
& \max_{\theta_T}\log\prod_{i=1}^n p_{X_1\dots X_k}(x_1^{(i)},\dots,x_k^{(i)};T;\theta_T) \\
& = \max_{\theta_T}\left\lbrace\log\prod_{i=1}^n
\left[p(x_r^{(i)})\prod_{j\neq r}p(x_j^{(i)}\mid x_{\pi(j)}^{(i)})\right]\right\rbrace\\
& = \max_{\theta_T}\left\lbrace\sum_{i=1}^n\log p(x_r^{(i)})+\sum_{j\neq r}\sum_{i=1}^n \log p(x_j^{(i)}\mid x_{\pi(j)}^{(i)})\right\rbrace
\end{eqnarray}
$$

We know that the best choice for $\theta$ is just to replace the probability distribution on $p$ with the empirical distribution $\hat p$ given by the training data which are just, as a reminder:
$$
\hat p_{X_r} = \frac{1}{n}\sum_{i=1}^n\mathbb{1}\{x_r^{(i)}=a\}
$$
We can rewrite the previous equation with the empirical data $\hat p$ instead of the probability distributions $p$.

We can go even further by spliting the sum over the values of each random variable as follows:
$$
\sum_a\underbrace{\left\lbrace\sum_{i=1}^n\mathbb{1}\{x_r^{(i)}=a\}\right\rbrace}_{n\hat p_{X_r}(a)}\log \hat p_{X_r}(a)+\sum_{j\neq r}\sum_{a, b}\underbrace{\left\lbrace\sum_{i=1}^n \mathbb{1}\{x_j^{(i)}=a, x_{\pi(j)}^{(i)}=b\}\right\rbrace}_{n\hat p_{X_j, X_{\pi(j)}}(a, b)}\log \hat p_{X_j\mid X_{\pi(j)}}(a\mid b)
$$
which can be factorized and rewritten as
$$
n\left[\underbrace{\sum_a\hat p_{X_r}(a)\log \hat p_{X_r}(a)}_{-H(\hat p_{X_r})}+\sum_{j\neq r}\sum_{a, b}\hat p_{X_j, X_{\pi(j)}}(a, b)\log \frac{\hat p_{X_j, X_{\pi(j)}}(a, b)}{\hat p_{X_{\pi(j)}}(b)}\right]
$$
The second term can be rewritten as an entropy by multiplying and dividing by $\hat p_{X_j}(a)$:
$$
\sum_{j\neq r}\sum_{a, b}\hat p_{X_j, X_{\pi(j)}}(a, b)\log \frac{\hat p_{X_j, X_{\pi(j)}}(a, b)\hat p_{X_j}(a)}{\hat p_{X_{\pi(j)}}(b)\hat p_{X_j}(a)} = \sum_{j\neq r}\left[\underbrace{D\left(\hat p_{X_j, X_{\pi(j)}}\vert\vert\hat p_{X_{\pi(j)}}\hat p_{X_j}\right)}_{\hat I(X_j; X_{\pi(j)})}+\underbrace{\sum_{a, b}\hat p_{X_j, X_{\pi(j)}}(a, b)\log\hat p_{X_j}(a)}_{-H(\hat p_{X_j})}\right]
$$
where $\hat I$ is the empirical mutual information.
We note the the last double summation on $a$ and $b$ is rewritten as a marginal distribution on $a$ and is thus equal to $-H(\hat p_{X_j})$.

We can summarize our maximum likelihood across parameter for a specific tree T as:
$$
\begin{eqnarray}
\max_{\theta_T}\log\prod_{i=1}^n p_{X_1\dots X_k}(x_1^{(i)},\dots,x_k^{(i)};T;\theta_T) & = &n\left[-\sum_{j\in V}H(\hat p_{X_j})+\sum_{j\neq r}\hat I(X_j; X_{\pi(j)})\right]\\
&= &n\left[-\sum_{j\in V}H(\hat p_{X_j})+\sum_{(i,j)\in E}\hat I(X_i; X_j)\right]
\end{eqnarray}
$$
where the last equation is obtained because the mutual information is symetric.
When we scan through the trees, the nodes are not going to change. To maximize the likelihood across the family of tree, we thus just need to maximize the second term, i.e. the sum over the empirical mutual information on the edges.

## Chow-Liu algorithm
1. Start with a graph with no edges
2. For all pairs $(i, j)$ with $i\neq j$, compute $\hat I(X_i; X_j)$
3. Sort the empirical mutual information values from highest to lowest
4. Starting from pair with highest emp. mutual information, we add edges and skip adding an edge if it results in a cycle.

The Chow-Liu algorithm will always terminates since a given tree with $N$ nodes only have $N-1$ edges.