# Maximum Entropy Modelling

In [19]:
import pods
import mlai
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Weather Forecast

How do we forecast tomorrow's weather?

- satellite
- history, or one's experience
- shoe

How do we handle information from various sources?

- **backoff:** choose one
    1. use satellite image by default
    2. rely on statistics (ie, past history) if 1. is not available
    3. try a shoe prediction if 2. is not available

- **interpolation:** weight each evidence, eg, $0.6\times\text{satellite} + 0.3\times\text{history} + 0.1\times\text{shoe}$

- **maximum entropy:** make a least biased decision

### Maximum Entropy

**Coin tossing**

Suppose a weighted coin has probability $h$ of coming up heads,	entropy of tossing the coin only once is given by
$$
    H(X) = h \log_2 \frac{1}{h} + (1 - h) \log_2 \frac{1}{1 - h}
$$

Entropy is maximised when the coin is fair (\ie, unbiased):
<img src="./figs/coin_toss.jpg", width=300, align=center>

**Concept**

- model *all* that is known
- assume *nothing* that is unknown

**Principle**

- given a collection of facts, the **maximum entropy** method choose a model that is *consistent with all facts*, but otherwise *as uniform as possible*

**Simple example**

We wish to estimate a joint probability distribution $p(x, y)$ where $x\in\{x_1, x_2\}$ and $y\in\{y_1, y_2\}$, given the constraints
\begin{align*}
    p(x_1, y_1) + p(x_2, y_1) &= 0.6 \\
    p(x_1, y_1) + p(x_1, y_2) + p(x_2, y_1) + p(x_2, y_2) &= 1
\end{align*}
Given these constraints, our objective is to maximise
$$
    H(X, Y) = \sum_{x\in\{x_1, x_2\}} \sum_{y\in\{y_1, y_2\}} p(x, y) \log\frac{1}{p(x, y)}
$$

One distribution that satisfies constrains:

.     | $y_1$ | $y_2$ |
------|-------|-------|--------
$x_1$ | 0.5   | 0.1   |
$x_2$ | 0.1   | 0.3   |
total | 0.6   |       | 1.0

The most uniform distribution that satisfies constrains:

.     | $y_1$ | $y_2$ |
------|-------|-------|--------
$x_1$ | 0.3   | 0.2   |
$x_2$ | 0.3   | 0.2   |
total | 0.6   |       | 1.0


### British Weather Forecast

**Problem**

Firstly, we make overly simplified assumption that five weather types, \{misty, foggy, cloudy, sunny, rainy\}, can fully describe British weather. Now, suppose that today's weather is \{cloudy\}, what will be tomorrow's weather?

**A single constraint**

Initially we have the *total probability* constraint:
$$
    p(misty) + p(foggy) + p(cloudy) + p(sunny) + p(rainy) = 1
$$
There exist infinite combinations of probabilities that may satisfy the constraint above.

The most intuitively appealing model is
\begin{align*}
    p(misty) &= 0.2 \\
    p(foggy) &= 0.2 \\
    p(cloudy) &= 0.2 \\
    p(sunny) &= 0.2 \\
    p(rainy) &= 0.2
\end{align*}
This model allocates the total probability (that is ` 1 ') evenly among five possible weathers. It is the most *uniform* model subject to our knowledge... but what is exactly meant by *uniform*?

Analytically, we wish to maximise the entropy $H(Y)$ given the total probability constraint. Using a Lagrange multiplier $\lambda_1$:
$$
    \Lambda = H(Y) + \lambda_1 \times \text{constraint}
    = \left( - \sum_{y\in {\cal Y}} p(y) \log p(y) \right) + \lambda_1 \left( \sum_{y\in {\cal Y}} p(y) - 1 \right)
$$
and take a partial derivative with respect to $p(y)$:
$$
    \frac{\partial\Lambda}{\partial p(y)} = - \log p(y) - 1 + \lambda_1
$$
Then set $\displaystyle \frac{\partial\Lambda}{\partial p(y)} = 0$ and, finally, get $p(y) = 0.2$ for all $y$.

**Two constraints**

Suppose we have *two* constraints:
\begin{align*}
    p(misty) + p(foggy) + p(cloudy) + p(sunny) + p(rainy) &= 1 \\
    p(misty) + p(foggy) &= 0.3
\end{align*}
By observation, the most *uniform* model is
\begin{align*}
    p(misty) &= 0.15 \\
    p(foggy) &= 0.15 \\
    p(cloudy) &= 0.233... \\
    p(sunny) &= 0.233... \\
    p(rainy) &= 0.233...
\end{align*}

Analytically, we maximise $H(Y)$ given those two constraints:
\begin{align*}
    \Lambda = H(Y)
        & + \lambda_1 \{ p(misty) + p(foggy) + p(cloudy) + p(sunny) + p(rainy) - 1 \} \\
        & + \lambda_2 \{ p(misty) + p(foggy) - 0.3 \}
\end{align*}
and calculate partial derivatives:
$$
    \frac{\partial\Lambda}{\partial p(y)} = \left\{ \begin{array}{ll}
        - \log p(y) - 1 + \lambda_1 + \lambda_2	& y = misty, foggy \\
        - \log p(y) - 1 + \lambda_1		& \text{otherwise}
	\end{array} \right.
$$
Then set $\displaystyle \frac{\partial\Lambda}{\partial p(y)} = 0$.

**Three constraints**

Suppose we have *three* constraints:
\begin{align*}
    p(misty) + p(foggy) + p(cloudy) + p(sunny) + p(rainy) &= 1 \\
    p(misty) + p(foggy) &= 0.3 \\
    p(misty) + p(cloudy) &= 0.5
\end{align*}

Solution is no longer obvious, but we still can work on this case analytically.

We maximise $H(Y)$ given three constraints:
\begin{align*}
    \Lambda = H(Y)
        & + \lambda_1 \{ p(misty) + p(foggy) + p(cloudy) + p(sunny) + p(rainy) - 1 \} \\
        & + \lambda_2 \{ p(misty) + p(foggy) - 0.3 \} \\
        & + \lambda_3 \{ p(misty) + p(cloudy) - 0.5 \}
\end{align*}
Thus,
$$
    \frac{\partial\Lambda}{\partial p(y)} = \left\{ \begin{array}{ll}
        - \log p(y) - 1 + \lambda_1 + \lambda_2 + \lambda_3 & y = misty \\
        - \log p(y) - 1 + \lambda_1 + \lambda_2		& y = foggy \\
        - \log p(y) - 1 + \lambda_1 + \lambda_3		& y = cloudy \\
        - \log p(y) - 1 + \lambda_1			& \text{otherwise}
	\end{array} \right.
$$
We set $\displaystyle \frac{\partial\Lambda}{\partial p(y)} = 0$ and get the most *uniform* model:
\begin{align*}
    p(misty) &= \frac{9 - \sqrt{51}}{10} = 0.186... \\
    p(foggy) &= \frac{\sqrt{51} - 6}{10} = 0.114... \\
    p(cloudy) &= \frac{\sqrt{51} - 4}{10} = 0.314... \\
    p(sunny) &= \frac{11 - \sqrt{51}}{20} = 0.193... \\
    p(rainy) &= \frac{11 - \sqrt{51}}{20} = 0.193...
\end{align*}

### Maximum Entropy Model

**Random process**

Formally, we define a random process as follows:
\begin{align*}
    x &: \text{some information influencing the output}, \ x\in{\cal X} \\
    y &: \text{output value}, \ y\in{\cal Y}
\end{align*}

- (eg) a random process defined for the British weather forecast problem
\begin{align*}
    x &: \text{today's weather}, \ x\in\{cloudy\} \\
    y &: \text{tomorrow's weather}, \ y\in\{misty, foggy, cloudy, sunny, rainy\}
\end{align*}

**Training samples**

We also have training samples $(x_1, y_1), (x_2, y_2), \ \ldots \ , (x_N, y_N)$ .

- (eg) ten training samples for the British weather forecast problem
\begin{align*}
    & (cloudy, cloudy), (cloudy, sunny), (cloudy, sunny), (cloudy, misty), (cloudy, cloudy), \\
    & (cloudy, rainy), (cloudy, misty), (cloudy, foggy), (cloudy, cloudy), (cloudy, rainy)
\end{align*}

**Feature**

For $i = 1,\ldots,n$ ($n$: number of features), we define $f_i(x, y)$, an indicator function of type ${\cal X}\times{\cal Y}\longrightarrow\{0, 1\}$ .

- (eg) a feature set for the British weather forecast problem
\begin{align*}
    f_1(x, y) = 1 \quad & \text{if} \ y = \{misty, foggy, cloudy, sunny, rainy\} \\
    f_2(x, y) = 1 \quad & \text{if} \ y = \{misty, foggy\} \\
    f_3(x, y) = 1 \quad & \text{if} \ y = \{misty, cloudy\}
\end{align*}
otherwise $f_i(x, y) = 0$ for $i = 1,2,3$ .

    - note that $x$ is always $\{cloudy\}$ with this problem
    - if one sample is $(cloudy, foggy)$: $f_1 = 1, f_2 = 1, f_3 = 0$

**Expected values**

The expected value of $f_i$ with respect to an *empirical distribution* $\tilde{p}(x, y)$ is given by
$$
    \tilde{p}(f_i) \equiv E_{\tilde{p}}[f_i] = \sum_{x\in {\cal X}} \sum_{y\in {\cal Y}} \tilde{p}(x, y) f_i(x, y)
$$

- $\tilde{p}(x, y)$ represents a summary of the training sample, that is,
  $\displaystyle \tilde{p}(x, y) \equiv \frac{1}{N} \times \text{number of times that $(x, y)$ occurs in the sample}$
- some pair $(x, y)$ may not occur at all in the sample

The expected value of $f_i$ with respect to a *model distribution* $p(x, y)$ is given by
$$
    p(f_i) \equiv E_p[f_i] = \sum_{x\in {\cal X}} \sum_{y\in {\cal Y}} p(x, y) f_i(x, y)
    \sim \sum_{x\in {\cal X}} \tilde{p}(x) \sum_{y\in {\cal Y}} p(y | x) f_i(x, y)
$$

- calculation of $p(f_i)$ with respect to $p(x, y)$ is to the order of $|\ {\cal X}\times{\cal Y}\ |$, which is often too large
- instead, by using the empirical distribution $\tilde{p}(x)$, the calculation gets more tractable because we only consider those in the training sample
- we are likely to have more reliable estimates for $p(y | x)$ than for $p(x, y)$

Now we have the following **constraint** that relates two expected values:
$$
    p(f_i) = \tilde{p}(f_i)
$$
where $\tilde{p}(f_i)$ is a mean of representing statistical phenomena in the training sample, and $p(f_i) = \tilde{p}(f_i)$ is a mean of requiring that our model generalises these phenomena.

- (eg) constraints for the British weather forecast problem
  \begin{align*}
    p(misty) + p(foggy) + p(cloudy) + p(sunny) + p(rainy) &= 1 \\
    p(misty) + p(foggy) &= 0.3 \\
    p(misty) + p(cloudy) &= 0.5
  \end{align*}

**Maximum entropy model**

Of all conditional probability distributions ${\cal P}$, a subset ${\cal C} \equiv \{ \ p\in{\cal P} \ | \ p(f_i) = \tilde{p}(f_i) \ \text{ for } \ i = 1,\ldots,n \ \}$ constrains the model according to our knowledge.
The following conditional entropy indicates a *mathematical measure of the uniformity*:
$$
    H_p(Y|X) \equiv - \sum_{x\in {\cal X}} \tilde{p}(x) \sum_{y\in {\cal Y}} p(y | x) \log p(y | x)
$$
where notation $H_p(Y|X)$ emphasises the dependency of the entropy on $p$.

Finally, we have reached the maximum entropy model of the form:
$$
    p* = \mathop{\rm argmax}_{p\in{\cal C}} H_p(Y|X)
$$

**Parameter estimation procedure**

For each feature $f_i \ (i = 1,\ldots,n)$, we introduce a *Lagrange multiplier* $\lambda_i$, then define the *Lagrangian*:
$$
    \Lambda(p, \lambda) \equiv H_p(Y|X) + \sum_i \lambda_i \{ p(f_i) - \tilde{p}(f_i) \}
$$
    
Holding $\lambda = \{\lambda_i\}$ fixed, we compute the unconstrained maximum of $\Lambda(p, \lambda)$ over all $p\in{\cal P}$ :
\begin{align*}
    p_{\lambda}(y|x) &= \mathop{\rm argmax}_{p\in{\cal P}} \Lambda(p, \lambda) \\
	\Phi(\lambda) &= \Lambda(p_{\lambda}, \lambda)
\end{align*}

General solution is the exponential model:
\begin{align*}
    p_{\lambda}(y|x) &= \frac{1}{Z_{\lambda}(x)} \exp \left( \sum_i \lambda_i f_i(x,y) \right) \\
    \Phi(\lambda) &= - \sum_{x\in {\cal X}} \tilde{p}(x) \log Z_{\lambda}(x) + \sum_i \lambda_i \tilde{p}(f_i)
\end{align*}
where $Z_{\lambda}(x)$ is a normalising constant given by
$$
    Z_{\lambda}(x) = \sum_{y\in {\cal Y}} \exp \left( \sum_i \lambda_i f_i(x,y) \right)
$$
Technically, $\lambda_i$ is a *Lagrange multiplier*, associated with the feature $f_i(x,y)$, in a certain constrained optimisation problem. In a sense, $\lambda_i$ is a measure of the *importance* of the feature $f_i(x,y)$ .