### week 8

#  Constrained Optimisation and Maximum Entropy Modelling

- entropy and maximum entropy modelling
- constrained optimisation using Lagrange mutiplier
- analytical solution and numerical solution

In [1]:
import pods
import notebook as nb
import mlai
import numpy as np
import scipy as sp
from matplotlib import pyplot as plt
%matplotlib inline

### Review and Preview

- last week we Looked at unsupervised learning
- clustering, dimensionality reduction and latent variables and clustering were introduced
- the topic this week concerns constrained optimisation for a maximum entropy model, which plays an important role in the field of information theory

### Entropy: Definition

We calculate the entropy $H$ by
$$
  H = \sum_{i = 1 \ldots m} p_i \log_b \frac{1}{p_i} = - \sum_{i = 1 \ldots m} p_i \log_b p_i
$$

- Typical choices for the base $b$ are $2$, $e$, $10$
- Suppose the log is taken to the base $2$ (ie, $b=2$), entropy is expressed in *bits*
- We drop $b$ and use **natural log** with the base $e$, unless otherwise noted
- we may sometime write $H(p)$ for the above quantity
- as a convention, $\displaystyle 0\log \frac{1}{0} = 0$, because $\displaystyle \lim_{a\rightarrow 0+} a\log \frac{1}{a} \rightarrow 0$
- $H(X) = \sum_{x\in {\cal X}} p(x) \log \frac{1}{p(x)}$ for a random variable $X$

### Entropy: Examples

1. a flip of a fair coin:
$$
  H = \frac{1}{2}\log_2 2 + \frac{1}{2}\log_2 2 = 1 \ (\text{bit})
$$

2. a biased coin, with head coming up as twice frequently as tail:
$$
  H = \frac{2}{3}\log_2\frac{3}{2} + \frac{1}{3}\log_2 3 \sim 0.92 \ (\text{bits})
$$

3. a race between four horses, having a chance to win with probabilities $0.4$, $0.3$, $0.2$, and $0.1$:
$$
  H = 0.4\log_2 \frac{1}{0.4} + 0.3\log_2 \frac{1}{0.3} + 0.2\log_2 \frac{1}{0.2} + 0.1\log_2 \frac{1}{0.1} \sim 1.85 \ (\text{bits})
$$

### Coin Tossing

Suppose a weighted coin has probability $h$ of coming up heads,	entropy of tossing the coin only once is given by
$$
  H(X) = h \log_2 \frac{1}{h} + (1 - h) \log_2 \frac{1}{1 - h}
$$

Entropy is maximised when the coin is fair (ie, unbiased):
<img src="./figs/coin_toss.jpg", width=300, align=center>

### Maximum Entropy Model

Concept:

- model **all** that is known
- assume **nothing** that is unknown

Maximum entropy principle:

- given a collection of facts, the maximum entropy method chooses a model that is **consistent with all facts**, but otherwise **as uniform as possible**

### Maximum Entropy Model: Simple Example

We wish to estimate a joint probability distribution $p(x, y)$ where $x\in\{x_1, x_2\}$ and $y\in\{y_1, y_2\}$, given the constraints
\begin{align*}
  p(x_1, y_1) + p(x_2, y_1) &= 0.6 \\
  p(x_1, y_1) + p(x_1, y_2) + p(x_2, y_1) + p(x_2, y_2) &= 1
\end{align*}
Given these constraints, our objective is to maximise
$$
  H(X, Y) = \sum_{x\in\{x_1, x_2\}} \sum_{y\in\{y_1, y_2\}} p(x, y) \log\frac{1}{p(x, y)}
$$

One distribution that satisfies the constrains:

.     | $y_1$ | $y_2$ |
:----:|:-----:|:-----:|:-----:
$x_1$ | 0.5   | 0.1   |
$x_2$ | 0.1   | 0.3   |
total | 0.6   |       | 1.0

The most uniform distribution that satisfies the constrains:

.     | $y_1$ | $y_2$ |
:----:|:-----:|:-----:|:-----:
$x_1$ | 0.3   | 0.2   |
$x_2$ | 0.3   | 0.2   |
total | 0.6   |       | 1.0

### (Example) Weather Forecast: Problem Description

Firstly we make overly simplified assumption that five weather types
$$
  \{misty, foggy, cloudy, sunny, rainy\}
$$
can fully describe weather.

Now, suppose that today's weather is $\{cloudy\}$, what will be tomorrow's weather?

### Weather Forecast: Single Constraint (I)

Initially we have the **total probability** constraint:
$$
  p(misty) + p(foggy) + p(cloudy) + p(sunny) + p(rainy) = 1
$$
There exist infinite combinations that may satisfy the constraint above.

The most intuitively appealing model is the one that allocates the total probability evenly among five possible weathers:
\begin{align*}
  p(misty) &= 0.2 \\
  p(foggy) &= 0.2 \\
  p(cloudy) &= 0.2 \\
  p(sunny) &= 0.2 \\
  p(rainy) &= 0.2
\end{align*}

### Weather Forecast: Single Constraint (II)

Analytically, we wish to maximise the entropy $H(Y)$ given the total probability constraint.
Using a **Lagrange multiplier** $\lambda_1$ :
$$
  \Lambda = H(Y) + \lambda_1 \times \text{constraint} = \left( - \sum_{y\in {\cal Y}} p(y) \log p(y) \right) + \lambda_1 \left( \sum_{y\in {\cal Y}} p(y) - 1 \right)
$$
and calculate the partial derivative with respect to $p(y)$ :
$$
  \frac{\partial\Lambda}{\partial p(y)} = - \log p(y) - 1 + \lambda_1
$$
then set $\frac{\partial\Lambda}{\partial p(y)} = 0$, resulting in $p(y) = 0.2$ for all $y$.

### Weather Forecast: Two Constraints (I)

Suppose we have two constraints:
\begin{align*}
  p(misty) + p(foggy) + p(cloudy) + p(sunny) + p(rainy) &= 1 \\
  p(misty) + p(foggy) &= 0.3
\end{align*}
By observation the most uniform model is
\begin{align*}
  p(misty) &= 0.15 \\
  p(foggy) &= 0.15 \\
  p(cloudy) &= 0.233... \\
  p(sunny) &= 0.233... \\
  p(rainy) &= 0.233...
\end{align*}

### Weather Forecast: Two Constraints (II)

Analytically, we maximise $H(Y)$ given those two constraints:
\begin{align*}
  \Lambda = H(Y) & + \lambda_1 \{ p(misty) + p(foggy) + p(cloudy) + p(sunny) + p(rainy) - 1 \} \\
  & + \lambda_2 \{ p(misty) + p(foggy) - 0.3 \}
\end{align*}
and calculate partial derivatives:
$$
  \frac{\partial\Lambda}{\partial p(y)} = \left\{
    \begin{array}{ll}
      - \log p(y) - 1 + \lambda_1 + \lambda_2	& y = misty, foggy \\
      - \log p(y) - 1 + \lambda_1		& \text{otherwise}
	\end{array} \right.
$$
Then set $\frac{\partial\Lambda}{\partial p(y)} = 0$ .

### Weather Forecast: Three Constraints (I)

Suppose we have *three* constraints:
\begin{align*}
  p(misty) + p(foggy) + p(cloudy) + p(sunny) + p(rainy) &= 1 \\
  p(misty) + p(foggy) &= 0.3 \\
  p(misty) + p(cloudy) &= 0.5
\end{align*}

Solution is no longer obvious, but we still can work on this case analytically.

### Weather Forecast: Three Constraints (II)

We maximise $H(Y)$ given three constraints:
\begin{align*}
  \Lambda = H(Y) & + \lambda_1 \{ p(misty) + p(foggy) + p(cloudy) + p(sunny) + p(rainy) - 1 \} \\
  & + \lambda_2 \{ p(misty) + p(foggy) - 0.3 \} \\
  & + \lambda_3 \{ p(misty) + p(cloudy) - 0.5 \}
\end{align*}
The partial derivatives are
$$
  \frac{\partial\Lambda}{\partial p(y)} = \left\{
    \begin{array}{ll}
      - \log p(y) - 1 + \lambda_1 + \lambda_2 + \lambda_3 & y = misty \\
      - \log p(y) - 1 + \lambda_1 + \lambda_2 & y = foggy \\
      - \log p(y) - 1 + \lambda_1 + \lambda_3 & y = cloudy \\
      - \log p(y) - 1 + \lambda_1 & \text{otherwise}
	\end{array} \right.
$$

### Weather Forecast: Three Constraints (III)

We set $\frac{\partial\Lambda}{\partial p(y)} = 0$ and get the most uniform model:
\begin{align*}
  p(misty) &= \frac{9 - \sqrt{51}}{10} = 0.186... \\
  p(foggy) &= \frac{\sqrt{51} - 6}{10} = 0.114... \\
  p(cloudy) &= \frac{\sqrt{51} - 4}{10} = 0.314... \\
  p(sunny) &= \frac{11 - \sqrt{51}}{20} = 0.193... \\
  p(rainy) &= \frac{11 - \sqrt{51}}{20} = 0.193...
\end{align*}

### Maximum Entropy Model (I)

Formally we define a **random process** as follows:
\begin{align*}
  x &: \text{some information influencing the output}, \ x\in{\cal X} \\
  y &: \text{output value}, \ y\in{\cal Y}
\end{align*}

(eg) a random process defined for the **weather forecast** problem:
\begin{align*}
  x &: \text{today's weather}, \ x\in\{cloudy\} \\
  y &: \text{tomorrow's weather}, \ y\in\{misty, foggy, cloudy, sunny, rainy\}
\end{align*}

Wikipedia: [Principle of maximum entropy](https://en.wikipedia.org/wiki/Principle_of_maximum_entropy)

### Maximum Entropy Model (II)

We also have $N$ **training samples** $(x_1, y_1), (x_2, y_2), \ \ldots \ , (x_N, y_N)$ .

(eg) ten training samples for the **weather forecast** problem:
\begin{align*}
  & (cloudy, cloudy), (cloudy, sunny), (cloudy, sunny), (cloudy, misty), (cloudy, cloudy), \\
  & (cloudy, rainy), (cloudy, misty), (cloudy, foggy), (cloudy, cloudy), (cloudy, rainy)
\end{align*}

### Maximum Entropy Model (III)

For $i = 1,\ldots,n$ ($n$: number of **features**), we define $f_i(x, y)$, an **indicator function** of type ${\cal X}\times{\cal Y}\longrightarrow\{0, 1\}$ .

(eg) a feature set for the **weather forecast** problem:
\begin{align*}
  f_1(x, y) = 1 \quad & \text{if} \ y = \{misty, foggy, cloudy, sunny, rainy\} \\
  f_2(x, y) = 1 \quad & \text{if} \ y = \{misty, foggy\} \\
  f_3(x, y) = 1 \quad & \text{if} \ y = \{misty, cloudy\}
\end{align*}
otherwise $f_i(x, y) = 0$ for $i = 1,2,3$ .

- note that $x$ is always $\{cloudy\}$ with this problem
- if one sample is $(cloudy, foggy)$: $f_1 = 1, f_2 = 1, f_3 = 0$

### Maximum Entropy Model (IV)

The **expected value** of $f_i$ with respect to an **empirical distribution** $\tilde{p}(x, y)$ is given by
$$
  \tilde{p}(f_i) \equiv E_{\tilde{p}}[f_i] = \sum_{x\in {\cal X}} \sum_{y\in {\cal Y}} \tilde{p}(x, y) f_i(x, y)
$$

- $\tilde{p}(x, y)$ represents a summary of the training sample, that is,
$$
  \tilde{p}(x, y) \equiv \frac{1}{N} \times \text{number of times that $(x, y)$ occurs in the sample}
$$

- some pair $(x, y)$ may not occur at all in the sample

### Maximum Entropy Model (V)

The **expected value** of $f_i$ with respect to a **model distribution** $p(x, y)$ is
$$
  p(f_i) \equiv E_p[f_i] = \sum_{x\in {\cal X}} \sum_{y\in {\cal Y}} p(x, y) f_i(x, y) \sim \sum_{x\in {\cal X}} \tilde{p}(x) \sum_{y\in {\cal Y}} p(y | x) f_i(x, y)
$$

- calculation of $p(f_i)$ with respect to $p(x, y)$ is to the order of $|\ {\cal X}\times{\cal Y}\ |$, which is often too large
- by using the empirical distribution $\tilde{p}(x)$, the calculation gets more tractable because we only consider the training sample
- we are likely to have more reliable estimates for $p(y | x)$ than for $p(x, y)$

### Maximum Entropy Model (VI)

Now we have the following **constraint** that relates two expected values:
$$
  p(f_i) = \tilde{p}(f_i)
$$
- $\tilde{p}(f_i)$ represents statistical phenomena in the training sample
- $p(f_i) = \tilde{p}(f_i)$ is the mean of requiring that our model generalises the phenomena

(eg) constraints for the **weather forecast** problem:
\begin{align*}
  p(misty) + p(foggy) + p(cloudy) + p(sunny) + p(rainy) &= 1 \\
  p(misty) + p(foggy) &= 0.3 \\
  p(misty) + p(cloudy) &= 0.5
\end{align*}

### Maximum Entropy Model (VII)

Of all conditional probability distributions ${\cal P}$, a subset
$$
  {\cal C} \equiv \{ \ p\in{\cal P} \ | \ p(f_i) = \tilde{p}(f_i) \ \text{ for } \ i = 1,\ldots,n \ \}
$$
constrains the model according to our knowledge.

A **mathematical measure of the uniformity** is the conditional entropy:
$$
  H_p(Y|X) \equiv - \sum_{x\in {\cal X}} \tilde{p}(x) \sum_{y\in {\cal Y}} p(y | x) \log p(y | x)
$$
where the notation $H_p$ emphasises the dependency on $p$ .

Finally we have reached the **maximum entropy model** of the form:
$$
  p^* = \mathop{\rm argmax}_{p\in{\cal C}} H_p(Y|X)
$$

### Maximum Entropy Model: Analytical Solution (I)

For each feature $f_i \ (i = 1,\ldots,n)$, we introduce a **Lagrange multiplier** $\lambda_i$, then define the **Lagrangian**:
$$
  \Lambda(p, \lambda) \equiv H_p(Y|X) + \sum_i \lambda_i \{ p(f_i) - \tilde{p}(f_i) \}
$$
    
Holding $\lambda = \{\lambda_i\}$ fixed, we compute the unconstrained maximum of $\Lambda(p, \lambda)$ over all $p\in{\cal P}$ :
$$
  p_{\lambda}(y|x) = \mathop{\rm argmax}_{p\in{\cal P}} \Lambda(p, \lambda)
$$

### Maximum Entropy Model: Analytical Solution (II)

General solution is the exponential model:
$$
  p_{\lambda}(y|x) = \frac{1}{Z_{\lambda}(x)} \exp \left( \sum_i \lambda_i f_i(x,y) \right)
$$
where $Z_{\lambda}(x) = \sum_{y\in {\cal Y}} \exp \left( \sum_i \lambda_i f_i(x,y) \right)$ is a normalising constant.

- technically, $\lambda_i$ is a **Lagrange multiplier**, associated with the feature $f_i(x,y)$, in a certain constrained optimisation problem
- in a sense, $\lambda_i$ is a measure of the **importance** of the feature $f_i(x,y)$

### Maximum Entropy Model: Numerical Solution

1. initially set $\lambda_i = 0$ for all $i = 1,\ldots,n$

2. do for each $i = 1,\ldots,n$ :

   **A.** let $\delta_i$ be the solution to
   $$
     \displaystyle \sum_{x\in {\cal X}} \tilde{p}(x) \sum_{y\in {\cal Y}} p_{\lambda}(y|x) f_i(x,y) \exp\left( \delta_i f^{\#}(x,y) \right) = \tilde{p}(f_i)
   $$
   where $f^{\#}(x,y) \equiv \sum_{i = 1,\ldots,n} f_i(x,y)$

   **B.** update multipliers by $\lambda_i \leftarrow \lambda_i + \delta_i$

3. go back to step 2, if not all $\lambda_i$ have converged

### Recall: Weather Forecast with Three Constraints (I)

We have *three* constraints:
\begin{align*}
  p(misty) + p(foggy) + p(cloudy) + p(sunny) + p(rainy) &= 1 \\
  p(misty) + p(foggy) &= 0.3 \\
  p(misty) + p(cloudy) &= 0.5
\end{align*}

### Recall: Weather Forecast with Three Constraints (II)

We reached the most uniform model analytically:
\begin{align*}
  p(misty) &= 0.186... \\
  p(foggy) &= 0.114... \\
  p(cloudy) &= 0.314... \\
  p(sunny) &= 0.193... \\
  p(rainy) &= 0.193...
\end{align*}

(next graph) Numerical calculation of the maxmum entropy model

<img src="./figs/isa.png", width=800, align=center>