In [1]:
from IPython.display import HTML, Image, display

display(
    HTML(
        data="""
<style>
   div#notebook-container    { width: 100%; }
   div#menubar-container     { width: 100%; }
   div#maintoolbar-container { width: 100%; }
</style>
"""
    )
)

Ref: https://github.com/zyxue/book-notes-ml-prob-perspective/blob/master/ch9-generalized-linear-models-and-the-exponential-family/definitions-of-exponential-family.ipynb

### Exponential family distribution

According to PRML Eq. (2.194), the exponential family distribution follows the form,

$$
p(\mathbf{x} | \boldsymbol{\eta}) = h(\mathbf{x}) g(\boldsymbol{\eta}) \exp \left \{  \boldsymbol{\eta}^T \mathbf{u}(\mathbf{x}) \right \}
$$

where

* $\boldsymbol{\eta} \in \mathbb{R}^d$ is called the natural parameters of the distribution
* $\mathbf{u}(\mathbf{x}) \in \mathbb{R}^d$ is some function of $\mathbf{x}$, the sufficient statistics.
* $g(\boldsymbol{\eta}) \in \mathbb{R}$ is the normalization factor
* $h(\mathbf{x}) \in \mathbb{R}$ is a scaling constant, often 1. 

Note, while $g(\boldsymbol{\eta})$ is a function $g$ is a function of $\boldsymbol{\eta}$, $h$ is a function of $\mathbf{x}$.

### Common exponential family distributions

| Distribution                                  | $\boldsymbol{\eta}$                                               | $\mathbf{u}(\mathbf{x})$                 | $g(\boldsymbol{\eta})$                                           | $h(\mathbf{x})$         | 
|-----------------------------------------------|-------------------------------------------------------------------|------------------------------------------|------------------------------------------------------------------|-------------------------|
| Bernoulli  $p(x|\mu)$                        | $\ln \frac{\mu}{1 - \mu}$                                         | $x$                                      |  $\sigma(-\eta)$                                                              |      $1$   |
| Multinomial $p(\mathbf{x}|\boldsymbol{\mu})$ | $\eta_k = \ln \mu_k$                                              | $\mathbf{x}$                             | $1$                                                              | $1$                     |
| Gaussian $p(x|\mu, \sigma^2)$                | $\begin{bmatrix} \mu / \sigma^2 \\  -1 / (2\sigma^2) \end{bmatrix}$ | $\begin{bmatrix} x \\ x^2 \end{bmatrix}$ | $(-2 \eta_2)^{1/2} \exp \left( \frac{\eta_1^2}{4\eta_2} \right)$ | $(2\pi)^{\frac{-1}{2}}$ |

Note, for Gaussian distribution, $g(\boldsymbol{\eta})$ can be reduced to $\frac{1}{\sigma} \exp \left( - \frac{\mu^2}{2 \sigma^2} \right)$.

### Maximum likelihood estimate for exponential family distribution

Solving the ML estimator leads to

$$-\nabla \ln g(\boldsymbol{\eta}_{\text{ML}}) = \frac{1}{N} \sum_{n=1}^N \mathbf{u}(\mathbf{x}_n)$$

For **Bernoulli distribution**,

\begin{align*}
g(\eta)
&= \sigma(- \eta) \\
&= \frac{1}{1 + e^\eta} \\
\ln g(\eta)
&= - \ln (1 + e^\eta) \\
\nabla \ln g(\eta)
&= - \frac{1}{1 + e^\eta} e^\eta \\
- \nabla \ln g(\eta)
&= \frac{1}{1 + e^\eta} e^\eta
\end{align*}

Given $\eta = \ln \frac{\mu}{1 - \mu}$,

\begin{align*}
- \nabla \ln g(\eta) 
&= \frac{1}{\frac{1}{1 - \mu}} \frac{\mu}{1 - \mu} = \mu
\end{align*}

Therefore,

$$-\nabla \ln g(\boldsymbol{\eta}_{\text{ML}}) = \mu_{\text{ML}} = \frac{1}{N} \sum_{n=1}^N \mathbf{u}(\mathbf{x}_n)$$

which matches the result when calculating the maximum likelihood estimate of $\mu$ for a Bernoulli distribution directly.

For **Gaussian distribution**,

\begin{align*}
g(\boldsymbol{\eta})
&= (-2 \eta_2)^{1/2} \exp \left( \frac{\eta_1^2}{4\eta_2} \right) \\
\ln g(\boldsymbol{\eta})
&= \frac{1}{2} \ln \left( - 2 \eta_2 \right ) + \frac{\eta_1^2}{4 \eta_2} \\
\nabla \ln g(\boldsymbol{\eta})
&= \begin{bmatrix}
\frac{\eta_1}{2 \eta_2} \\ 
\frac{1}{2\eta_2} - \frac{1}{4} \frac{\eta_1^2}{\eta_2^2}
\end{bmatrix} \\
&= \begin{bmatrix}
- \mu \\ 
- \sigma^2 - \mu^2
\end{bmatrix} \\
- \nabla \ln g(\boldsymbol{\eta})
&= \begin{bmatrix}
\mu \\ 
\sigma^2 + \mu^2
\end{bmatrix} \\
\end{align*}

Thefore,

\begin{align*}
-\nabla \ln g(\boldsymbol{\eta}_{\text{ML}})
&= \begin{bmatrix}
\mu_{\text{ML}} \\ 
\sigma_{\text{ML}}^2 + \mu_{\text{ML}}^2
\end{bmatrix} \\
&= \frac{1}{N} \sum_{n=1}^N \mathbf{u}(\mathbf{x}_n) \\
&= \frac{1}{N} \sum_{n=1}^N \begin{bmatrix}
x_n \\ 
x_n^2
\end{bmatrix} \\
\mu_{\text{ML}}
&= \frac{1}{N} \sum_{n=1}^N x_n \\
\sigma^2_{\text{ML}}
&= \frac{1}{N} \sum_{n=1}^N x_n^2 - \mu_{\text{ML}}^2 \\
&= \frac{1}{N} \sum_{n=1}^N x_n^2 - \left(\frac{1}{N} \sum_{n=1}^N x_n \right )^2 \\
&= \frac{1}{N} \sum_{n=1}^N \left( x_n - \mu_{\text{ML}} \right )^2
\end{align*}

which also match the results when calculating the maximum likelihood estimates of $\mu$ and $\sigma$ for a Gaussian distribution directly.