In [1]:
%%javascript
MathJax.Hub.Config({
    TeX: { equationNumbers: { autoNumber: "AMS" } }
});

<IPython.core.display.Javascript object>

The demo is about estimating $\mu$ of a univariate gaussian assuming known variance ($\sigma$) using maximum likelihood. The steps include:

1. Derive Fisher information.
1. Sample data from $N(\mu, \sigma^)$ multiple times.
1. Plot log-likelihood, the 1st- and 2nd-order derivaties of log-likelihood as a function of $\mu$ estimates for different samples.

# Theory

The likelihood is

\begin{align}
L
&= \prod_{i}^N p(x_i; \mu) \\
&= \prod_{i}^N \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left [- \frac{1}{2 \sigma^2} (x_i - \mu)^2 \right ] \\ 
&= \left( 2 \pi \sigma^2 \right)^{- \frac{N}{2}} \exp \left [- \frac{1}{2 \sigma^2} \sum_i^N (x_i - \mu)^2 \right ] \\ 
\end{align}

Ignoring the constant, the log-likelihood and its 1st- and 2nd-order derivatives are in the forms of 

\begin{align}
\ell &= -\frac{N}{2} \log \left(2 \pi \sigma^2 \right) - \frac{1}{2\sigma^2} \sum_i^N (x_i - \mu)^2 \\
\nabla_{\mu} \ell &= \frac{1}{\sigma^2} \sum_i^N (x_i - \mu) \\
\nabla^2_{\mu} \ell &= - \frac{N}{\sigma^2} \\
\end{align}

Then, Fisher information is the expectation of the first-order drivative squared:

\begin{align}
\mathcal{I}(\mu)
&= \mathbb{E}_X\left[ \left( \nabla_{\mu} \ell \right)^2 \right ] \\
&= \frac{1}{\sigma^4} \mathbb{E}_X\left[ \left( \sum_i^N (x_i - \mu) \right )^2 \right] \\
&= \frac{1}{\sigma^4} N \sigma^2 \\
&= \frac{N}{\sigma^2}
\end{align}

Note, in the 2nd equality, the expectation of all cross terms in the form of $\mathbb{E}_X[(x_i - \mu)(x_j - \mu)]$, i.e. the covariance, equal 0 because $x_i$ is independent of $x_j$.

Equivalently, Fisher information can also be written as the negative expectation of the second-order derivative:

\begin{align}
\mathcal{I}(\mu)
&= - \mathbb{E}_X \left[ \nabla^2_{\mu} \ell \right ] \\
&= \frac{N}{\sigma^2}
\end{align}

So the Fisher information of $\mu$ of a Gaussian distribution with known variance is just the inverse of the variance times sample size.

# Demo

In [2]:
import altair as alt
import matplotlib.pyplot as plt
import numpy as np

In [3]:
from numpy.random import default_rng

In [4]:
rng = default_rng()

True distribution where the data are sampled from.

In [5]:
μ = 77.7
σ = 1

In [6]:
# xs = rng.normal(loc=μ, scale=σ, size=100000)

log-likelihood without the constant term ($c$)

In [7]:
def log_likelihood(xs, mu, sigma):
    return - len(xs) / 2 * np.log (2 * np.pi * sigma ** 2) - 1 / (2 * sigma ** 2) * np.sum((xs - mu) ** 2)

In [8]:
def first_order_derivative(xs, mu, sigma):
    return 1 / sigma ** 2 * np.sum((xs - mu))

In [9]:
def second_order_derivative(xs, mu, sigma):
    """Note, second order derivative is independent of mu."""
    return - len(xs) / sigma ** 2

Plots log-likelihood as a function of data $X$ for different samples.

In [10]:
dfs = []

μ_estimates = np.linspace(μ - 5, μ + 5, 30)
ml_estimates = []  # To collect maximum likelihood estimates of μ.
num_samples = 3
sample_size = 5
for index in range(num_samples):
    xs = rng.normal(loc=μ, scale=σ, size=sample_size)
    ml_estimates.append(np.mean(xs))
    for μ_estimate in μ_estimates:
        _df = pd.DataFrame(
            {
                "μ_hat": [μ_estimate],
                "log_likelihood": log_likelihood(xs, μ_estimate, sigma=σ),
                "first_order_derivative": first_order_derivative(xs, μ_estimate, sigma=σ),
                "second_order_derivative": second_order_derivative(xs, μ_estimate, sigma=σ),
            }
        ).assign(dataset_id=index)
        dfs.append(_df)

ndf = pd.concat(dfs).reset_index(drop=True)

In [11]:
chart_lines = (
    alt.Chart(ndf)
    .mark_line()
    .encode(x="μ_hat", y="log_likelihood", color="dataset_id:N")
)

chart_ml_estimates = (
    alt.Chart(pd.DataFrame(np.array([ml_estimates]).T, columns=["ml_estimate"]))
    .mark_rule()
    .encode(x="ml_estimate")
)

chart_rule = (
    alt.Chart(pd.DataFrame([[μ]], columns=["truth"]))
    .mark_rule(color="red", strokeDash=[5, 5])
    .encode(x="truth")
)

In [12]:
chart_lines + chart_ml_estimates + chart_rule

In [13]:
chart_lines = (
    alt.Chart(ndf)
    .mark_line()
    .encode(x="μ_hat", y="first_order_derivative", color="dataset_id:N")
)

In [14]:
chart_lines + chart_ml_estimates + chart_rule

In [15]:
chart_lines = (
    alt.Chart(ndf)
    .mark_line()
    .encode(x="μ_hat", y="second_order_derivative", color="dataset_id:N")
)

In [16]:
chart_lines + chart_ml_estimates + chart_rule

**Intuition**: Since the likelihood function is expected to be concave, so the second-order derivatie is negative. The more negative the second-order derivative is, the bigger the curvature at the maximum-likelihood estimate of parameters $\theta$, which suggests the data contains more information about $\theta$, i.e. higher Fisher information.

# References

* Fisher Information and Cramer-Rao Lower Bound: Part 1 (https://wp.nyu.edu/kexinhuang/2018/08/16/fisher/).
* For Fisher information of multivariate Gaussian with known covariance matrix, see **Multivariate Normal Mean** section of https://www2.stat.duke.edu/courses/Spring16/sta532/lec/fish.pdf.