<a href="https://colab.research.google.com/github/tkorsi/Machine-Learning-Seminars/blob/main/Calcium%20Outflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sufficient Statistics

In the next task, we'll need the concept of a sufficient statistic, which will be discussed more in lectures later.

A function of the sample $T(x_{[n]})$ is called a **sufficient statistic** for a parametric family of distributions $p(x|\theta)$, $\theta \in \Theta$, for a sample of size $n$, if:

$$
p(x_{[n]} | T(x_{[n]}), \theta) = p(x_{[n]} | T(x_{[n]}) ) \quad \text{for all } \theta \in \Theta.
$$

Checking this directly is unpleasant, but there’s a simpler way:  
$T$ is a sufficient statistic **if** the likelihood can be written as:

$$
p(x_{[n]} | \theta) = g(T(x_{[n]}), \theta) \cdot h(x_{[n]})
$$

for some functions $g$ and $h$, where $g$ depends only on $T$ and $\theta$, and $h$ is independent of the parameter.

### Why is this useful?

Roughly speaking, a **sufficient statistic contains all information about the parameter present in the sample**.

Formally, this means the posterior distribution of the parameter given the sample equals the posterior given the sufficient statistic:

$$
p(\theta | x_{[n]}) =
\frac{p(x_{[n]} | \theta) p(\theta)}{\int p(x_{[n]} | \theta) p(\theta) d\theta}
=
\frac{p(T | \theta) p(\theta)}{\int p(T | \theta) p(\theta) d\theta}
= p(\theta | T).
$$

So, to compute the posterior, we don’t need the whole sample — only the sufficient statistic and sample size.

---

## Example: Normal Distribution with Known Variance

Let’s consider $X \sim \mathcal{N}(\mu, \sigma^2)$, where $\sigma^2$ is known and $\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)$. The likelihood is:

$$
p(x_{[n]} | \mu) = \prod_i \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} =
\left[ \frac{1}{(2\pi)^{n/2} \sigma^n} e^{-\frac{1}{2\sigma^2} \sum_i x_i^2} \right]
\left[ e^{-\frac{1}{2\sigma^2}(n\mu^2 - 2\mu \sum_i x_i)} \right]
$$

The first bracket depends only on $x_{[n]}$, the second only on $\mu$ and $\sum_i x_i$.  
Thus, $\sum_i x_i$ is a sufficient statistic, and so is the sample mean $\bar{x}$.

> If variance is known, it's enough to store the sample size and mean to compute the posterior.

---

## Example: Normal Distribution with Unknown Variance

If variance is unknown, the likelihood depends on $\mu, \sigma^2, \sum_i x_i, \sum_i x_i^2$, so we can take:

$$
T = \left( \sum_i x_i, \sum_i x_i^2 \right), \quad
g(T, \mu, \sigma^2) = p(x_{[n]} | \mu, \sigma^2), \quad
h \equiv 1
$$

So $(\sum x_i, \sum x_i^2)$ is sufficient — and thus $(\bar{x}, s^2)$ is sufficient.

> If variance is unknown, it's enough to store sample size, sample mean, and unbiased variance.

In particular, the parameters of $p(\mu | \sigma^2, x_{[n]})$ and $p(\sigma^2 | x_{[n]})$ depend on the sample only via $\bar{x}, s^2, n$.

---

## Application: Calcium Outflow Experiment

An experiment was conducted to study the effect of magnetic fields on calcium outflow from chicken brains.  
Two groups of chickens were used: a **control group** of 32 and a **treatment group** of 36.  
Each chicken had one measurement taken — the goal was to estimate the mean flow rates  
$\mu_c$ for the control group and $\mu_t$ for the treatment group.

- Control group (n = 32): sample mean = **1.013**, corrected std dev = **0.24**  
- Treatment group (n = 36): sample mean = **1.173**, corrected std dev = **0.20**

Assume the measurements are independent and normally distributed:

$$
\text{Control: } X_i \sim \mathcal{N}(\mu_c, \sigma_c^2), \quad
\text{Treatment: } Y_i \sim \mathcal{N}(\mu_t, \sigma_t^2)
$$

With uninformative priors:

$$
p(\mu_c, \sigma_c^2, \mu_t, \sigma_t^2) \sim \frac{1}{\sigma_c^2 \sigma_t^2}
$$

Estimate the **central 95% credible interval** for the difference:

$$
\mu_t - \mu_c
$$

and report the lower and upper bounds (separated by a space).

**Acceptable margin of error:** 0.01


In [22]:
import numpy as np
import pandas as pd
import scipy.stats as st

# ---------------------------------------------------------
#  Data from the problem
# ---------------------------------------------------------
# Control group: n=32, mean=1.013, corrected std=0.24
n_c = 32
mean_c = 1.013
sd_c = 0.24
var_c = sd_c**2  # corrected sample variance

# Treatment group: n=36, mean=1.173, corrected std=0.20
n_t = 36
mean_t = 1.173
sd_t = 0.20
var_t = sd_t**2

# ---------------------------------------------------------
#  Posterior parameters for sigma^2 (Inverse-Gamma)
# ---------------------------------------------------------
# For group i: alpha_i = (n_i - 1)/2
#              scale_i = (n_i - 1)*s_i^2 / 2
alpha_c = (n_c - 1) / 2.0
scale_c = ((n_c - 1) * var_c) / 2.0

alpha_t = (n_t - 1) / 2.0
scale_t = ((n_t - 1) * var_t) / 2.0


In [23]:

# ---------------------------------------------------------
#  Monte Carlo sampling
# ---------------------------------------------------------
rng = np.random.default_rng(42)
N = 200_000  # number of posterior draws

# 1) Sample sigma_c^2, sigma_t^2 from Inverse-Gamma
#    scipy.stats.invgamma takes parameters (a=shape, scale=scale)
sigma2_c_samples = st.invgamma(a=alpha_c, scale=scale_c).rvs(size=N, random_state=rng)
sigma2_t_samples = st.invgamma(a=alpha_t, scale=scale_t).rvs(size=N, random_state=rng)


In [24]:

# 2) Given sigma_c^2, sample mu_c ~ Normal( mean_c, sigma_c^2 / n_c )
mu_c_samples = rng.normal(
    loc=mean_c,
    scale=np.sqrt(sigma2_c_samples / n_c)
)

#    Given sigma_t^2, sample mu_t ~ Normal( mean_t, sigma_t^2 / n_t )
mu_t_samples = rng.normal(
    loc=mean_t,
    scale=np.sqrt(sigma2_t_samples / n_t)
)


In [25]:

# 3) Compute the difference
diff_samples = mu_t_samples - mu_c_samples

# 4) Extract 2.5% and 97.5% quantiles
lower_95, upper_95 = np.percentile(diff_samples, [2.5, 97.5])

print(f"95% credible interval for (mu_t - mu_c): [{lower_95:.3f}, {upper_95:.3f}]")


95% credible interval for (mu_t - mu_c): [0.051, 0.270]
