# Normal Distribution (`norm`)

The normal (Gaussian) distribution is the canonical model for **additive noise** and **aggregated effects**.

It appears throughout statistics and machine learning via the **Central Limit Theorem**, as the distribution of **measurement errors**, and as the maximum-entropy distribution under mean/variance constraints.


## Notebook roadmap
1) Title & classification
2) Intuition & motivation
3) Formal definition (PDF/CDF)
4) Moments & properties
5) Parameter interpretation
6) Derivations ($\mathbb{E}[X]$, $\mathrm{Var}(X)$, likelihood)
7) Sampling & simulation (NumPy-only)
8) Visualization (PDF, CDF, Monte Carlo)
9) SciPy integration (`scipy.stats.norm`)
10) Statistical use cases
11) Pitfalls
12) Summary


In [None]:
import math

import numpy as np
import scipy
from scipy import special, stats

import plotly
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

SEED = 7
rng = np.random.default_rng(SEED)

np.set_printoptions(precision=4, suppress=True)

print("numpy ", np.__version__)
print("scipy ", scipy.__version__)
print("plotly", plotly.__version__)


## Prerequisites & notation

**Prerequisites**
- comfort with basic calculus (integration by parts)
- basic probability (PDF/CDF, expectation, likelihood)

**Notation**
- $X \sim \mathcal{N}(\mu, \sigma^2)$ means: mean $\mu\in\mathbb{R}$, standard deviation $\sigma>0$.
- $Z \sim \mathcal{N}(0,1)$ denotes the **standard normal**.
- $\varphi$ and $\Phi$ denote the standard normal **PDF** and **CDF**.

SciPy uses a **location–scale** parameterization: `stats.norm(loc=μ, scale=σ)`.


## 1) Title & classification

- **Name**: `norm` (Normal / Gaussian distribution)
- **Type**: **continuous**
- **Support**: $x \in (-\infty, \infty)$
- **Parameter space**:
  - location (mean): $\mu \in \mathbb{R}$
  - scale (std dev): $\sigma \in (0, \infty)$

Equivalent parameterizations you’ll also see:
- variance $\sigma^2 > 0$
- precision $\tau = 1/\sigma^2 > 0$


## 2) Intuition & motivation

### What it models
The normal distribution often models **the sum of many small, independent effects**.
A classic mental model is **measurement error**:

$\text{observed} = \text{true signal} + \text{noise}$, where the noise is approximately Gaussian.

Two key reasons it shows up so often:
1) **Central Limit Theorem (CLT):** standardized sums of many weakly dependent variables tend toward a normal distribution.
2) **Maximum entropy:** among all continuous distributions with a fixed mean and variance, the normal has the largest differential entropy (it is the “least informative” choice under those constraints).

### Typical real-world use cases
- **Sensors & experiments:** additive noise in physical measurements
- **Averages/aggregates:** sampling distributions of means (often approximately normal)
- **Error models:** regression residuals, Kalman filters, Gaussian processes
- **Latent-variable models:** Gaussian priors and Gaussian likelihoods (conjugacy)

### Relations to other distributions
- Standardization: if $X \sim \mathcal{N}(\mu,\sigma^2)$, then $(X-\mu)/\sigma \sim \mathcal{N}(0,1)$.
- Chi-square: if $Z \sim \mathcal{N}(0,1)$, then $Z^2 \sim \chi^2_1$.
- Additivity: sums of independent normals are normal (means/variances add).
- Student-$t$: arises from a normal divided by a chi-square term.
- Lognormal: if $Y \sim \mathcal{N}(\mu,\sigma^2)$, then $\exp(Y)$ is lognormal.


## 3) Formal definition

Let $X \sim \mathcal{N}(\mu, \sigma^2)$ with $\mu\in\mathbb{R}$ and $\sigma>0$.

### PDF
\[
f(x\mid\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\,\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right),\qquad x\in\mathbb{R}.
\]

For the standard normal $Z\sim\mathcal{N}(0,1)$, the PDF is
\[
\varphi(z) = \frac{1}{\sqrt{2\pi}}\,e^{-z^2/2}.
\]

### CDF
The CDF is
\[
F(x\mid\mu,\sigma) = \mathbb{P}(X\le x) = \Phi\!\left(\frac{x-\mu}{\sigma}\right),
\]
where $\Phi$ is the standard normal CDF.

There is no elementary closed form, but it can be written using the error function:
\[
\Phi(z) = \tfrac{1}{2}\left(1 + \operatorname{erf}\!\left(\tfrac{z}{\sqrt{2}}\right)\right).
\]


## 4) Moments & properties

For $X \sim \mathcal{N}(\mu, \sigma^2)$:

### Moments
- **Mean**: $\mathbb{E}[X] = \mu$
- **Variance**: $\mathrm{Var}(X) = \sigma^2$
- **Skewness**: $0$ (symmetry)
- **Kurtosis**: $3$ (excess kurtosis $0$)
- **Median / mode**: $\mu$

### MGF and characteristic function
- **MGF** (all real $t$):
\[
M_X(t) = \mathbb{E}[e^{tX}] = \exp\!\left(\mu t + \tfrac{1}{2}\sigma^2 t^2\right).
\]

- **Characteristic function**:
\[
\varphi_X(t) = \mathbb{E}[e^{itX}] = \exp\!\left(i\mu t - \tfrac{1}{2}\sigma^2 t^2\right).
\]

### Entropy (differential, in nats)
\[
H(X) = \tfrac{1}{2}\ln\!\left(2\pi e\,\sigma^2\right).
\]

### Other notable properties
- **Affine invariance**: if $Y=aX+b$, then $Y$ is normal with mean $a\mu+b$ and variance $a^2\sigma^2$.
- **Additivity**: sums of independent normals are normal (and covariances add in the multivariate case).
- **Maximum entropy** under fixed mean/variance constraints.


In [None]:
SQRT_2PI = math.sqrt(2.0 * math.pi)


def norm_pdf(x: np.ndarray, loc: float = 0.0, scale: float = 1.0) -> np.ndarray:
    x = np.asarray(x, dtype=float)
    if scale <= 0:
        raise ValueError("scale must be > 0")
    z = (x - loc) / scale
    return np.exp(-0.5 * z**2) / (scale * SQRT_2PI)


def norm_cdf(x: np.ndarray, loc: float = 0.0, scale: float = 1.0) -> np.ndarray:
    x = np.asarray(x, dtype=float)
    if scale <= 0:
        raise ValueError("scale must be > 0")
    z = (x - loc) / scale
    return special.ndtr(z)


def norm_logpdf(x: np.ndarray, loc: float = 0.0, scale: float = 1.0) -> np.ndarray:
    x = np.asarray(x, dtype=float)
    if scale <= 0:
        raise ValueError("scale must be > 0")
    z = (x - loc) / scale
    return -0.5 * z**2 - math.log(scale) - 0.5 * math.log(2.0 * math.pi)


def norm_loglik(loc: float, scale: float, x: np.ndarray) -> float:
    x = np.asarray(x, dtype=float)
    if scale <= 0 or np.any(~np.isfinite(x)):
        return -np.inf
    return float(np.sum(norm_logpdf(x, loc=loc, scale=scale)))


def norm_mle(x: np.ndarray) -> tuple[float, float]:
    """MLE for (μ, σ) under iid N(μ, σ²).

    Note: the MLE for σ uses ddof=0 (biased as an estimator of σ).
    """

    x = np.asarray(x, dtype=float)
    mu_hat = float(np.mean(x))
    sigma_hat = float(np.sqrt(np.mean((x - mu_hat) ** 2)))
    return mu_hat, sigma_hat


def sample_norm_box_muller(
    n: int,
    loc: float = 0.0,
    scale: float = 1.0,
    rng: np.random.Generator | None = None,
) -> np.ndarray:
    """NumPy-only sampling via the Box–Muller transform.

    Returns n iid samples from N(loc, scale^2).
    """

    if rng is None:
        rng = np.random.default_rng()
    if n < 0:
        raise ValueError("n must be >= 0")
    if scale <= 0:
        raise ValueError("scale must be > 0")

    m = (n + 1) // 2  # number of (Z0, Z1) pairs
    u1 = rng.random(m)
    u2 = rng.random(m)

    # Avoid log(0) when u1 is exactly 0.
    u1 = np.maximum(u1, np.nextafter(0.0, 1.0))

    r = np.sqrt(-2.0 * np.log(u1))
    theta = 2.0 * math.pi * u2

    z0 = r * np.cos(theta)
    z1 = r * np.sin(theta)

    z = np.empty(2 * m, dtype=float)
    z[0::2] = z0
    z[1::2] = z1
    z = z[:n]

    return loc + scale * z


## 5) Parameter interpretation

### Location $\mu$
- Shifts the distribution left/right.
- $\mu$ is the center of symmetry, and it equals the mean/median/mode.

### Scale $\sigma$
- Controls dispersion: larger $\sigma$ spreads mass out and lowers the peak.
- About 68% / 95% / 99.7% of mass lies within $\mu \pm 1\sigma$, $\mu \pm 2\sigma$, $\mu \pm 3\sigma$ (the “68–95–99.7 rule”).

### Shape changes
All normal PDFs are bell-shaped and symmetric; changing $\mu$ shifts the bell, changing $\sigma$ changes its width.


In [None]:
x = np.linspace(-8, 8, 800)

params = [
    (0.0, 1.0),
    (0.0, 2.0),
    (1.5, 1.0),
    (-2.0, 0.6),
]

fig = go.Figure()
for mu, sigma in params:
    fig.add_trace(
        go.Scatter(
            x=x,
            y=norm_pdf(x, loc=mu, scale=sigma),
            mode="lines",
            name=f"μ={mu:g}, σ={sigma:g}",
        )
    )
    fig.add_vline(x=mu, line_dash="dot", opacity=0.25)

fig.update_layout(title="Normal PDFs for different (μ, σ)", xaxis_title="x", yaxis_title="f(x)")
fig

## 6) Derivations

We derive $\mathbb{E}[X]$, $\mathrm{Var}(X)$, and the likelihood/MLE.

### Expectation
For the standard normal $Z\sim\mathcal{N}(0,1)$ with PDF $\varphi(z)$,
\[
\mathbb{E}[Z] = \int_{-\infty}^{\infty} z\,\varphi(z)\,dz.
\]
The integrand $z\,\varphi(z)$ is an **odd function** (since $\varphi$ is even), so the integral over a symmetric domain is $0$.

For $X = \mu + \sigma Z$:
\[
\mathbb{E}[X] = \mu + \sigma\,\mathbb{E}[Z] = \mu.
\]

### Variance
First compute $\mathbb{E}[Z^2]$:
\[
\mathbb{E}[Z^2] = \int_{-\infty}^{\infty} z^2\,\varphi(z)\,dz.
\]
Use the fact that $\varphi'(z) = -z\,\varphi(z)$, so $z\,\varphi(z) = -\varphi'(z)$. Then
\[
\mathbb{E}[Z^2] = \int z^2\varphi(z)\,dz = -\int z\,\varphi'(z)\,dz.
\]
Integrate by parts with $u=z$ and $dv=\varphi'(z)\,dz$:
\[
-\int z\,\varphi'(z)\,dz = -\big[z\,\varphi(z)\big]_{-\infty}^{\infty} + \int \varphi(z)\,dz.
\]
The boundary term is $0$ because $z\,\varphi(z)\to 0$ as $|z|\to\infty$, and $\int \varphi(z)\,dz = 1$. Hence $\mathbb{E}[Z^2]=1$, so $\mathrm{Var}(Z)=1$.

For $X=\mu+\sigma Z$:
\[
\mathrm{Var}(X) = \sigma^2\,\mathrm{Var}(Z) = \sigma^2.
\]

### Likelihood and MLE
For iid data $x_1,\dots,x_n$ from $\mathcal{N}(\mu,\sigma^2)$, the likelihood is
\[
L(\mu,\sigma) = \prod_{i=1}^n \frac{1}{\sigma\sqrt{2\pi}}\exp\!\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right).
\]
The log-likelihood is
\[
\ell(\mu,\sigma) = -n\ln\sigma - \tfrac{n}{2}\ln(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i-\mu)^2.
\]
Setting derivatives to zero gives the MLEs:
\[
\hat\mu = \bar x,\qquad \hat\sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i-\bar x)^2.
\]
(The familiar unbiased sample variance uses $n-1$ instead of $n$.)


In [None]:
# MLE demo on simulated data
true_mu = 1.5
true_sigma = 0.8
n = 600

x = sample_norm_box_muller(n, loc=true_mu, scale=true_sigma, rng=rng)

mu_hat, sigma_hat = norm_mle(x)

loglik_true = norm_loglik(true_mu, true_sigma, x)
loglik_hat = norm_loglik(mu_hat, sigma_hat, x)

true_mu, true_sigma, mu_hat, sigma_hat, loglik_true, loglik_hat


## 7) Sampling & simulation (NumPy-only)

### Box–Muller transform
Let $U_1, U_2 \sim \mathrm{Uniform}(0,1)$ iid. Define
\[
R = \sqrt{-2\ln U_1},\qquad \Theta = 2\pi U_2.
\]
Then
\[
Z_0 = R\cos\Theta,\qquad Z_1 = R\sin\Theta
\]
are iid $\mathcal{N}(0,1)$. Finally, to sample $X\sim\mathcal{N}(\mu,\sigma^2)$, return $X = \mu + \sigma Z$.

**Numerical note:** if $U_1=0$, then $\ln U_1$ is undefined, so we clip $U_1$ away from 0.


In [None]:
# Sampling: compare histogram to the true PDF
mu = 0.7
sigma = 1.3
n = 60_000

samples = sample_norm_box_muller(n, loc=mu, scale=sigma, rng=rng)

x_grid = np.linspace(mu - 4.5 * sigma, mu + 4.5 * sigma, 500)

fig = px.histogram(
    samples,
    nbins=70,
    histnorm="probability density",
    title=f"Monte Carlo samples vs PDF (n={n}, μ={mu:g}, σ={sigma:g})",
    labels={"value": "x"},
)
fig.add_trace(go.Scatter(x=x_grid, y=norm_pdf(x_grid, mu, sigma), mode="lines", name="true pdf"))
fig.update_layout(yaxis_title="density")
fig.show()

samples.mean(), samples.std(ddof=0)


## 8) Visualization (PDF, CDF, Monte Carlo)

We’ll visualize:
- the PDF for multiple $\sigma$ values
- the CDF and an empirical CDF from Monte Carlo samples


In [None]:
# PDF and CDF for multiple scales
mu = 0.0
sigmas = [0.5, 1.0, 2.0]
x = np.linspace(-8, 8, 800)

fig_pdf = go.Figure()
fig_cdf = go.Figure()

for s in sigmas:
    fig_pdf.add_trace(go.Scatter(x=x, y=norm_pdf(x, mu, s), mode="lines", name=f"σ={s:g}"))
    fig_cdf.add_trace(go.Scatter(x=x, y=norm_cdf(x, mu, s), mode="lines", name=f"σ={s:g}"))

fig_pdf.update_layout(title="Normal PDF (μ=0)", xaxis_title="x", yaxis_title="f(x)")
fig_cdf.update_layout(title="Normal CDF (μ=0)", xaxis_title="x", yaxis_title="F(x)")

fig_pdf.show()
fig_cdf.show()


In [None]:
# Empirical CDF vs true CDF
mu = -0.5
sigma = 1.2
n = 25_000
samples = sample_norm_box_muller(n, loc=mu, scale=sigma, rng=rng)

xs = np.sort(samples)
ys = np.arange(1, n + 1) / n

x_grid = np.linspace(mu - 4.5 * sigma, mu + 4.5 * sigma, 600)

fig = go.Figure()
fig.add_trace(go.Scatter(x=xs, y=ys, mode="lines", name="empirical CDF"))
fig.add_trace(go.Scatter(x=x_grid, y=norm_cdf(x_grid, mu, sigma), mode="lines", name="true CDF"))
fig.update_layout(
    title=f"Empirical CDF vs true CDF (n={n}, μ={mu:g}, σ={sigma:g})",
    xaxis_title="x",
    yaxis_title="F(x)",
)
fig

## 9) SciPy integration (`scipy.stats.norm`)

SciPy’s `norm` is parameterized as `stats.norm(loc=μ, scale=σ)`.

Useful methods include:
- `pdf`, `logpdf`
- `cdf`, `sf` (survival function), and the numerically stable `logcdf`, `logsf`
- `ppf` (quantiles)
- `rvs` (sampling)
- `fit` (MLE fitting)


In [None]:
mu = 0.7
sigma = 1.3
dist = stats.norm(loc=mu, scale=sigma)

x = np.linspace(mu - 3 * sigma, mu + 3 * sigma, 7)
pdf_vals = dist.pdf(x)
cdf_vals = dist.cdf(x)

# Sampling
samples = dist.rvs(size=5, random_state=rng)

# Fit (MLE)
big_sample = dist.rvs(size=5_000, random_state=rng)
mu_fit, sigma_fit = stats.norm.fit(big_sample)

x, pdf_vals, cdf_vals, samples, (mu_fit, sigma_fit)


In [None]:
# Tail-stability: logcdf/logsf vs log(cdf/sf)
z = -40.0
cdf_direct = stats.norm.cdf(z)
logcdf_stable = stats.norm.logcdf(z)

z2 = 40.0
sf_direct = stats.norm.sf(z2)
logsf_stable = stats.norm.logsf(z2)

(cdf_direct, logcdf_stable), (sf_direct, logsf_stable)


## 10) Statistical use cases

### Hypothesis testing (z-test for a mean, $\sigma$ known)
If $X_1,\dots,X_n \sim \mathcal{N}(\mu,\sigma^2)$ with known $\sigma$, then under $H_0: \mu=\mu_0$,
\[
Z = \frac{\bar X - \mu_0}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1).
\]
A two-sided p-value is $p = 2\,\mathbb{P}(|Z|\ge |z_{obs}|)$.

### Bayesian modeling (Normal–Normal conjugacy for a mean, $\sigma$ known)
Prior: $\mu \sim \mathcal{N}(\mu_0,\tau_0^2)$. Likelihood: $X_i\mid\mu \sim \mathcal{N}(\mu,\sigma^2)$ with known $\sigma$.

Posterior: $\mu\mid x \sim \mathcal{N}(\mu_n,\tau_n^2)$ where
\[
\tau_n^2 = \left(\tfrac{1}{\tau_0^2} + \tfrac{n}{\sigma^2}\right)^{-1},\qquad
\mu_n = \tau_n^2\left(\tfrac{\mu_0}{\tau_0^2} + \tfrac{n\bar x}{\sigma^2}\right).
\]

### Generative modeling
Normals are building blocks for generative models:
- **Linear Gaussian models** (e.g., Kalman filters): Gaussian latent states + Gaussian noise
- **Gaussian mixtures** (GMMs): weighted sums of normals for multi-modal densities
- **Multivariate normal**: correlated features via linear transforms of independent normals


In [None]:
# Hypothesis test example: two-sided z-test for a mean (σ known)
mu0 = 0.0
sigma_known = 2.0
n = 40

# Simulated measurements with true mean != mu0
true_mu = 0.9
data = sample_norm_box_muller(n, loc=true_mu, scale=sigma_known, rng=rng)

xbar = data.mean()
z_obs = (xbar - mu0) / (sigma_known / math.sqrt(n))
p_two_sided = 2.0 * stats.norm.sf(abs(z_obs))

alpha = 0.05
z_crit = stats.norm.ppf(1 - alpha / 2)
ci = (
    xbar - z_crit * sigma_known / math.sqrt(n),
    xbar + z_crit * sigma_known / math.sqrt(n),
)

xbar, z_obs, p_two_sided, ci


In [None]:
# Bayesian update for μ with known σ (Normal–Normal)
mu0 = 0.0
tau0 = 1.5  # prior std dev
sigma = sigma_known

xbar = data.mean()
tau_n2 = 1.0 / (1.0 / tau0**2 + n / sigma**2)
mu_n = tau_n2 * (mu0 / tau0**2 + n * xbar / sigma**2)
tau_n = math.sqrt(tau_n2)

mu_n, tau_n


In [None]:
# Visualize prior vs posterior over μ
mu_grid = np.linspace(mu_n - 5 * tau0, mu_n + 5 * tau0, 600)

prior = stats.norm(loc=mu0, scale=tau0)
post = stats.norm(loc=mu_n, scale=tau_n)

fig = go.Figure()
fig.add_trace(go.Scatter(x=mu_grid, y=prior.pdf(mu_grid), mode="lines", name="prior"))
fig.add_trace(go.Scatter(x=mu_grid, y=post.pdf(mu_grid), mode="lines", name="posterior"))
fig.update_layout(title="Bayesian update for μ (σ known)", xaxis_title="μ", yaxis_title="density")
fig

In [None]:
# Generative modeling example: 2D correlated Gaussian via a linear transform
n = 3_000
mu_vec = np.array([1.0, -1.0])
Sigma = np.array([[1.0, 0.8], [0.8, 2.0]])
L = np.linalg.cholesky(Sigma)

z = sample_norm_box_muller(2 * n, loc=0.0, scale=1.0, rng=rng).reshape(n, 2)
x = mu_vec + z @ L.T

df = {"x1": x[:, 0], "x2": x[:, 1]}
fig = px.scatter(df, x="x1", y="x2", opacity=0.35, title="Samples from a correlated 2D Gaussian")
fig.update_layout(xaxis_title="x1", yaxis_title="x2")
fig.show()

x.mean(axis=0), np.cov(x.T)


## 11) Pitfalls

- **Invalid parameters**: $\sigma\le 0$ is not allowed. In code, guard against non-positive `scale`.
- **Overconfidence in normality**: real data may be skewed, heavy-tailed, or multi-modal. Diagnose with histograms/QQ-plots; consider alternatives (e.g., Student-$t$, mixtures, robust losses).
- **Outliers**: Gaussian likelihoods heavily penalize large residuals, so a few outliers can dominate fits.
- **Numerical issues in the tails**: `cdf`/`sf` may underflow to 0; prefer `logcdf`/`logsf` or work in log-space.
- **Sampling edge cases**: Box–Muller requires $U_1>0$; clip `u1` away from 0 to avoid `log(0)`.


## 12) Summary

- `norm` is a **continuous** distribution on $( -\infty,\infty )$ with parameters $\mu\in\mathbb{R}$, $\sigma>0$.
- PDF: bell-shaped and symmetric; $\mu$ shifts, $\sigma$ spreads.
- Key formulas: $\mathbb{E}[X]=\mu$, $\mathrm{Var}(X)=\sigma^2$, $M_X(t)=\exp(\mu t + \tfrac12\sigma^2 t^2)$, $H=\tfrac12\ln(2\pi e\sigma^2)$.
- MLE: $\hat\mu=\bar x$, $\hat\sigma^2 = \tfrac1n\sum(x_i-\bar x)^2$.
- For tails, prefer `stats.norm.logcdf/logsf` over taking `log` of `cdf/sf`.
