# `nchypergeom_fisher` (Fisher’s noncentral hypergeometric distribution)

The Fisher noncentral hypergeometric distribution models **biased sampling without replacement** from a population containing **two types** of items.

This notebook uses the same parameterization as `scipy.stats.nchypergeom_fisher`:
- `M` = total population size
- `n` = number of **Type I** items in the population
- `N` = number of draws (sample size, without replacement)
- `odds` = odds ratio (>0) favoring Type I over Type II

## Learning goals
By the end you should be able to:
- state the support and write the PMF/CDF
- understand how the odds ratio **tilts** a hypergeometric distribution
- compute moments via the log-partition function (and numerically via the PMF)
- sample from the distribution with a **NumPy-only inverse-CDF** method
- connect `nchypergeom_fisher` to **2×2 contingency tables** and exact inference on odds ratios


In [None]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=6, suppress=True)
rng = np.random.default_rng(42)


## 1) Title & Classification

**Name**: `nchypergeom_fisher` (Fisher’s noncentral hypergeometric distribution)  
**Type**: **Discrete**  

**Support**:
Let:
- `M` be the total population size
- `n` be the number of Type I items (so `M-n` are Type II)
- `N` be the number of draws (without replacement)

If \(X\) is the number of Type I items drawn, then:

\[
X \in \{x_{\min}, x_{\min}+1, \dots, x_{\max}\},\qquad
x_{\min} = \max\bigl(0,\, N-(M-n)\bigr),\quad
x_{\max} = \min(n, N).
\]

**Parameter space**:
\[
M \in \{0,1,2,\dots\},\quad
n\in\{0,\dots,M\},\quad
N\in\{0,\dots,M\},\quad
\text{odds} > 0.
\]

Interpretation:
- `M` sets the urn size.
- `n` is how many Type I (“success”) items exist in the population.
- `N` is how many items you draw.
- `odds` controls the bias toward Type I (\(\text{odds}=1\) recovers the usual hypergeometric distribution).


## 2) Intuition & Motivation

### What this distribution models
You have an urn with two types of objects:
- Type I: \(n\) objects
- Type II: \(M-n\) objects

You draw \(N\) objects **without replacement**, but the draw is **biased**.
A convenient way to define Fisher’s version is:

> each possible sample composition is weighted by an **odds ratio** \(\omega\) (“odds”) raised to the number of Type I objects in that sample.

So if a sample contains \(x\) Type I objects (and \(N-x\) Type II objects), it gets a weight proportional to \(\omega^x\), on top of the usual combinatorial counts.

### Typical real-world use cases
- **2×2 contingency tables / case–control studies**: conditional inference about an odds ratio given fixed margins
- **Biased audits / inspections**: a fixed-size sample is collected, but certain types are more likely to appear
- **Ecology / genetics**: sampling a fixed number of individuals where types have different sampling propensities

### Relations to other distributions
- If \(\omega = 1\), this reduces to the **hypergeometric** distribution (unbiased sampling without replacement).
- If the sampling fraction is small (\(N \ll M\)), it is often well-approximated by a **binomial** with a bias-adjusted success probability.
- Compare with `scipy.stats.nchypergeom_wallenius` (Wallenius’ noncentral hypergeometric):
  - Fisher: the bias is applied to the *final sample composition* (“draw a handful at once”).
  - Wallenius: the bias is applied *sequentially* as you draw items.
  - They coincide at \(\omega=1\) but differ in general.


## 3) Formal Definition

Let \(X\) be the number of Type I objects in a biased sample of size \(N\).
Define the combinatorial term
\[
a_x = \binom{n}{x}\binom{M-n}{N-x}
\]
for integers \(x\) in the feasible range.

### PMF
For \(x \in \{x_{\min},\dots,x_{\max}\}\), the PMF is
\[
\Pr(X=x\mid M,n,N,\omega)
= \frac{a_x\,\omega^x}{Z(\omega)},
\qquad
Z(\omega)=\sum_{k=x_{\min}}^{x_{\max}} a_k\,\omega^k,
\qquad \omega=\text{odds}>0.
\]
Outside the support, \(\Pr(X=x)=0\).

### CDF
For a real number \(t\), the CDF is the finite sum
\[
F(t) = \Pr(X\le t) = \sum_{k=x_{\min}}^{\lfloor t\rfloor} \Pr(X=k).
\]

There is no simple closed form for the CDF in general; it is typically evaluated by summation in a numerically stable way.


## 4) Moments & Properties

A key observation is that Fisher’s noncentral hypergeometric distribution is an **exponential family** in the natural parameter
\(\theta = \log \omega\).

Define the log-partition function
\[
A(\theta) = \log Z(\theta),\qquad
Z(\theta) = \sum_{k=x_{\min}}^{x_{\max}} a_k\,e^{\theta k}.
\]

### Mean, variance, skewness, kurtosis
Derivatives of \(A(\theta)\) give **cumulants**:
\[
\mathbb{E}[X] = A'(\theta),\qquad
\mathrm{Var}(X) = A''(\theta).
\]
More generally, the \(r\)-th cumulant is \(\kappa_r = A^{(r)}(\theta)\). Then
\[
\text{skew}(X) = \frac{\kappa_3}{\kappa_2^{3/2}},
\qquad
\text{excess kurt}(X) = \frac{\kappa_4}{\kappa_2^2}.
\]
In practice, these are often computed numerically from the PMF.

### MGF and characteristic function
Using the partition function, the MGF has a clean ratio form:
\[
M_X(t)=\mathbb{E}[e^{tX}] = \frac{Z(\theta+t)}{Z(\theta)} = \frac{Z(\omega e^{t})}{Z(\omega)}.
\]
Similarly, the characteristic function is
\[
\varphi_X(t)=\mathbb{E}[e^{itX}] = \frac{Z(\theta+it)}{Z(\theta)}.
\]

### Entropy
The (Shannon) entropy is
\[
H(X) = -\sum_x p(x)\log p(x).
\]
For this family, you can also write
\[
H(X) = A(\theta) - \theta\,\mathbb{E}[X] - \mathbb{E}[\log a_X],
\]
which is useful conceptually, but still requires numerical evaluation.


In [None]:
def _validate_M_n_N(M, n, N):
    for name, v in [("M", M), ("n", n), ("N", N)]:
        if isinstance(v, bool) or not isinstance(v, (int, np.integer)):
            raise TypeError(f"{name} must be an integer")

    M = int(M)
    n = int(n)
    N = int(N)

    if M < 0:
        raise ValueError("M must be >= 0")
    if not (0 <= n <= M):
        raise ValueError("n must satisfy 0 <= n <= M")
    if not (0 <= N <= M):
        raise ValueError("N must satisfy 0 <= N <= M")

    return M, n, N


def _validate_params(M, n, N, odds):
    M, n, N = _validate_M_n_N(M, n, N)

    odds = float(odds)
    if not np.isfinite(odds) or odds <= 0:
        raise ValueError("odds must be a positive finite number")

    return M, n, N, odds


def nchypergeom_fisher_support(M, n, N) -> np.ndarray:
    M, n, N = _validate_M_n_N(M, n, N)
    lo = max(0, N - (M - n))
    hi = min(n, N)
    return np.arange(lo, hi + 1, dtype=int)


def _log_factorials(max_n: int) -> np.ndarray:
    max_n = int(max_n)
    if max_n < 0:
        raise ValueError("max_n must be >= 0")

    logfact = np.zeros(max_n + 1, dtype=float)
    if max_n >= 2:
        logfact[2:] = np.cumsum(np.log(np.arange(2, max_n + 1)))
    return logfact


def _log_choose(n: int, k: np.ndarray, logfact: np.ndarray) -> np.ndarray:
    k = np.asarray(k, dtype=int)
    return logfact[n] - logfact[k] - logfact[n - k]


def _logsumexp(a: np.ndarray):
    """Stable log(sum(exp(a))) for real or complex a."""
    a = np.asarray(a)
    m = np.max(a.real)
    return m + np.log(np.sum(np.exp(a - m)))


def _nchgf_loga(M, n, N):
    """Return support x and log a_x = log[C(n,x) C(M-n, N-x)]."""
    x = nchypergeom_fisher_support(M, n, N)
    logfact = _log_factorials(M)
    loga = _log_choose(n, x, logfact) + _log_choose(M - n, N - x, logfact)
    return x, loga


def nchypergeom_fisher_logZ_theta(M, n, N, theta):
    M, n, N = _validate_M_n_N(M, n, N)
    x, loga = _nchgf_loga(M, n, N)
    return _logsumexp(loga + theta * x)


def nchypergeom_fisher_logZ(M, n, N, odds):
    M, n, N, odds = _validate_params(M, n, N, odds)
    return nchypergeom_fisher_logZ_theta(M, n, N, np.log(odds))


def nchypergeom_fisher_logpmf_array(M, n, N, odds):
    M, n, N, odds = _validate_params(M, n, N, odds)

    x, loga = _nchgf_loga(M, n, N)

    theta = np.log(odds)
    logw = loga + theta * x
    logZ = _logsumexp(logw)

    return x, logw - logZ


def nchypergeom_fisher_logpmf(k, M, n, N, odds):
    """Log-PMF evaluated at integer k; non-integers return -inf."""
    M, n, N, odds = _validate_params(M, n, N, odds)
    k_arr = np.asarray(k)
    k_int = k_arr.astype(int)

    xs, logp = nchypergeom_fisher_logpmf_array(M, n, N, odds)
    lo, hi = int(xs[0]), int(xs[-1])

    out = np.full_like(k_arr, -np.inf, dtype=float)
    inside = (k_arr == k_int) & (k_int >= lo) & (k_int <= hi)
    if np.any(inside):
        out[inside] = logp[k_int[inside] - lo]
    return out


def nchypergeom_fisher_pmf_array(M, n, N, odds):
    xs, logp = nchypergeom_fisher_logpmf_array(M, n, N, odds)
    p = np.exp(logp)
    p = p / p.sum()
    return xs, p


def nchypergeom_fisher_pmf(k, M, n, N, odds):
    return np.exp(nchypergeom_fisher_logpmf(k, M, n, N, odds))


def nchypergeom_fisher_cdf_array(M, n, N, odds):
    xs, p = nchypergeom_fisher_pmf_array(M, n, N, odds)
    cdf = np.cumsum(p)
    cdf[-1] = 1.0
    return xs, cdf


def nchypergeom_fisher_cdf(x, M, n, N, odds):
    M, n, N, odds = _validate_params(M, n, N, odds)
    x_arr = np.asarray(x)

    xs = nchypergeom_fisher_support(M, n, N)
    lo, hi = int(xs[0]), int(xs[-1])

    _, cdf = nchypergeom_fisher_cdf_array(M, n, N, odds)

    k = np.floor(x_arr).astype(int)
    out = np.zeros_like(x_arr, dtype=float)
    out[k >= hi] = 1.0

    inside = (k >= lo) & (k < hi)
    if np.any(inside):
        out[inside] = cdf[k[inside] - lo]

    return out


def nchypergeom_fisher_stats(M, n, N, odds):
    xs, p = nchypergeom_fisher_pmf_array(M, n, N, odds)

    mean = float(np.sum(xs * p))
    var = float(np.sum((xs - mean) ** 2 * p))

    if var == 0.0:
        skew = 0.0
        kurt_excess = 0.0
    else:
        std = np.sqrt(var)
        skew = float(np.sum((xs - mean) ** 3 * p) / std**3)
        kurt_excess = float(np.sum((xs - mean) ** 4 * p) / std**4 - 3.0)

    entropy_nats = float(-np.sum(p * np.log(p)))

    return {
        "mean": mean,
        "var": var,
        "skew": skew,
        "kurt_excess": kurt_excess,
        "entropy_nats": entropy_nats,
    }


def nchypergeom_fisher_log_mgf(t, M, n, N, odds):
    M, n, N, odds = _validate_params(M, n, N, odds)
    theta = np.log(odds)
    return nchypergeom_fisher_logZ_theta(M, n, N, theta + t) - nchypergeom_fisher_logZ_theta(M, n, N, theta)


def nchypergeom_fisher_mgf(t, M, n, N, odds):
    return np.exp(nchypergeom_fisher_log_mgf(t, M, n, N, odds))


def nchypergeom_fisher_log_cf(t, M, n, N, odds):
    M, n, N, odds = _validate_params(M, n, N, odds)
    theta = np.log(odds)
    return nchypergeom_fisher_logZ_theta(M, n, N, theta + 1j * t) - nchypergeom_fisher_logZ_theta(M, n, N, theta)


def nchypergeom_fisher_cf(t, M, n, N, odds):
    return np.exp(nchypergeom_fisher_log_cf(t, M, n, N, odds))


def nchypergeom_fisher_mean_from_logodds(M, n, N, logodds: float) -> float:
    M, n, N = _validate_M_n_N(M, n, N)
    xs, loga = _nchgf_loga(M, n, N)

    logw = loga + logodds * xs
    logp = logw - _logsumexp(logw)
    p = np.exp(logp)

    return float(np.sum(xs * p))


def fit_nchypergeom_fisher_odds_mle(data, M, n, N, *, max_iter=80, tol=1e-10) -> float:
    # Conditional MLE of odds with (M,n,N) fixed.
    # For Fisher's NCHG, the MLE solves E_theta[X] = sample_mean.

    M, n, N = _validate_M_n_N(M, n, N)

    data = np.asarray(data)
    if not np.issubdtype(data.dtype, np.integer):
        data_int = data.astype(int)
        if np.any(data_int != data):
            raise ValueError("data must be integer-valued")
        data = data_int

    xs = nchypergeom_fisher_support(M, n, N)
    lo, hi = int(xs[0]), int(xs[-1])

    if np.any(data < lo) or np.any(data > hi):
        raise ValueError("data outside the support")

    target = float(np.mean(data))

    # If the sample mean hits the boundary, the MLE is at odds -> 0 or odds -> +inf.
    if target <= lo + 1e-12:
        return 1e-12
    if target >= hi - 1e-12:
        return 1e12

    theta_lo, theta_hi = -10.0, 10.0
    m_lo = nchypergeom_fisher_mean_from_logodds(M, n, N, theta_lo)
    m_hi = nchypergeom_fisher_mean_from_logodds(M, n, N, theta_hi)

    # Expand the bracket if needed.
    while m_lo > target:
        theta_hi = theta_lo
        theta_lo -= 10.0
        m_lo = nchypergeom_fisher_mean_from_logodds(M, n, N, theta_lo)

    while m_hi < target:
        theta_lo = theta_hi
        theta_hi += 10.0
        m_hi = nchypergeom_fisher_mean_from_logodds(M, n, N, theta_hi)

    # Bisection on theta = log odds.
    for _ in range(max_iter):
        theta_mid = 0.5 * (theta_lo + theta_hi)
        m_mid = nchypergeom_fisher_mean_from_logodds(M, n, N, theta_mid)

        if abs(m_mid - target) < tol:
            break

        if m_mid < target:
            theta_lo = theta_mid
        else:
            theta_hi = theta_mid

    return float(np.exp(theta_mid))


In [None]:
# Quick sanity check + moments

M, n, N, odds = 50, 20, 15, 3.0

xs, pmf = nchypergeom_fisher_pmf_array(M, n, N, odds)

moments = nchypergeom_fisher_stats(M, n, N, odds)

# MGF spot-check: ratio-of-partition vs direct sum

t = 0.2
mgf_ratio = float(nchypergeom_fisher_mgf(t, M, n, N, odds))
mgf_direct = float(np.sum(np.exp(t * xs) * pmf))

{
    "support": (int(xs[0]), int(xs[-1])),
    "pmf_sum": float(pmf.sum()),
    "moments": moments,
    "mgf_ratio": mgf_ratio,
    "mgf_direct": mgf_direct,
    "mgf_abs_diff": float(abs(mgf_ratio - mgf_direct)),
    "cf_at_1.0": complex(nchypergeom_fisher_cf(1.0, M, n, N, odds)),
}


## 5) Parameter Interpretation

### Meaning of the parameters
- **`M` (population size)**: total number of items.
- **`n` (Type I count)**: how many Type I items exist in the population (baseline prevalence \(n/M\)).
- **`N` (draws)**: sample size drawn without replacement.
- **`odds` (odds ratio \(\omega\))**: the *relative* preference for Type I.
  - \(\omega>1\) shifts mass toward larger \(X\)
  - \(\omega<1\) shifts mass toward smaller \(X\)
  - \(\omega=1\) is unbiased (hypergeometric)

### Shape changes
- As \(\omega\to 0\), the distribution concentrates at \(x_{\min}\) (as few Type I draws as possible).
- As \(\omega\to \infty\), it concentrates at \(x_{\max}\) (as many Type I draws as possible).
- The map \(\omega \mapsto \mathbb{E}[X]\) is monotone increasing (useful for estimation).


In [None]:
from plotly.subplots import make_subplots

M, n, N = 60, 20, 15
odds_values = [0.2, 0.5, 1.0, 2.0, 5.0]

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=(
        f"PMF for different odds (M={M}, n={n}, N={N})",
        "Mean as a function of odds",
    ),
)

# Left: PMF curves
for o in odds_values:
    xs, pmf = nchypergeom_fisher_pmf_array(M, n, N, o)
    fig.add_trace(
        go.Scatter(
            x=xs,
            y=pmf,
            mode="lines+markers",
            name=f"odds={o}",
        ),
        row=1,
        col=1,
    )

fig.update_xaxes(title_text="x", row=1, col=1)
fig.update_yaxes(title_text="P(X=x)", row=1, col=1)

# Right: mean curve
odds_grid = np.logspace(-2, 2, 220)
means = np.array([nchypergeom_fisher_stats(M, n, N, o)["mean"] for o in odds_grid])

fig.add_trace(
    go.Scatter(
        x=odds_grid,
        y=means,
        mode="lines",
        name="E[X]",
        showlegend=False,
    ),
    row=1,
    col=2,
)

fig.update_xaxes(title_text="odds (log scale)", type="log", row=1, col=2)
fig.update_yaxes(title_text="E[X]", row=1, col=2)

fig.update_layout(title="How the odds ratio changes the distribution")
fig.show()


## 6) Derivations

Write the PMF in exponential-family form. Let \(\theta = \log \omega\) and
\(
Z(\theta)=\sum_x a_x e^{\theta x},\; A(\theta)=\log Z(\theta).
\)
Then
\[
\Pr(X=x) = \exp\bigl(\log a_x + \theta x - A(\theta)\bigr).
\]

### Expectation
Differentiate \(A(\theta)\):
\[
A'(\theta)
= \frac{Z'(\theta)}{Z(\theta)}
= \frac{\sum_x a_x x e^{\theta x}}{\sum_x a_x e^{\theta x}}
= \sum_x x\,\Pr(X=x)
= \mathbb{E}[X].
\]

### Variance
Differentiate again:
\[
A''(\theta) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \mathrm{Var}(X).
\]

### Likelihood (odds ratio)
With fixed \((M,n,N)\), treat \(\omega\) (or \(\theta=\log\omega\)) as the parameter.

For a single observation \(X=x\), the likelihood is
\[
L(\omega\mid x) \propto \omega^x\, /\, Z(\omega),
\]
and the log-likelihood (in \(\theta\)) is
\[
\ell(\theta) = \theta x - A(\theta) + \text{const}.
\]

For i.i.d. data \(x_1,\dots,x_m\):
\[
\ell(\theta) = \theta\sum_{i=1}^m x_i - m A(\theta) + \text{const}.
\]
The score equation is
\[
\frac{\partial\ell}{\partial\theta} = \sum_{i=1}^m x_i - m\,\mathbb{E}_\theta[X]=0
\quad\Longleftrightarrow\quad
\mathbb{E}_\theta[X] = \bar{x}.
\]
So the conditional MLE for \(\omega\) solves “model mean = sample mean”.


In [None]:
# Visualize the likelihood for odds (single observation)

M, n, N = 50, 20, 15
x_obs = 10

odds_grid = np.logspace(-2, 2, 500)
logL = np.array([float(nchypergeom_fisher_logpmf(x_obs, M, n, N, o)) for o in odds_grid])

odds_hat = fit_nchypergeom_fisher_odds_mle(np.array([x_obs]), M, n, N)

fig = go.Figure()
fig.add_trace(go.Scatter(x=odds_grid, y=logL, mode="lines", name="log-likelihood"))
fig.add_vline(
    x=odds_hat,
    line_dash="dash",
    line_color="black",
    annotation_text=f"MLE oddŝ={odds_hat:.3f}",
)
fig.update_layout(
    title="Fisher NCHG log-likelihood for odds (M,n,N fixed)",
    xaxis_title="odds",
    yaxis_title="log L(odds)",
)
fig.update_xaxes(type="log")
fig.show()

{
    "x_obs": x_obs,
    "odds_hat": odds_hat,
    "E[X] at odds_hat": nchypergeom_fisher_stats(M, n, N, odds_hat)["mean"],
}


## 7) Sampling & Simulation

Because the support is finite, a clean **NumPy-only** approach is inverse transform sampling.

### Inverse CDF algorithm
1. Compute log-weights
   \(\log w_x = \log\binom{n}{x} + \log\binom{M-n}{N-x} + x\log\omega\).
2. Normalize to get the PMF \(p(x)\) (use log-sum-exp for stability).
3. Compute the discrete CDF \(F(x)=\sum_{k\le x} p(k)\).
4. Draw \(U\sim\mathrm{Uniform}(0,1)\) and return the smallest \(x\) with \(F(x)\ge U\).

Cost: \(O(|\text{support}| + \text{size}\cdot\log|\text{support}|)\).
For large parameters, production implementations use more specialized methods (SciPy provides one).


In [None]:
def sample_nchypergeom_fisher_inverse_cdf(M, n, N, odds, size=1, *, rng: np.random.Generator):
    M, n, N, odds = _validate_params(M, n, N, odds)

    xs, pmf = nchypergeom_fisher_pmf_array(M, n, N, odds)
    if len(xs) == 1:
        return np.full(size, xs[0], dtype=int)

    cdf = np.cumsum(pmf)
    cdf[-1] = 1.0

    u = rng.random(size=size)
    idx = np.searchsorted(cdf, u, side="right")
    return xs[idx]


# Monte Carlo check
M, n, N, odds = 60, 20, 15, 2.5
mom = nchypergeom_fisher_stats(M, n, N, odds)

samples = sample_nchypergeom_fisher_inverse_cdf(M, n, N, odds, size=200_000, rng=rng)

{
    "theory_mean": mom["mean"],
    "mc_mean": float(samples.mean()),
    "theory_var": mom["var"],
    "mc_var": float(samples.var(ddof=0)),
}


## 8) Visualization

We’ll visualize:
- the **PMF** on its integer support
- the **CDF** (a step function)
- Monte Carlo samples vs the exact PMF


In [None]:
M, n, N, odds = 50, 20, 15, 3.0

xs, pmf = nchypergeom_fisher_pmf_array(M, n, N, odds)
cdf = np.cumsum(pmf)

fig_pmf = go.Figure()
fig_pmf.add_trace(go.Bar(x=xs, y=pmf, name="PMF"))
fig_pmf.update_layout(
    title=f"Fisher NCHG PMF (M={M}, n={n}, N={N}, odds={odds})",
    xaxis_title="x",
    yaxis_title="P(X=x)",
)
fig_pmf.show()

fig_cdf = go.Figure()
fig_cdf.add_trace(go.Scatter(x=xs, y=cdf, mode="lines", line_shape="hv", name="CDF"))
fig_cdf.update_layout(
    title=f"Fisher NCHG CDF (M={M}, n={n}, N={N}, odds={odds})",
    xaxis_title="x",
    yaxis_title="P(X≤x)",
)
fig_cdf.show()

mc = sample_nchypergeom_fisher_inverse_cdf(M, n, N, odds, size=250_000, rng=rng)
lo = int(xs[0])
counts = np.bincount(mc - lo, minlength=len(xs))
pmf_hat = counts / counts.sum()

fig_mc = go.Figure()
fig_mc.add_trace(go.Bar(x=xs, y=pmf_hat, name="Monte Carlo", opacity=0.6))
fig_mc.add_trace(go.Scatter(x=xs, y=pmf, mode="lines+markers", name="Exact PMF"))
fig_mc.update_layout(
    title="Monte Carlo samples vs exact PMF",
    xaxis_title="x",
    yaxis_title="Probability",
)
fig_mc.show()


## 9) SciPy Integration

SciPy provides a fast, numerically robust implementation via `scipy.stats.nchypergeom_fisher`.

- Use `pmf`, `cdf`, `sf`, `rvs`, `logpmf`, …
- `rv_discrete` distributions do **not** expose a generic `.fit()` method (SciPy 1.15).
  For Fisher’s NCHG, you typically fit `odds` with custom likelihood-based code while treating `(M,n,N)` as fixed.


In [None]:
from scipy import stats
from scipy.optimize import minimize_scalar

# Compare NumPy implementation to SciPy
M, n, N, odds = 50, 20, 15, 3.0
xs = nchypergeom_fisher_support(M, n, N)

pmf_np = nchypergeom_fisher_pmf(xs, M, n, N, odds)
cdf_np = nchypergeom_fisher_cdf(xs, M, n, N, odds)

pmf_sp = stats.nchypergeom_fisher.pmf(xs, M, n, N, odds)
cdf_sp = stats.nchypergeom_fisher.cdf(xs, M, n, N, odds)

mean_sp, var_sp, skew_sp, kurt_sp = stats.nchypergeom_fisher.stats(M, n, N, odds, moments="mvsk")
ent_sp = stats.nchypergeom_fisher.entropy(M, n, N, odds)

{
    "max_abs_pmf_diff": float(np.max(np.abs(pmf_np - pmf_sp))),
    "max_abs_cdf_diff": float(np.max(np.abs(cdf_np - cdf_sp))),
    "numpy_moments": nchypergeom_fisher_stats(M, n, N, odds),
    "scipy_mean": float(mean_sp),
    "scipy_var": float(var_sp),
    "scipy_skew": float(skew_sp),
    "scipy_kurt_excess": float(kurt_sp),
    "scipy_entropy_nats": float(ent_sp),
}


In [None]:
# "Fit" odds with (M,n,N) fixed: conditional MLE

M, n, N, odds_true = 80, 25, 20, 4.0

data = stats.nchypergeom_fisher.rvs(M, n, N, odds_true, size=5_000, random_state=rng)

odds_hat_bisect = fit_nchypergeom_fisher_odds_mle(data, M, n, N)

# Optimization check on log-odds (should agree with the MLE condition E[X]=mean)

def nll(theta):
    odds = float(np.exp(theta))
    return -float(stats.nchypergeom_fisher.logpmf(data, M, n, N, odds).sum())

res = minimize_scalar(nll, bounds=(-6, 6), method="bounded")
odds_hat_opt = float(np.exp(res.x))

{
    "odds_true": odds_true,
    "odds_hat_bisect": odds_hat_bisect,
    "odds_hat_opt": odds_hat_opt,
    "opt_success": bool(res.success),
}


## 10) Statistical Use Cases

### A) Hypothesis testing (conditional test for an odds ratio)
For a 2×2 table

|            | Exposed | Unexposed |
|------------|---------|-----------|
| Cases      | \(a\)   | \(b\)     |
| Controls   | \(c\)   | \(d\)     |

conditioning on the margins fixes:
- total exposed \(a+c\)
- total cases \(a+b\)
- total sample size \(M=a+b+c+d\)

Then the conditional distribution of \(A=a\) given the margins is Fisher’s NCHG with
\[
M=a+b+c+d,\quad n=a+c,\quad N=a+b,\quad \omega=\text{odds ratio}.
\]
The null hypothesis of independence corresponds to \(\omega=1\) (hypergeometric).

### B) Bayesian modeling (posterior over odds)
With fixed margins, Fisher’s NCHG provides a likelihood for \(\omega\). Combine it with a prior over \(\theta=\log\omega\) (e.g., Normal) to get a posterior.

### C) Generative modeling (simulate tables with fixed margins)
To generate random 2×2 tables with fixed margins and a chosen odds ratio, sample \(A\sim\text{nchypergeom\_fisher}(M,n,N,\omega)\) and fill in the remaining cells deterministically.


In [None]:
from scipy.stats import fisher_exact

# Example 2x2 table
#            Exposed  Unexposed
# Cases         a         b
# Controls      c         d

a, b, c, d = 12, 5, 8, 15

M = a + b + c + d
n = a + c       # total exposed
N = a + b       # total cases

# A) Conditional one-sided p-value under H0: odds=1 (independence)
# For "greater" association: P(A >= a_obs | odds=1)

p_cond_greater = float(stats.nchypergeom_fisher.sf(a - 1, M, n, N, 1.0))

oddsratio_hat, p_fisher_greater = fisher_exact([[a, b], [c, d]], alternative="greater")

# B) Simple Bayesian posterior on odds via grid on theta = log(odds)

a_obs = a
sigma = 1.0  # prior std on log-odds

theta_grid = np.linspace(-4, 4, 801)
odds_grid = np.exp(theta_grid)

loglik = np.array([float(nchypergeom_fisher_logpmf(a_obs, M, n, N, o)) for o in odds_grid])
logprior = -0.5 * (theta_grid / sigma) ** 2 - np.log(sigma * np.sqrt(2 * np.pi))

logpost = loglik + logprior
logpost = logpost - np.max(logpost)
post = np.exp(logpost)
post = post / np.trapz(post, odds_grid)  # density over odds

cdf_post = np.cumsum(post) * (odds_grid[1] - odds_grid[0])

post_mean = float(np.trapz(odds_grid * post, odds_grid))
ci_low = float(odds_grid[np.searchsorted(cdf_post, 0.025)])
ci_high = float(odds_grid[np.searchsorted(cdf_post, 0.975)])

fig = go.Figure()
fig.add_trace(go.Scatter(x=odds_grid, y=post, mode="lines", name="posterior density"))
fig.add_vline(x=1.0, line_dash="dash", line_color="black", annotation_text="odds=1")
fig.update_layout(
    title="Posterior over odds (log-odds Normal prior; fixed margins)",
    xaxis_title="odds",
    yaxis_title="posterior density",
)
fig.update_xaxes(type="log")
fig.show()

# C) Generative: sample a new table with the same margins at a chosen odds

def sample_table_with_fixed_margins(M, n, N, odds, *, rng: np.random.Generator):
    a_new = int(sample_nchypergeom_fisher_inverse_cdf(M, n, N, odds, size=1, rng=rng)[0])
    b_new = N - a_new
    c_new = n - a_new
    d_new = (M - n) - b_new
    return np.array([[a_new, b_new], [c_new, d_new]], dtype=int)

odds_sim = 2.5
sample_table = sample_table_with_fixed_margins(M, n, N, odds_sim, rng=rng)

{
    "margins": {"M": M, "n_exposed": n, "N_cases": N},
    "p_cond_greater_odds1": p_cond_greater,
    "fisher_exact_greater": float(p_fisher_greater),
    "posterior_mean_odds": post_mean,
    "posterior_95%_CI": (ci_low, ci_high),
    "sample_table_at_odds_sim": sample_table.tolist(),
}


## 11) Pitfalls

- **Parameter confusion**: SciPy uses `M` (total), `n` (Type I in population), `N` (draws), `odds` (bias). This matches `scipy.stats.hypergeom`’s `(M,n,N)` convention.
- **Fisher vs Wallenius**: they correspond to different biased sampling mechanisms. Don’t simulate Fisher’s distribution by sequentially drawing with changing probabilities (that produces Wallenius).
- **Invalid parameters**:
  - `M`, `n`, `N` must be integers with `0 ≤ n ≤ M` and `0 ≤ N ≤ M`.
  - `odds` must be strictly positive.
- **Numerical issues**:
  - Directly computing \(\binom{n}{x}\) can overflow; use log-space + log-sum-exp.
  - Very large `M` makes the simple NumPy log-factorial approach slower; prefer SciPy’s implementation.
- **Degenerate cases**: if the support has length 1 (e.g., `N=0` or `n=0`), variance is 0 and standardized skew/kurtosis are not informative.


## 12) Summary

- `nchypergeom_fisher` is a **finite-support discrete** distribution for biased sampling without replacement.
- The PMF is a **tilted hypergeometric**: \(p(x) \propto \binom{n}{x}\binom{M-n}{N-x}\,\omega^x\).
- Moments and the MGF follow naturally from the **log-partition function** \(A(\theta)\).
- A simple NumPy-only sampler uses **inverse CDF** on the finite support.
- The distribution is central in **exact conditional inference** for odds ratios in 2×2 tables.
