# Bayesian Logistic Regression Simulation

This notebook simulates binary data, fits Bayesian logistic regression, and summarizes posterior inference for $\theta$.

## Model Setup

For $i=1,\ldots,n$, let $y_i\in\{0,1\}$ and $x_i\in\mathbb{R}^p$.

$$
\Pr(y_i=1\mid x_i,\theta)=\sigma(x_i'\theta),\qquad \sigma(t)=\frac{1}{1+e^{-t}}.
$$

The likelihood is
$$
p(y\mid\theta)=\prod_{i=1}^n \sigma(x_i'\theta)^{y_i}\left[1-\sigma(x_i'\theta)\right]^{1-y_i}.
$$

We use a Gaussian prior:
$$
\theta\sim N(0,\tau^2I_p).
$$

The posterior is proportional to
$$
p(\theta\mid y)\propto p(y\mid\theta)\,p(\theta).
$$

Sampling is done with `emcee` using a Gaussian Metropolis-Hastings proposal (`emcee.moves.GaussianMove`).

In [1]:
import numpy as np
import emcee

In [2]:
def simulate_logistic_data(n_obs, beta_true, seed):
    rng = np.random.default_rng(seed)
    x1 = rng.normal(size=n_obs)
    x2 = rng.normal(size=n_obs)
    X = np.column_stack([np.ones(n_obs), x1, x2])
    eta = X @ beta_true
    p = 1.0 / (1.0 + np.exp(-eta))
    y = rng.binomial(1, p, size=n_obs)
    return X, y


def log_prior(theta, prior_sd):
    if not np.all(np.isfinite(theta)):
        return -np.inf
    return -0.5 * np.sum((theta / prior_sd) ** 2)


def log_likelihood(theta, X, y):
    eta = X @ theta
    return np.sum(y * eta - np.logaddexp(0.0, eta))


def log_posterior(theta, X, y, prior_sd):
    lp = log_prior(theta, prior_sd)
    if not np.isfinite(lp):
        return -np.inf
    return lp + log_likelihood(theta, X, y)


def run_mcmc(X, y, prior_sd, seed, n_walkers, burn_in, n_steps, thin, proposal_scale):
    rng = np.random.default_rng(seed + 1)
    n_dim = X.shape[1]
    if n_walkers < 2 * n_dim:
        raise ValueError(f"n_walkers must be at least {2 * n_dim}.")

    initial = rng.normal(0.0, 0.2, size=(n_walkers, n_dim))
    mh_move = emcee.moves.GaussianMove(cov=(proposal_scale**2) * np.eye(n_dim), mode="vector")

    sampler = emcee.EnsembleSampler(
        nwalkers=n_walkers,
        ndim=n_dim,
        log_prob_fn=log_posterior,
        args=(X, y, prior_sd),
        moves=mh_move,
    )

    state = sampler.run_mcmc(initial, burn_in, progress=False)
    sampler.reset()
    sampler.run_mcmc(state, n_steps, progress=False)
    samples = sampler.get_chain(flat=True, thin=thin)
    accept_rate = float(np.mean(sampler.acceptance_fraction))
    return samples, accept_rate


def summarize(samples):
    post_mean = np.mean(samples, axis=0)
    ci_low = np.quantile(samples, 0.025, axis=0)
    ci_high = np.quantile(samples, 0.975, axis=0)
    return post_mean, ci_low, ci_high

## Simulation Design

True parameter: $\theta_0 = (-0.4,\;1.1,\;-1.6)'$.

MCMC settings below are chosen to run quickly while still giving stable posterior summaries.

In [3]:
seed = 2026
n_obs = 1000
beta_true = np.array([-0.4, 1.1, -1.6])
prior_sd = 2.5
n_walkers = 40
burn_in = 1200
n_steps = 2200
thin = 5
proposal_scale = 0.08

X, y = simulate_logistic_data(n_obs=n_obs, beta_true=beta_true, seed=seed)
print(f"n_obs={n_obs}, mean(y)={y.mean():.3f}")

n_obs=1000, mean(y)=0.458


In [4]:
samples, accept_rate = run_mcmc(
    X=X,
    y=y,
    prior_sd=prior_sd,
    seed=seed,
    n_walkers=n_walkers,
    burn_in=burn_in,
    n_steps=n_steps,
    thin=thin,
    proposal_scale=proposal_scale,
)

post_mean, ci_low, ci_high = summarize(samples)
names = ["intercept", "beta_1", "beta_2"]

print(f"Mean acceptance fraction: {accept_rate:.3f}")
print("name      true      post_mean    2.5%       97.5%")
for i, name in enumerate(names):
    print(f"{name:9s} {beta_true[i]:8.3f} {post_mean[i]:11.3f} {ci_low[i]:9.3f} {ci_high[i]:9.3f}")

Mean acceptance fraction: 0.495
name      true      post_mean    2.5%       97.5%
intercept   -0.400      -0.418    -0.580    -0.261
beta_1       1.100       1.032     0.851     1.222
beta_2      -1.600      -1.634    -1.849    -1.422


## Prediction Exercise (Posterior Predictive)

Using posterior draws $\{\theta^{(s)}\}_{s=1}^S$, the posterior predictive probability for a new feature vector $\tilde x$ is
$$
\Pr(\tilde y=1\mid \tilde x, y) \approx \frac{1}{S}\sum_{s=1}^S \sigma(\tilde x'\theta^{(s)}).
$$

Below we evaluate prediction on a simulated holdout sample and report both point performance and posterior predictive intervals.

In [5]:
# Holdout data from the same DGP
n_test = 300
X_test, y_test = simulate_logistic_data(n_obs=n_test, beta_true=beta_true, seed=seed + 100)

# Posterior predictive probabilities for each test observation
eta_draws = samples @ X_test.T                      # shape: (S, n_test)
pred_draws = 1.0 / (1.0 + np.exp(-eta_draws))      # shape: (S, n_test)
pred_mean = pred_draws.mean(axis=0)
pred_low = np.quantile(pred_draws, 0.025, axis=0)
pred_high = np.quantile(pred_draws, 0.975, axis=0)

# Classification metrics based on posterior mean probability
y_hat = (pred_mean >= 0.5).astype(int)
accuracy = np.mean(y_hat == y_test)
eps = 1e-12
log_loss = -np.mean(y_test * np.log(pred_mean + eps) + (1 - y_test) * np.log(1 - pred_mean + eps))

print(f"Test accuracy (threshold 0.5): {accuracy:.3f}")
print(f"Test log loss: {log_loss:.3f}")
print("")
print("First 5 posterior predictive probabilities with 95% intervals:")
print("obs   y_test   p_mean    p_2.5%    p_97.5%")
for i in range(5):
    print(f"{i:3d}   {y_test[i]:6d}   {pred_mean[i]:6.3f}   {pred_low[i]:7.3f}   {pred_high[i]:7.3f}")

Test accuracy (threshold 0.5): 0.793
Test log loss: 0.404

First 5 posterior predictive probabilities with 95% intervals:
obs   y_test   p_mean    p_2.5%    p_97.5%
  0        0    0.072     0.052     0.096
  1        0    0.021     0.013     0.033
  2        1    0.864     0.826     0.897
  3        0    0.079     0.058     0.105
  4        1    0.653     0.605     0.699


## Interpretation

- Posterior means should be close to true values in repeated simulations.
- The 95% credible intervals should typically include the true coefficients at this sample size.
- The acceptance fraction can be tuned with `proposal_scale`.