# Simulation-Based Inference: A Practical Introduction

## 1. Introduction and Motivation

Simulation-based inference (SBI), also known as likelihood-free inference or implicit likelihood inference, addresses a fundamental challenge in modern scientific modeling: **what do we do when we can simulate data from our model but cannot evaluate the likelihood?**



In traditional Bayesian inference, we want to compute the posterior:

$$p(\theta | x) = \frac{p(x | \theta) p(\theta)}{p(x)}$$

This requires:
- A prior $p(\theta)$
- A likelihood $p(x | \theta)$ that we can evaluate
- Computing the evidence $p(x) = \int p(x | \theta) p(\theta) d\theta$

**However**, in many scientific domains (astrophysics, cosmology, neuroscience, ecology, epidemiology), we have:
- ✅ A simulator that can generate data: $x \sim p(x | \theta)$
- ✅ A prior over parameters: $\theta \sim p(\theta)$
- ❌ No tractable likelihood function $p(x | \theta)$


## 2. The Core Idea of SBI

The key insight: **If we can simulate, we can learn the statistical relationships we need using machine learning.**

Instead of deriving likelihoods analytically, we:
1. Generate many simulations: $(θ_i, x_i)$ where $\theta_i \sim p(\theta)$, $x_i \sim p(x|\theta_i)$
2. Use neural networks to learn probability distributions from these samples
3. Apply these learned distributions to perform inference

### Three Main Approaches

| Method | What it learns | Target Distribution |
|--------|---------------|-------------------|
| **Neural Posterior Estimation (NPE)** | $p(\theta \| x)$ directly | Posterior |
| **Neural Likelihood Estimation (NLE)** | $p(x \| \theta)$ | Likelihood |
| **Neural Ratio Estimation (NRE)** | $r(x,\theta) = \frac{p(x\|\theta)}{p(x)}$ | Likelihood ratio |



## 3. Mathematical Framework

### 3.1 The Joint Distribution

Start with the joint distribution over parameters and data:

$$p(\theta, x) = p(x | \theta) p(\theta)$$

From simulation, we can sample from this joint distribution:
- Sample $\theta \sim p(\theta)$
- Sample $x \sim p(x|\theta)$ using the simulator

This gives us a dataset: $\mathcal{D} = \{(\theta_i, x_i)\}_{i=1}^N$

### 3.2 Neural Posterior Estimation (NPE)

**Goal**: Approximate $p(\theta | x)$ directly using a neural network.

**Method**: Train a conditional density estimator $q_\phi(\theta | x)$ to minimize:

$$\mathcal{L}_{\text{NPE}}(\phi) = -\mathbb{E}_{(\theta, x) \sim p(\theta, x)}[\log q_\phi(\theta | x)]$$

This is the **negative log-likelihood** of the training data under our neural density estimator.

**Key equation**: 
$$q_\phi^*(\theta | x) \approx p(\theta | x)$$

where $\phi^*$ are the optimal parameters.

### 3.3 Neural Likelihood Estimation (NLE)

**Goal**: Approximate $p(x | \theta)$ using a neural network.

**Method**: Train a conditional density estimator $q_\phi(x | \theta)$ to minimize:

$$\mathcal{L}_{\text{NLE}}(\phi) = -\mathbb{E}_{(\theta, x) \sim p(\theta, x)}[\log q_\phi(x | \theta)]$$

**Key equations**: Once we have $q_\phi(x | \theta) \approx p(x | \theta)$, we can use it in MCMC or SMC to sample from:

$$p(\theta | x_o) \propto q_\phi(x_o | \theta) p(\theta)$$

### 3.4 Neural Ratio Estimation (NRE)

**Goal**: Approximate the likelihood-to-evidence ratio.

**Method**: This uses a clever binary classification trick. Define:

$$r(x, \theta) = \frac{p(x | \theta)}{p(x)} = \frac{p(x, \theta)}{p(\theta)p(x)}$$

Train a binary classifier $d_\phi(x, \theta)$ to distinguish between:
- Positive class: $(x, \theta)$ drawn from joint $p(x, \theta) = p(x|\theta)p(\theta)$
- Negative class: $(x, \theta)$ drawn from marginals $p(x)p(\theta)$

**Key equation**: The optimal classifier satisfies:

$$d_\phi^*(x, \theta) = \frac{p(x, \theta)}{p(x, \theta) + p(x)p(\theta)} = \frac{r(x, \theta)}{1 + r(x, \theta)}$$

Therefore:
$$r(x, \theta) = \frac{d_\phi^*(x, \theta)}{1 - d_\phi^*(x, \theta)}$$

And we can compute:
$$p(\theta | x_o) \propto r(x_o, \theta) p(\theta) = \frac{d_\phi(x_o, \theta)}{1 - d_\phi(x_o, \theta)} p(\theta)$$



## 4. Neural Density Estimators

The key to SBI is using flexible neural networks to represent probability distributions. Here are the most common architectures:

### 4.1 Mixture Density Networks (MDN)

A neural network outputs the parameters of a mixture distribution:

$$q_\phi(\theta | x) = \sum_{k=1}^K \pi_k(x) \mathcal{N}(\theta | \mu_k(x), \Sigma_k(x))$$

where a neural network $f_\phi(x)$ outputs $\{\pi_k, \mu_k, \Sigma_k\}$ for each mixture component.

**Advantages**: 
- Can represent multimodal distributions
- Relatively simple to implement

**Disadvantages**:
- Limited flexibility
- Difficult to scale to high dimensions

### 4.2 Normalizing Flows

A normalizing flow transforms a simple base distribution through a series of invertible transformations:

$$\theta = T_\phi(z, x), \quad z \sim p(z)$$

The density is given by the change of variables formula:

$$q_\phi(\theta | x) = p(z) \left| \det \frac{\partial T_\phi^{-1}(\theta, x)}{\partial \theta} \right|$$

**Common flow architectures**:
- **MAF (Masked Autoregressive Flow)**: Good for density estimation (NPE)
- **IAF (Inverse Autoregressive Flow)**: Good for sampling
- **Neural Spline Flows**: Uses monotonic splines for transformations
- **Continuous Normalizing Flows**: Uses neural ODEs

**Advantages**:
- Exact density evaluation
- Exact sampling
- Very flexible

**Disadvantages**:
- Can be computationally expensive
- Architecture design is important

### 4.3 Autoregressive Models

Decompose the joint density using the chain rule:

$$q_\phi(\theta | x) = \prod_{i=1}^d q_\phi(\theta_i | \theta_{<i}, x)$$

Each conditional is modeled by a neural network.




## 5. Sequential vs. Amortized Inference

### 5.1 Amortized Inference

**Single round**: Train once on many simulations, then apply to any observation.

$$(\theta_i, x_i) \sim p(\theta, x) \text{ for } i=1,\ldots,N$$

Train $q_\phi(\theta | x)$ on this dataset. Then for any new observation $x_o$, directly evaluate $q_\phi(\theta | x_o)$.

**Advantages**: Very fast at inference time  
**Disadvantages**: May be inaccurate far from training distribution

### 5.2 Sequential Inference

**Multiple rounds**: Iteratively focus simulations near observed data.

**Round 1**: Sample from prior, train initial approximation $q_\phi^{(1)}$

**Round t**: 
- Sample $\theta \sim q_\phi^{(t-1)}(\cdot | x_o)$ (focus near posterior)
- Simulate $x \sim p(x|\theta)$
- Retrain to get $q_\phi^{(t)}$

**Advantages**: Much more simulation-efficient, better accuracy  
**Disadvantages**: Must re-simulate for each new observation


## 6. Comparison with Traditional Methods

| Method | Requires Likelihood | Flexibility | Simulation Efficiency |
|--------|-------------------|-------------|----------------------|
| **MCMC** | Yes | High | N/A |
| **ABC** | No | Medium | Low (many rejections) |
| **NPE** | No | High | High (amortized) |
| **NLE + MCMC** | No | High | Medium |
| **NRE** | No | High | Medium |

### Key Advantages of SBI:
1. **Works with implicit likelihoods**: No need to derive complex probability densities
2. **Amortization**: Train once, apply to many observations (for NPE)
3. **Scalability**: Modern neural networks handle high-dimensional data
4. **Automatic**: No hand-crafted distance metrics (unlike ABC)

### Challenges:
1. **Diagnostics**: Harder to assess convergence and accuracy
2. **Simulation cost**: Still needs many simulations for training
3. **Extrapolation**: Networks may perform poorly outside training distribution
4. **Stochastic simulators**: Need special handling for variance

## 7. Practical Considerations

### 7.1 Summary Statistics vs. Raw Data

For high-dimensional data, often use summary statistics:
- Reduces dimensionality
- Focuses on relevant features
- **Caution**: May lose information needed for inference

### 7.2 Model Misspecification

What if the simulator doesn't match reality?

- SBI learns $p(\theta | x, \mathcal{M})$ where $\mathcal{M}$ is your model
- Garbage in, garbage out: poor simulators give poor inference
- Consider **model criticism** and **posterior predictive checks**

### 7.3 Computational Cost

Training neural networks requires:
- Many simulations (thousands to millions)
- GPU compute for training
- Careful hyperparameter tuning

**But**: Once trained, inference is very fast!



**Key Papers**:
- Papamakarios & Murray (2016) - Fast ε-free Inference of Simulation Models with Bayesian Conditional Density Estimation
- Greenberg et al. (2019) - Automatic Posterior Transformation for Likelihood-Free Inference
- Cranmer et al. (2020) - The frontier of simulation-based inference

**Software**:
- `sbi`: Python package for SBI (https://sbi-dev.github.io/sbi/)
- `lampe`: Likelihood-free AMortized Posterior Estimation
- `sbibm`: SBI benchmarking framework
