# Data Modeling II: Bayesian Statistics

Bayesian statistics systematically combines our prior knowledge about a situation with new data to refine what we believe is true.
In countless real-world scenarios---ranging from medical diagnostics to fundamental physics experiments---information we already have (like the rarity of a disease or theoretical constraints on a physical parameter) can significantly shape how we interpret fresh evidence.
By framing unknowns as probability distributions, Bayesian methods provide a coherent framework for updating those distributions whenever new observations appear, yielding a posterior that reflects all evidence, old and new.
This unifying perspective makes it possible to quantify uncertainties in a transparent way, avoid common logical pitfalls, and naturally propagate errors to any derived quantities of interest.

## Medical Test "Paradox"

The medical test paradox occurs when a diagnostic test is described as highly accurate, yet a person who tests positive for a rare disease ends up with a much lower chance of actually having it.
This seemingly contradiction highlights the importance of prior knowledge or base rates.

Consider a disease that affects only 1% of the population.
Imagine a test that has:
* 99% sensitivity: if you **do** have the disease, it flags you positive 99% of the time.
* 99% specificity: if you **do not** have the disease, it correctly flags you negative 99% of the time.

Many people assume that a "99% accurate" test implies a 99% chance of having the disease if you test positive.
We will see that is not necessarily true when the disease is rare.

### A Simple Counting Argument

Suppose we have 10,000 people.
About 100 of them are diseased (1%).
The remaining 9,900 are healthy.  

Of the 100 diseased people, 99 will test positive (true positives).
Of the 9,900 healthy people, 1% will falsely test positive (99 people).
We end up with a total of 198 positive results: 99 true positives plus 99 false positives.

Hence, only half of these positives (99 out of 198) are truly diseased.
This implies a 50% chance of actually having the disease, which is far lower than 99%.

### Why This Happens

When a condition is rare, most people do not have it.
A small fraction of a large healthy group (the 1% false-positive rate applied to 9,900 healthy people) can match or exceed the positives from the much smaller diseased group.
This is a direct consequence of prior probability: we have to weigh how common the disease is before we interpret a new test result.

## An Intuitive Derivation of Bayes' Theorem

Bayes' Theorem emerges directly from the definition of **conditional probability**.
We start with $P(A \mid B)$, which is read as "the probability of $A$ given that $B$ occurred."
By definition, this is the fraction of times both $A$ and $B$ happen, out of all times $B$ happens:
\begin{align}
P(A \mid B) = \frac{P(A \cap B)}{P(B)}.
\end{align}

Here, $P(A \cap B)$ is the joint probability that both events occur.
We can also express this joint probability in another way:
\begin{align}
P(A \cap B) = P(B \mid A)\,P(A).
\end{align}

Placing this back into our conditional probability formula gives:
\begin{align}
P(A \mid B)
= \frac{P(B \mid A)\,P(A)}{P(B)}.
\end{align}

We can split $B$ into two disjoint groups:
\begin{align}
P(B) = P(B \mid A)\,P(A) \;+\; P(B \mid \bar{A})\,P(\bar{A}).
\end{align}

Putting this altogether yields **Bayes' Theorem**:
\begin{align}
P(A \mid B)
= \frac{P(B \mid A)\,P(A)}
       {P(B \mid A)\,P(A) \;+\; P(B \mid \bar{A})\,P(\bar{A})}.
\end{align}

We can connect each term to our **medical test paradox**. In that story:
* $P(A)$ is the **prevalence** (1%).
* $P(\bar{A})$ is the chance of not having the disease (99%).
* $P(B \mid A)$ is the **sensitivity** (99%).
* $P(B \mid \bar{A})$ is the **false-positive rate** (1%).

When we substitute these numbers, we match the counting argument that led to a final probability of around 50% if you test positive.
This result might seem surprising at first, but it follows naturally once we include both the **base rate** of the disease and the test's **accuracy**.
Bayes' Theorem thus formalizes the intuition behind "counting true positives vs. false positives" and ensures we do not overlook the large fraction of healthy individuals in the population.

This same line of reasoning applies to many physics and data-modeling scenarios.
We often start with a **prior** for a parameter (like the prevalence in the medical example) and then update it with **likelihood** information from new observations.
Bayes' Theorem tells us how to combine both pieces of information in a consistent way, yielding a **posterior probability** that captures our updated understanding of the system.

### Why Bayes' Theorem Matters

The key power of Bayes' Theorem is that it forces us to incorporate the **prior probability** $P(A)$ before we look at new evidence $B$.
Once the data (test results) come in, we use the likelihood $P(B \mid A)$ to update this prior, producing the **posterior probability** $P(A \mid B)$.
In the medical context, the "update" reveals how a single test result against a low prevalence might not be enough for a confident diagnosis.

## A Physically Motivated Example: Exoplanet Detection

We can adapt the logic from the medical test paradox to a physics or astronomy problem.
Consider exoplanet detection: we look for a slight dip in a star's brightness that could signify a planet passing in front of the star (a "transit").
Even if our detection algorithm is "99% accurate," it may trigger many **false alarms** due to noise or stellar variability.
If only a small fraction of stars have detectable planets, we face a scenario similar to the medical test paradox.

### Setup

1. **Prevalence (Prior):** Suppose only **1%** of stars in our survey have a planet large enough (and orbit aligned just right) to cause a detectable transit.
2. **Detection Sensitivity:** If a star truly has a planet, our detection pipeline correctly flags it **99%** of the time.
3. **False Alarm Rate:** If a star does **not** have a planet, the pipeline still flags a **false positive** **1%** of the time (perhaps due to random noise, starspots, or measurement artifacts).

These numbers mirror the "disease prevalence" and "test sensitivity" from the medical example.
We want to know the **posterior probability** that a star truly has a planet given that we have detected a "transit signal."

### Bayes' Theorem for Exoplanets

Let:
* $A$ = "Star has a detectable planet."
* $B$ = "Detection algorithm flags a transit."

By Bayes' Theorem,
\begin{align}
P(\text{Star has planet} \,\mid\, \text{Transit Flag}) =
\frac{P(\text{Transit Flag} \,\mid\, \text{Star has planet}) \times P(\text{Star has planet})}{P(\text{Transit Flag})}.
\end{align}

Here:
1. $P(\text{Star has planet})$ is the 1% prevalence (prior).
2. $P(\text{Transit Flag} \,\mid\, \text{Star has planet})$ is the 99% detection sensitivity.
3. $P(\text{Transit Flag})$ accounts for both real transits and false alarms.

Just like in the medical paradox, we expect the **posterior probability** of having a planet given a positive detection to be around **50%**, not 99%. The rarity (1% prevalence) dilutes the significance of a single positive detection.

### Python Demo: Simulating an Exoplanet Survey

Below is a small Python script that simulates a survey of stars to illustrate how "99% detection accuracy" can still yield a large fraction of false positives if planets are rare.

In [None]:
# Number of stars in the survey
N_stars = 100_000

# Prior: fraction of stars with a detectably transiting planet
planet_prevalence = 0.01

# Detection sensitivity: P(flagged transit | planet)
detection_sensitivity = 0.99

# False alarm rate: P(flagged transit | no planet) = 1 - specificity
false_alarm_rate = 0.01

In [None]:
from random import random

# Simulate which stars have planets
stars_have_planet = [
    random() < planet_prevalence
    for _ in range(N_stars)
]

In [None]:
# Simulate detection outcomes
flags = []
for has_planet in stars_have_planet:
    if has_planet:
        # Real transit flagged with probability = detection_sensitivity
        flag = random() < detection_sensitivity
    else:
        # False alarm with probability = false_alarm_rate
        flag = random() < false_alarm_rate
    flags.append(flag)

In [None]:
# Count how many flagged
flagged_count = sum(flags)

# Count how many flagged stars actually have planets
true_planet_count = sum(
    has_planet and flagged 
    for has_planet, flagged in zip(stars_have_planet, flags)
)

In [None]:
print(f"Out of {N_stars} stars, {flagged_count} were flagged.")
if flagged_count > 0:
    posterior_prob = true_planet_count / flagged_count
    print(f"Among flagged stars, {true_planet_count} truly have planets.")
    print(f"Posterior probability of having a planet if flagged: "
          f"{posterior_prob:.2f}")
else:
    print("No transits flagged (very unlikely with these settings)!")

```{exercise}
Adjust the different parameters and check if the results follow your intuition.
```

## Example: Estimating the Mass of a New Fundamental Particle

In high-energy physics, discovering or characterizing a new particle often boils down to measuring its mass (alongside other properties like spin or decay channels).
Particle masses are typically extracted from observed signatures in a detector, such as energy peaks or invariant-mass distributions of decay products.
Bayesian methods are increasingly used to combine "priors" (e.g., theoretical constraints, previous measurments) with "likelihood" (collision data) to infer a posterior distribution for the unknown mass.

Below is a simplified illustration that mirrors our earlier continuous examples but sets the context in particle physics.

### Physical Picture

Suppose theorists predict a new fundamental particle with a mass in the range of 2 to 5 TeV.
We carry out an experiment in a large collider and measure an invariant mass peak from the particle's decay products.
However, our measurement is noisy and uncertain due to detector resolution, background events, etc.
Let:
* $\theta$ = the particle's true mass (in TeV).  
* $m_{\text{obs}}$ = the observed peak from our measurement (in TeV).  

We assume a **Gaussian** error model for simplicity:
\begin{align}
m_{\text{obs}} \sim \mathcal{N}(\theta, \sigma^2),
\end{align}
where $\sigma$ represents the *typical detector resolution* or *statistical uncertainty* in reconstructing the mass peak.

### Prior on the Particle's Mass

We might have a **theoretical prior** stating that $\theta$ lies between 2 and 5 TeV, with no strong preference within that range.
That leads to a **uniform prior**:
\begin{align}
p(\theta) = 
\begin{cases}
\frac{1}{5 - 2}, & 2 \le \theta \le 5, \\
0, & \text{otherwise}.
\end{cases}
\end{align}


In a real analysis, the prior might come from previous measurements/constraints like electroweak measurements or indirect searches, but a uniform prior is a straightforward starting point.

### Likelihood of Observed Data

If the measured peak is $m_\text{obs}=3.2$ TeV with an uncertainty $\sigma=0.2$ TeV, the **likelihood** function is:
\begin{align}
p(m_{\text{obs}} \mid \theta) 
= \frac{1}{\sqrt{2\pi}\,\sigma}
  \exp\left[-\frac{(m_{\text{obs}}-\theta)^2}{2\sigma^2}\right].
\end{align}

This function is large if $\theta$ is near 3.2 TeV and small if $\theta$ is far from 3.2 TeV.

### Posterior Distribution

Bayes' Theorem states:
\begin{align}
p(\theta \mid m_{\text{obs}}) 
\propto p(m_{\text{obs}} \mid \theta) \, p(\theta).
\end{align}

We then normalize the right-hand side to ensure the posterior integrates to 1 over $\theta \in [2,5]$.

### Python Demo: Simple Grid Approximation

Below is a small Python code that illustrates how to compute the posterior distribution by sampling $\theta$ on a grid from 2 to 5 TeV:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Observed mass peak (TeV) and estimated detector resolution (TeV)
m_obs = 3.2
sigma = 0.2

In [None]:
# Define range for theta (2 to 5 TeV)

theta_min, theta_max = 2.0, 5.0
n_points = 3000
thetas = np.linspace(theta_min, theta_max, n_points)

In [None]:
# Uniform prior in [2, 5]
def prior(theta):
    # 1/(5-2)=1/3 in [2,5], else 0
    return np.where((theta >= theta_min) & (theta <= theta_max), 1/(theta_max - theta_min), 0)

In [None]:
# Gaussian likelihood
def likelihood(m_obs, theta, sigma):
    norm = 1.0 / (np.sqrt(2*np.pi) * sigma)
    return norm * np.exp(- 0.5 * ((m_obs - theta)/sigma)**2)

In [None]:
# Compute unnormalized posterior
unnorm_post = likelihood(m_obs, thetas, sigma) * prior(thetas)

In [None]:
# Normalize
norm = np.trapezoid(unnorm_post, thetas)
post = unnorm_post / norm

In [None]:
# Plot the posterior

plt.plot(thetas, post, color='k', label="Posterior PDF")
plt.axvline(m_obs, ls='--', label=f"Observed peak={m_obs} TeV")
plt.title("Posterior for New Particle Mass")
plt.xlabel("Mass (TeV)")
plt.ylabel("Posterior Density")
plt.legend()
plt.show()

When you run this script, you see a posterior distribution peaked near 3.2 TeV.
The width of the posterior depends on $\sigma$ and how close 3.2 TeV is to the edges (2 or 5 TeV).
If the measured value were near the boundary (e.g., 2.1 TeV), the posterior might be truncated heavily on one side.

### Interpretation

In this simplified scenario:
1. The prior indicates that $\theta$ should lie somewhere between 2 and 5 TeV.
2. The likelihood becomes sharper if our detector resolution is good (small $\sigma$).
3. The posterior is effectively a constrained, "truncated" Gaussian centered near the observed peak, but strictly between 2 and 5 TeV.

In reality, measuring a new particle mass at a collider typically involves many events rather than a single measurement.
We would accumulate data from multiple collisions, each with some measured mass or energy distribution.
The Bayesian approach, however, remains the same: start with a prior (theoretical constraints), define the likelihood (the probability of observing the data given a hypothesized mass), and compute the posterior distribution over $\theta$.
(See the lab.)

```{exercise}
Adjust the different parameters and check if the results follow your intuition.
```