<div style="display: flex; align-items: center; gap: 2px;">
  
  <div style="text-align: left; padding: 0;">
   <h2 style="font-size: 1.8em; margin-bottom: 0;"><b>Non Frequentist's guide to Statistical Inference...</b></h2>
   <br>
   <h3 style=" font-size: 1.2em;margin-bottom: 0;">Bayesian Analysis</h3>
   <h3 style="font-size: 1.2em; margin-bottom: 0; color: blue;"><i>Dr. Satadisha Saha Bhowmick</i></h3>
  </div>

  <div style="margin-right: 5px; padding: 0;">
    <img src="images/intro-pic.png" align="right" alt="intro-pic" style="width: 70%;">
    <!-- TEXT NEXT TO IMAGE -->
      <div style="font-size: 0.5em;">
        <p>Woman teaching geometry, from a fourteenth-century edition of Euclid‚Äôs geometry book.</p>
      </div>
  </div>

</div>

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
import ipywidgets as widgets
from IPython.display import display, clear_output
from PIL import Image
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D  # noqa: F401
from ipywidgets import interact
import sklearn.metrics as metrics
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

## Module 3: Learning Outcomes
Statistical Inference from a Bayesian's POV

<div style="display: flex; gap: 2px;">

  <div style="flex: 1;">
  <ul>
    <li class="fragment">Understand the basic idea behind Bayesian inference.</li>
    <li class="fragment">Derive a posterior distribution, given a prior one and a likelihood function.</li>
    <li class="fragment">Use conjugate pairs to derive posterior distributions.</li>
  </ul>

  
  </div>
<!--
  <div style="flex: 1;">
  <ul>
    <li class="fragment">Posterior Simulation and Analysis.</li>
    <li class="fragment">Use Markov chains to model certain types of problems.</li>
    <li class="fragment">Find the steady states for certain types of problems.</li>
  </ul>
  </div>
!-->
</div>

## A Bayesian World View

### Squared and Polynomial Terms
<p>As children, it takes a few spills to understand that liquid doesn‚Äôt stay in a glass.<br> Or a couple attempts at conversation to understand that, unlike in cartoons, real dogs can‚Äôt talk.</p>
<div style="display: flex; align-items: center; gap: 5px;">

  <div class="fragment"; style="flex: 1;">
    <p>Bayesian Knowledge Building...</p> 
    <p>Repeating an iterative process:
    <ul>
    <li>Acknowledge your <b>prior</b> beliefs. We do not walk into inquiries without context.</li>
    <li>Gather <b>data</b> about the world.</li>
    <li>Use data to update knowledge <i><b>a posteriori</b></i>
    </ul>
    </p>
    <p>Apply this Bayesian process to all rigorous research inquiries.</p>
  </div>

  <div class="fragment"; style="flex: 1;">
    <img src="images/bayes_diagram.png" alt="Bayesian Knowledge Building" scale="0.75;" style="width: 90%;">
  </div>
</div>

### Bayesian vs Frequentist

Statistical modeling is broadly divided into these two camps.
- <b>REMINDER!</b> They both believe in Bayes' Theorem!
- Frequentist philosophy has traditionally dominated statistics.

Commonalities:
- Common goal: learn from data about the world around us.
    - Both Bayesian and frequentist analyses use data to fit models, make predictions, and evaluate hypotheses.
- Similar inferences: when working with the same data, they will typically produce a similar set of conclusions.



### Bayesian vs Frequentist: Key Difference

- Meaning of Probability
    - Bayesians believe a probability measures the relative plausibility of an event.
    - Frequentists interpret probability as the long-run relative frequency of a repeatable event.
    - For one-time events, the frequentist interpretation often cracks. 
        - If a pollster says that candidate $A$ has a 0.9 probability of winning the upcoming election, a Bayesian would interpret that, based on election models, the relative plausibility of winning is high, the candidate is 9 times more likely to win than to lose. 
        - For frequentists, the long-run relative frequency concept of observing the election over and over simply doesn‚Äôt apply here. They may say that the pollster is simply wrong! Or in a more flexible intepretation state that in long-run <i>hypothetical</i> repetitions of the election candidate $A$ would win roughly $90\%$ of the time.

### Bayesian vs Frequentist: Key Difference

- Treatment of Parameters and Data
    - Bayesians treat parameters as random variables with a *prior* distribution gathered from historical data. Observed data is fixed and is used to update the distribution of beliefs about the parameters.
    - Frequentists rely solely on current experimental data. For them parameters are *fixed constants* that are simply unknown. The data collected is considered a random sample from a fixed population.

**Freuqentists require larger samples for reliable estimates since their starting point of analysis is with the observed data. Bayesian analysis if more robust with small sample sizes due to prior info.**

### Big Idea

In classic probability, we build a model that estimates the probability of $\text{Data}$ given out model's parameters $\Theta$.
$$
P(\text{Data}|\Theta)
$$
For example, if our probability model suggests that the data should follow a Gaussian distribution $\mathcal{N(\mu, \sigma^2)}$, then parameters $\mu$ and $\sigma$ specify the likelihood of our observed data points.

Bayesian Inference flips this idea on it's head. We have measurements of our observed data that comes from a system (*a probabilistic process*). What we want is to derive an inferred model, *a specific set of model parameters*, that is most likely given our observations. $\rightarrow$ **A statistical inverse problem**

Applying Bayes Theorem,
$$
P(\Theta|\text{Data}) = \frac{P(\text{Data}|\Theta)\times P(\Theta)}{P(\text{Data})}
$$

### Some Important Notations

- $\displaystyle \Theta$: The parameter of interest  
- $\displaystyle P(\Theta)$: The **prior** distribution of the parameter of interest  
- $\displaystyle P(\Theta \mid \text{Data})$: The updated, or **posterior**, distribution of the parameter  
- $\displaystyle P(\text{Data} \mid \Theta)$: The **likelihood** of the data  

### Prior Distribution

- The **prior distribution**, $\displaystyle P(\Theta)$, is the distribution describing beliefs about an unknown quantity (a.k.a., our parameter) before new evidence is taken into account.

- Think of this as your loose preconceived notions about what you are studying.

- The prior distribution is something **YOU** get to pick!
    - If you have no evidence, you might use something called a *naive prior*.
    - If you have some evidence (maybe from a pilot study or previous knowledge), you‚Äôd use more of an ‚Äúeducated guess.‚Äù

### Likelihood

- The **likelihood**, $P(\text{Data} \mid \Theta)$, provides a framework to compare the relative compatibility of our data with a particular value of $\Theta$.

- It is **NOT** a probability! The normal rules for probabilities do not apply.

- It is, however, something closely related to probability, and you get to ‚Äúpick‚Äù it (although sometimes picking is just using the most common option in a particular scenario ‚Äî more on that later).

- Formally, with $n$ observations $d_i$:

$$
\ell(\Theta)
=
P(\text{Data} \mid \Theta)
=
\prod_{i=1}^{n} P(d_i \mid \Theta)
$$

- Occasionally (mostly), you will see **log-likelihood**, as it‚Äôs easier to work with.

### Posterior Distribution

- The **posterior distribution**, $P(\Theta \mid \text{Data})$, contains information on our beliefs *updated by the data* from which we can estimate $\Theta$.

- The posterior distribution is something you **derive**, using

$$
P(\Theta \mid \text{Data})
\propto
P(\text{Data} \mid \Theta)\, P(\Theta)
$$

- Our statistical inverse!

### Probability of Data

You may have noticed that the formula for the posterior does not include $P(\text{Data})$ even though we did see it earlier. Why?
$$
P(\Theta \mid \text{Data})
\propto
P(\text{Data} \mid \Theta)\, P(\Theta)
$$

- Long story short: we don‚Äôt know, and we don‚Äôt care.
    - $P(\text{Data})$ in this context is operationally a  **normalizing constant**.
    - It‚Äôs really hard to find the probability of the data ‚Äî you can make assumptions, but you don‚Äôt really have any way of saying this exact dataset has this exact probability.
    - The posterior distribution **IS A DISTRIBUTION!!** We know the rules for distributions, including that they should integrate to one. We can just figure out a normalizing constant later.

### Bayesian Update

Bayesian Statistics is all about applying some prior assumptions about our probability distributions and then updating that prior knowledge *after* observing new evidence.

Posterior is nothing but an updated prior.
- Can get a batch of new data to update priors.
- Or sequentially, one at a time, like a coin flip.

Bayesian Inference *balances* evidence with prior beliefs.
- Upon 3 successive coin flips, each yielding *Heads*, MLE (the frequentist approach) would state that $P(\Theta)= P(Heads) = 1.0$. That is absurd, this could just be a bad luck of draw. This is a big problem esp. when working with small samples.
- Starting with a reasonable prior and then gradually updating it with evidence from observed data prevents such reckless conclusions.

### Bayesian Update

- Sequential order of data collection.
- Iteratively updating priors at each time step with posterior computed at the previous time step.

$$
P_k(\Theta \mid \text{Data}) = \frac{P_k(\text{Data} \mid \Theta)\times P_k(\Theta)}{P_k(\text{Data})}
$$
$$
P_{k+1}(\Theta) = P_k(\Theta \mid \text{Data}) 
$$


### Bayesian Update

**Question:** During a coin flip experiment, can we use Bayesian hypothesis testing to check if the coin fair or biased? The notion of an unfair or biased coin is one where the probability of obtaining heads on a flip is $0.7$.

In [17]:
# Robust GIF export: Bayesian posterior update after each flip
import matplotlib
matplotlib.use("Agg")
from matplotlib.animation import FuncAnimation, PillowWriter

np.random.seed(42)
gif_path = "images/bayesian_posterior_update.gif"

In [None]:
# -----------------------------
# Settings
# -----------------------------
# True coin (change to 0.5 or 0.7 to experiment)
true_p = 0.7
n_flips = 60

# Hypotheses
p_fair = 0.5
p_biased = 0.7

# Prior probabilities
prior_fair = 0.5
prior_biased = 0.5

# Store posterior over time
posterior_biased_history = []

In [None]:
# -----------------------------
# Sequential Bayesian updates
# -----------------------------

# Simulate coin flips (1 = heads, 0 = tails)
flips = np.random.binomial(1, true_p, n_flips)

for i in range(n_flips):
    flip = flips[i]

    # Likelihoods for this flip
    if flip == 1:
        likelihood_fair = p_fair
        likelihood_biased = p_biased
    else:
        likelihood_fair = 1 - p_fair
        likelihood_biased = 1 - p_biased

    # Bayes update
    numerator = likelihood_biased * prior_biased
    denominator = (
        likelihood_fair * prior_fair +
        likelihood_biased * prior_biased
    )

    posterior_biased = numerator / denominator
    posterior_fair = 1 - posterior_biased

    # Update priors for next step
    prior_biased = posterior_biased
    prior_fair = posterior_fair

    posterior_biased_history.append(posterior_biased)

In [16]:
# Robust GIF export: Bayesian posterior update after each flip
# -----------------------------
# Animate and save GIF
# -----------------------------
fig, ax = plt.subplots()
ax.set_xlim(1, n_flips)
ax.set_ylim(0, 1)
ax.set_xlabel("Flip number")
ax.set_ylabel("Posterior P(biased)")
ax.axhline(0.5, linestyle="--")
line, = ax.plot([], [])
dot,  = ax.plot([], [], marker="o", linestyle="")

def init():
    line.set_data([], [])
    dot.set_data([], [])
    ax.set_title("Bayesian Updating")
    return line, dot

def update(i):
    # i = 0..n_flips-1
    x = np.arange(1, i + 2)
    y = posterior_biased_history[: i + 1]
    line.set_data(x, y)
    dot.set_data([i + 1], [y[-1]])
    ax.set_title(f"Flip {i+1}: {'H' if flips[i]==1 else 'T'}")
    return line, dot

ani = FuncAnimation(fig, update, frames=n_flips, init_func=init, blit=False)

# Important: save *before* closing the figure, keep ani referenced
writer = PillowWriter(fps=fps)
ani.save(gif_path, writer=writer)

plt.close(fig)
# print("Saved:", gif_path)

<img src="images/bayesian_posterior_update.gif" alt="Bayesian Knowledge Building" scale="0.75;" style="width: 90%;">

## Conjugate Priors

- Under carefully chosen settings, the posterior distribution belongs to the same family of distributions as the prior distribution.

- If so, the prior is called a **conjugate prior**. A conjugate prior must have a ‚Äúmatching‚Äù likelihood pair.

- Conjugate priors are particularly convenient because they reduce Bayesian updating to modifying the parameters of a distribution.
    - The math behind updating the parameters is often **MUCH** simpler than doing a full derivation, which usually involves integrals.

- Check out <a href="https://en.wikipedia.org/wiki/Conjugate_prior" target="_blank">Wikipedia‚Äôs Table of Conjugate Prior Pairs!!</a>

### Some of the most popular conjugate pairs:

| Parameter | Data | Prior | Likelihood | Posterior |
|------------|------|--------|------------|------------|
| $p$ | $d_i \; (N)$ | $\text{Beta}(p \mid \alpha, \beta)$ <br> $\sim p^{\alpha}(1-p)^{\beta}$ | $\text{Binomial}(p \mid d_i, N)$ <br> $\sim p^{d_i}(1-p)^{N-d_i}$ | $\text{Beta}(p \mid \alpha + d_i, \beta + N - d_i)$ <br> $\sim p^{\alpha+d_i}(1-p)^{\beta+N-d_i}$ |
| $\lambda$ | $d_i$ | $\text{Gamma}(\lambda \mid \alpha, \beta)$ <br> $\sim \lambda^{\alpha-1} e^{-\beta \lambda}$ | $\text{Poisson}(\lambda \mid d_i)$ <br> $\sim \lambda^{d_i} e^{-\lambda}$ | $\text{Gamma}(\lambda \mid \alpha + d_i, \beta + 1)$ <br> $\sim \lambda^{\alpha-1+d_i} e^{-(\beta+1)\lambda}$ |
| $x$ | $d_i$ | $\text{Normal}(x \mid \mu, \sigma)$ <br> $\sim e^{-(x-\mu)^2 / 2\sigma^2}$ | $\text{Normal}(x \mid d_i, s)$* <br> $\sim e^{-(x-d_i)^2 / 2s^2}$ | $\text{Normal}\!\left(x \mid \frac{s^2\mu + \sigma^2 d_i}{\sigma^2 + s^2}, \frac{\sigma^2 s^2}{\sigma^2 + s^2}\right)$ <br> $\sim e^{-\left(x - \frac{s^2\mu + \sigma^2 d_i}{\sigma^2 + s^2}\right)^2 \big/ \left(\frac{2\sigma^2 s^2}{\sigma^2 + s^2}\right)}$ |

---

\* If the value of $s$ is unknown, the population variance needs to be estimated from multiple data points.

### Some Fundamental Bayesian Models

- The **Beta-Binomial model**: probability that it rains tomorrow in Australia using data on binary categorical variable $Y$, whether or not it rains for each of 1000 sampled days.
- The **Gamma-Poisson model**: the rate of bald eagle sightings in Ontario, Canada using data on variable $Y$, the **counts** of eagles seen in each of 37 one-week observation periods.
- The **Normal-Normal model**: the average 3 p.m. temperature in Australia using data on the bell-shaped variable $Y$, temperatures on a sample of study days.


## ü™ô The Beta‚ÄìBinomial Bayesian Model  

1Ô∏è‚É£ The Setup: Example: Estimating a Gambler‚Äôs Success Probability $p$, as successive bets are placed.<br>
Let $p = \text{probability of success (win)}$.<br>
Assume independent bets. Individual outcomes follow
$Y_i \mid p \sim \text{Bernoulli}(p)$<br>
For $n$ bets:
$
Y = \sum_{i=1}^n Y_i \sim \text{Binomial}(n, p)
$

### ü™ô The Beta‚ÄìBinomial Bayesian Model  

2Ô∏è‚É£ Likelihood 

- Suppose we observe $$y = \text{number of wins out of } n \text{ bets}$$
- Likelihood: $P(y \mid p) = \binom{n}{y} p^y (1-p)^{n-y}$
- Ignoring constants in $p$: $L(p) \propto p^y (1-p)^{n-y}$

### ü™ô The Beta‚ÄìBinomial Bayesian Model  

3Ô∏è‚É£ Choosing a Prior

- Since $p \in (0,1)$, a natural prior is $p \sim \text{Beta}(\alpha, \beta)$.
    - A Beta distribution is a type of continuous probability distribution defined on the interval $[0,1]$.
    - It helps us model probabilities. While a Binomial distribution counts "how many successes," the Beta distribution models "how likely is success?"
- PDF: $\pi(p) = \frac{1}{B(\alpha,\beta)}p^{\alpha-1}(1-p)^{\beta-1}$
- Interpretation:
    - $\alpha - 1$ prior "successes"
    - $\beta - 1$ prior "failures"
- Prior mean: $\mathbb{E}[p] = \frac{\alpha}{\alpha+\beta}$

### ü™ô The Beta‚ÄìBinomial Bayesian Model  

4Ô∏è‚É£ Posterior Derivation

We learnt earlier that the posterior probability distribution is given by: $\pi(p \mid y) \propto L(p)\pi(p)$

Substituting our prior and likelihood distributions into this, we get:
$$
\pi(p \mid y) \propto p^y (1-p)^{n-y}\times p^{\alpha-1}(1-p)^{\beta-1}
$$

Upon combining exponents, we can easily recognize the kernel of an *updated* Beta distribution, thereby establishing conjugacy.
$$
\pi(p \mid y) \propto p^{\alpha+y-1}(1-p)^{\beta+n-y-1}
$$
$$
\boxed{
p \mid y \sim \text{Beta}(\alpha+y,\ \beta+n-y)
}
$$

### ü™ô The Beta‚ÄìBinomial Bayesian Model

5Ô∏è‚É£ Interpretation: Hypothetical Counts (recall this is the allowance 'relaxed frequentists' make too!).<br>
Posterior parameters: $\alpha_{\text{post}} = \alpha + y$; $\beta_{\text{post}} = \beta + n - y$<br>
So Bayesian updating becomes : Prior pseudo-counts + Observed counts

6Ô∏è‚É£ Posterior expectation: $\mathbb{E}[p \mid y] = \frac{\alpha+y}{\alpha+\beta+n}$

With a bit of algebraic manipulation:
$$
=
\frac{\alpha+\beta}{\alpha+\beta+n}
\cdot
\frac{\alpha}{\alpha+\beta}
+
\frac{n}{\alpha+\beta+n}
\cdot
\frac{y}{n}
$$

> Posterior mean = weighted average of prior mean and sample proportion

As $n \to 0$: Posterior ‚âà Prior.<br>
As $n \to \infty$: $\mathbb{E}[p \mid y] \to \frac{y}{n}$. Meaning that the data dominates and bayesian learning gradually shifts belief toward evidence.

### ü™ô The Beta‚ÄìBinomial Bayesian Model

8Ô∏è‚É£ Prior Choices in the Gambler Example

- Naive prior: $\text{Beta}(1,1)$
    - Uniform on $(0,1)$
- Skeptical prior (centered at fairness): $\text{Beta}(5,5)$
    - Expected probability of success at $0.5$.
- Strong belief gambler is skilled: $\text{Beta}(14,6)$
    - Expected probability of success at $0.7$.
Different priors ‚Üí different posterior behavior for small $n$.

### ü™ô The Beta‚ÄìBinomial Bayesian Model

9Ô∏è‚É£ Posterior Predictive Distribution

Probability next bet is a win:$P(Y_{\text{new}} = 1 \mid y)=\mathbb{E}[p \mid y]=\frac{\alpha+y}{\alpha+\beta+n}$

The marginal distribution of wins: $Y \sim \text{Beta-Binomial}(n,\alpha,\beta)$.
- $\mathbb{E}[Y]=n \frac{\alpha}{\alpha + \beta}$

Variance is larger than Binomial variance (because it accounts for parameter uncertainty).

### Back to the Gambler‚Äôs Success Proportion

- In his first experiment, he wins 4 bets and loses 3.
- In his second experiment, he wins 9 bets and loses 11.

Let‚Äôs use a Beta prior with $\alpha = 1$ and $\beta = 1$.<br>
Recall the Beta density: $P(p \mid \alpha, \beta)=\frac{p^{\alpha-1}(1-p)^{\beta-1}}{B(\alpha,\beta)}$

Substitute $\alpha = 1$, $\beta = 1$: $P(p \mid \alpha, \beta)=\frac{p^{1-1}(1-p)^{1-1}}{B(1,1)} = \frac{p^0(1-p)^0}{1}=\frac{1}{1} = 1$.
- **Note**, this is probability density, not probability itself!
- Intuitively this means, at first, all values of $p$ in $(0,1)$ are equally plausible (*a uniform distribution*).
    - $\mathbb{E}[p \mid y]=\frac{1}{1+1}=0.5$

### Back to the Gambler‚Äôs Success Proportion

After the first experiment we observe: $y = 4, \quad N = 7$.<br>
Using a $\text{Beta}(1,1)$ prior: $\alpha' = \alpha + y = 1 + 4 = 5$, and, $\beta' = \beta + N - y = 1 + 7 - 4 = 4$.<br>
So, the posterior distribution is $p \mid y\sim\text{Beta}(5,4)$ with density $P(p \mid \alpha', \beta') = \frac{p^{5-1}(1-p)^{4-1}}{B(5,4)}$
- $\mathbb{E}[p \mid y]=\frac{5}{5+4} \approx 0.556$
- The numbers of successes and failures keep getting added to the parameters of the prior in each turn.

As soon as we recognize that the distribution is another Beta distribution, we can:
- calculate the expected value $\frac{\alpha'}{\alpha'+\beta'}$ and variance
- can simulate experimental data

## Gamma‚ÄìPoisson Model

<br>When Do We Use Gamma‚ÄìPoisson?</br>

- The core concept is **waiting times** and the parameter of interest is a **rate parameter**, $\lambda$.
    - **Gamma Distribution**: Models the total time until the $n$-th event.
    - **Poisson Distribution**: Flips the question‚Äîit models the number of events that occur in a fixed time.

- Data are counts occurring in a fixed interval of time or space.
- The likelihood is **Poisson**.
- The conjugate prior (*and posterior*) is **Gamma**.
- As the number of days ($n$) goes to infinity, the $\alpha$ and $\beta$ from your prior become mathematically irrelevant and expected value eventually converges to your observed one.

### Model Setup

Suppose we are interested in *the number of spam phone calls I receive per day*.

- **Likelihood**
    - $ Y \mid \lambda \sim \text{Poisson}(\lambda) $
    - For a single observation $y$: $P(y \mid \lambda) = \frac{\lambda^y e^{-\lambda}}{y!}$
- **Prior**
    - $\lambda \sim \text{Gamma}(\alpha, \beta)$ 
    - $\alpha$ roughly represents the number of events and $\beta$ represents the interval/average time between events. 
    - $p(\lambda \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)}\lambda^{\alpha-1} e^{-\beta \lambda}$
    - Mean: $\mathbb{E}[\lambda] = \frac{\alpha}{\beta}$
- **Posterior**
    - Because Gamma is conjugate to Poisson: $\lambda \mid y \sim \text{Gamma}(\alpha + y,\; \beta + 1)$
    - For $n$ observations: $\lambda \mid y_1,\dots,y_n \sim \text{Gamma}\!\left(\alpha + \sum y_i,\; \beta + n\right)$

### Example 1: Nearly Uninformed Prior

Suppose I observe $y = 6$ spam calls in one day.
- Prior: Smaller parameters are closer to *uninformed priors* 
    - $\alpha = 0.001,\; \beta = 0.001$
- Posterior parameters:
    - $\alpha' = \alpha + y = 6.001$ and $\beta' = \beta + 1 = 1.001$
- Posterior mean: almost entirely driven by the data!
    - $\mathbb{E}[\lambda \mid y] = \frac{6.001}{1.001} \approx 6.00$


### Example 2: Very Opinionated Prior

Suppose instead: $\alpha = 50,\quad \beta = 2$
- Prior mean: $\mathbb{E}[\lambda] = \frac{50}{2} = 25$
    - This prior strongly believes the rate is around 25 calls per day.
- Posterior parameters: with $y = 6$
    - $\alpha' = 50 + 6 = 56$, and $\beta' = 2 + 1 = 3$
- Posterior mean: remains much closer to the prior belief than to the observed value.
    - $\mathbb{E}[\lambda \mid y]=\frac{56}{3}\approx 18.67$


## Normal-Normal Model

- The Normal-Normal conjugate pair is used for continuous measurements $y$ that might reasonably follow a normal distribution.
    - $\text{Normal}(\mu, \sigma^2)$. We usually assume the population variance $\sigma^2$ is known and we are trying to estimate the true mean $\mu$.
    - In practice, you may estimate variance using the standard unbiased estimator. 
        - $s^2 =\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}$

- Suppose I am interested in whether the teaspoons I use to bake are accurate. A standard teaspoon should hold 4.2 grams.
    - Then, the likelihood is the Normal distribution.
    - The prior is the Normal distribution. The initial belief about $\mu$ is $\text{Normal}(\mu_0, \sigma^2)$
- I weigh the amount my teaspoon can hold three times, and come up with measurements of $4.1$, $4.4$, and $4.6$.

## Concluding Thoughts

- It‚Äôs totally fine to use a well known conjugate pair if it applies! Again, check the Wikipedia page for things you might be able to use, look at other examples similar to your context, etc.
- Sometimes, we can make things slide easily out with nice, easy to update calculations.
- What do we do when things are more complicated?
    - The answer is something called a Markov Chain Monte Carlo simulation, but we need to review Markov Chains specifically first.