## Beta-Bernoulli Model

### 1. Model Definition

#### **Notation**

- $N$: Total number of Bernoulli trials
- $y_i$: Outcome of the $i$-th trial, where $y_i \in \{0,1\}$
- $k = \sum_{i=1}^N y_i$: Total number of successes observed
- $\theta \in [0,1]$: Probability of success on each trial

#### **Definition**

We observe $N$ Bernoulli trials, resulting in $k$ successes. We assume each trial’s probability of success is $\theta$. Our goal is to infer $\theta$.

---

### 2. Prior

#### **Notation**

- $\alpha, \beta > 0$: Shape parameters of the Beta prior distribution.

#### **Beta Prior**

We place a Beta prior on $\theta$:
$$
  p(\theta) 
  \;=\; \mathrm{Beta}(\theta \mid \alpha, \beta) 
  \;=\; \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\,\Gamma(\beta)}
        \,\theta^{\alpha - 1} \,(1-\theta)^{\beta - 1}
$$

This choice of prior is conjugate to the Bernoulli likelihood, ensuring the posterior also follows a Beta distribution.

---

### 3. Likelihood

#### **Notation**

- $p(D \mid \theta)$: Probability of data $D$ (i.e., the observed outcomes) given $\theta$.
- $k$: Number of successes in $N$ trials.

#### **Bernoulli Likelihood**

For $k$ successes and $N-k$ failures, the Bernoulli likelihood is
$$
  p(D \mid \theta) 
  \;=\; \theta^{k}\,(1-\theta)^{(N - k)}
$$

---

### 4. Posterior

#### **Notation**

- $p(\theta \mid D)$: Posterior distribution of $\theta$ after observing data $D$.
- $p(D)$: Marginal likelihood (normalizing constant).
- $\propto$: “Proportional to,” omitting constant factors that do not depend on $\theta$.

#### **Bayes’ Theorem and Posterior Form**

By Bayes’ Theorem,

$$
  p(\theta \mid D) 
  \;=\; \frac{p(D \mid \theta)\,p(\theta)}{p(D)}
$$
Substitute the Bernoulli likelihood and the Beta prior:
$$
  p(\theta \mid D) 
  \;\propto\; \theta^{k}\,(1-\theta)^{(N - k)} 
               \;\times\; \theta^{\alpha - 1}\,(1-\theta)^{\beta - 1}
$$
Combine the exponents of $\theta$ and $(1-\theta)$:
$$
  p(\theta \mid D) 
  \;\propto\; \theta^{(\alpha - 1) + k}
              \,(1-\theta)^{(\beta - 1) + (N - k)}
$$
Recognizing the Beta kernel, we conclude:
$$
  p(\theta \mid D) 
  = \mathrm{Beta}\Bigl(\theta \;\bigm|\; \alpha + k,\;\beta + (N - k)\Bigr)
$$

Hence, the posterior is another Beta distribution with **updated** shape parameters $\alpha + k$ and $\beta + (N - k)$.

---

### 2. Sympy Verification

Below, we symbolically represent the **unnormalized** posterior and numerically confirm that the integral over $[0,1]$ matches the Beta function normalization.

#### **Notation**

- $\alpha, \beta > 0$: Beta prior parameters.
- $N$: Number of trials.
- $k$: Number of observed successes.
- $\theta$: Success probability.

#### **Unnormalized Posterior**

$$
  \text{posterior}_\text{unnorm}(\theta) 
  \;=\; \bigl[\theta^{\alpha - 1}\,(1 - \theta)^{\beta - 1}\bigr]
         \;\times\;
         \bigl[\theta^{k}\,(1 - \theta)^{(N - k)}\bigr].
$$

When integrated over $\theta\in[0,1]$, it should yield the normalizing constant that turns this expression into a valid Beta density with parameters $\alpha + k$ and $\beta + (N-k)$.

In [3]:
import mpmath
import sympy

theta = sympy.Symbol("theta", positive=True)
alpha_v, beta_v = 2, 3  # Values for alpha and beta
N_v, k_v = 5, 2  # Observed 5 trials, 2 successes

# Unnormalized posterior expression
expr_symbolic = (
    theta ** (alpha_v - 1)
    * (1 - theta) ** (beta_v - 1)
    * theta ** (k_v)
    * (1 - theta) ** (N_v - k_v)
)

f = sympy.lambdify(theta, expr_symbolic, "mpmath")

# Numeric integration from 0 to 1
res = mpmath.quad(f, [0, 1])
print("Unnormalized integral =", res)

Unnormalized integral = 0.00198412698412698


**Expected Result**  

From the Beta conjugacy, the posterior’s normalizing constant for $\alpha_v=2, \beta_v=3, N_v=5, k_v=2$ is:

$$
  \mathrm{B}(\alpha_v + k_v,\, \beta_v + (N_v - k_v)) 
  \;=\; \mathrm{B}(4,6)
$$
reduces to $1/504$. For concreteness, we use the fact that:

$$
  \mathrm{B}(x,y) \;=\; \frac{\Gamma(x)\,\Gamma(y)}{\Gamma(x+y)},
  \quad \text{and} 
  \quad \Gamma(n+1)=n! \text{ for positive integers } n.
$$

Substitute $\mathrm{B}(4,6)$:

$$
\mathrm{B}(4,6) 
\;=\; \frac{\Gamma(4)\,\Gamma(6)}{\Gamma(4+6)}
\;=\; \frac{\Gamma(4)\,\Gamma(6)}{\Gamma(10)}.
$$

Convert gamma functions to factorials:

- $\Gamma(4) = 3! = 3 \times 2 \times 1 = 6.$  
- $\Gamma(6) = 5! = 5 \times 4 \times 3 \times 2 \times 1 = 120.$  
- $\Gamma(10) = 9! = 9 \times 8 \times 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1 = 362{,}880.$

So,

$$
  \frac{\Gamma(4)\,\Gamma(6)}{\Gamma(10)} 
  \;=\; \frac{3!\,5!}{9!} 
  \;=\; \frac{6 \times 120}{362{,}880}.
$$

Simplify numerically:

$$
  \frac{6 \times 120}{362{,}880} 
  \;=\; \frac{720}{362{,}880}.
$$

Recognizing $362{,}880 / 720 = 504$, we have

$$
  \frac{720}{362{,}880} 
  \;=\; \frac{1}{504}.
$$

Hence, when we numerically integrate the **unnormalized** posterior over $\theta \in [0,1]$ and get $0.001984...$, we match the exact fraction $\tfrac{1}{504}$. This confirms the **Beta conjugacy** result:

$$
  \int_0^1 \text{posterior}_\text{unnorm}(\theta)\,d\theta 
  \;=\; \frac{1}{504}.
$$

## Impact of Different Priors on the Posterior in a Beta-Bernoulli Model

### Model Setup

- **Prior**: Beta distribution with various parameterizations.
- **Likelihood**: Bernoulli trials with observed successes and failures.
- **Posterior**: Updated Beta distribution via conjugacy.

### 🔢 Priors Explored

<center>

| Prior        | Description                              |
|--------------|------------------------------------------|
| Beta(1,1)    | Uniform, non-informative                 |
| Beta(2,2)    | Weakly informative, centered at 0.5      |
| Beta(0.5,0.5)| Informative, favoring extremes (0 and 1) |
| Beta(5,1)    | Informative, favoring high success prob. |
| Beta(1,5)    | Informative, favoring low success prob.  |
| Beta(10,10)  | Strongly centered around 0.5             |

</center>

---

<center>
<img src="diagrams/prior_posterior_plot.png" alt="Prior and Posterior Distributions" width="90%" />
</center>

1. **Uniform Prior (Beta(1,1))**:  
   - This prior represents no prior knowledge and allows the posterior to be driven mainly by the data.
   - As data accumulates, the posterior becomes more concentrated around the observed proportion of successes.

2. **Weakly Informative Prior (Beta(2,2))**:  
   - Slightly informative but still relatively neutral, this prior pulls the posterior slightly toward 0.5 when data is sparse.
   - With more data, the influence of the prior diminishes.

3. **Informative Prior Favoring Extremes (Beta(0.5,0.5))**:  
   - This prior assumes high variance, favoring probabilities near 0 or 1.
   - When data is limited, the posterior also exhibits a preference for extremes.

4. **Informative Prior Favoring High Probabilities (Beta(5,1))**:  
   - This prior assumes a strong prior belief that success probability is high.
   - When data contradicts this belief, the posterior updates but remains somewhat biased toward higher values, especially with small datasets.

5. **Informative Prior Favoring Low Probabilities (Beta(1,5))**:  
   - This prior assumes the success probability is low.
   - The posterior updates based on observed data but remains skewed toward lower probabilities when data is scarce.

### General Takeaways

- **More Data Reduces Prior Influence**: As more trials are observed, the posterior relies more on data and less on the prior.
- **Stronger Priors Require More Data to Overcome**: Highly informative priors can significantly shape the posterior when the sample size is small.
- **Prior Selection Matters in Low-Data Regimes**: When data is limited, the choice of prior can heavily influence conclusions.