# Comprehensive Probability Theory Review
## Based on D2L Chapter 2.6: Probability
---

## Section 1: Core Definitions and Axioms of Probability

### 1.1 Sample Space and Events

**Definition 1.1 (Sample Space):**
The **sample space**, denoted $\mathcal{S}$ (or sometimes $\Omega$), is the set of all possible outcomes of a random experiment.

**Definition 1.2 (Event):**
An **event** $\mathcal{A}$ is a subset of the sample space: $\mathcal{A} \subseteq \mathcal{S}$.

**Definition 1.3 (Event Occurrence):**
We say that event $\mathcal{A}$ has occurred if and only if the realized outcome $z$ satisfies $z \in \mathcal{A}$.

**Examples:**
- Coin flip: $\mathcal{S} = \{\text{Heads}, \text{Tails}\}$
- Die roll: $\mathcal{S} = \{1, 2, 3, 4, 5, 6\}$
- Temperature measurement: $\mathcal{S} = \mathbb{R}$

**Reviewer's Commentary:**
- These definitions are standard and correct.
- The notation $\mathcal{S}$ for sample space is conventional (alternative notations include $\Omega$ or $S$).
- Events form a $\sigma$-algebra (discussed in Section 12).

### 1.2 Kolmogorov Axioms of Probability

**Definition 1.4 (Probability Function):**
A **probability function** (or probability measure) is a function
$$P: \mathcal{F} \to [0,1]$$
where $\mathcal{F}$ is a $\sigma$-algebra of events on $\mathcal{S}$, that satisfies the following three axioms:

**Axiom 1 (Non-negativity):** For any event $\mathcal{A} \in \mathcal{F}$,
$$P(\mathcal{A}) \geq 0$$

**Axiom 2 (Normalization):** The probability of the entire sample space is 1:
$$P(\mathcal{S}) = 1$$

**Axiom 3 (Countable Additivity):** For any countable collection of **mutually exclusive** (disjoint) events $\{\mathcal{A}_i\}_{i=1}^{\infty}$ where $\mathcal{A}_i \cap \mathcal{A}_j = \emptyset$ for all $i \neq j$,
$$P\left(\bigcup_{i=1}^{\infty} \mathcal{A}_i\right) = \sum_{i=1}^{\infty} P(\mathcal{A}_i)$$

**Reviewer's Commentary:**
- These are the **Kolmogorov axioms**, the foundation of modern probability theory (1933).
- **Critical requirement:** Axiom 3 requires that events be **mutually exclusive** (disjoint). This condition must be stated explicitly.
- For finite cases, we often use the simpler finite additivity: $P(A \cup B) = P(A) + P(B)$ when $A \cap B = \emptyset$.

### 1.3 Immediate Consequences of the Axioms

**Theorem 1.1 (Probability of Empty Set):**
$$P(\emptyset) = 0$$

**Proof:**

**Step 1:** Consider the sample space $\mathcal{S}$ and the empty set $\emptyset$.

**Step 2:** Note that $\mathcal{S} \cap \emptyset = \emptyset$ and $\mathcal{S} \cup \emptyset = \mathcal{S}$.

*This follows from the definition of the empty set.*

**Step 3:** Therefore $\mathcal{S}$ and $\emptyset$ are disjoint events.

*This follows directly from Step 2.*

**Step 4:** By Axiom 3 (countable additivity with $n=2$):
$$P(\mathcal{S} \cup \emptyset) = P(\mathcal{S}) + P(\emptyset)$$

*The additivity axiom applies since the events are disjoint.*

**Step 5:** Simplify the left side using $\mathcal{S} \cup \emptyset = \mathcal{S}$:
$$P(\mathcal{S}) = P(\mathcal{S}) + P(\emptyset)$$

**Step 6:** Subtract $P(\mathcal{S})$ from both sides:
$$0 = P(\emptyset)$$

Therefore $P(\emptyset) = 0$. $\blacksquare$

**Theorem 1.2 (Complement Rule):**
For any event $\mathcal{A}$ and its complement $\mathcal{A}^c = \mathcal{S} \setminus \mathcal{A}$,
$$P(\mathcal{A}) + P(\mathcal{A}^c) = 1$$

**Proof:**

**Step 1:** By definition of complement, $\mathcal{A} \cap \mathcal{A}^c = \emptyset$.

*An event and its complement are disjoint by definition.*

**Step 2:** Also by definition of complement, $\mathcal{A} \cup \mathcal{A}^c = \mathcal{S}$.

*An event and its complement together partition the sample space.*

**Step 3:** Since $\mathcal{A}$ and $\mathcal{A}^c$ are disjoint, by Axiom 3:
$$P(\mathcal{A} \cup \mathcal{A}^c) = P(\mathcal{A}) + P(\mathcal{A}^c)$$

**Step 4:** Substitute from Step 2:
$$P(\mathcal{S}) = P(\mathcal{A}) + P(\mathcal{A}^c)$$

**Step 5:** By Axiom 2, $P(\mathcal{S}) = 1$, therefore:
$$1 = P(\mathcal{A}) + P(\mathcal{A}^c)$$

Therefore $P(\mathcal{A}) + P(\mathcal{A}^c) = 1$. $\blacksquare$

**Corollary 1.2.1:**
$$P(\mathcal{A}^c) = 1 - P(\mathcal{A})$$

**Theorem 1.3 (Monotonicity):**
If $\mathcal{A} \subseteq \mathcal{B}$, then $P(\mathcal{A}) \leq P(\mathcal{B})$.

**Proof:**

**Step 1:** Write $\mathcal{B} = \mathcal{A} \cup (\mathcal{B} \setminus \mathcal{A})$.

*Any set can be decomposed into a subset and its relative complement.*

**Step 2:** Note that $\mathcal{A}$ and $\mathcal{B} \setminus \mathcal{A}$ are disjoint.

**Step 3:** By Axiom 3:
$$P(\mathcal{B}) = P(\mathcal{A}) + P(\mathcal{B} \setminus \mathcal{A})$$

**Step 4:** By Axiom 1, $P(\mathcal{B} \setminus \mathcal{A}) \geq 0$.

**Step 5:** Therefore:
$$P(\mathcal{B}) = P(\mathcal{A}) + P(\mathcal{B} \setminus \mathcal{A}) \geq P(\mathcal{A})$$

$\blacksquare$

---

## Section 2: Random Variables

### 2.1 What is a Random Variable?

A random variable provides a mathematical framework for mapping outcomes from a sample space to numerical values, enabling quantitative analysis.

**Definition 2.1 (Random Variable - Formal Version):**
A **random variable** is a measurable function $X: \mathcal{S} \to \mathbb{R}$ that maps outcomes from the sample space to real numbers.

**Intuition:** A random variable assigns a number to each possible outcome of a random experiment.

**Example 1: Coin Toss** ðŸª™

Consider a coin flip experiment where the sample space is $\mathcal{S} = \{\text{Heads}, \text{Tails}\}$.

Define a random variable $X$ that maps:
- Heads â†’ 1
- Tails â†’ 0

Now we can perform mathematical operations like computing averages.

**Example 2: Rolling a Die** ðŸŽ²

When rolling a die, define $X$ = "the number showing on the top face."

This gives us values in $\{1, 2, 3, 4, 5, 6\}$.

**Two Types of Random Variables:**

| Type | Description | Examples |
|------|-------------|----------|
| **Discrete** | Countable values | Coin flips, dice rolls, number of customers |
| **Continuous** | Any value in a range | Height, weight, temperature, time |

### 2.2 Probability Mass Function (PMF) and Probability Density Function (PDF)

**Definition 2.2 (Probability Mass Function):**
For a discrete random variable $X$, the **probability mass function (PMF)** is:
$$p_X(x) = P(X = x)$$

**Properties of PMF:**
1. $p_X(x) \geq 0$ for all $x$
2. $\sum_{x} p_X(x) = 1$

**Definition 2.3 (Probability Density Function):**
For a continuous random variable $X$, the **probability density function (PDF)** is a function $f_X(x)$ such that:
$$P(a \leq X \leq b) = \int_a^b f_X(x) \, dx$$

**Properties of PDF:**
1. $f_X(x) \geq 0$ for all $x$
2. $\int_{-\infty}^{\infty} f_X(x) \, dx = 1$

**Important Note:** For continuous random variables, $P(X = x) = 0$ for any specific value $x$. We can only compute probabilities for intervals.

### 2.3 Cumulative Distribution Function (CDF)

**Definition 2.4 (Cumulative Distribution Function):**
For any random variable $X$, the **cumulative distribution function (CDF)** is:
$$F_X(x) = P(X \leq x)$$

**Properties of CDF:**
1. $\lim_{x \to -\infty} F_X(x) = 0$
2. $\lim_{x \to +\infty} F_X(x) = 1$
3. $F_X$ is non-decreasing: if $a < b$, then $F_X(a) \leq F_X(b)$
4. $F_X$ is right-continuous

**Relationship between PDF and CDF (continuous case):**
$$f_X(x) = \frac{d}{dx}F_X(x)$$
$$F_X(x) = \int_{-\infty}^{x} f_X(t) \, dt$$

**Computing Probabilities:**
$$P(a < X \leq b) = F_X(b) - F_X(a)$$

### 2.4 Example: The Classic Coin Toss

Let's work through the most famous probability example - flipping a fair coin.

**The Setup:**
- Fair coin with 50-50 chance of heads or tails
- Sample space: $\mathcal{S} = \{\text{Heads}, \text{Tails}\}$

**Step 1: Define a Random Variable**

Let $X = 1$ if Heads, and $X = 0$ if Tails.

So:
- $P(X = 1) = 0.5$ (probability of heads)
- $P(X = 0) = 0.5$ (probability of tails)

**Step 2: Calculate the Expected Value**

What value do we expect on average if we flip many times?

$$E[X] = \sum_{x} x \cdot P(X = x) = (1)(0.5) + (0)(0.5) = 0.5$$

**Interpretation:** If we flip the coin many times and average all the results, we'll get close to 0.5.

**Step 3: Calculate the Variance**

How spread out are the results?

Using the formula: $\text{Var}[X] = E[X^2] - (E[X])^2$

First, find $E[X^2]$:
$$E[X^2] = (1^2)(0.5) + (0^2)(0.5) = 0.5$$

Then:
$$\text{Var}[X] = 0.5 - (0.5)^2 = 0.5 - 0.25 = 0.25$$

Standard deviation:
$$\sigma = \sqrt{0.25} = 0.5$$

**Interpretation:** The results are quite spread out since you either get 0 or 1, nothing in between.

### 2.5 Example: Flipping Two Coins

**Sample space:** $\mathcal{S} = \{HH, HT, TH, TT\}$

Let $Y$ = total number of heads. Possible values: 0, 1, or 2.

**Probabilities:**
- $P(Y = 0) = 1/4$ (both tails: TT)
- $P(Y = 1) = 2/4 = 1/2$ (one head: HT or TH)
- $P(Y = 2) = 1/4$ (both heads: HH)

**Expected number of heads:**
$$E[Y] = (0)(1/4) + (1)(1/2) + (2)(1/4) = 0 + 0.5 + 0.5 = 1$$

With 2 coins, on average you get 1 head.

**Variance:**
$$E[Y^2] = (0^2)(1/4) + (1^2)(1/2) + (2^2)(1/4) = 0 + 0.5 + 1 = 1.5$$

$$\text{Var}[Y] = 1.5 - (1)^2 = 1.5 - 1 = 0.5$$

**Pattern:** For 1 coin, $E[X] = 0.5$ and for 2 coins, $E[Y] = 1 = 2 \times 0.5$. Expectations add up, which demonstrates the **linearity of expectation**.

### 2.6 More Real-Life Examples

#### Example 1: Weather Forecasting

When the weather person says "30% chance of rain tomorrow," what does that mean?

**Sample Space:** $\mathcal{S} = \{\text{Rain}, \text{No Rain}\}$

**Probabilities:**
- $P(\text{Rain}) = 0.3$
- $P(\text{No Rain}) = 0.7$

Let $X = 1$ if it rains, $X = 0$ if it doesn't.

**Expected value:**
$$E[X] = (1)(0.3) + (0)(0.7) = 0.3$$

**Interpretation:** If there were 100 days exactly like tomorrow, it would rain on about 30 of them. There's a 70% chance you won't need an umbrella, and 30% chance you will.

---

#### Example 2: Exam Scores

Suppose your exam performance follows this distribution:
- Ace it (90-100): 40% probability
- Do okay (70-89): 50% probability
- Barely pass (60-69): 10% probability

Using the midpoint of each range:

$$E[\text{Score}] = (95)(0.4) + (80)(0.5) + (65)(0.1) = 38 + 40 + 6.5 = 84.5$$

**Expected score: 84.5** - That's a solid B!

**Variance calculation:**
$$E[\text{Score}^2] = (95^2)(0.4) + (80^2)(0.5) + (65^2)(0.1) = 3610 + 3200 + 422.5 = 7232.5$$

$$\text{Var}[\text{Score}] = 7232.5 - (84.5)^2 = 7232.5 - 7140.25 = 92.25$$

$$\sigma = \sqrt{92.25} \approx 9.6$$

**Interpretation:** Your scores vary by about 10 points either way. Even though you expect an 84.5, you might get anywhere from 75 to 95!

---

#### Example 3: Gaming Loot Boxes

A loot box in a game has:
- Common item (worth \$1): 70% chance
- Rare item (worth \$5): 25% chance
- Legendary item (worth \$50): 5% chance

**Expected value of a loot box:**
$$E[X] = (1)(0.70) + (5)(0.25) + (50)(0.05) = 0.70 + 1.25 + 2.50 = 4.45$$

Each box is worth \$4.45 on average. **If the box costs \$5 to buy, you're losing money!** About \$0.55 per box on average.

**Checking the variance:**
$$E[X^2] = (1^2)(0.70) + (5^2)(0.25) + (50^2)(0.05) = 0.70 + 6.25 + 125 = 131.95$$

$$\text{Var}[X] = 131.95 - (4.45)^2 = 131.95 - 19.80 = 112.15$$

$$\sigma = \sqrt{112.15} \approx 10.59$$

**Huge variance!** Most of the time you'll get a \$1 item, but occasionally you might hit that \$50 legendary.

**Lesson:** Expected value tells you the average, but variance tells you the RISK. High variance = high unpredictability!

---

## Section 3: Joint and Conditional Probability

### 3.1 Joint Probability

**Definition 3.1 (Joint Probability):**
For two random variables $A$ and $B$, the **joint probability** is:
$$P(A = a, B = b)$$
which represents the probability that both events $\{A = a\}$ and $\{B = b\}$ occur simultaneously.

**Alternative notation:** $P(A = a \cap B = b)$ or $P(A = a \text{ and } B = b)$

**Theorem 3.1 (Joint Probability Bounds):**
For any values $a, b$:
$$P(A = a, B = b) \leq P(A = a)$$
$$P(A = a, B = b) \leq P(B = b)$$

**Proof of first inequality:**

**Step 1:** Partition the event $\{A = a\}$ based on all possible values of $B$:
$$\{A = a\} = \bigcup_{v \in \text{Val}(B)} \{A = a, B = v\}$$

*This follows from the law of total probability.*

**Step 2:** The events $\{A = a, B = v\}$ for different values of $v$ are mutually exclusive.

*$B$ cannot take two different values simultaneously.*

**Step 3:** Apply Axiom 3 (countable additivity):
$$P(A = a) = \sum_{v \in \text{Val}(B)} P(A = a, B = v)$$

**Step 4:** Since all probabilities are non-negative (Axiom 1), each term in the sum is $\geq 0$:
$$P(A = a) = P(A = a, B = b) + \sum_{v \neq b} P(A = a, B = v) \geq P(A = a, B = b)$$

Therefore $P(A = a, B = b) \leq P(A = a)$. The proof for the second inequality is symmetric. $\blacksquare$

### 3.2 Conditional Probability

**Definition 3.2 (Conditional Probability):**
The **conditional probability** of event $B = b$ given that event $A = a$ has occurred is defined as:
$$P(B = b \mid A = a) = \frac{P(A = a, B = b)}{P(A = a)}$$
provided that $P(A = a) > 0$.

**Interpretation:**
- We restrict our sample space to outcomes where $A = a$ occurred
- We renormalize probabilities so they sum to 1 over this restricted space
- The conditional probability measures the likelihood of $B = b$ within this restricted space

**Important Condition:** This definition is only valid when $P(A = a) > 0$. If $P(A = a) = 0$, conditional probability is undefined (division by zero).

**Example:**

Roll a fair die. Let:
- $A$ = "roll is even" = $\{2, 4, 6\}$
- $B$ = "roll is greater than 3" = $\{4, 5, 6\}$

Then:
- $P(A) = 3/6 = 1/2$
- $P(B) = 3/6 = 1/2$
- $P(A \cap B) = P(\{4, 6\}) = 2/6 = 1/3$

$$P(B \mid A) = \frac{P(A \cap B)}{P(A)} = \frac{1/3}{1/2} = \frac{2}{3}$$

Given that the roll is even, there's a 2/3 chance it's greater than 3.

### 3.3 Verification that Conditional Probability is a Valid Probability

**Theorem 3.2:** For fixed $A = a$ with $P(A = a) > 0$, the function $Q(B = b) = P(B = b \mid A = a)$ satisfies all probability axioms.

**Proof:**

**Axiom 1 (Non-negativity):**

**Step 1:** By definition:
$$Q(B = b) = P(B = b \mid A = a) = \frac{P(A = a, B = b)}{P(A = a)}$$

**Step 2:** The numerator satisfies $P(A = a, B = b) \geq 0$ (by Axiom 1).

**Step 3:** The denominator satisfies $P(A = a) > 0$ (given assumption).

**Step 4:** Therefore:
$$Q(B = b) = \frac{P(A = a, B = b)}{P(A = a)} \geq 0$$

**Axiom 2 (Normalization):**

**Step 1:** Sum over all possible values of $B$:
$$\sum_{b} Q(B = b) = \sum_{b} \frac{P(A = a, B = b)}{P(A = a)}$$

**Step 2:** Factor out constant denominator:
$$= \frac{1}{P(A = a)} \sum_{b} P(A = a, B = b)$$

**Step 3:** By marginalization:
$$= \frac{1}{P(A = a)} \cdot P(A = a) = 1$$

**Axiom 3 (Additivity):** Similar verification for disjoint events.

Therefore, conditional probability defines a valid probability measure. $\blacksquare$

### 3.4 Marginalization (Law of Total Probability)

**Theorem 3.3 (Marginalization):**
For random variables $A$ and $B$:
$$P(A = a) = \sum_{v \in \text{Val}(B)} P(A = a, B = v)$$

**Proof:** This was proven in Theorem 3.1.

**Theorem 3.4 (Law of Total Probability - Alternative Form):**
$$P(A = a) = \sum_{v \in \text{Val}(B)} P(A = a \mid B = v) P(B = v)$$

**Proof:**

**Step 1:** Start with the marginalization formula:
$$P(A = a) = \sum_{v} P(A = a, B = v)$$

**Step 2:** Apply the definition of conditional probability to each term:
$$P(A = a, B = v) = P(A = a \mid B = v) P(B = v)$$

*This assumes $P(B = v) > 0$.*

**Step 3:** Substitute into Step 1:
$$P(A = a) = \sum_{v} P(A = a \mid B = v) P(B = v)$$

**Note:** For values $v$ where $P(B = v) = 0$, the term contributes 0 to the sum. $\blacksquare$

**Intuition:** To find the probability of $A$, sum over all possible "paths" through values of $B$.

---

## Section 4: Bayes' Theorem

### 4.1 Derivation of Bayes' Theorem

**Theorem 4.1 (Bayes' Theorem - Basic Form):**
For events $A$ and $B$ with $P(A) > 0$ and $P(B) > 0$:
$$P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}$$

**Proof:**

**Step 1:** By definition of conditional probability:
$$P(A \mid B) = \frac{P(A, B)}{P(B)}$$

*This is valid when $P(B) > 0$.*

**Step 2:** Note that joint probability is symmetric:
$$P(A, B) = P(B, A)$$

*$P(A \cap B) = P(B \cap A)$.*

**Step 3:** Apply definition of conditional probability to express joint probability:
$$P(B, A) = P(B \mid A) P(A)$$

*This is valid when $P(A) > 0$.*

**Step 4:** Combine Steps 2 and 3:
$$P(A, B) = P(B \mid A) P(A)$$

**Step 5:** Substitute Step 4 into Step 1:
$$P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}$$

Therefore Bayes' theorem is established. $\blacksquare$

### 4.2 Terminology and Interpretation

In Bayes' theorem:
$$P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}$$

| Term | Name | Meaning |
|------|------|--------|
| $P(A)$ | **Prior** | What we believed BEFORE seeing evidence |
| $P(B \mid A)$ | **Likelihood** | How likely is the evidence if our belief is true? |
| $P(B)$ | **Marginal / Evidence** | How likely is the evidence overall? |
| $P(A \mid B)$ | **Posterior** | What we believe AFTER seeing evidence |

**The Big Idea:** Bayes' theorem tells us how to update our beliefs when we get new information.

### 4.3 Bayes' Theorem with Marginalization

**Theorem 4.2 (Bayes' Theorem - Normalized Form):**
When $P(B)$ is unknown, we can compute it via marginalization:
$$P(A \mid B) = \frac{P(B \mid A) P(A)}{\sum_{a'} P(B \mid A = a') P(A = a')}$$

**Proof:**

**Step 1:** Start with Bayes' theorem:
$$P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}$$

**Step 2:** Apply marginalization to compute $P(B)$:
$$P(B) = \sum_{a'} P(B, A = a') = \sum_{a'} P(B \mid A = a') P(A = a')$$

**Step 3:** Substitute into Step 1:
$$P(A \mid B) = \frac{P(B \mid A) P(A)}{\sum_{a'} P(B \mid A = a') P(A = a')}$$

$\blacksquare$

### 4.4 Proportional Form of Bayes' Theorem

**Theorem 4.3 (Bayes' Theorem - Proportional Form):**
$$P(A \mid B) \propto P(B \mid A) P(A)$$

**Explanation:** The proportionality symbol $\propto$ means "proportional to" or "equal up to a normalization constant."

**Detailed Meaning:**

**Step 1:** From Bayes' theorem:
$$P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}$$

**Step 2:** When computing $P(A \mid B)$ for different values of $A$ with $B$ fixed, the denominator $P(B)$ is constant.

**Step 3:** Therefore:
$$P(A \mid B) = \frac{1}{P(B)} \cdot P(B \mid A) P(A) \propto P(B \mid A) P(A)$$

**Step 4:** To recover the full probability, we normalize:
$$P(A = a \mid B) = \frac{P(B \mid A = a) P(A = a)}{\sum_{a'} P(B \mid A = a') P(A = a')}$$

**Usage:** This form is useful when we only need to compare relative probabilities or when we'll normalize at the end.

### 4.5 Real-World Example: The Unreliable Friend

**Situation:** Your friend says "I'm coming to your party!" But you know from experience:
- When they actually come, they ALWAYS say they're coming: $P(\text{Says Yes} \mid \text{Comes}) = 1.0$
- But they only actually show up 30% of the time: $P(\text{Comes}) = 0.3$
- They say "yes" 80% of the time: $P(\text{Says Yes}) = 0.8$

**Question:** Given that they said yes, what's the probability they'll actually come?

**Using Bayes' theorem:**
$$P(\text{Comes} \mid \text{Says Yes}) = \frac{P(\text{Says Yes} \mid \text{Comes}) \cdot P(\text{Comes})}{P(\text{Says Yes})}$$

$$= \frac{(1.0)(0.3)}{0.8} = \frac{0.3}{0.8} = 0.375 = 37.5\%$$

**Result:** Even though they said yes, there's only a 37.5% chance they'll actually come!

**Why?** Because your friend says "yes" a LOT (80% of the time), but only shows up 30% of the time. So "saying yes" doesn't mean much!

---

### 4.6 Real-World Example: Email Spam Filter

Your email has these characteristics:
- **Prior:** 20% of all emails are spam: $P(\text{Spam}) = 0.2$
- **Likelihood:** If it's spam, there's a 90% chance it contains "FREE MONEY!!!": $P(\text{FREE MONEY} \mid \text{Spam}) = 0.9$
- **Marginal:** Overall, 25% of emails contain "FREE MONEY": $P(\text{FREE MONEY}) = 0.25$

You get an email with "FREE MONEY!!!" - is it spam?

$$P(\text{Spam} \mid \text{FREE MONEY}) = \frac{(0.9)(0.2)}{0.25} = \frac{0.18}{0.25} = 0.72 = 72\%$$

**Result:** There's a 72% chance it's spam. Your filter flags it!

**Why this works:**
1. We STARTED with "20% of emails are spam" (prior)
2. We SAW evidence: "FREE MONEY!!!"
3. We UPDATED our belief to "72% likely spam" (posterior)

**This is exactly how spam filters, medical diagnosis, and AI work!**

---

## Section 5: Independence

### 5.1 Independence of Events

**Definition 5.1 (Independence):**
Two random variables $A$ and $B$ are **independent** (denoted $A \perp B$) if and only if:
$$P(A, B) = P(A) P(B)$$
for all values of $A$ and $B$.

**Theorem 5.1 (Equivalent Characterization):**
If $P(B) > 0$, then $A \perp B$ if and only if:
$$P(A \mid B) = P(A)$$

**Proof:**

**Direction 1: Independence implies conditional equals marginal**

**Step 1:** Assume $A \perp B$, so $P(A, B) = P(A) P(B)$.

**Step 2:** By definition of conditional probability:
$$P(A \mid B) = \frac{P(A, B)}{P(B)}$$

**Step 3:** Substitute independence:
$$P(A \mid B) = \frac{P(A) P(B)}{P(B)}$$

**Step 4:** Cancel $P(B)$ (valid since $P(B) > 0$):
$$P(A \mid B) = P(A)$$

**Direction 2: Conditional equals marginal implies independence**

**Step 1:** Assume $P(A \mid B) = P(A)$.

**Step 2:** By definition of conditional probability:
$$P(A \mid B) = \frac{P(A, B)}{P(B)}$$

**Step 3:** Substitute assumption:
$$P(A) = \frac{P(A, B)}{P(B)}$$

**Step 4:** Multiply both sides by $P(B)$:
$$P(A) P(B) = P(A, B)$$

Therefore $A \perp B$. $\blacksquare$

**Interpretation:** Independence means that knowing $B$ provides no information about $A$ - the conditional probability equals the unconditional probability.

### 5.2 Conditional Independence

**Definition 5.2 (Conditional Independence):**
Random variables $A$ and $B$ are **conditionally independent given $C$** (denoted $A \perp B \mid C$) if and only if:
$$P(A, B \mid C) = P(A \mid C) P(B \mid C)$$
for all values of $A$, $B$, and $C$ (where $P(C) > 0$).

**Theorem 5.2 (Equivalent Characterization):**
If $P(B, C) > 0$, then $A \perp B \mid C$ if and only if:
$$P(A \mid B, C) = P(A \mid C)$$

**Proof:**

**Direction 1:**

**Step 1:** Assume $A \perp B \mid C$, so $P(A, B \mid C) = P(A \mid C) P(B \mid C)$.

**Step 2:** By definition of conditional probability:
$$P(A \mid B, C) = \frac{P(A, B \mid C)}{P(B \mid C)}$$

**Step 3:** Substitute conditional independence:
$$P(A \mid B, C) = \frac{P(A \mid C) P(B \mid C)}{P(B \mid C)} = P(A \mid C)$$

**Direction 2:** Similar (reverse the steps). $\blacksquare$

**Important Notes:**
- Variables can be **marginally independent** ($A \perp B$) but **conditionally dependent** ($A \not\perp B \mid C$)
- Variables can be **marginally dependent** ($A \not\perp B$) but **conditionally independent** ($A \perp B \mid C$)
- Example of first case: Common effect (explaining away)
- Example of second case: Common cause (confounding)

---

## Section 6: Expectation and Variance

### 6.1 Expectation (Expected Value)

**Definition 6.1 (Expectation - Discrete Case):**
For a discrete random variable $X$ with probability mass function $P(X = x)$, the **expectation** (or expected value, or mean) is:
$$E[X] = \sum_{x} x \cdot P(X = x)$$
where the sum is over all possible values of $X$.

**Conditions for existence:** The expectation exists if and only if $\sum_{x} |x| \cdot P(X = x) < \infty$.

**Alternative Notation:** $E[X] = E_{X \sim P}[X] = \mu_X = \mu$

**Definition 6.2 (Expectation of Function):**
For a function $f: \mathbb{R} \to \mathbb{R}$ and discrete random variable $X$:
$$E_{X \sim P}[f(X)] = \sum_{x} f(x) \cdot P(X = x)$$

**Definition 6.3 (Expectation - Continuous Case):**
For a continuous random variable $X$ with probability density function $f(x)$:
$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx$$

For a function $g$:
$$E[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f(x) \, dx$$

### 6.2 Properties of Expectation

**Theorem 6.1 (Linearity of Expectation):**
For random variables $X$ and $Y$ and constants $a, b \in \mathbb{R}$:
$$E[aX + bY] = aE[X] + bE[Y]$$

**Proof (Discrete Case):**

**Step 1:** Write the expectation:
$$E[aX + bY] = \sum_{x, y} (ax + by) \cdot P(X = x, Y = y)$$

**Step 2:** Distribute:
$$= \sum_{x, y} ax \cdot P(X = x, Y = y) + \sum_{x, y} by \cdot P(X = x, Y = y)$$

**Step 3:** Factor out constants:
$$= a \sum_{x, y} x \cdot P(X = x, Y = y) + b \sum_{x, y} y \cdot P(X = x, Y = y)$$

**Step 4:** For the first term, marginalize over $y$:
$$\sum_{x, y} x \cdot P(X = x, Y = y) = \sum_{x} x \sum_{y} P(X = x, Y = y) = \sum_{x} x \cdot P(X = x) = E[X]$$

**Step 5:** Similarly for the second term:
$$\sum_{x, y} y \cdot P(X = x, Y = y) = E[Y]$$

**Step 6:** Combine:
$$E[aX + bY] = aE[X] + bE[Y]$$

$\blacksquare$

**Important Note:** Linearity holds **regardless of whether $X$ and $Y$ are independent**. This is a powerful property.

**Corollary 6.1.1:** For constants $a$ and $b$:
- $E[aX] = aE[X]$
- $E[X + b] = E[X] + b$
- $E[a] = a$

### 6.3 Variance

**Definition 6.4 (Variance):**
The **variance** of a random variable $X$ measures the spread of its distribution around the mean:
$$\text{Var}[X] = E\left[(X - E[X])^2\right]$$

**Alternative Notation:** $\text{Var}[X] = \sigma_X^2 = \sigma^2$

**Theorem 6.2 (Computational Formula for Variance):**
$$\text{Var}[X] = E[X^2] - (E[X])^2$$

**Proof:**

**Step 1:** Start with definition:
$$\text{Var}[X] = E\left[(X - E[X])^2\right]$$

**Step 2:** Let $\mu = E[X]$ for notational simplicity:
$$\text{Var}[X] = E\left[(X - \mu)^2\right]$$

**Step 3:** Expand the square:
$$= E\left[X^2 - 2\mu X + \mu^2\right]$$

**Step 4:** Apply linearity of expectation:
$$= E[X^2] - E[2\mu X] + E[\mu^2]$$

**Step 5:** Factor out constants:
$$= E[X^2] - 2\mu E[X] + \mu^2$$

**Step 6:** Substitute $\mu = E[X]$:
$$= E[X^2] - 2E[X] \cdot E[X] + (E[X])^2$$

**Step 7:** Simplify:
$$= E[X^2] - 2(E[X])^2 + (E[X])^2$$

**Step 8:** Combine like terms:
$$= E[X^2] - (E[X])^2$$

Therefore $\text{Var}[X] = E[X^2] - (E[X])^2$. $\blacksquare$

**Interpretation:** Variance is the difference between the "mean of squares" and "square of mean."

### 6.4 Properties of Variance

**Theorem 6.3 (Variance of Scaled Random Variable):**
For any constant $c$:
$$\text{Var}[cX] = c^2 \text{Var}[X]$$

**Proof:**

**Step 1:** Use the computational formula:
$$\text{Var}[cX] = E[(cX)^2] - (E[cX])^2$$

**Step 2:** Simplify:
$$= E[c^2 X^2] - (cE[X])^2$$
$$= c^2 E[X^2] - c^2 (E[X])^2$$
$$= c^2 (E[X^2] - (E[X])^2)$$
$$= c^2 \text{Var}[X]$$

$\blacksquare$

**Theorem 6.4 (Variance is Shift-Invariant):**
For any constant $c$:
$$\text{Var}[X + c] = \text{Var}[X]$$

**Proof:**

**Step 1:** Note that $E[X + c] = E[X] + c$.

**Step 2:** Apply definition:
$$\text{Var}[X + c] = E[(X + c - E[X + c])^2] = E[(X + c - E[X] - c)^2] = E[(X - E[X])^2] = \text{Var}[X]$$

$\blacksquare$

**Theorem 6.5 (Variance of Sum - Independent Case):**
If $X$ and $Y$ are **independent**, then:
$$\text{Var}[X + Y] = \text{Var}[X] + \text{Var}[Y]$$

**Note:** This does NOT hold in general for dependent random variables.

### 6.5 Standard Deviation

**Definition 6.5 (Standard Deviation):**
The **standard deviation** is the square root of variance:
$$\sigma_X = \sqrt{\text{Var}[X]}$$

**Purpose:** Standard deviation is expressed in the same units as the original random variable, making it more interpretable than variance.

**Example:** If $X$ represents height in centimeters:
- $\text{Var}[X]$ has units of cmÂ² (squared centimeters)
- $\sigma_X$ has units of cm (centimeters)

**Properties:**
- $\sigma_X \geq 0$
- $\sigma_{cX} = |c| \sigma_X$
- $\sigma_{X+c} = \sigma_X$

---

## Section 7: Vector-Valued Random Variables

### 7.1 Expectation of Random Vectors

**Definition 7.1 (Random Vector):**
A **random vector** $\mathbf{x} \in \mathbb{R}^n$ is a vector whose components are random variables:
$$\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}$$

**Definition 7.2 (Expectation of Random Vector):**
The expectation of a random vector is computed component-wise:
$$\boldsymbol{\mu} = E[\mathbf{x}] = \begin{bmatrix} E[x_1] \\ E[x_2] \\ \vdots \\ E[x_n] \end{bmatrix}$$

More explicitly:
$$\mu_i = E[x_i]$$
for $i = 1, 2, \ldots, n$.

### 7.2 Covariance and Covariance Matrix

**Definition 7.3 (Covariance):**
The **covariance** between two random variables $X$ and $Y$ is:
$$\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]$$

**Properties:**
- $\text{Cov}(X, X) = \text{Var}(X)$
- $\text{Cov}(X, Y) = \text{Cov}(Y, X)$ (symmetric)
- If $X \perp Y$, then $\text{Cov}(X, Y) = 0$

**Definition 7.4 (Covariance Matrix):**
For a random vector $\mathbf{x} \in \mathbb{R}^n$ with mean $\boldsymbol{\mu} = E[\mathbf{x}]$, the **covariance matrix** is:
$$\boldsymbol{\Sigma} = \text{Cov}[\mathbf{x}] = E\left[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^T\right]$$

**Expanded Form:**
$$\boldsymbol{\Sigma} = \begin{bmatrix}
\text{Var}[x_1] & \text{Cov}[x_1, x_2] & \cdots & \text{Cov}[x_1, x_n] \\
\text{Cov}[x_2, x_1] & \text{Var}[x_2] & \cdots & \text{Cov}[x_2, x_n] \\
\vdots & \vdots & \ddots & \vdots \\
\text{Cov}[x_n, x_1] & \text{Cov}[x_n, x_2] & \cdots & \text{Var}[x_n]
\end{bmatrix}$$

**Properties:**
- Diagonal elements are variances: $\Sigma_{ii} = \text{Var}[x_i]$
- Off-diagonal elements are covariances: $\Sigma_{ij} = \text{Cov}[x_i, x_j]$
- The matrix is symmetric: $\Sigma_{ij} = \Sigma_{ji}$
- The matrix is positive semi-definite

### 7.3 Variance of Linear Combinations

**Theorem 7.1 (Variance of Linear Combination):**
For any constant vector $\mathbf{v} \in \mathbb{R}^n$ and random vector $\mathbf{x}$:
$$\text{Var}[\mathbf{v}^T \mathbf{x}] = \mathbf{v}^T \boldsymbol{\Sigma} \mathbf{v}$$

**Proof:**

**Step 1:** Let $Y = \mathbf{v}^T \mathbf{x}$ be a scalar random variable.

**Step 2:** Compute the mean of $Y$:
$$E[Y] = E[\mathbf{v}^T \mathbf{x}] = \mathbf{v}^T E[\mathbf{x}] = \mathbf{v}^T \boldsymbol{\mu}$$

**Step 3:** Therefore:
$$Y - E[Y] = \mathbf{v}^T \mathbf{x} - \mathbf{v}^T \boldsymbol{\mu} = \mathbf{v}^T (\mathbf{x} - \boldsymbol{\mu})$$

**Step 4:** Compute variance:
$$\text{Var}[Y] = E\left[(Y - E[Y])^2\right] = E\left[(\mathbf{v}^T (\mathbf{x} - \boldsymbol{\mu}))^2\right]$$

**Step 5:** Note that $(\mathbf{v}^T (\mathbf{x} - \boldsymbol{\mu}))^2 = \mathbf{v}^T (\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{v}$:
$$= E\left[\mathbf{v}^T (\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{v}\right]$$

**Step 6:** Factor out constants:
$$= \mathbf{v}^T E\left[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^T\right] \mathbf{v}$$

**Step 7:** Recognize the covariance matrix:
$$= \mathbf{v}^T \boldsymbol{\Sigma} \mathbf{v}$$

$\blacksquare$

**Important Consequence:** Since variance is always non-negative, we have $\mathbf{v}^T \boldsymbol{\Sigma} \mathbf{v} \geq 0$ for all $\mathbf{v}$, which proves that $\boldsymbol{\Sigma}$ is positive semi-definite.

### 7.4 Covariance and Independence

**Theorem 7.2 (Independence Implies Zero Covariance):**
If $X \perp Y$ (i.e., $X$ and $Y$ are independent), then $\text{Cov}(X, Y) = 0$.

**Proof:**

**Step 1:** By definition of covariance:
$$\text{Cov}(X, Y) = E[XY] - E[X]E[Y]$$

**Step 2:** If $X \perp Y$, then:
$$E[XY] = E[X]E[Y]$$

**Step 3:** Substitute into Step 1:
$$\text{Cov}(X, Y) = E[X]E[Y] - E[X]E[Y] = 0$$

$\blacksquare$

**Important Note:** The converse is NOT true in general. Zero covariance does not imply independence.

**Counterexample:** Let $X \sim \text{Uniform}(-1, 1)$ and $Y = X^2$. Then:
- $E[X] = 0$
- $E[XY] = E[X \cdot X^2] = E[X^3] = 0$ (odd function)
- $\text{Cov}(X, Y) = E[XY] - E[X]E[Y] = 0 - 0 = 0$

But $X$ and $Y$ are clearly dependent (knowing $X$ determines $Y$ exactly)!

**Exception:** For jointly Gaussian random variables, zero covariance DOES imply independence.

---

## Section 8: Practical Examples

### 8.1 HIV Test Example - Complete Calculation

**Problem Setup:**

**Given Information:**
- Let $H = 1$ denote "has HIV", $H = 0$ denote "does not have HIV"
- Let $D_1 = 1$ denote "first test is positive", $D_1 = 0$ denote "first test is negative"
- Prior probability of having HIV: $P(H = 1) = 0.0015$
- Consequently: $P(H = 0) = 1 - 0.0015 = 0.9985$
- Test sensitivity (true positive rate): $P(D_1 = 1 \mid H = 1) = 1.0$
- False positive rate: $P(D_1 = 1 \mid H = 0) = 0.01$

**Question:** What is the probability of having HIV given a positive test result, i.e., $P(H = 1 \mid D_1 = 1)$?

#### 8.1.1 Solution - Single Test

**Step 1:** Apply Bayes' theorem:
$$P(H = 1 \mid D_1 = 1) = \frac{P(D_1 = 1 \mid H = 1) P(H = 1)}{P(D_1 = 1)}$$

**Step 2:** Compute $P(D_1 = 1)$ using marginalization:
$$P(D_1 = 1) = P(D_1 = 1 \mid H = 0) P(H = 0) + P(D_1 = 1 \mid H = 1) P(H = 1)$$

**Step 3:** Substitute values:
$$P(D_1 = 1) = (0.01)(0.9985) + (1.0)(0.0015)$$

**Step 4:** Compute:
$$P(D_1 = 1) = 0.009985 + 0.0015 = 0.011485$$

**Step 5:** Compute the numerator:
$$P(D_1 = 1 \mid H = 1) P(H = 1) = (1.0)(0.0015) = 0.0015$$

**Step 6:** Apply Bayes' theorem:
$$P(H = 1 \mid D_1 = 1) = \frac{0.0015}{0.011485} = 0.1306$$

**Conclusion:** The probability of actually having HIV given a positive test is approximately **13.06%** or about **1 in 8**.

**Interpretation:** Despite a positive test, the probability of actually having the disease is relatively low because:
1. The disease is rare (low prior: 0.15%)
2. The false positive rate (1%) is much higher than the disease prevalence
3. Most positive tests are false positives, not true positives

#### 8.1.2 Solution - Two Tests

**Extended Problem:**
A second test is administered with properties:
- Sensitivity: $P(D_2 = 1 \mid H = 1) = 0.98$
- False positive rate: $P(D_2 = 1 \mid H = 0) = 0.03$

**Assumption:** The tests are **conditionally independent given $H$**, meaning:
$$P(D_1 = 1, D_2 = 1 \mid H) = P(D_1 = 1 \mid H) P(D_2 = 1 \mid H)$$

**Question:** What is $P(H = 1 \mid D_1 = 1, D_2 = 1)$?

**Solution:**

**Step 1:** Apply Bayes' theorem:
$$P(H = 1 \mid D_1 = 1, D_2 = 1) = \frac{P(D_1 = 1, D_2 = 1 \mid H = 1) P(H = 1)}{P(D_1 = 1, D_2 = 1)}$$

**Step 2:** Use conditional independence for $H = 1$:
$$P(D_1 = 1, D_2 = 1 \mid H = 1) = (1.0)(0.98) = 0.98$$

**Step 3:** Use conditional independence for $H = 0$:
$$P(D_1 = 1, D_2 = 1 \mid H = 0) = (0.01)(0.03) = 0.0003$$

**Step 4:** Compute marginal probability:
$$P(D_1 = 1, D_2 = 1) = (0.0003)(0.9985) + (0.98)(0.0015)$$
$$= 0.00029955 + 0.00147 = 0.00176955$$

**Step 5:** Compute numerator:
$$P(D_1 = 1, D_2 = 1 \mid H = 1) P(H = 1) = (0.98)(0.0015) = 0.00147$$

**Step 6:** Apply Bayes' theorem:
$$P(H = 1 \mid D_1 = 1, D_2 = 1) = \frac{0.00147}{0.00176955} = 0.8307$$

**Conclusion:** With two positive tests, the probability of having HIV increases to approximately **83.07%**.

**Interpretation:** The second positive test provides significant additional evidence, increasing confidence from 13.06% to 83.07%.

### 8.2 Investment Variance Example

**Problem Setup:**

An investment has three possible outcomes:
- Return nothing (0Ã—): probability 0.5
- Double investment (2Ã—): probability 0.4
- Tenfold return (10Ã—): probability 0.1

Let $X$ be the return multiplier.

**Question:** Compute $E[X]$ and $\text{Var}[X]$.

#### 8.2.1 Expected Return

$$E[X] = \sum_{x} x \cdot P(X = x)$$
$$= (0)(0.5) + (2)(0.4) + (10)(0.1)$$
$$= 0 + 0.8 + 1.0 = 1.8$$

**Conclusion:** The expected return is **1.8Ã—** the investment (80% gain on average).

#### 8.2.2 Variance of Return

**Step 1:** Compute $E[X^2]$:
$$E[X^2] = (0^2)(0.5) + (2^2)(0.4) + (10^2)(0.1)$$
$$= 0 + 1.6 + 10 = 11.6$$

**Step 2:** Compute variance:
$$\text{Var}[X] = E[X^2] - (E[X])^2 = 11.6 - (1.8)^2 = 11.6 - 3.24 = 8.36$$

**Step 3:** Standard deviation:
$$\sigma = \sqrt{8.36} \approx 2.89$$

**Conclusion:** The variance is **8.36**, indicating high risk.

**Interpretation:**
- The large variance relative to the mean indicates significant uncertainty
- Most likely outcome (50% chance) is losing everything
- Occasional large gains (10Ã— return with 10% probability) pull the expected value up

---

## Section 9: Important Inequalities and Limit Theorems

### 9.1 Chebyshev's Inequality

**Theorem 9.1 (Chebyshev's Inequality):**
For any random variable $X$ with mean $\mu$ and variance $\sigma^2$, and for any $k > 0$:
$$P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}$$

**Equivalently:** For any $\epsilon > 0$:
$$P(|X - \mu| \geq \epsilon) \leq \frac{\sigma^2}{\epsilon^2}$$

**Proof:**

**Step 1:** Define an indicator random variable:
$$I = \begin{cases} 1 & \text{if } |X - \mu| \geq \epsilon \\ 0 & \text{otherwise} \end{cases}$$

**Step 2:** Note that $E[I] = P(|X - \mu| \geq \epsilon)$.

**Step 3:** Observe that for all outcomes:
$$I \leq \frac{(X - \mu)^2}{\epsilon^2}$$

*Justification:*
- If $|X - \mu| \geq \epsilon$: Then $(X - \mu)^2 \geq \epsilon^2$, so $\frac{(X - \mu)^2}{\epsilon^2} \geq 1 = I$
- If $|X - \mu| < \epsilon$: Then $I = 0$ and $\frac{(X - \mu)^2}{\epsilon^2} \geq 0 = I$

**Step 4:** Take expectations:
$$E[I] \leq E\left[\frac{(X - \mu)^2}{\epsilon^2}\right] = \frac{\sigma^2}{\epsilon^2}$$

**Step 5:** Therefore:
$$P(|X - \mu| \geq \epsilon) \leq \frac{\sigma^2}{\epsilon^2}$$

$\blacksquare$

**Interpretation:**
- For $k = 2$: At least 75% of values lie within 2 standard deviations of the mean
- For $k = 3$: At least 88.9% of values lie within 3 standard deviations of the mean

**Note:** This bound applies to ANY distribution with finite variance, but is often loose for specific distributions.

### 9.2 Law of Large Numbers

**Theorem 9.2 (Weak Law of Large Numbers):**
Let $X_1, X_2, \ldots, X_n$ be independent and identically distributed (i.i.d.) random variables with mean $\mu$ and finite variance $\sigma^2$. Then the sample average
$$\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i$$
converges in probability to $\mu$ as $n \to \infty$:
$$\lim_{n \to \infty} P(|\bar{X}_n - \mu| > \epsilon) = 0 \quad \text{for all } \epsilon > 0$$

**Proof Sketch:**

**Step 1:** Compute the mean of $\bar{X}_n$:
$$E[\bar{X}_n] = \frac{1}{n} \sum_{i=1}^n E[X_i] = \frac{1}{n} \cdot n\mu = \mu$$

**Step 2:** Compute the variance of $\bar{X}_n$:
$$\text{Var}[\bar{X}_n] = \frac{1}{n^2} \sum_{i=1}^n \text{Var}[X_i] = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}$$

**Step 3:** Apply Chebyshev's inequality:
$$P(|\bar{X}_n - \mu| \geq \epsilon) \leq \frac{\text{Var}[\bar{X}_n]}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2}$$

**Step 4:** Take the limit:
$$\lim_{n \to \infty} P(|\bar{X}_n - \mu| \geq \epsilon) \leq \lim_{n \to \infty} \frac{\sigma^2}{n\epsilon^2} = 0$$

$\blacksquare$

**Interpretation:**
- Empirical averages converge to theoretical means
- Foundation for statistical inference
- Justifies using sample means to estimate population means

### 9.3 Central Limit Theorem

**Theorem 9.3 (Central Limit Theorem):**
Let $X_1, X_2, \ldots, X_n$ be i.i.d. random variables with mean $\mu$ and variance $\sigma^2 < \infty$. Then the standardized sample average
$$Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma}$$
converges in distribution to a standard normal distribution $N(0,1)$ as $n \to \infty$.

**Equivalently:**
$$\bar{X}_n \approx N\left(\mu, \frac{\sigma^2}{n}\right) \text{ for large } n$$

**Key Insights:**

1. **Convergence Rate:** The standard error is $\sigma/\sqrt{n}$, meaning:
   - To halve the error, need 4Ã— more samples
   - To reduce error by factor of 10, need 100Ã— more samples

2. **Universality:** The CLT applies regardless of the original distribution (as long as variance exists)

3. **Rule of Thumb:** CLT approximation is usually good for $n \geq 30$

**Applications:**
- Confidence intervals
- Hypothesis testing
- Monte Carlo estimation

---

## Section 10: Types of Uncertainty

### 10.1 Aleatoric Uncertainty

**Definition 10.1 (Aleatoric Uncertainty):**
**Aleatoric uncertainty** (also called **irreducible uncertainty** or **stochastic uncertainty**) is the inherent randomness in a system that cannot be reduced by collecting more data.

**Etymology:** From Latin "aleator" (dice player)

**Examples:**
- Quantum randomness
- Thermal noise in electronic circuits
- Outcome of a fair coin flip

**Characteristics:**
- Intrinsic to the system
- Cannot be eliminated
- Best we can do is characterize the distribution

### 10.2 Epistemic Uncertainty

**Definition 10.2 (Epistemic Uncertainty):**
**Epistemic uncertainty** (also called **model uncertainty** or **reducible uncertainty**) is uncertainty about model parameters or structure that can be reduced by collecting more data.

**Etymology:** From Greek "episteme" (knowledge)

**Examples:**
- Uncertainty in parameter estimates due to limited data
- Model selection uncertainty
- Uncertainty about the true underlying distribution

**Characteristics:**
- Due to lack of knowledge
- Can be reduced with more data
- Captured by distributions over parameters (Bayesian approach)

**Important Distinction:**
- **Aleatoric:** "We'll never know which outcome will occur, even with infinite data"
- **Epistemic:** "We don't know yet, but more data will help"

---

## Section 11: Sampling

### 11.1 Sampling from Distributions

**Motivation:** In practice, we often need to generate random samples from a probability distribution for:
- Monte Carlo estimation
- Simulation studies
- Bayesian inference
- Machine learning (dropout, data augmentation)

**Definition 11.1 (Random Sample):**
A **random sample** of size $n$ from a distribution $P$ is a collection of $n$ i.i.d. random variables $X_1, X_2, \ldots, X_n$ where each $X_i \sim P$.

### 11.2 Inverse Transform Sampling

**Theorem 11.1 (Inverse Transform):**
If $U \sim \text{Uniform}(0,1)$ and $F$ is a CDF with inverse $F^{-1}$, then:
$$X = F^{-1}(U) \sim F$$

**Proof:**

**Step 1:** We need to show $P(X \leq x) = F(x)$.

**Step 2:** Compute:
$$P(X \leq x) = P(F^{-1}(U) \leq x) = P(U \leq F(x))$$

**Step 3:** Since $U \sim \text{Uniform}(0,1)$:
$$P(U \leq F(x)) = F(x)$$

$\blacksquare$

**Algorithm:**
1. Generate $U \sim \text{Uniform}(0,1)$
2. Return $X = F^{-1}(U)$

**Example:** Exponential distribution with rate $\lambda$
- CDF: $F(x) = 1 - e^{-\lambda x}$
- Inverse: $F^{-1}(u) = -\frac{\ln(1-u)}{\lambda}$
- Sample: $X = -\frac{\ln(1-U)}{\lambda}$ where $U \sim \text{Uniform}(0,1)$

---

## Section 12: Sigma-Algebra (Advanced Topic)

### 12.1 Why Do We Need Sigma-Algebras?

**Motivation:** For continuous sample spaces like $\mathbb{R}$, not all subsets can be assigned consistent probabilities. The $\sigma$-algebra defines which events are "measurable."

**Definition 12.1 ($\sigma$-Algebra):**
A **$\sigma$-algebra** $\mathcal{F}$ on a set $\mathcal{S}$ is a collection of subsets satisfying:

1. **Contains sample space:** $\mathcal{S} \in \mathcal{F}$

2. **Closed under complementation:** If $E \in \mathcal{F}$, then $E^c \in \mathcal{F}$

3. **Closed under countable unions:** If $E_1, E_2, E_3, \ldots \in \mathcal{F}$, then $\bigcup_{i=1}^{\infty} E_i \in \mathcal{F}$

**Consequences:**
- $\emptyset \in \mathcal{F}$ (since $\emptyset = \mathcal{S}^c$)
- Closed under countable intersections (by De Morgan's laws)
- Closed under set differences

**Common Examples:**
1. **Trivial $\sigma$-algebra:** $\mathcal{F} = \{\emptyset, \mathcal{S}\}$
2. **Power set:** $\mathcal{F} = 2^{\mathcal{S}}$ (all subsets) - works for countable $\mathcal{S}$
3. **Borel $\sigma$-algebra on $\mathbb{R}$:** Generated by all open intervals

**For practical purposes:** In discrete/finite settings, we can use the power set and ignore these technicalities.

---

## Section 13: Gaussian (Normal) Distribution

### 13.1 Definition

**Definition 13.1 (Gaussian Distribution):**
A random variable $X$ follows a **Gaussian (normal) distribution** with parameters $\mu$ (mean) and $\sigma^2$ (variance), denoted $X \sim \mathcal{N}(\mu, \sigma^2)$, if it has PDF:
$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

**Standard Normal:** When $\mu = 0$ and $\sigma = 1$:
$$\phi(z) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{z^2}{2}\right)$$

### 13.2 Key Properties

1. **Symmetry:** $f(\mu + x) = f(\mu - x)$
2. **Mean:** $E[X] = \mu$
3. **Variance:** $\text{Var}(X) = \sigma^2$
4. **Standardization:** If $X \sim \mathcal{N}(\mu, \sigma^2)$, then $Z = \frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1)$

### 13.3 The 68-95-99.7 Rule

For a normal distribution:
- Approximately 68% of values lie within $\mu \pm \sigma$
- Approximately 95% of values lie within $\mu \pm 2\sigma$
- Approximately 99.7% of values lie within $\mu \pm 3\sigma$

### 13.4 Proof that the PDF Integrates to 1

**Claim:** $\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-z^2/2} dz = 1$

**Proof:**

**Step 1:** Let $I = \int_{-\infty}^{\infty} e^{-z^2/2} dz$. We need to show $I = \sqrt{2\pi}$.

**Step 2:** Compute $I^2$:
$$I^2 = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-(z^2 + w^2)/2} dz \, dw$$

**Step 3:** Convert to polar coordinates $(r, \theta)$:
$$I^2 = \int_0^{2\pi} \int_0^{\infty} e^{-r^2/2} r \, dr \, d\theta$$

**Step 4:** Evaluate the radial integral with substitution $u = r^2/2$:
$$\int_0^{\infty} e^{-r^2/2} r \, dr = \int_0^{\infty} e^{-u} du = 1$$

**Step 5:** Complete the calculation:
$$I^2 = \int_0^{2\pi} 1 \, d\theta = 2\pi$$

**Step 6:** Therefore $I = \sqrt{2\pi}$.

$\blacksquare$

---

## Section 14: Bernoulli Distribution

### 14.1 Definition

**Definition 14.1 (Bernoulli Distribution):**
A random variable $X$ follows a **Bernoulli distribution** with parameter $p$, denoted $X \sim \text{Bernoulli}(p)$, if:
$$P(X = 1) = p, \quad P(X = 0) = 1 - p$$

where $0 \leq p \leq 1$.

### 14.2 Mean and Variance

**Theorem 14.1:**
For $X \sim \text{Bernoulli}(p)$:
- $E[X] = p$
- $\text{Var}(X) = p(1-p)$

**Proof of Mean:**
$$E[X] = 0 \cdot (1-p) + 1 \cdot p = p$$

**Proof of Variance:**

**Step 1:** Note that $X^2 = X$ since $X \in \{0, 1\}$.

**Step 2:** Therefore $E[X^2] = E[X] = p$.

**Step 3:** Apply variance formula:
$$\text{Var}(X) = E[X^2] - (E[X])^2 = p - p^2 = p(1-p)$$

$\blacksquare$

**Properties:**
- Maximum variance at $p = 0.5$: $\text{Var}(X) = 0.25$
- Variance is symmetric: same for $p$ and $1-p$
- Variance is 0 when $p = 0$ or $p = 1$ (deterministic outcomes)

---

## Section 15: Verification and Summary

### 15.1 Cross-Verification Against Authoritative Sources

All formulas and proofs in this notebook have been verified against:

1. **D2L Textbook** (https://d2l.ai/chapter_preliminaries/probability.html)
2. **Wikipedia** articles on probability theory
3. **Standard textbooks:**
   - Sheldon Ross, "A First Course in Probability"
   - Dimitri Bertsekas and John Tsitsiklis, "Introduction to Probability"

### 15.2 Key Formulas Summary

| Concept | Formula | Conditions |
|---------|---------|------------|
| Conditional Probability | $P(A \mid B) = \frac{P(A,B)}{P(B)}$ | $P(B) > 0$ |
| Bayes' Theorem | $P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}$ | $P(A), P(B) > 0$ |
| Marginalization | $P(A) = \sum_b P(A,B=b)$ | - |
| Independence | $P(A,B) = P(A)P(B)$ | Definition |
| Expectation (discrete) | $E[X] = \sum_x x \cdot P(X=x)$ | Sum converges |
| Variance | $\text{Var}(X) = E[X^2] - (E[X])^2$ | Variance exists |
| Linearity of Expectation | $E[aX+bY] = aE[X] + bE[Y]$ | Always holds |
| Chebyshev's Inequality | $P(|X-\mu| \geq k\sigma) \leq 1/k^2$ | $\sigma^2 < \infty$ |
| Bernoulli Mean | $E[X] = p$ | $X \sim \text{Bernoulli}(p)$ |
| Bernoulli Variance | $\text{Var}(X) = p(1-p)$ | $X \sim \text{Bernoulli}(p)$ |

### 15.3 Notation Summary

| Symbol | Meaning |
|--------|--------|
| $\mathcal{S}$ or $\Omega$ | Sample space |
| $P(A)$ | Probability of event $A$ |
| $P(A,B)$ | Joint probability |
| $P(A \mid B)$ | Conditional probability |
| $E[X]$ or $\mu$ | Expected value |
| $\text{Var}(X)$ or $\sigma^2$ | Variance |
| $\sigma$ | Standard deviation |
| $A \perp B$ | $A$ independent of $B$ |
| $A \perp B \mid C$ | $A$ conditionally independent of $B$ given $C$ |
| $\boldsymbol{\Sigma}$ | Covariance matrix |
| $\mathcal{N}(\mu, \sigma^2)$ | Normal distribution |

---

## References

1. Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive into Deep Learning. https://d2l.ai/

2. Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson.

3. Bertsekas, D. P., & Tsitsiklis, J. N. (2008). Introduction to Probability (2nd ed.). Athena Scientific.

4. Wikipedia contributors. (2024). Probability axioms. Wikipedia. https://en.wikipedia.org/wiki/Probability_axioms

5. Wikipedia contributors. (2024). Bayes' theorem. Wikipedia. https://en.wikipedia.org/wiki/Bayes%27_theorem