# **Introduction**

This notebook provides a comprehensive exploration of **univariate probabilistic models**, focusing on foundational concepts of the Probabilistic Machine Learning . It is designed to build strong conceptual clarity on how randomness and uncertainty are quantified and modeled using probability theory. The notebook begins by explaining the core ideas of **random variables**, both **discrete** and **continuous**, including how their behaviors are described through **probability mass functions (pmf)** and **probability density functions (pdf)**.

Key statistical tools such as the **cumulative distribution function (cdf)** and **quantile functions** are introduced, with intuitive explanations and real-world examples to illustrate how these functions help characterize probability distributions. The notebook also discusses the **moments of a distribution**, including **mean**, **variance**, **standard deviation**, and extends into **conditional moments**, supported by derivations and visualization.

A major part of the notebook emphasizes the distinction between **epistemic (model) uncertainty** and **aleatoric (data) uncertainty**, helping build a foundation for understanding variability in data and predictions. Concepts of **independence** and **conditional independence** between random variables are covered, along with the **law of total expectation** and **law of total variance**, using both mathematical derivations and practical analogies.

The final section highlights the **limitations of summary statistics** using famous examples like **Anscombe’s quartet** and the **Datasaurus Dozen**, demonstrating how datasets with identical low-order statistics can still be fundamentally different in structure and distribution.

Throughout the notebook, the explanations are supported with **mathematical proofs** making it a complete self-contained study guide on univariate models and their real-world implications in probabilistic modeling.


# Univariate Model

## Definition:
A univariate model is a model that uses only one input (independent) variable to predict or explain one output (dependent) variable.

## Key Properties:
- Uses only 1 feature (e.g., income, age, temperature).
- Common in simple regression or time series analysis.
- Easy to interpret and visualize (e.g., line plot or scatter plot).

## Example:
Linear regression with one feature:
    Price = β₀ + β₁ × SquareFeet + ε

## Use Cases:
- Exploratory analysis
- Simple prediction problems
- Time series forecasting (e.g., AR models)

## Limitations:
- Ignores other potential influencing variables
- Not suitable for complex real-world problems


# Probability: Concepts & Interpretations

## What is Probability?
Probability is the mathematical study of uncertainty. It quantifies how likely an event is to occur, especially when the outcome is not certain.

## Two Interpretations of Probability

1. Frequentist Interpretation:
- Defines probability as the long-run frequency of an event.
- Assumes repeated trials under identical conditions.
- Example: In 1000 flips of a fair coin, we expect about 500 heads.
- Limitation: Cannot reason about unique or one-time events.

2. Bayesian Interpretation:
- Defines probability as a measure of belief or uncertainty.
- Based on prior knowledge, updated by evidence.
- Example: Probability that the ice caps will melt by 2030.
- Can model one-time events and adapt based on new data.

## Why Bayesian is Used in Machine Learning:
- Captures model uncertainty
- Useful for rare/unique cases
- Naturally aligns with decision-making under uncertainty
- Allows probability updating as new data becomes available

## Key Point:
Both interpretations use the same core probability rules (e.g., Bayes' Theorem, Conditional Probability), even though they interpret them differently.


# Types of Uncertainty

## 1. Epistemic Uncertainty (Model Uncertainty):
- Caused by a lack of knowledge or incomplete models.
- Can be reduced by collecting more or better-quality data.
- Example: Not knowing all variables that affect stock prices.

## 2. Aleatoric Uncertainty (Data/Noise Uncertainty):
- Caused by inherent randomness or noise in the system.
- Cannot be reduced even with more data.
- Example: Tossing a fair coin (p = 0.5 for heads or tails).

## Importance in Machine Learning:
- In active learning, we select samples with high uncertainty (high entropy).
- If uncertainty is epistemic → collect more data is helpful.
- If uncertainty is aleatoric → collecting more data doesn’t help.

## Tip:
- Epistemic = ignorance (fixable).
- Aleatoric = randomness (unfixable).


# Probability Rules

## 1. Conjunction (AND / Joint Probability)
- Pr(A ∧ B) = Pr(A, B)
- If A and B are independent: Pr(A, B) = Pr(A) × Pr(B)

## 2. Union (OR Probability)
- Pr(A ∨ B) = Pr(A) + Pr(B) − Pr(A ∧ B)
- If A and B are mutually exclusive: Pr(A ∨ B) = Pr(A) + Pr(B)

## 3. Conditional Probability
- Pr(B | A) = Pr(A ∧ B) / Pr(A)
- Not defined if Pr(A) = 0

## 4. Independence
- A and B are independent if: Pr(A, B) = Pr(A) × Pr(B)

## 5. Conditional Independence
- A and B are conditionally independent given C:
  Pr(A, B | C) = Pr(A | C) × Pr(B | C)
- Notation: A ⊥ B | C


# Random Variables

## What is a Random Variable?
- A variable whose value is uncertain and depends on chance.
- Sample Space: The set of all possible values the variable can take.

## Discrete Random Variables
- Values are countable (e.g., dice rolls, coin flips).
- Probability Mass Function (pmf): 
  - p(x) = Pr(X = x)
  - 0 ≤ p(x) ≤ 1
  - Sum over all p(x) = 1

### Examples:
- Fair die: p(x) = 1/6 for x in {1,2,3,4,5,6}
- Degenerate: p(x) = 1 if x = 1, else 0

## Continuous Random Variables
- Can take any value in ℝ (e.g., height, temperature)
- Individual values have zero probability; use intervals.

### Cumulative Distribution Function (CDF)
- Notation: P(x) = Pr(X ≤ x)
- Monotonically non-decreasing
- Pr(a < X ≤ b) = P(b) − P(a)

### Probability Density Function (PDF)
- p(x) = d/dx P(x)
- Pr(a < X ≤ b) = ∫ from a to b of p(x) dx
- For small dx: Pr(x < X ≤ x + dx) ≈ p(x) × dx


# Central Limit Theorem (CLT)

## Statement:
The CLT states that, regardless of the original population distribution,
the distribution of the sample means will approach a normal distribution
as the sample size increases.

## Conditions:
- Sample size n ≥ 30 (larger is better)
- Samples must be independent and identically distributed
- Population variance must be finite

## Mathematical Form:
If X₁, X₂, ..., Xn are i.i.d. with mean μ and variance σ²:
Then as n → ∞,
  Z = (X̄ - μ) / (σ / √n) → N(0, 1)

## Real-Life Example:
Delivery times (skewed) averaged over days → becomes bell-shaped

## Why It Matters:
- Justifies using normal-based methods even when data is not normal
- Forms the basis of confidence intervals, hypothesis testing, etc.


# Central Limit Theorem (CLT) Summary with Example

Given:
- Population Mean (μ) = 15
- Population Std. Dev (σ) = 4
- Sample Size (n) = 25

Then:
- Standard Error (SE) = σ / √n = 4 / √25 = 0.8
- The sample means (averages of size-25 samples) will follow a normal distribution:
    ~ N(μ, σ²/n) = N(15, 0.64)

This is true even if the original data is not normally distributed!
CLT guarantees that the distribution of sample means becomes normal
as n increases (rule of thumb: n ≥ 30).


# Normal Distribution

## Definition:
A probability distribution where values are symmetrically distributed around the mean in a bell-shaped curve.

## Formula:
f(x) = (1 / sqrt(2πσ²)) * e^(-(x - μ)² / (2σ²))

## Parameters:
- μ (mu): Mean (center)
- σ (sigma): Standard deviation (spread)

## Properties:
- Symmetric shape
- Mean = Median = Mode
- Total area under curve = 1
- Follows 68-95-99.7 Rule

## Central Limit Theorem:
The average of a large number of independent samples from any distribution tends to be normally distributed.

## Real-Life Examples:
- Human heights
- Exam scores
- Errors in measurements

## Applications:
- Hypothesis testing
- Confidence intervals
- Regression assumptions
- Data normalization


# Quantiles

## What is a Quantile?
- A quantile is the value below which a given percentage of data falls.

## Inverse CDF / Quantile Function / Percent Point Function (PPF)
- P⁻¹(q) = the value x such that Pr(X ≤ x) = q

## Common Quantiles
- Median: P⁻¹(0.5) → 50% of data below
- Lower Quartile: P⁻¹(0.25)
- Upper Quartile: P⁻¹(0.75)
- 95% Interval (for Normal): (P⁻¹(0.025), P⁻¹(0.975)) ≈ (−1.96, 1.96)

## Example (Standard Normal: N(0,1))
- 95% of values fall within (−1.96, 1.96)

## General Normal Distribution: N(μ, σ²)
- 95% Interval: (μ − 1.96σ, μ + 1.96σ)
- Approximate: μ ± 2σ

## Real-Life Example
- Average height = 170 cm, σ = 10 cm
- 95% heights: 170 ± 20 = (150 cm, 190 cm)


# How Quantiles Divide a Probability Distribution

## Definition
- The q-th quantile, x_q, is the value where:
  P(X ≤ x_q) = q
- This divides the distribution into two parts:
  - Left: q probability mass
  - Right: 1 − q probability mass

## Special Case: Median
- Median = 0.5 quantile
- Divides the distribution exactly in half:
  - 50% of values ≤ median
  - 50% ≥ median

## Example (Standard Normal Distribution)
- Q1 (25% quantile): ≈ -0.674
- Median (50% quantile): 0
- Q3 (75% quantile): ≈ +0.674

## Visualization
- Quantiles are vertical "cut-points" on the x-axis of a PDF plot.
- Area under curve to the left = cumulative probability = q

## Conclusion
- Quantiles are useful to summarize data, detect outliers, and understand spread.
- Each quantile point divides the distribution into predictable, equal-probability parts.


# How Quantiles Divide a Probability Distribution

## Definition
- The q-th quantile, x_q, is the value where:
  P(X ≤ x_q) = q
- This divides the distribution into two parts:
  - Left: q probability mass
  - Right: 1 − q probability mass

## Special Case: Median
- Median = 0.5 quantile
- Divides the distribution exactly in half:
  - 50% of values ≤ median
  - 50% ≥ median

## Example (Standard Normal Distribution)
- Q1 (25% quantile): ≈ -0.674
- Median (50% quantile): 0
- Q3 (75% quantile): ≈ +0.674

## Visualization
- Quantiles are vertical "cut-points" on the x-axis of a PDF plot.
- Area under curve to the left = cumulative probability = q

## Conclusion
- Quantiles are useful to summarize data, detect outliers, and understand spread.
- Each quantile point divides the distribution into predictable, equal-probability parts.


# Sets of Related Random Variables

## 1. Joint Distribution
- p(X, Y) = probability that both X and Y happen.
- Represented in a table (if both variables are discrete).
- All values in joint distribution must sum to 1.

## 2. Marginal Distribution
- Probability of a single variable, ignoring others.
- p(X) = ∑ p(X, Y) over all Y (called sum rule).

## 3. Conditional Probability
- p(Y | X) = p(X, Y) / p(X)
- Gives probability of Y given that X has occurred.

## 4. Product Rule
- p(X, Y) = p(X) * p(Y | X)

## 5. Independence
- If X ⊥ Y → p(X, Y) = p(X) * p(Y)
- Requires fewer parameters to define the model.

## 6. Chain Rule (for many variables)
- p(x₁, x₂, ..., x_D) = p(x₁) * p(x₂ | x₁) * ... * p(x_D | x₁:₍D−1₎)

## Real-Life Examples:
- Weather vs Traffic
- Dice rolls
- Multiday weather predictions



# Independence and Conditional Independence

## Independence:
- X ⊥ Y  ⇔  p(X, Y) = p(X) * p(Y)
- Knowing X doesn't change belief about Y
- Real-life: Two dice rolled separately

## Mutual Independence:
- All subsets of variables are independent
- Example: Tossing 3 independent coins

## Conditional Independence:
- X ⊥ Y | Z  ⇔  p(X, Y | Z) = p(X | Z) * p(Y | Z)
- Once Z is known, X and Y become independent
- Real-life: 
  - Weather mediates between Ice Cream & Umbrella
  - Disease mediates between Symptoms

## Graph Notation:
- X — Z — Y: Z mediates the relationship between X and Y
- Used in graphical models to simplify complex distributions


* Unconditional independence is rare; Conditional independence is more useful in practice.


# Moments of a Distribution

## 1. Mean (Expected Value)

### Continuous:
$$
E[X] = \int_{-\infty}^{\infty} x \cdot p(x) \, dx
$$

### Discrete:
$$
E[X] = \sum_{x \in \mathcal{X}} x \cdot p(x)
$$

### Linearity of Expectation:
$$
E[aX + b] = aE[X] + b
$$
$$
E\left[\sum_{i=1}^n X_i\right] = \sum_{i=1}^n E[X_i]
$$

### If Independent:
$$
E\left[\prod_{i=1}^n X_i\right] = \prod_{i=1}^n E[X_i]
$$

## 2. Variance

### Definition:
$$
\text{Var}[X] = E[(X - \mu)^2] = \int (x - \mu)^2 p(x) dx
$$

### Shortcut:
$$
\text{Var}[X] = E[X^2] - \mu^2
$$

### Standard Deviation:
$$
\text{std}[X] = \sqrt{\text{Var}[X]} = \sigma
$$

### Scaling:
$$
\text{Var}[aX + b] = a^2 \cdot \text{Var}[X]
$$

### Variance of Sums (Independent):
$$
\text{Var}\left[\sum_{i=1}^n X_i\right] = \sum_{i=1}^n \text{Var}[X_i]
$$

### Variance of Products (Independent):
$$
\text{Var}\left[\prod_{i=1}^n X_i\right] = \prod_i (V[X_i] + (E[X_i])^2) - \prod_i (E[X_i])^2
$$


# Conditional Moments

## Law of Total Expectation:
$$
\mathbb{E}[X] = \mathbb{E}_Y[\mathbb{E}[X|Y]]
$$

### Meaning:
- Compute expected value of X **within each group Y**
- Average those values weighted by \( P(Y) \)

## Discrete Proof of the Law of Total Expectation

We want to prove:

$$
\mathbb{E}[X] = \mathbb{E}_Y[\mathbb{E}[X|Y]]
$$

### Step-by-step Derivation

Start from the right-hand side:

$$
\mathbb{E}_Y[\mathbb{E}[X|Y]] = \sum_y \left( \sum_x x \cdot p(X = x \mid Y = y) \right) p(Y = y)
$$

By the definition of conditional probability:

$$
= \sum_y \sum_x x \cdot p(X = x, Y = y)
$$

Change the order of summation:

$$
= \sum_x \sum_y x \cdot p(X = x, Y = y)
$$

Factor out \( x \):

$$
= \sum_x x \cdot \left( \sum_y p(X = x, Y = y) \right)
$$

Recognize that summing over all \( y \) gives the marginal \( p(X = x) \):

$$
= \sum_x x \cdot p(X = x)
$$

Which is by definition:

$$
= \mathbb{E}[X]
$$

## Real-life Example:

Let X = lightbulb lifetime, Y = factory

Given:
- \( \mathbb{E}[X|Y=1] = 5000 \), \( \mathbb{E}[X|Y=2] = 4000 \)
- \( p(Y=1) = 0.6 \), \( p(Y=2) = 0.4 \)

Then:
$$
\mathbb{E}[X] = 0.6 \cdot 5000 + 0.4 \cdot 4000 = 4600
$$

## Law of Total Variance – Derivation

We want to prove:

$$
\text{Var}[X] = \mathbb{E}_Y[\text{Var}[X|Y]] + \text{Var}_Y[\mathbb{E}[X|Y]]
$$

### Step 1: Start with variance definition

$$
\text{Var}[X] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2
$$

### Step 2: Apply law of total expectation

- For the first term:
  $$
  \mathbb{E}[X^2] = \mathbb{E}_Y[\mathbb{E}[X^2|Y]]
  $$
- For the second term:
  $$
  \mathbb{E}[X] = \mathbb{E}_Y[\mathbb{E}[X|Y]] \Rightarrow (\mathbb{E}[X])^2 = \left( \mathbb{E}_Y[\mathbb{E}[X|Y]] \right)^2
  $$

### Step 3: Plug into the variance formula

$$
\text{Var}[X] = \mathbb{E}_Y[\mathbb{E}[X^2|Y]] - \left( \mathbb{E}_Y[\mathbb{E}[X|Y]] \right)^2
$$

### Step 4: Add and subtract middle term

Add and subtract \( \mathbb{E}_Y[\mathbb{E}[X|Y]^2] \) to split the expression:

$$
\text{Var}[X] = \underbrace{\mathbb{E}_Y[\mathbb{E}[X^2|Y] - \mathbb{E}[X|Y]^2]}_{\text{(1) Expected conditional variance}} + \underbrace{\mathbb{E}_Y[\mathbb{E}[X|Y]^2] - \left( \mathbb{E}_Y[\mathbb{E}[X|Y]] \right)^2}_{\text{(2) Variance of conditional expectation}}
$$

### Final Result

$$
\boxed{
\text{Var}[X] = \mathbb{E}_Y[\text{Var}[X|Y]] + \text{Var}_Y[\mathbb{E}[X|Y]]
}
$$

### Intuition:

- First term: average variance within each group (given \( Y \))
- Second term: variance between group means (how much \( \mathbb{E}[X|Y] \) varies)


## Real-Life Example: Lightbulb Lifetimes from Two Factories

Let:  
- $X$: Lifetime of a lightbulb  
- $Y$: Factory from which the lightbulb comes

Assume:

- **Factory A** ($Y = A$):  
  - Produces 60% of the lightbulbs: $P(Y = A) = 0.6$  
  - Mean lifetime: $\mu_A = \mathbb{E}[X \mid Y = A] = 5000$  
  - Variance: $\sigma_A^2 = \text{Var}(X \mid Y = A) = 100^2 = 10000$

- **Factory B** ($Y = B$):  
  - Produces 40% of the lightbulbs: $P(Y = B) = 0.4$  
  - Mean lifetime: $\mu_B = \mathbb{E}[X \mid Y = B] = 4000$  
  - Variance: $\sigma_B^2 = \text{Var}(X \mid Y = B) = 200^2 = 40000$

### Step 1: Law of Total Expectation

$$
\mathbb{E}[X] = \mathbb{E}[X \mid Y = A] \cdot P(Y = A) + \mathbb{E}[X \mid Y = B] \cdot P(Y = B)
$$

$$
\mathbb{E}[X] = 5000 \cdot 0.6 + 4000 \cdot 0.4 = 3000 + 1600 = \boxed{4600}
$$

The expected lifetime of a randomly selected lightbulb is **4600 hours**.


### Step 2: Law of Total Variance

$$
\text{Var}(X) = \mathbb{E}_Y[\text{Var}(X \mid Y)] + \text{Var}_Y[\mathbb{E}[X \mid Y]]
$$


#### First Term: Expected Conditional Variance

$$
\mathbb{E}_Y[\text{Var}(X \mid Y)] = \sigma_A^2 \cdot P(Y = A) + \sigma_B^2 \cdot P(Y = B)
$$

$$
= 10000 \cdot 0.6 + 40000 \cdot 0.4 = 6000 + 16000 = \boxed{22000}
$$


#### Second Term: Variance of Conditional Means

Let $\mu = \mathbb{E}[X] = 4600$. Then:

$$
\text{Var}_Y[\mathbb{E}[X \mid Y]] = P(Y = A)(\mu_A - \mu)^2 + P(Y = B)(\mu_B - \mu)^2
$$

$$
= 0.6 \cdot (5000 - 4600)^2 + 0.4 \cdot (4000 - 4600)^2
$$

$$
= 0.6 \cdot 160000 + 0.4 \cdot 360000 = 96000 + 144000 = \boxed{240000}
$$

### Final Answer: Total Variance

$$
\text{Var}(X) = 22000 + 240000 = \boxed{262000}
$$


### Summary

- $\mathbb{E}[X] = 4600$ hours  
- $\text{Var}(X) = 262000$ hours²

This shows how total variability arises from both:  
1. **Variation within each factory** (random noise), and  
2. **Variation between factories** (systematic difference in means).



# Limitations of Summary Statistics

## Objective

The objective of this section is to understand why relying solely on summary statistics such as the mean, variance, and correlation can be misleading in data analysis. Through well-known examples such as Anscombe’s Quartet and the Datasaurus Dozen, we will demonstrate that datasets with identical summary statistics can have drastically different underlying distributions. This highlights the importance of combining statistical summaries with graphical data visualization techniques to fully understand the characteristics of data.

## Why Summary Statistics Are Used

Summary statistics are numerical measures that describe and summarize features of a dataset. Common statistics include:

- Mean (average): Indicates the central tendency.
- Variance: Measures the spread or variability of the data.
- Correlation coefficient: Measures the strength and direction of a linear relationship between two variables.

These summaries are often used because they are concise and easy to compute, providing a quick overview of the data. However, they can obscure deeper patterns, anomalies, or structures present in the dataset.

## Case Study 1: Anscombe’s Quartet

In 1973, statistician Francis Anscombe created four small datasets of (x, y) pairs. Each dataset contains 11 data points, and all four have:

- Mean of x: E[x] = 9
- Variance of x: Var[x] = 11
- Mean of y: E[y] = 7.5
- Variance of y: Var[y] = 4.12
- Correlation between x and y: ρ(x, y) = 0.816

Despite having identical summary statistics, these four datasets are fundamentally different in structure. For example:

- Dataset I exhibits a linear relationship.
- Dataset II displays a clear non-linear relationship.
- Dataset III includes an outlier that strongly influences the correlation.
- Dataset IV has most data points aligned vertically with one distant outlier.

These differences become immediately obvious when visualizing the data, but remain hidden if one only examines the statistics.

## Case Study 2: The Datasaurus Dozen

A more recent and visually striking demonstration of the same issue is the Datasaurus Dozen, a set of 12 datasets that all share the same low-order statistical properties, including:

- Identical means and variances of x and y
- Identical correlation coefficients

However, the shape and distribution of the data vary dramatically. The datasets include:

- A dinosaur-shaped scatter plot (the original "Datasaurus")
- Various geometric shapes like a circle, star, and bullseye
- Patterns such as slanted lines, dots, vertical and horizontal bars

These datasets were generated using simulated annealing, an optimization technique that slightly modifies data points to preserve the original summary statistics while reshaping the data into visually distinct patterns. This process confirms that it is possible to engineer datasets with identical statistics but vastly different distributions.

## Real-Life Example: Factory Production Stability

Consider two machines in a manufacturing plant:

- Machine A produces 1000 units per day, with very little variation (e.g., values between 990 and 1010).
- Machine B also has an average daily output of 1000 units, but with high variability (e.g., sometimes 500 units, sometimes 1500).

Both machines have the same mean output. They could also be configured to have the same variance. But in practice:

- Machine A is reliable and predictable.
- Machine B is unstable and could lead to supply chain issues.

Without looking at the full distribution of production values or visualizing a time series, the summary statistics alone fail to reveal the operational differences between the two machines.

## Better Visualization Tools

To obtain a complete understanding of a dataset, it is essential to visualize the data using appropriate graphical techniques. Some commonly used visualization methods include:

- **Scatter plots**: Reveal relationships, clusters, and shapes in bivariate data.
- **Box plots**: Summarize distributions while identifying outliers and spread.
- **Violin plots**: Display the full distribution along with summary statistics.
- **Density plots**: Provide a smooth estimation of data distribution.
- **Time series plots**: Capture patterns, trends, and irregularities over time.

These tools complement statistical summaries by revealing aspects of the data that numbers alone cannot capture.

## Boolean Summary Table

The following table summarizes which properties are captured by summary statistics and which are better revealed through visualizations:

| Feature                        | Captured by Summary Stats | Captured by Visualization |
|-------------------------------|----------------------------|----------------------------|
| Mean                          | Yes                        | Yes                        |
| Variance                      | Yes                        | Yes                        |
| Outliers                      | No                         | Yes                        |
| Non-linearity                 | No                         | Yes                        |
| Clustering                    | No                         | Yes                        |
| Distribution shape            | No                         | Yes                        |
| Patterns and symmetry         | No                         | Yes                        |

# **Conclusion**

Summary statistics such as mean, variance, and correlation are useful tools for summarizing data, but they do not tell the whole story. Two datasets may share identical summary statistics yet differ fundamentally in their shape, distribution, and structure. As shown by Anscombe’s Quartet and the Datasaurus Dozen, data visualization is essential for discovering hidden patterns, outliers, and relationships that statistics may conceal.

In practical scenarios, such as evaluating machinery, student performance, or financial markets, reliance on numerical summaries alone can lead to incorrect conclusions. Data visualization allows for deeper insights, improving decision-making and preventing misleading interpretations.

Therefore, the key takeaway is:

**Always visualize data. Never rely solely on summary statistics.**
