In [None]:
'''
 * Copyright (c) 2016 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

# Empirical Statistics and Variance Analysis

## 1. Introduction to Empirical Statistics

In statistics, we often need to estimate population parameters from finite datasets. This notebook explores the transition from population statistics to empirical statistics using a dataset of size $N$ with observations $X_1, X_2, \ldots, X_N$.

## 2. Empirical Mean and Covariance

### Definition 6.9: Empirical Mean and Covariance

The **empirical mean vector** is the arithmetic average of observations for each variable:

$$\bar{x} := \frac{1}{N} \sum_{n=1}^{N} x_n$$

where $x_n \in \mathbb{R}^D$.

The **empirical covariance matrix** is a $D \times D$ matrix defined as:

$$\Sigma := \frac{1}{N} \sum_{n=1}^{N} (x_n - \bar{x})(x_n - \bar{x})^{\top}$$

**Properties of Empirical Covariance:**
- Symmetric
- Positive semidefinite

```python
import numpy as np
import matplotlib.pyplot as plt

# Example: Computing empirical mean and covariance
np.random.seed(42)
N = 1000
D = 2

# Generate sample data
data = np.random.multivariate_normal([2, 3], [[1, 0.5], [0.5, 2]], N)

# Empirical mean
empirical_mean = np.mean(data, axis=0)
print(f"Empirical mean: {empirical_mean}")

# Empirical covariance
empirical_cov = np.cov(data, rowvar=False)
print(f"Empirical covariance:\n{empirical_cov}")
```

## 3. Three Expressions for Variance

### 3.1 Standard Definition (Definition 6.5)

The variance of a random variable $X$ is the expectation of the squared deviation from its expected value $\mu$:

$$V_X[x] := E_X[(x - \mu)^2]$$

This represents the mean of the new random variable $Z := (X - \mu)^2$.

### 3.2 Raw-Score Formula

The variance can be reformulated as:

$$V_X[x] = E_X[x^2] - (E_X[x])^2$$

This is remembered as **"the mean of the square minus the square of the mean"**.

**Advantages:**
- Single-pass computation through data
- Can accumulate $x_i$ and $x_i^2$ simultaneously

**Disadvantages:**
- Numerically unstable when the two terms are large and approximately equal
- Can suffer from loss of numerical precision in floating-point arithmetic

### 3.3 Pairwise Differences Formula

The variance can also be expressed as a sum of pairwise differences:

$$\frac{1}{N^2} \sum_{i,j=1}^{N} (x_i - x_j)^2 = \frac{2}{N} \left( \frac{1}{N} \sum_{i=1}^{N} x_i^2 - \left(\frac{1}{N} \sum_{i=1}^{N} x_i\right)^2 \right)$$

This shows that:
- Left side: $N^2$ pairwise distance terms
- Right side: $N$ terms for mean computation + $N$ terms for variance computation

```python
# Demonstration of three variance formulas
def compute_variance_three_ways(data):
    """
    Compute variance using three different formulas
    """
    x = np.array(data)
    n = len(x)
    
    # Method 1: Standard definition (two-pass)
    mean_x = np.mean(x)
    var1 = np.mean((x - mean_x)**2)
    
    # Method 2: Raw-score formula (one-pass)
    mean_x_sq = np.mean(x**2)
    mean_x_2 = (np.mean(x))**2
    var2 = mean_x_sq - mean_x_2
    
    # Method 3: Pairwise differences
    pairwise_sum = 0
    for i in range(n):
        for j in range(n):
            pairwise_sum += (x[i] - x[j])**2
    var3 = pairwise_sum / (2 * n**2)
    
    return var1, var2, var3

# Example computation
sample_data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
v1, v2, v3 = compute_variance_three_ways(sample_data)

print(f"Standard definition: {v1:.6f}")
print(f"Raw-score formula: {v2:.6f}")
print(f"Pairwise differences: {v3:.6f}")
print(f"NumPy variance: {np.var(sample_data, ddof=0):.6f}")
```

## 4. Sums and Transformations of Random Variables

When working with combinations of random variables $X$ and $Y$ with states $x, y \in \mathbb{R}^D$:

### Expected Values
$$E[x + y] = E[x] + E[y]$$
$$E[x - y] = E[x] - E[y]$$

### Variances
$$V[x + y] = V[x] + V[y] + \text{Cov}[x, y] + \text{Cov}[y, x]$$
$$V[x - y] = V[x] + V[y] - \text{Cov}[x, y] - \text{Cov}[y, x]$$

Since covariance is symmetric ($\text{Cov}[x, y] = \text{Cov}[y, x]$):
$$V[x + y] = V[x] + V[y] + 2\text{Cov}[x, y]$$
$$V[x - y] = V[x] + V[y] - 2\text{Cov}[x, y]$$

```python
# Demonstration of variance properties
def demonstrate_variance_properties():
    np.random.seed(123)
    n = 10000
    
    # Generate correlated random variables
    mu1, mu2 = 0, 0
    sigma1, sigma2 = 1, 1.5
    rho = 0.6  # correlation coefficient
    
    cov_matrix = [[sigma1**2, rho*sigma1*sigma2],
                  [rho*sigma1*sigma2, sigma2**2]]
    
    data = np.random.multivariate_normal([mu1, mu2], cov_matrix, n)
    x = data[:, 0]
    y = data[:, 1]
    
    # Compute empirical statistics
    var_x = np.var(x, ddof=0)
    var_y = np.var(y, ddof=0)
    cov_xy = np.cov(x, y, ddof=0)[0, 1]
    
    # Compute variance of sum and difference
    x_plus_y = x + y
    x_minus_y = x - y
    
    var_sum_empirical = np.var(x_plus_y, ddof=0)
    var_diff_empirical = np.var(x_minus_y, ddof=0)
    
    # Theoretical values
    var_sum_theoretical = var_x + var_y + 2*cov_xy
    var_diff_theoretical = var_x + var_y - 2*cov_xy
    
    print("Variance Properties Verification:")
    print(f"Var(X): {var_x:.4f}")
    print(f"Var(Y): {var_y:.4f}")
    print(f"Cov(X,Y): {cov_xy:.4f}")
    print()
    print(f"Var(X+Y) - Empirical: {var_sum_empirical:.4f}")
    print(f"Var(X+Y) - Theoretical: {var_sum_theoretical:.4f}")
    print()
    print(f"Var(X-Y) - Empirical: {var_diff_empirical:.4f}")
    print(f"Var(X-Y) - Theoretical: {var_diff_theoretical:.4f}")

demonstrate_variance_properties()
```

## 5. Computational Considerations

### Numerical Stability
The raw-score formula $V[x] = E[x^2] - (E[x])^2$ can be numerically unstable when:
- Both terms are large
- The terms are approximately equal
- Working with limited floating-point precision

### Alternative: Welford's Algorithm
For numerically stable online variance computation:

```python
def welford_variance(data):
    """
    Numerically stable online variance computation
    """
    count = 0
    mean = 0.0
    M2 = 0.0
    
    for x in data:
        count += 1
        delta = x - mean
        mean += delta / count
        delta2 = x - mean
        M2 += delta * delta2
    
    if count < 2:
        return float('nan')
    else:
        return M2 / count  # Population variance

# Compare with raw-score method for numerical stability
large_numbers = [1e9 + i for i in range(1000)]

# Welford's method
var_welford = welford_variance(large_numbers)

# Raw-score method
mean_sq = np.mean([x**2 for x in large_numbers])
sq_mean = (np.mean(large_numbers))**2
var_raw = mean_sq - sq_mean

print(f"Welford's variance: {var_welford:.6f}")
print(f"Raw-score variance: {var_raw:.6f}")
print(f"True variance: {np.var(large_numbers, ddof=0):.6f}")
```

## 6. Summary

This notebook covered:

1. **Empirical Statistics**: Transition from population to sample statistics
2. **Three Variance Formulas**:
   - Standard definition: $E[(x-\mu)^2]$
   - Raw-score formula: $E[x^2] - (E[x])^2$
   - Pairwise differences: Geometric interpretation
3. **Linear Transformations**: Properties of sums and differences of random variables
4. **Computational Aspects**: Numerical stability considerations

Key takeaways:
- Empirical covariance matrices are symmetric and positive semidefinite
- Different variance formulas offer computational trade-offs
- Covariance terms are crucial when combining random variables
- Numerical stability matters in practical implementations

# Affine Transformations and Statistical Independence

## 1. Affine Transformations of Random Variables

Mean and covariance exhibit useful properties under affine transformations. Consider a random variable $X$ with mean $\mu$ and covariance matrix $\Sigma$, and a deterministic affine transformation:

$$y = Ax + b$$

where $A$ is a matrix and $b$ is a vector. Then $y$ is itself a random variable.

### 1.1 Mean Under Affine Transformation

The mean vector of the transformed variable is:

$$E_Y[y] = E_X[Ax + b] = AE_X[x] + b = A\mu + b \quad (6.50)$$

### 1.2 Covariance Under Affine Transformation

The covariance matrix of the transformed variable is:

$$V_Y[y] = V_X[Ax + b] = V_X[Ax] = AV_X[x]A^{\top} = A\Sigma A^{\top} \quad (6.51)$$

### 1.3 Cross-Covariance Between Original and Transformed Variables

The covariance between $x$ and $y$ is derived as follows:

$$\text{Cov}[x, y] = E[x(Ax + b)^{\top}] - E[x]E[Ax + b]^{\top} \quad (6.52a)$$

$$= E[x]b^{\top} + E[xx^{\top}]A^{\top} - \mu b^{\top} - \mu\mu^{\top}A^{\top} \quad (6.52b)$$

$$= \mu b^{\top} - \mu b^{\top} + E[xx^{\top}] - \mu\mu^{\top}A^{\top} \quad (6.52c)$$

$$= \Sigma A^{\top} \quad (6.52d)$$

where $\Sigma = E[xx^{\top}] - \mu\mu^{\top}$ is the covariance of $X$.

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
import seaborn as sns

# Demonstration of affine transformation properties
def demonstrate_affine_transformation():
    np.random.seed(42)
    n_samples = 10000
    
    # Original random variable X
    mu_x = np.array([2, 3])
    Sigma_x = np.array([[2, 0.8], [0.8, 1.5]])
    
    # Generate samples
    X = np.random.multivariate_normal(mu_x, Sigma_x, n_samples)
    
    # Affine transformation: y = Ax + b
    A = np.array([[1.5, 0.5], [-0.3, 2.0]])
    b = np.array([1, -2])
    
    # Transform the data
    Y = (A @ X.T).T + b
    
    # Theoretical calculations
    mu_y_theory = A @ mu_x + b
    Sigma_y_theory = A @ Sigma_x @ A.T
    Cov_xy_theory = Sigma_x @ A.T
    
    # Empirical calculations
    mu_x_emp = np.mean(X, axis=0)
    mu_y_emp = np.mean(Y, axis=0)
    Sigma_x_emp = np.cov(X, rowvar=False)
    Sigma_y_emp = np.cov(Y, rowvar=False)
    Cov_xy_emp = np.cov(X, Y, rowvar=False)[:2, 2:]
    
    print("Affine Transformation Results:")
    print("="*50)
    print(f"Original mean μ_x: {mu_x}")
    print(f"Empirical mean X: {mu_x_emp}")
    print()
    print(f"Theoretical mean μ_y: {mu_y_theory}")
    print(f"Empirical mean Y: {mu_y_emp}")
    print()
    print("Original covariance Σ_x:")
    print(Sigma_x)
    print("Empirical covariance X:")
    print(Sigma_x_emp)
    print()
    print("Theoretical covariance Σ_y:")
    print(Sigma_y_theory)
    print("Empirical covariance Y:")
    print(Sigma_y_emp)
    print()
    print("Theoretical cross-covariance Cov[x,y]:")
    print(Cov_xy_theory)
    print("Empirical cross-covariance:")
    print(Cov_xy_emp)
    
    return X, Y, A, b

# Run demonstration
X, Y, A, b = demonstrate_affine_transformation()
```

### Visualization of Affine Transformation

```python
# Visualize the transformation
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Original data
ax1.scatter(X[:1000, 0], X[:1000, 1], alpha=0.6, s=20)
ax1.set_title('Original Data X')
ax1.set_xlabel('X₁')
ax1.set_ylabel('X₂')
ax1.grid(True, alpha=0.3)

# Transformed data
ax2.scatter(Y[:1000, 0], Y[:1000, 1], alpha=0.6, s=20, color='red')
ax2.set_title('Transformed Data Y = AX + b')
ax2.set_xlabel('Y₁')
ax2.set_ylabel('Y₂')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
```

## 2. Statistical Independence

### Definition 6.10: Independence
Two random variables $X$ and $Y$ are **statistically independent** if and only if:

$$p(x, y) = p(x)p(y) \quad (6.53)$$

**Intuitive interpretation**: Two random variables are independent if knowing the value of one variable provides no additional information about the other variable.

### Properties of Independent Random Variables

If $X$ and $Y$ are statistically independent, then:

1. $p(y | x) = p(y)$
2. $p(x | y) = p(x)$  
3. $V_{X,Y}[x + y] = V_X[x] + V_Y[y]$
4. $\text{Cov}_{X,Y}[x, y] = 0$

**Important Note**: The converse of property 4 is not necessarily true. Zero covariance does not imply independence because covariance only measures linear dependence.

### Example 6.5: Zero Covariance but Dependence

Consider a random variable $X$ with zero mean ($E_X[x] = 0$) and $E_X[x^3] = 0$. Let $y = x^2$ (so $Y$ is dependent on $X$). The covariance between $X$ and $Y$ is:

$$\text{Cov}[x, y] = E[xy] - E[x]E[y] = E[x^3] = 0 \quad (6.54)$$

Despite zero covariance, $Y$ is clearly dependent on $X$ since $y = x^2$.

```python
# Demonstration of zero covariance with dependence
def demonstrate_zero_covariance_dependence():
    np.random.seed(123)
    n_samples = 10000
    
    # Generate X from a symmetric distribution with zero mean
    # Using a mixture of normal distributions to ensure E[X³] = 0
    X = np.random.normal(0, 1, n_samples)
    
    # Create Y = X²
    Y = X**2
    
    # Calculate statistics
    mean_x = np.mean(X)
    mean_y = np.mean(Y)
    cov_xy = np.cov(X, Y)[0, 1]
    corr_xy = np.corrcoef(X, Y)[0, 1]
    
    print("Zero Covariance but Dependence Example:")
    print("="*45)
    print(f"Mean of X: {mean_x:.6f}")
    print(f"Mean of Y: {mean_y:.6f}")
    print(f"Covariance(X, Y): {cov_xy:.6f}")
    print(f"Correlation(X, Y): {corr_xy:.6f}")
    print(f"Are X and Y independent? No! (Y = X²)")
    
    # Visualization
    plt.figure(figsize=(10, 4))
    
    plt.subplot(1, 2, 1)
    plt.scatter(X[:1000], Y[:1000], alpha=0.6, s=20)
    plt.xlabel('X')
    plt.ylabel('Y = X²')
    plt.title('Y = X² (Clearly Dependent)')
    plt.grid(True, alpha=0.3)
    
    plt.subplot(1, 2, 2)
    plt.hist2d(X, Y, bins=50, density=True)
    plt.xlabel('X')
    plt.ylabel('Y = X²')
    plt.title('Joint Distribution')
    plt.colorbar()
    
    plt.tight_layout()
    plt.show()

demonstrate_zero_covariance_dependence()
```

## 3. Independent and Identically Distributed (i.i.d.) Random Variables

In machine learning, we often work with **independent and identically distributed (i.i.d.)** random variables $X_1, X_2, \ldots, X_N$.

- **Independent**: All subsets of variables are mutually independent
- **Identically Distributed**: All variables follow the same distribution

```python
# Example of i.i.d. random variables
def demonstrate_iid_variables():
    np.random.seed(42)
    n_variables = 5
    n_samples = 1000
    
    # Generate i.i.d. normal random variables
    iid_data = np.random.normal(loc=2, scale=1.5, size=(n_samples, n_variables))
    
    print("I.I.D. Random Variables Analysis:")
    print("="*35)
    
    # Check identical distribution (similar means and variances)
    means = np.mean(iid_data, axis=0)
    variances = np.var(iid_data, axis=0)
    
    print("Sample means:", means)
    print("Sample variances:", variances)
    print()
    
    # Check independence (covariance matrix should be approximately diagonal)
    cov_matrix = np.cov(iid_data, rowvar=False)
    print("Covariance matrix:")
    print(cov_matrix)
    print()
    print("Off-diagonal elements (should be close to 0):")
    off_diag = cov_matrix - np.diag(np.diag(cov_matrix))
    print(f"Max absolute off-diagonal: {np.max(np.abs(off_diag)):.4f}")

demonstrate_iid_variables()
```

## 4. Conditional Independence

### Definition 6.11: Conditional Independence
Two random variables $X$ and $Y$ are **conditionally independent** given $Z$ if and only if:

$$p(x, y | z) = p(x | z)p(y | z) \quad \text{for all } z \in Z \quad (6.55)$$

We write $X \perp\!\!\!\perp Y | Z$ to denote conditional independence.

### Alternative Definition
Using the product rule of probability:

$$p(x, y | z) = p(x | y, z)p(y | z) \quad (6.56)$$

Comparing with equation (6.55):

$$p(x | y, z) = p(x | z) \quad (6.57)$$

**Interpretation**: "Given that we know $z$, knowledge about $y$ does not change our knowledge of $x$."

### Relationship to Independence
Independence can be viewed as a special case of conditional independence: $X \perp\!\!\!\perp Y | \emptyset$.

```python
# Demonstration of conditional independence
def demonstrate_conditional_independence():
    """
    Example: Weather and Traffic are dependent, but conditionally 
    independent given the day of the week
    """
    np.random.seed(42)
    n_samples = 10000
    
    # Simulate day of week (0=Monday, ..., 6=Sunday)
    # Weekdays (0-4) vs Weekends (5-6)
    day_of_week = np.random.randint(0, 7, n_samples)
    is_weekend = (day_of_week >= 5).astype(int)
    
    # Weather: 1=Good, 0=Bad
    # Weekend bias: better weather on weekends (hypothetically)
    weather_prob = 0.3 + 0.4 * is_weekend  # 30% good weather on weekdays, 70% on weekends
    weather = np.random.binomial(1, weather_prob)
    
    # Traffic: 1=Heavy, 0=Light  
    # Weekend bias: lighter traffic on weekends, but also depends on weather
    traffic_base_prob = 0.8 - 0.6 * is_weekend  # 80% heavy on weekdays, 20% on weekends
    # Weather influence: good weather increases traffic slightly
    traffic_prob = traffic_base_prob + 0.1 * weather * (1 - is_weekend)
    traffic = np.random.binomial(1, traffic_prob)
    
    # Analyze dependencies
    print("Conditional Independence Analysis:")
    print("="*40)
    
    # Overall correlation (should be non-zero)
    overall_corr = np.corrcoef(weather, traffic)[0, 1]
    print(f"Overall correlation (Weather, Traffic): {overall_corr:.4f}")
    
    # Conditional correlations given weekend status
    weekday_mask = is_weekend == 0
    weekend_mask = is_weekend == 1
    
    weekday_corr = np.corrcoef(weather[weekday_mask], traffic[weekday_mask])[0, 1]
    weekend_corr = np.corrcoef(weather[weekend_mask], traffic[weekend_mask])[0, 1]
    
    print(f"Weekday correlation: {weekday_corr:.4f}")
    print(f"Weekend correlation: {weekend_corr:.4f}")
    print()
    print("If conditionally independent, both should be close to 0")
    
    # Contingency tables
    from scipy.stats import chi2_contingency
    
    # Overall contingency table
    overall_table = np.array([[np.sum((weather == 0) & (traffic == 0)),
                              np.sum((weather == 0) & (traffic == 1))],
                             [np.sum((weather == 1) & (traffic == 0)),
                              np.sum((weather == 1) & (traffic == 1))]])
    
    chi2_overall, p_overall = chi2_contingency(overall_table)[:2]
    
    print(f"\nOverall independence test:")
    print(f"Chi-square: {chi2_overall:.4f}, p-value: {p_overall:.4f}")
    
    # Conditional tables
    weekday_table = np.array([[np.sum((weather[weekday_mask] == 0) & (traffic[weekday_mask] == 0)),
                              np.sum((weather[weekday_mask] == 0) & (traffic[weekday_mask] == 1))],
                             [np.sum((weather[weekday_mask] == 1) & (traffic[weekday_mask] == 0)),
                              np.sum((weather[weekday_mask] == 1) & (traffic[weekday_mask] == 1))]])
    
    if np.all(weekday_table > 0):  # Avoid division by zero
        chi2_weekday, p_weekday = chi2_contingency(weekday_table)[:2]
        print(f"Weekday conditional independence test:")
        print(f"Chi-square: {chi2_weekday:.4f}, p-value: {p_weekday:.4f}")

demonstrate_conditional_independence()
```

## 5. Summary and Key Takeaways

### Affine Transformations
1. **Mean transforms linearly**: $E[Ax + b] = AE[x] + b$
2. **Covariance transforms quadratically**: $\text{Var}[Ax + b] = A\text{Var}[x]A^{\top}$
3. **Cross-covariance**: $\text{Cov}[x, Ax + b] = \text{Cov}[x]A^{\top}$

### Statistical Independence
1. **Definition**: $p(x, y) = p(x)p(y)$
2. **Zero covariance** is necessary but not sufficient for independence
3. **Nonlinear dependence** can exist with zero covariance

### Conditional Independence
1. **Definition**: $p(x, y | z) = p(x | z)p(y | z)$
2. **Interpretation**: Given $z$, knowing $y$ provides no information about $x$
3. **Independence** is a special case with empty conditioning set

### Applications in Machine Learning
- **Feature transformations**: Understanding how preprocessing affects statistics
- **Dimensionality reduction**: PCA uses covariance properties
- **Probabilistic models**: Independence assumptions simplify computations
- **Causal inference**: Conditional independence reveals causal structures

```python
# Final comprehensive example
def comprehensive_example():
    """
    Combine all concepts in a machine learning context
    """
    np.random.seed(42)
    n_samples = 5000
    
    # Original features
    X1 = np.random.normal(0, 1, n_samples)
    X2 = np.random.normal(0, 1, n_samples)
    
    # Create dependent feature through affine transformation
    A = np.array([[2, 1], [0.5, 3]])
    b = np.array([1, -1])
    
    X_original = np.column_stack([X1, X2])
    X_transformed = (A @ X_original.T).T + b
    
    # Add noise to create realistic scenario
    noise = np.random.normal(0, 0.1, (n_samples, 2))
    X_final = X_transformed + noise
    
    print("Comprehensive Example: ML Pipeline")
    print("="*40)
    print("1. Generated independent features X1, X2")
    print("2. Applied affine transformation Y = AX + b")
    print("3. Added noise for realism")
    print()
    
    # Analyze correlations
    corr_original = np.corrcoef(X_original.T)
    corr_final = np.corrcoef(X_final.T)
    
    print("Original correlation matrix:")
    print(corr_original)
    print("\nFinal correlation matrix:")
    print(corr_final)
    
    # This demonstrates how preprocessing can introduce correlations
    # even when starting with independent features

comprehensive_example()
```

This notebook demonstrates the fundamental concepts of affine transformations and statistical independence that are crucial for understanding machine learning algorithms and statistical modeling.

In [1]:
#!/usr/bin/env python3
"""
Statistical Concepts Implementation in Core Python
=================================================

This module implements statistical concepts including:
- Empirical mean and covariance
- Three expressions for variance
- Affine transformations
- Statistical independence tests
- Conditional independence

Author: Core Python Implementation
Date: 2025
"""

import math
import random
from typing import List, Tuple, Dict, Optional, Union

class Matrix:
    """A simple matrix class for linear algebra operations."""
    
    def __init__(self, data: List[List[float]]):
        """Initialize matrix with 2D list of data."""
        self.data = data
        self.rows = len(data)
        self.cols = len(data[0]) if data else 0
        
        # Validate matrix dimensions
        for row in data:
            if len(row) != self.cols:
                raise ValueError("All rows must have the same number of columns")
    
    def __repr__(self):
        """String representation of matrix."""
        return f"Matrix({self.rows}x{self.cols}):\n" + \
               "\n".join([" ".join([f"{x:8.4f}" for x in row]) for row in self.data])
    
    def __getitem__(self, key):
        """Get item using matrix[i][j] notation."""
        return self.data[key]
    
    def __add__(self, other):
        """Matrix addition."""
        if self.rows != other.rows or self.cols != other.cols:
            raise ValueError("Matrices must have same dimensions for addition")
        
        result = []
        for i in range(self.rows):
            row = []
            for j in range(self.cols):
                row.append(self.data[i][j] + other.data[i][j])
            result.append(row)
        
        return Matrix(result)
    
    def __mul__(self, other):
        """Matrix multiplication or scalar multiplication."""
        if isinstance(other, (int, float)):
            # Scalar multiplication
            result = []
            for i in range(self.rows):
                row = []
                for j in range(self.cols):
                    row.append(self.data[i][j] * other)
                result.append(row)
            return Matrix(result)
        
        elif isinstance(other, Matrix):
            # Matrix multiplication
            if self.cols != other.rows:
                raise ValueError("Invalid dimensions for matrix multiplication")
            
            result = []
            for i in range(self.rows):
                row = []
                for j in range(other.cols):
                    sum_val = 0
                    for k in range(self.cols):
                        sum_val += self.data[i][k] * other.data[k][j]
                    row.append(sum_val)
                result.append(row)
            return Matrix(result)
        
        else:
            raise TypeError("Can only multiply by scalar or Matrix")
    
    def transpose(self):
        """Return transpose of matrix."""
        result = []
        for j in range(self.cols):
            row = []
            for i in range(self.rows):
                row.append(self.data[i][j])
            result.append(row)
        return Matrix(result)


class StatisticalAnalyzer:
    """Core statistical analysis functions."""
    
    @staticmethod
    def mean(data: List[float]) -> float:
        """Calculate arithmetic mean."""
        if not data:
            raise ValueError("Cannot calculate mean of empty list")
        return sum(data) / len(data)
    
    @staticmethod
    def mean_vector(data: List[List[float]]) -> List[float]:
        """Calculate empirical mean vector for multivariate data."""
        if not data:
            raise ValueError("Cannot calculate mean of empty dataset")
        
        n_samples = len(data)
        n_dimensions = len(data[0])
        
        means = []
        for d in range(n_dimensions):
            dimension_sum = sum(sample[d] for sample in data)
            means.append(dimension_sum / n_samples)
        
        return means
    
    @staticmethod
    def variance_standard(data: List[float]) -> float:
        """
        Calculate variance using standard definition: E[(x - μ)²]
        Two-pass algorithm.
        """
        if len(data) < 2:
            raise ValueError("Need at least 2 data points")
        
        # First pass: calculate mean
        mean_val = StatisticalAnalyzer.mean(data)
        
        # Second pass: calculate variance
        squared_deviations = [(x - mean_val) ** 2 for x in data]
        return StatisticalAnalyzer.mean(squared_deviations)
    
    @staticmethod
    def variance_raw_score(data: List[float]) -> float:
        """
        Calculate variance using raw-score formula: E[x²] - (E[x])²
        One-pass algorithm (but numerically unstable).
        """
        if len(data) < 2:
            raise ValueError("Need at least 2 data points")
        
        mean_val = StatisticalAnalyzer.mean(data)
        mean_squared = StatisticalAnalyzer.mean([x ** 2 for x in data])
        
        return mean_squared - (mean_val ** 2)
    
    @staticmethod
    def variance_pairwise(data: List[float]) -> float:
        """
        Calculate variance using pairwise differences formula.
        More computationally expensive but illustrates the concept.
        """
        if len(data) < 2:
            raise ValueError("Need at least 2 data points")
        
        n = len(data)
        pairwise_sum = 0
        
        for i in range(n):
            for j in range(n):
                pairwise_sum += (data[i] - data[j]) ** 2
        
        return pairwise_sum / (2 * n * n)
    
    @staticmethod
    def variance_welford(data: List[float]) -> float:
        """
        Welford's algorithm for numerically stable variance calculation.
        Online algorithm that processes data in one pass.
        """
        if len(data) < 2:
            raise ValueError("Need at least 2 data points")
        
        count = 0
        mean = 0.0
        M2 = 0.0
        
        for x in data:
            count += 1
            delta = x - mean
            mean += delta / count
            delta2 = x - mean
            M2 += delta * delta2
        
        return M2 / count  # Population variance
    
    @staticmethod
    def covariance_matrix(data: List[List[float]]) -> Matrix:
        """
        Calculate empirical covariance matrix.
        Formula: Σ = (1/N) * Σ(xn - x̄)(xn - x̄)ᵀ
        """
        if not data:
            raise ValueError("Cannot calculate covariance of empty dataset")
        
        n_samples = len(data)
        n_dimensions = len(data[0])
        
        # Calculate mean vector
        mean_vec = StatisticalAnalyzer.mean_vector(data)
        
        # Initialize covariance matrix
        cov_matrix = [[0.0 for _ in range(n_dimensions)] for _ in range(n_dimensions)]
        
        # Calculate covariance matrix
        for sample in data:
            # Calculate deviations from mean
            deviations = [sample[i] - mean_vec[i] for i in range(n_dimensions)]
            
            # Outer product of deviations
            for i in range(n_dimensions):
                for j in range(n_dimensions):
                    cov_matrix[i][j] += deviations[i] * deviations[j]
        
        # Normalize by N
        for i in range(n_dimensions):
            for j in range(n_dimensions):
                cov_matrix[i][j] /= n_samples
        
        return Matrix(cov_matrix)
    
    @staticmethod
    def covariance(x: List[float], y: List[float]) -> float:
        """Calculate covariance between two variables."""
        if len(x) != len(y):
            raise ValueError("Variables must have same length")
        
        mean_x = StatisticalAnalyzer.mean(x)
        mean_y = StatisticalAnalyzer.mean(y)
        
        cov = sum((x[i] - mean_x) * (y[i] - mean_y) for i in range(len(x)))
        return cov / len(x)
    
    @staticmethod
    def correlation(x: List[float], y: List[float]) -> float:
        """Calculate Pearson correlation coefficient."""
        cov_xy = StatisticalAnalyzer.covariance(x, y)
        var_x = StatisticalAnalyzer.variance_standard(x)
        var_y = StatisticalAnalyzer.variance_standard(y)
        
        if var_x == 0 or var_y == 0:
            return 0.0
        
        return cov_xy / math.sqrt(var_x * var_y)


class AffineTransformation:
    """Implements affine transformations and their statistical properties."""
    
    @staticmethod
    def transform_data(data: List[List[float]], A: Matrix, b: List[float]) -> List[List[float]]:
        """
        Apply affine transformation y = Ax + b to data.
        
        Args:
            data: List of data points (each point is a list of coordinates)
            A: Transformation matrix
            b: Translation vector
        
        Returns:
            Transformed data points
        """
        transformed_data = []
        
        for point in data:
            # Convert point to matrix for multiplication
            point_matrix = Matrix([[coord] for coord in point])
            
            # Apply transformation: y = Ax + b
            transformed_point_matrix = A * point_matrix
            
            # Add translation vector
            transformed_point = []
            for i in range(len(b)):
                transformed_point.append(transformed_point_matrix[i][0] + b[i])
            
            transformed_data.append(transformed_point)
        
        return transformed_data
    
    @staticmethod
    def transform_mean(mean_x: List[float], A: Matrix, b: List[float]) -> List[float]:
        """
        Calculate mean of transformed variable: E[y] = A*μ + b
        """
        # Convert mean to matrix
        mean_matrix = Matrix([[m] for m in mean_x])
        
        # Apply transformation
        transformed_mean_matrix = A * mean_matrix
        
        # Add translation and convert back to list
        transformed_mean = []
        for i in range(len(b)):
            transformed_mean.append(transformed_mean_matrix[i][0] + b[i])
        
        return transformed_mean
    
    @staticmethod
    def transform_covariance(cov_x: Matrix, A: Matrix) -> Matrix:
        """
        Calculate covariance of transformed variable: Var[y] = A*Σ*Aᵀ
        """
        A_transpose = A.transpose()
        return A * cov_x * A_transpose
    
    @staticmethod
    def cross_covariance(cov_x: Matrix, A: Matrix) -> Matrix:
        """
        Calculate cross-covariance: Cov[x, y] = Σ*Aᵀ
        """
        A_transpose = A.transpose()
        return cov_x * A_transpose


class IndependenceTests:
    """Tests for statistical independence."""
    
    @staticmethod
    def chi_square_test(contingency_table: List[List[int]]) -> Tuple[float, float]:
        """
        Chi-square test for independence.
        
        Args:
            contingency_table: 2x2 contingency table
            
        Returns:
            (chi_square_statistic, degrees_of_freedom)
        """
        # Calculate marginal totals
        row_totals = [sum(row) for row in contingency_table]
        col_totals = [sum(contingency_table[i][j] for i in range(len(contingency_table))) 
                     for j in range(len(contingency_table[0]))]
        
        total = sum(row_totals)
        
        if total == 0:
            return 0.0, 0
        
        # Calculate expected frequencies
        chi_square = 0.0
        for i in range(len(contingency_table)):
            for j in range(len(contingency_table[0])):
                expected = (row_totals[i] * col_totals[j]) / total
                if expected > 0:
                    observed = contingency_table[i][j]
                    chi_square += ((observed - expected) ** 2) / expected
        
        # Degrees of freedom
        df = (len(contingency_table) - 1) * (len(contingency_table[0]) - 1)
        
        return chi_square, df
    
    @staticmethod
    def create_contingency_table(x: List[int], y: List[int]) -> List[List[int]]:
        """Create 2x2 contingency table from binary variables."""
        if len(x) != len(y):
            raise ValueError("Variables must have same length")
        
        # Initialize 2x2 table
        table = [[0, 0], [0, 0]]
        
        for i in range(len(x)):
            table[x[i]][y[i]] += 1
        
        return table


class RandomVariableGenerator:
    """Generate random variables for testing purposes."""
    
    def __init__(self, seed: Optional[int] = None):
        """Initialize with optional seed for reproducibility."""
        if seed is not None:
            random.seed(seed)
    
    def normal(self, mu: float = 0, sigma: float = 1, n: int = 1000) -> List[float]:
        """
        Generate normal random variables using Box-Muller transform.
        """
        samples = []
        
        for _ in range(n // 2):
            # Generate two uniform random variables
            u1 = random.random()
            u2 = random.random()
            
            # Box-Muller transform
            z1 = math.sqrt(-2 * math.log(u1)) * math.cos(2 * math.pi * u2)
            z2 = math.sqrt(-2 * math.log(u1)) * math.sin(2 * math.pi * u2)
            
            # Scale and shift
            samples.extend([mu + sigma * z1, mu + sigma * z2])
        
        # Handle odd n
        if n % 2 == 1:
            u1 = random.random()
            u2 = random.random()
            z1 = math.sqrt(-2 * math.log(u1)) * math.cos(2 * math.pi * u2)
            samples.append(mu + sigma * z1)
        
        return samples[:n]
    
    def multivariate_normal(self, mu: List[float], cov: Matrix, n: int = 1000) -> List[List[float]]:
        """
        Generate multivariate normal samples using Cholesky decomposition.
        Simplified version - assumes positive definite covariance.
        """
        dim = len(mu)
        samples = []
        
        # Simple Cholesky decomposition for 2x2 case
        if dim == 2 and cov.rows == 2 and cov.cols == 2:
            # L * L^T = Σ
            L11 = math.sqrt(cov[0][0])
            L21 = cov[1][0] / L11
            L22 = math.sqrt(cov[1][1] - L21 * L21)
            
            L = Matrix([[L11, 0], [L21, L22]])
            
            for _ in range(n):
                # Generate independent standard normal variables
                z = [random.gauss(0, 1) for _ in range(dim)]
                z_matrix = Matrix([[z[i]] for i in range(dim)])
                
                # Transform: x = μ + L * z
                transformed = L * z_matrix
                sample = [mu[i] + transformed[i][0] for i in range(dim)]
                samples.append(sample)
        
        else:
            # Fallback for other dimensions - independent samples
            for _ in range(n):
                sample = [random.gauss(mu[i], math.sqrt(cov[i][i])) for i in range(dim)]
                samples.append(sample)
        
        return samples


def demonstrate_empirical_statistics():
    """Demonstrate empirical mean and covariance calculations."""
    print("=" * 60)
    print("EMPIRICAL STATISTICS DEMONSTRATION")
    print("=" * 60)
    
    # Generate sample data
    gen = RandomVariableGenerator(seed=42)
    analyzer = StatisticalAnalyzer()
    
    # Univariate case
    print("\n1. Univariate Statistics:")
    print("-" * 30)
    
    data = gen.normal(mu=2.5, sigma=1.2, n=1000)
    
    mean_val = analyzer.mean(data)
    var_standard = analyzer.variance_standard(data)
    var_raw_score = analyzer.variance_raw_score(data)
    var_pairwise = analyzer.variance_pairwise(data[:10])  # Small sample for pairwise
    var_welford = analyzer.variance_welford(data)
    
    print(f"Sample size: {len(data)}")
    print(f"Empirical mean: {mean_val:.4f}")
    print(f"Variance (standard): {var_standard:.4f}")
    print(f"Variance (raw-score): {var_raw_score:.4f}")
    print(f"Variance (pairwise, n=10): {var_pairwise:.4f}")
    print(f"Variance (Welford): {var_welford:.4f}")
    
    # Multivariate case
    print("\n2. Multivariate Statistics:")
    print("-" * 30)
    
    mu = [2.0, 3.0]
    cov_true = Matrix([[2.0, 0.8], [0.8, 1.5]])
    
    mv_data = gen.multivariate_normal(mu, cov_true, n=1000)
    
    empirical_mean = analyzer.mean_vector(mv_data)
    empirical_cov = analyzer.covariance_matrix(mv_data)
    
    print(f"True mean: {mu}")
    print(f"Empirical mean: {[f'{m:.4f}' for m in empirical_mean]}")
    print(f"\nTrue covariance matrix:")
    print(cov_true)
    print(f"\nEmpirical covariance matrix:")
    print(empirical_cov)


def demonstrate_affine_transformations():
    """Demonstrate affine transformation properties."""
    print("\n" + "=" * 60)
    print("AFFINE TRANSFORMATION DEMONSTRATION")
    print("=" * 60)
    
    gen = RandomVariableGenerator(seed=42)
    analyzer = StatisticalAnalyzer()
    transformer = AffineTransformation()
    
    # Generate original data
    mu_x = [2.0, 3.0]
    cov_x = Matrix([[2.0, 0.8], [0.8, 1.5]])
    
    data_x = gen.multivariate_normal(mu_x, cov_x, n=1000)
    
    # Define affine transformation y = Ax + b
    A = Matrix([[1.5, 0.5], [-0.3, 2.0]])
    b = [1.0, -2.0]
    
    print(f"\nTransformation: y = Ax + b")
    print(f"A matrix:")
    print(A)
    print(f"b vector: {b}")
    
    # Apply transformation
    data_y = transformer.transform_data(data_x, A, b)
    
    # Calculate empirical statistics
    mean_x_emp = analyzer.mean_vector(data_x)
    mean_y_emp = analyzer.mean_vector(data_y)
    cov_x_emp = analyzer.covariance_matrix(data_x)
    cov_y_emp = analyzer.covariance_matrix(data_y)
    
    # Calculate theoretical statistics
    mean_y_theory = transformer.transform_mean(mean_x_emp, A, b)
    cov_y_theory = transformer.transform_covariance(cov_x_emp, A)
    cross_cov_theory = transformer.cross_covariance(cov_x_emp, A)
    
    print(f"\nResults:")
    print(f"Empirical mean X: {[f'{m:.4f}' for m in mean_x_emp]}")
    print(f"Empirical mean Y: {[f'{m:.4f}' for m in mean_y_emp]}")
    print(f"Theoretical mean Y: {[f'{m:.4f}' for m in mean_y_theory]}")
    
    print(f"\nEmpirical covariance Y:")
    print(cov_y_emp)
    print(f"\nTheoretical covariance Y:")
    print(cov_y_theory)


def demonstrate_independence():
    """Demonstrate statistical independence concepts."""
    print("\n" + "=" * 60)
    print("STATISTICAL INDEPENDENCE DEMONSTRATION")
    print("=" * 60)
    
    gen = RandomVariableGenerator(seed=42)
    analyzer = StatisticalAnalyzer()
    tester = IndependenceTests()
    
    # Example 1: Independent variables
    print("\n1. Independent Variables:")
    print("-" * 30)
    
    x_indep = gen.normal(0, 1, 1000)
    y_indep = gen.normal(0, 1, 1000)
    
    cov_indep = analyzer.covariance(x_indep, y_indep)
    corr_indep = analyzer.correlation(x_indep, y_indep)
    
    print(f"Covariance: {cov_indep:.6f}")
    print(f"Correlation: {corr_indep:.6f}")
    
    # Example 2: Zero covariance but dependent (y = x²)
    print("\n2. Zero Covariance but Dependent (y = x²):")
    print("-" * 45)
    
    x_dep = gen.normal(0, 1, 1000)
    y_dep = [x ** 2 for x in x_dep]
    
    cov_dep = analyzer.covariance(x_dep, y_dep)
    corr_dep = analyzer.correlation(x_dep, y_dep)
    
    print(f"Covariance: {cov_dep:.6f}")
    print(f"Correlation: {corr_dep:.6f}")
    print("Note: Y is clearly dependent on X (Y = X²), but covariance ≈ 0")
    
    # Example 3: Chi-square test
    print("\n3. Chi-square Independence Test:")
    print("-" * 35)
    
    # Create binary variables for testing
    x_binary = [1 if x > 0 else 0 for x in x_indep[:100]]
    y_binary = [1 if y > 0 else 0 for y in y_indep[:100]]
    
    contingency_table = tester.create_contingency_table(x_binary, y_binary)
    chi_square, df = tester.chi_square_test(contingency_table)
    
    print(f"Contingency table: {contingency_table}")
    print(f"Chi-square statistic: {chi_square:.4f}")
    print(f"Degrees of freedom: {df}")


def comprehensive_example():
    """Comprehensive example combining all concepts."""
    print("\n" + "=" * 60)
    print("COMPREHENSIVE EXAMPLE")
    print("=" * 60)
    
    gen = RandomVariableGenerator(seed=42)
    analyzer = StatisticalAnalyzer()
    transformer = AffineTransformation()
    
    # Step 1: Generate original independent features
    print("\n1. Generating independent features X₁, X₂")
    x1 = gen.normal(0, 1, 1000)
    x2 = gen.normal(0, 1, 1000)
    original_data = [[x1[i], x2[i]] for i in range(len(x1))]
    
    original_cov = analyzer.covariance_matrix(original_data)
    print("Original covariance matrix (should be approximately diagonal):")
    print(original_cov)
    
    # Step 2: Apply affine transformation
    print("\n2. Applying affine transformation Y = AX + b")
    A = Matrix([[2, 1], [0.5, 3]])
    b = [1, -1]
    
    transformed_data = transformer.transform_data(original_data, A, b)
    transformed_cov = analyzer.covariance_matrix(transformed_data)
    
    print("Transformed covariance matrix (now correlated):")
    print(transformed_cov)
    
    # Step 3: Analyze the transformation effects
    print("\n3. Analyzing transformation effects")
    
    # Extract individual variables for correlation analysis
    y1 = [point[0] for point in transformed_data]
    y2 = [point[1] for point in transformed_data]
    
    correlation_y1_y2 = analyzer.correlation(y1, y2)
    print(f"Correlation between Y₁ and Y₂: {correlation_y1_y2:.4f}")
    
    # Verify transformation properties
    original_mean = analyzer.mean_vector(original_data)
    transformed_mean = analyzer.mean_vector(transformed_data)
    theoretical_mean = transformer.transform_mean(original_mean, A, b)
    
    print(f"\nMean verification:")
    print(f"Empirical transformed mean: {[f'{m:.4f}' for m in transformed_mean]}")
    print(f"Theoretical transformed mean: {[f'{m:.4f}' for m in theoretical_mean]}")
    
    print(f"\nThis demonstrates how preprocessing can introduce correlations")
    print(f"even when starting with independent features.")


if __name__ == "__main__":
    """Run all demonstrations."""
    
    demonstrate_empirical_statistics()
    demonstrate_affine_transformations()
    demonstrate_independence()
    comprehensive_example()
    
    print("\n" + "=" * 60)
    print("ALL DEMONSTRATIONS COMPLETED")
    print("=" * 60)

EMPIRICAL STATISTICS DEMONSTRATION

1. Univariate Statistics:
------------------------------
Sample size: 1000
Empirical mean: 2.4911
Variance (standard): 1.4010
Variance (raw-score): 1.4010
Variance (pairwise, n=10): 0.6305
Variance (Welford): 1.4010

2. Multivariate Statistics:
------------------------------
True mean: [2.0, 3.0]
Empirical mean: ['1.9895', '3.0256']

True covariance matrix:
Matrix(2x2):
  2.0000   0.8000
  0.8000   1.5000

Empirical covariance matrix:
Matrix(2x2):
  1.9444   0.7481
  0.7481   1.5322

AFFINE TRANSFORMATION DEMONSTRATION

Transformation: y = Ax + b
A matrix:
Matrix(2x2):
  1.5000   0.5000
 -0.3000   2.0000
b vector: [1.0, -2.0]

Results:
Empirical mean X: ['1.9759', '2.9688']
Empirical mean Y: ['5.4483', '3.3448']
Theoretical mean Y: ['5.4483', '3.3448']

Empirical covariance Y:
Matrix(2x2):
  6.1954   2.8545
  2.8545   5.1710

Theoretical covariance Y:
Matrix(2x2):
  6.1954   2.8545
  2.8545   5.1710

STATISTICAL INDEPENDENCE DEMONSTRATION

1. Indepen

![image.png](attachment:image.png)

Fig.6 Geometry of random variables. If random variables X and Y are uncorrelated, they are orthogonal vectors in a corresponding vector space, and the Pythagorean theorem applies.

![image-2.png](attachment:image-2.png)

## 6.4.6 Inner Products of Random Variables

Recall the definition of inner products from Section 3.2. We can define an inner product between random variables, which we briefly describe in this section. If we have two uncorrelated random variables $X$ and $Y$, then

$$
V[x + y] = V[x] + V[y]. \tag{6.58}
$$

Since variances are measured in squared units, this looks very much like the Pythagorean theorem for right triangles $c^2 = a^2 + b^2$. In the following, we see whether we can find a geometric interpretation of the variance relation of uncorrelated random variables in Equation (6.58).

### Figure 6.6: Geometry of Random Variables

If random variables $X$ and $Y$ are uncorrelated, they are orthogonal vectors in a corresponding vector space, and the Pythagorean theorem applies:

$$
\text{var}[x + y] = \text{var}[x] + \text{var}[y]
$$

Random variables can be considered vectors in a vector space, and we can define inner products to obtain geometric properties of random variables (Eaton, 2007). If we define

$$
\langle X, Y \rangle := \text{Cov}[x, y] \tag{6.59}
$$

for zero-mean random variables $X$ and $Y$, we obtain an inner product. We see that the covariance is symmetric, positive definite, and linear in either argument:

$$
\text{Cov}[x, x] = 0 \quad \text{if and only if} \quad x = 0
$$

$$
\text{Cov}[\alpha x + z, y] = \alpha \text{Cov}[x, y] + \text{Cov}[z, y] \quad \text{for} \quad \alpha \in \mathbb{R}.
$$

The length of a random variable is

$$
\|X\| = \sqrt{\text{Cov}[x, x]} = \sqrt{V[x]} = \sigma[x], \tag{6.60}
$$

i.e., its standard deviation. The “longer” the random variable, the more uncertain it is; and a random variable with length 0 is deterministic.

If we look at the angle $\theta$ between two random variables $X$ and $Y$, we get

$$
\cos \theta = \frac{\langle X, Y \rangle}{\|X\| \|Y\|} = \frac{\text{Cov}[x, y]}{\sqrt{V[x] V[y]}}, \tag{6.61}
$$

which is the correlation (Definition 6.8) between the two random variables. This means that we can think of correlation as the cosine of the angle between two random variables when we consider them geometrically. We know from Definition 3.7 that $X \perp Y \iff \langle X, Y \rangle = 0$. In our case, this means that $X$ and $Y$ are orthogonal if and only if $\text{Cov}[x, y] = 0$, i.e., they are uncorrelated. Figure 6.6 illustrates this relationship.

### Remark

While it is tempting to use the Euclidean distance (constructed from the preceding definition of inner products) to compare probability distributions, it is unfortunately not the best way to obtain distances between distributions. Recall that the probability mass (or density) is positive and needs to add up to 1. These constraints mean that distributions live on something called a *statistical manifold*. The study of this space of probability distributions is called *information geometry*. Computing distances between distributions is often done using Kullback-Leibler divergence, which is a generalization of distances that account for properties of the statistical manifold. Just like the Euclidean distance is a special case of a metric (Section 3.3), the Kullback-Leibler divergence is a special case of two more general classes of divergences called Bregman divergences and $f$-divergences. The study of divergences is beyond the scope of this book, and we refer for more details to the recent book by Amari (2016), one of the founders of the field of information geometry. $\diamond$

### Figure 6.7: Gaussian Distribution of Two Random Variables

Figure 6.7 shows the Gaussian distribution of two random variables $x_1$ and $x_2$, with a probability density function $p(x_1, x_2)$ ranging from 0.00 to 0.20, plotted over $x_1$ and $x_2$ axes ranging from approximately -5.0 to 7.5.



In [2]:
import random
import math

# --- Simulate Uncorrelated Random Variables ---
def generate_uncorrelated_rvs(n_samples=1000):
    """
    Generate two uncorrelated, zero-mean random variables X and Y.
    Use a simple method: X from a normal distribution, Y independent.
    """
    # Simulate X ~ N(0, 1)
    X = [random.gauss(mu=0, sigma=1) for _ in range(n_samples)]
    # Simulate Y ~ N(0, 1), independent of X
    Y = [random.gauss(mu=0, sigma=1) for _ in range(n_samples)]
    return X, Y

# --- Compute Statistics ---
def compute_mean(data):
    return sum(data) / len(data)

def compute_variance(data):
    mean = compute_mean(data)
    return sum((x - mean) ** 2 for x in data) / len(data)

def compute_covariance(X, Y):
    mean_X = compute_mean(X)
    mean_Y = compute_mean(Y)
    return sum((x - mean_X) * (y - mean_Y) for x, y in zip(X, Y)) / len(X)

def compute_correlation(X, Y):
    cov = compute_covariance(X, Y)
    var_X = compute_variance(X)
    var_Y = compute_variance(Y)
    if var_X == 0 or var_Y == 0:
        return 0
    return cov / math.sqrt(var_X * var_Y)

# --- Demonstration ---
def demonstrate_inner_products():
    """
    Demonstrate inner products of random variables (Section 6.4.6).
    - Generate uncorrelated random variables X and Y
    - Compute variances, covariance, lengths, and correlation
    - Visualize their relationship
    """
    print("=== Inner Products of Random Variables ===")
    print("Section 6.4.6: Inner Products of Random Variables\n")

    # Step 1: Generate random variables
    print("Step 1: Generating Uncorrelated Random Variables")
    X, Y = generate_uncorrelated_rvs()
    print(f"Generated {len(X)} samples for X and Y")
    print(f"Mean of X: {compute_mean(X):.4f}")
    print(f"Mean of Y: {compute_mean(Y):.4f}")
    print()

    # Step 2: Compute variances (Equation 6.58)
    print("Step 2: Verifying Variance Relation (Equation 6.58)")
    var_X = compute_variance(X)
    var_Y = compute_variance(Y)
    X_plus_Y = [x + y for x, y in zip(X, Y)]
    var_X_plus_Y = compute_variance(X_plus_Y)
    print(f"V[X] = {var_X:.4f}")
    print(f"V[Y] = {var_Y:.4f}")
    print(f"V[X+Y] = {var_X_plus_Y:.4f}")
    print(f"V[X] + V[Y] = {(var_X + var_Y):.4f}")
    print(f"Verification: V[X+Y] ≈ V[X] + V[Y] (should be close for uncorrelated variables)")
    print()

    # Step 3: Compute inner product as covariance (Equation 6.59)
    print("Step 3: Inner Product as Covariance (Equation 6.59)")
    cov_XY = compute_covariance(X, Y)
    print(f"⟨X, Y⟩ = Cov[X, Y] = {cov_XY:.4f}")
    print("Since X and Y are uncorrelated, Cov[X, Y] should be close to 0")
    print()

    # Step 4: Compute lengths (Equation 6.60)
    print("Step 4: Lengths of Random Variables (Equation 6.60)")
    length_X = math.sqrt(var_X)
    length_Y = math.sqrt(var_Y)
    print(f"‖X‖ = σ[X] = {length_X:.4f}")
    print(f"‖Y‖ = σ[Y] = {length_Y:.4f}")
    print()

    # Step 5: Compute correlation as cos θ (Equation 6.61)
    print("Step 5: Correlation as Cosine of Angle (Equation 6.61)")
    correlation = compute_correlation(X, Y)
    print(f"Correlation (cos θ) = {correlation:.4f}")
    print("Since X and Y are uncorrelated, correlation should be close to 0 (orthogonal)")
    print()

# --- Main Execution ---
if __name__ == "__main__":
    print("Inner Products Analysis")
    print("=" * 60)

    # Run demonstration
    demonstrate_inner_products()

    print("\n" + "=" * 60)
    print("Summary of Key Results:")
    print("• Generated two uncorrelated random variables X and Y")
    print("• Verified V[X+Y] ≈ V[X] + V[Y] (Equation 6.58)")
    print("• Computed inner product ⟨X, Y⟩ = Cov[X, Y] ≈ 0 (Equation 6.59)")
    print("• Computed lengths ‖X‖ and ‖Y‖ as standard deviations (Equation 6.60)")
    print("• Computed correlation (cos θ) ≈ 0, confirming orthogonality (Equation 6.61)")

Inner Products Analysis
=== Inner Products of Random Variables ===
Section 6.4.6: Inner Products of Random Variables

Step 1: Generating Uncorrelated Random Variables
Generated 1000 samples for X and Y
Mean of X: 0.0386
Mean of Y: -0.0272

Step 2: Verifying Variance Relation (Equation 6.58)
V[X] = 1.0443
V[Y] = 0.9546
V[X+Y] = 1.9968
V[X] + V[Y] = 1.9989
Verification: V[X+Y] ≈ V[X] + V[Y] (should be close for uncorrelated variables)

Step 3: Inner Product as Covariance (Equation 6.59)
⟨X, Y⟩ = Cov[X, Y] = -0.0010
Since X and Y are uncorrelated, Cov[X, Y] should be close to 0

Step 4: Lengths of Random Variables (Equation 6.60)
‖X‖ = σ[X] = 1.0219
‖Y‖ = σ[Y] = 0.9771

Step 5: Correlation as Cosine of Angle (Equation 6.61)
Correlation (cos θ) = -0.0010
Since X and Y are uncorrelated, correlation should be close to 0 (orthogonal)


Summary of Key Results:
• Generated two uncorrelated random variables X and Y
• Verified V[X+Y] ≈ V[X] + V[Y] (Equation 6.58)
• Computed inner product ⟨X, Y⟩ =

![image.png](attachment:image.png)
Fig.8 Gaussian distributions overlaid with 100 samples. (a) One- dimensional case; (b) two-dimensional case.

## Gaussian Distribution

The Gaussian distribution is the most well-studied probability distribution for continuous-valued random variables. It is also referred to as the *normal distribution*. Its importance originates from the fact that it has many computationally convenient properties, which we will be discussing in the following. In particular, we will use it to define the likelihood and prior for linear regression (Chapter 9), and consider a mixture of Gaussians for density estimation (Chapter 11).

There are many other areas of machine learning that also benefit from using a Gaussian distribution, for example Gaussian processes, variational inference, and reinforcement learning. It is also widely used in other application areas such as signal processing (e.g., Kalman filter), control (e.g., linear quadratic regulator), and statistics (e.g., hypothesis testing).

The Gaussian distribution arises naturally when we consider sums of independent and identically distributed random variables. This is known as the central limit theorem (Grinstead and Snell, 1997).

### Figure 6.8: Gaussian Distributions Overlaid with Samples

**(a)** One-dimensional case: A univariate Gaussian distribution with mean and variance, overlaid with 100 samples. The red cross shows the mean, and the red line shows the extent of the variance.

**(b)** Two-dimensional case: A multivariate Gaussian distribution, viewed from the top. The red cross shows the mean, and the colored lines show the contour lines of the density.

For a univariate random variable, the Gaussian distribution has a density that is given by

$$
p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right). \tag{6.62}
$$

The multivariate Gaussian distribution is fully characterized by a mean vector $\mu$ and a covariance matrix $\Sigma$ and defined as

$$
p(x \mid \mu, \Sigma) = (2\pi)^{-\frac{D}{2}} |\Sigma|^{-\frac{1}{2}} \exp\left(-\frac{1}{2}(x - \mu)^{\top} \Sigma^{-1} (x - \mu)\right), \tag{6.63}
$$

where $x \in \mathbb{R}^D$. We write $p(x) = \mathcal{N}(x \mid \mu, \Sigma)$ or $X \sim \mathcal{N}(\mu, \Sigma)$. Figure 6.7 shows a bivariate Gaussian (mesh), with the corresponding contour plot. Figure 6.8 shows a univariate Gaussian and a bivariate Gaussian with corresponding samples.

The special case of the Gaussian with zero mean and identity covariance, that is, $\mu = 0$ and $\Sigma = I$, is referred to as the *standard normal distribution*.

Gaussians are widely used in statistical estimation and machine learning as they have closed-form expressions for marginal and conditional distributions. In Chapter 9, we use these closed-form expressions extensively for linear regression. A major advantage of modeling with Gaussian random variables is that variable transformations (Section 6.7) are often not needed. Since the Gaussian distribution is fully specified by its mean and covariance, we often can obtain the transformed distribution by applying the transformation to the mean and covariance of the random variable.

## Marginals and Conditionals of Gaussians are Gaussians

In the following, we present marginalization and conditioning in the general case of multivariate random variables. If this is confusing at first reading, the reader is advised to consider two univariate random variables instead. Let $X$ and $Y$ be two multivariate random variables, that may have

![image-2.png](attachment:image-2.png)

Fig.9 (a) Bivariate Gaussian; (b) marginal of a joint Gaussian distribution is Gaussian; (c) the conditional distribution of a Gaussian is also Gaussian.

different dimensions. To consider the effect of applying the sum rule of probability and the effect of conditioning, we explicitly write the Gaussian distribution in terms of the concatenated states $[x^\top \, y^\top]^\top$ so that

$$
p(x, y) = \mathcal{N}\left(\begin{bmatrix} \mu_x \\ \mu_y \end{bmatrix}, \begin{bmatrix} \Sigma_{xx} & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_{yy} \end{bmatrix}\right), \tag{6.64}
$$

where $\Sigma_{xx} = \text{Cov}[x, x]$ and $\Sigma_{yy} = \text{Cov}[y, y]$ are the marginal covariance matrices of $x$ and $y$, respectively, and $\Sigma_{xy} = \text{Cov}[x, y]$ is the cross-covariance matrix between $x$ and $y$.

The conditional distribution $p(x \mid y)$ is also Gaussian (illustrated in Figure 6.9(c)) and given by (derived in Section 2.3 of Bishop, 2006)

$$
p(x \mid y) = \mathcal{N}(\mu_{x|y}, \Sigma_{x|y}) \tag{6.65}
$$

$$
\mu_{x|y} = \mu_x + \Sigma_{xy} \Sigma_{yy}^{-1} (y - \mu_y) \tag{6.66}
$$

$$
\Sigma_{x|y} = \Sigma_{xx} - \Sigma_{xy} \Sigma_{yy}^{-1} \Sigma_{yx}. \tag{6.67}
$$

Note that in the computation of the mean in (6.66), the $y$-value is an observation and no longer random.

### Remark

The conditional Gaussian distribution shows up in many places, where we are interested in posterior distributions:

- The Kalman filter (Kalman, 1960), one of the most central algorithms for state estimation in signal processing, does nothing but compute Gaussian conditionals of joint distributions (Deisenroth and Ohlsson, 2011; Särkkä, 2013).
- Gaussian processes (Rasmussen and Williams, 2006), which are a practical implementation of a distribution over functions. In a Gaussian process, we make assumptions of joint Gaussianity of random variables. By (Gaussian) conditioning on observed data, we can determine a posterior distribution over functions.
- Latent linear Gaussian models (Roweis and Ghahramani, 1999; Murphy, 2012), which include probabilistic principal component analysis (PPCA) (Tipping and Bishop, 1999). We will look at PPCA in more detail in Section 10.7. $\diamond$

The marginal distribution $p(x)$ of a joint Gaussian distribution $p(x, y)$ (see (6.64)) is itself Gaussian and computed by applying the sum rule (6.20) and given by

$$
p(x) = \int p(x, y) \, dy = \mathcal{N}(x \mid \mu_x, \Sigma_{xx}). \tag{6.68}
$$

The corresponding result holds for $p(y)$, which is obtained by marginalizing with respect to $x$. Intuitively, looking at the joint distribution in (6.64), we ignore (i.e., integrate out) everything we are not interested in. This is illustrated in Figure 6.9(b).

### Example 6.6

### Figure 6.9: Properties of Bivariate Gaussian Distributions

**(a)** Bivariate Gaussian: A two-dimensional Gaussian distribution plotted over $x_1$ and $x_2$.

**(b)** Marginal of a joint Gaussian distribution: The marginal distribution $p(x_1)$, which is Gaussian, with mean and $2\sigma$ extent shown.

**(c)** Conditional distribution: The conditional distribution $p(x_1 \mid x_2 = -1)$, which is also Gaussian, with mean and $2\sigma$ extent shown.

Consider the bivariate Gaussian distribution (illustrated in Figure 6.9):

$$
p(x_1, x_2) = \mathcal{N}\left(\begin{bmatrix} 0 \\ 2 \end{bmatrix}, \begin{bmatrix} 0.3 & -1 \\ -1 & 5 \end{bmatrix}\right). \tag{6.69}
$$

We can compute the parameters of the univariate Gaussian, conditioned on $x_2 = -1$, by applying (6.66) and (6.67) to obtain the mean and variance respectively. Numerically, this is

$$
\mu_{x_1 \mid x_2 = -1} = 0 + (-1) \cdot 0.2 \cdot (-1 - 2) = 0.6 \tag{6.70}
$$

and

$$
\sigma_{x_1 \mid x_2 = -1}^2 = 0.3 - (-1) \cdot 0.2 \cdot (-1) = 0.1. \tag{6.71}
$$

Therefore, the conditional Gaussian is given by

$$
p(x_1 \mid x_2 = -1) = \mathcal{N}(0.6, 0.1). \tag{6.72}
$$

The marginal distribution $p(x_1)$, in contrast, can be obtained by applying (6.68), which is essentially using the mean and variance of the random variable $x_1$, giving us

$$
p(x_1) = \mathcal{N}(0, 0.3). \tag{6.73}
$$

In [3]:
import random
import math

# --- Gaussian Distribution Functions ---
def univariate_gaussian_pdf(x, mu, sigma2):
    """Compute the univariate Gaussian PDF (Equation 6.62)."""
    coeff = 1 / math.sqrt(2 * math.pi * sigma2)
    exponent = -((x - mu) ** 2) / (2 * sigma2)
    return coeff * math.exp(exponent)

def generate_bivariate_gaussian_samples(mu, Sigma, n_samples=1000):
    """
    Generate samples from a bivariate Gaussian using the Box-Muller transform.
    Since we're limited to core Python, approximate using independent normals
    and adjust for covariance.
    """
    mu_x, mu_y = mu
    Sigma_xx, Sigma_xy = Sigma[0][0], Sigma[0][1]
    Sigma_yx, Sigma_yy = Sigma[1][0], Sigma[1][1]

    # Standard normal samples
    Z1 = [random.gauss(0, 1) for _ in range(n_samples)]
    Z2 = [random.gauss(0, 1) for _ in range(n_samples)]

    # Cholesky decomposition of Sigma (simplified for 2x2 matrix)
    L11 = math.sqrt(Sigma_xx)
    L21 = Sigma_yx / L11
    L22 = math.sqrt(Sigma_yy - (L21 ** 2))

    # Transform to correlated variables: [X, Y] = mu + L * [Z1, Z2]
    X = [mu_x + L11 * z1 for z1 in Z1]
    Y = [mu_y + L21 * z1 + L22 * z2 for z1, z2 in zip(Z1, Z2)]

    return X, Y

# --- Compute Conditional Parameters (Equations 6.70, 6.71) ---
def compute_conditional_parameters(mu, Sigma, y_obs):
    """
    Compute the conditional mean and variance of x1 given x2 = y_obs (Equations 6.66, 6.67).
    """
    mu_x, mu_y = mu
    Sigma_xx, Sigma_xy = Sigma[0][0], Sigma[0][1]
    Sigma_yx, Sigma_yy = Sigma[1][0], Sigma[1][1]

    # Sigma_yy inverse (scalar for univariate y)
    Sigma_yy_inv = 1 / Sigma_yy
    # Compute conditional mean (Equation 6.66)
    mu_x_given_y = mu_x + Sigma_xy * Sigma_yy_inv * (y_obs - mu_y)
    # Compute conditional variance (Equation 6.67)
    Sigma_x_given_y = Sigma_xx - Sigma_xy * Sigma_yy_inv * Sigma_yx

    return mu_x_given_y, Sigma_x_given_y

# --- Demonstration ---
def demonstrate_gaussian_distributions():
    """
    Demonstrate Gaussian distribution properties (Sections 6.5, 6.5.1).
    - Generate samples from the bivariate Gaussian (Equation 6.69)
    - Compute marginal and conditional distributions (Equations 6.70-6.73)
    - Visualize the distributions (Figures 6.8a, 6.9b, 6.9c)
    """
    print("=== Gaussian Distribution Analysis ===")
    print("Sections 6.5, 6.5.1: Gaussian Distribution Properties\n")

    # Step 1: Define the bivariate Gaussian (Equation 6.69)
    print("Step 1: Defining Bivariate Gaussian (Equation 6.69)")
    mu = [0, 2]  # [mu_x1, mu_x2]
    Sigma = [[0.3, -1], [-1, 5]]  # [[Sigma_xx, Sigma_xy], [Sigma_yx, Sigma_yy]]
    print(f"Mean vector: {mu}")
    print(f"Covariance matrix: {Sigma}")
    print()

    # Step 2: Generate samples
    print("Step 2: Generating Bivariate Gaussian Samples")
    X1, X2 = generate_bivariate_gaussian_samples(mu, Sigma, n_samples=1000)
    print(f"Generated {len(X1)} samples for (x1, x2)")
    print()

    # Step 3: Compute marginal distribution p(x1) (Equation 6.73)
    print("Step 3: Marginal Distribution p(x1) (Equation 6.73)")
    mu_x1, Sigma_xx = mu[0], Sigma[0][0]
    print(f"Marginal p(x1) = N({mu_x1}, {Sigma_xx})")
    print()

    # Step 4: Compute conditional distribution p(x1 | x2 = -1) (Equations 6.70-6.72)
    print("Step 4: Conditional Distribution p(x1 | x2 = -1) (Equations 6.70-6.72)")
    y_obs = -1
    mu_x1_given_x2, Sigma_x1_given_x2 = compute_conditional_parameters(mu, Sigma, y_obs)
    print(f"Computed conditional mean: μ_x1|x2=-1 = {mu_x1_given_x2:.1f}")
    print(f"Computed conditional variance: σ_x1|x2=-1^2 = {Sigma_x1_given_x2:.1f}")
    print(f"Conditional p(x1 | x2 = -1) = N({mu_x1_given_x2:.1f}, {Sigma_x1_given_x2:.1f})")
    print()

# --- Main Execution ---
if __name__ == "__main__":
    print("Gaussian Distribution Analysis")
    print("=" * 60)

    # Run demonstration
    demonstrate_gaussian_distributions()

    print("\n" + "=" * 60)
    print("Summary of Key Results:")
    print("• Defined bivariate Gaussian with given mean and covariance (Equation 6.69)")
    print("• Generated samples from the bivariate Gaussian")
    print("• Computed marginal distribution p(x1) = N(0, 0.3) (Equation 6.73)")
    print("• Computed conditional distribution p(x1 | x2 = -1) = N(0.6, 0.1) (Equation 6.72)")

Gaussian Distribution Analysis
=== Gaussian Distribution Analysis ===
Sections 6.5, 6.5.1: Gaussian Distribution Properties

Step 1: Defining Bivariate Gaussian (Equation 6.69)
Mean vector: [0, 2]
Covariance matrix: [[0.3, -1], [-1, 5]]

Step 2: Generating Bivariate Gaussian Samples
Generated 1000 samples for (x1, x2)

Step 3: Marginal Distribution p(x1) (Equation 6.73)
Marginal p(x1) = N(0, 0.3)

Step 4: Conditional Distribution p(x1 | x2 = -1) (Equations 6.70-6.72)
Computed conditional mean: μ_x1|x2=-1 = 0.6
Computed conditional variance: σ_x1|x2=-1^2 = 0.1
Conditional p(x1 | x2 = -1) = N(0.6, 0.1)


Summary of Key Results:
• Defined bivariate Gaussian with given mean and covariance (Equation 6.69)
• Generated samples from the bivariate Gaussian
• Computed marginal distribution p(x1) = N(0, 0.3) (Equation 6.73)
• Computed conditional distribution p(x1 | x2 = -1) = N(0.6, 0.1) (Equation 6.72)


![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)