# 7. Mean and Variance

# Definitions of Mean and Variance

## Mean
The **mean** of a dataset is the sum of all the elements divided by the total number of elements.  
It is also referred to as the "average."

$$
\mu = \frac{1}{N} \sum_{i=1}^{N} x_i
$$

Where:
- $\mu$ is the mean
- $N$ is the number of elements in the dataset
- $x_i$ is each individual element in the dataset


## Variance
The **variance** measures the spread of the data points from the mean.  
It is the average of the squared differences between each data point and the mean.

$$
\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2
$$

Where:
- $\sigma^2$ is the variance
- $\mu$ is the mean
- $N$ is the number of elements in the dataset
- $x_i$ is each individual element in the dataset


In [None]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(suppress=True, precision=3)

In [None]:
# Generate random 1D dataset
np.random.seed(42)  # For reproducibility
data = np.random.randn(5)  # 100 random points from a normal distribution
print(data)
# Calculate mean using definition
mean = sum(data) / len(data)

# Calculate variance using definition
variance = sum((x - mean) ** 2 for x in data) / len(data)

# Output the results
print(f"Mean: {mean}")
print(f"Variance: {variance}")

## 7.2 2D Mean and Covariance
### Mean
For a 2D dataset, the **mean** is calculated separately for each dimension.  
If the dataset consists of points \((x_i, y_i)\), the mean for each dimension is defined as:

$$
\mu_x = \frac{1}{N} \sum_{i=1}^{N} x_i
$$

$$
\mu_y = \frac{1}{N} \sum_{i=1}^{N} y_i
$$


### Variance
The **variance** for a 2D dataset is also calculated separately for each dimension.  
The variance for each dimension is defined as:

$$
\sigma_x^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu_x)^2
$$

$$
\sigma_y^2 = \frac{1}{N} \sum_{i=1}^{N} (y_i - \mu_y)^2
$$



### Covariance
The **covariance** measures the relationship between two dimensions, \(x\) and \(y\).  
It is defined as the average of the product of their deviations from their respective means:

$$
\text{Cov}(x, y) = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu_x)(y_i - \mu_y)
$$

Where:
- $\mu_x$ and $\mu_y$ are the means of the $x$ and $y$ dimensions
- $x_i$ and $y_i$ are the coordinates of the $i$th point
- $\text{Cov}(x, y)$ represents the covariance between $x$ and $y$

---

## Notes
- Variance is a special case of covariance where the two dimensions are the same.
- Covariance indicates how $x$ and $y$ vary together. A positive value indicates a direct relationship, while a negative value indicates an inverse relationship.


In [None]:
# Generate random 2D dataset
np.random.seed(42)  # For reproducibility
data = np.random.randn(100, 2)  # 100 random points in 2D (x, y)

# Calculate mean for each dimension
mean_x = sum(data[:, 0]) / len(data[:, 0])
mean_y = sum(data[:, 1]) / len(data[:, 1])

# Calculate variance for each dimension
variance_x = sum((x - mean_x) ** 2 for x in data[:, 0]) / len(data[:, 0])
variance_y = sum((y - mean_y) ** 2 for y in data[:, 1]) / len(data[:, 1])

# Calculate covariance between x and y
covariance_xy = sum((x - mean_x) * (y - mean_y) for x, y in data) / len(data)

# Output results
print(f"Mean (x): {mean_x}, Mean (y): {mean_y}")
print(f"Variance (x): {variance_x}, Variance (y): {variance_y}")
print(f"Covariance (x, y): {covariance_xy}")
print(f"Covariance matrix:\n{np.cov(data, rowvar=False, ddof=0)}")


# 8. Gaussian (Normal) Distribution

The **Gaussian distribution**, or **Normal distribution**, is a bell-shaped probability distribution widely used in statistics and machine learning.

## Key Concepts

### Probability Density Function (PDF)
- The Gaussian PDF is given by:
$$
f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}
$$
- $\mu$: Mean
- $\sigma$: Standard deviation
- The PDF describes the likelihood of a random variable taking a particular value.

### Parameters
- Mean ($\mu$): The center of the distribution.
- Standard Deviation ($\sigma$): Measures the spread or width of the distribution.

---
In this chapter, we:
1. Generate and visualize a 1D Gaussian distribution.
2. Extend the concept to a 2D Gaussian distribution.
3. Explore covariance matrices and eigendecomposition.

### 8.1 1D Gaussian Distribution

- Generate random samples from a Gaussian distribution.
- Visualize the histogram and overlay the theoretical PDF.

In [None]:
# Generate 1D Gaussian distribution
mean = 0  # Mean of the distribution
std_dev = 1  # Standard deviation of the distribution
num_samples = 1000  # Number of data points

# Generate random samples
data_1d = np.random.normal(mean, std_dev, num_samples)

# Plot the Gaussian distribution
plt.figure(figsize=(10, 6))
plt.hist(data_1d, bins=30, density=True, alpha=0.7, label='Histogram')

# Plot the theoretical PDF
x = np.linspace(mean - 4 * std_dev, mean + 4 * std_dev, 1000)
pdf = (1 / (np.sqrt(2 * np.pi) * std_dev)) * np.exp(-0.5 * ((x - mean) / std_dev) ** 2)
plt.plot(x, pdf, label='Theoretical PDF', color='red')

plt.title("1D Gaussian Distribution")
plt.xlabel("Value")
plt.ylabel("Density")
plt.legend()
plt.grid(True)
plt.show()

### 8.2 2D Gaussian Distribution

- Generate random samples from a 2D Gaussian distribution.
- Plot the scatter of the sampled data.

In [None]:

# Parameters for the 2D Gaussian distribution
mean = [0, 0]  # Mean for x and y
cov = [[1, 0.5], [0.5, 1]]  # Covariance matrix
num_samples = 1000  # Number of data points

# Generate 2D Gaussian distribution samples
data = np.random.multivariate_normal(mean, cov, num_samples)

# Create a grid for the theoretical PDF
x = np.linspace(-4, 4, 100)
y = np.linspace(-4, 4, 100)
X, Y = np.meshgrid(x, y)

# Compute the 2D Gaussian PDF
Z = (1 / (2 * np.pi * np.sqrt(np.linalg.det(cov)))) * \
    np.exp(-0.5 * (cov[0][0] * (X - mean[0])**2 +
                   cov[1][1] * (Y - mean[1])**2 +
                   2 * cov[0][1] * (X - mean[0]) * (Y - mean[1])))

# Plot the 2D Gaussian distribution
fig = plt.figure(figsize=(12, 6))

# Scatter plot of the sampled data
ax1 = fig.add_subplot(121)
ax1.scatter(data[:, 0], data[:, 1], alpha=0.5, label='Samples', s=10)
ax1.set_title("2D Gaussian Samples")
ax1.set_xlabel("X")
ax1.set_ylabel("Y")
ax1.grid(True)
ax1.legend()

# Surface plot of the theoretical PDF
ax2 = fig.add_subplot(122, projection='3d')
ax2.plot_surface(X, Y, Z, cmap='viridis', edgecolor='k', alpha=0.7)
ax2.set_title("2D Gaussian PDF")
ax2.set_xlabel("X")
ax2.set_ylabel("Y")
ax2.set_zlabel("Density")

plt.tight_layout()
plt.show()


### 8.3 Covariance Matrix and Eigendecomposition

- The covariance matrix describes the spread of data in 2D space.
- Eigendecomposition of the covariance matrix gives:
  - Eigenvalues: Variances along principal directions.
  - Eigenvectors: Principal directions.

In [None]:
# Eigen decomposition of the covariance matrix
eigenvalues, eigenvectors = np.linalg.eig(cov)
print("Eigenvalues:", eigenvalues)
print("Rotation Matrix (Eigenvectors):")
print(eigenvectors)
# in degrees for the rotation matrix
angle_deg = np.degrees(np.arctan2(eigenvectors[1, 0], eigenvectors[0, 0]))
print(f"Rotation Angle: {angle_deg:.2f} degrees")

# Define the rotation matrix (eigenvectors) and scale by eigenvalues for principal axes
principal_axes = eigenvectors * np.sqrt(eigenvalues)

# Plot 2D Gaussian with principal axes using green and red for arrows
fig, ax = plt.subplots(figsize=(8, 8))

# Scatter plot of the sampled data
ax.scatter(data[:, 0], data[:, 1], alpha=0.3, label='Samples', s=10)

# Plot the principal axes with specified colors
origin = [mean[0]], [mean[1]]  # Origin point
colors = ['green', 'red']
for i in range(len(eigenvalues)):
    vector = principal_axes[:, i]
    ax.quiver(*origin, *vector, angles='xy', scale_units='xy', scale=1, color=colors[i], 
              label=f'Principal Axis {i+1} (λ={eigenvalues[i]:.2f})')

# Add labels and grid
ax.set_title("2D Gaussian with Principal Axes (Colored)")
ax.set_xlabel("X")
ax.set_ylabel("Y")
ax.grid(True)
ax.axis('equal')
ax.legend()
plt.show()
