# Simple 2D PCA Example - Complete Walkthrough

## Introduction

In this notebook, we'll perform PCA step-by-step on a simple 2D dataset. You'll see every calculation and understand exactly what PCA does.

### What You'll Learn
1. Complete manual PCA calculation from scratch
2. Geometric interpretation of PCA transformation
3. How to project data onto principal components
4. How to reconstruct original data from PCA
5. Understanding information preservation

### The Process
```
Original Data â†’ Center Data â†’ Covariance â†’ Eigenvectors â†’ Project â†’ Reconstruct
```

In [None]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import FancyArrowPatch
import pandas as pd

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# For better print formatting
np.set_printoptions(precision=3, suppress=True)

print("âœ“ Libraries imported successfully!")

## Step 1: Create a Simple 2D Dataset

Let's create a small dataset representing soil measurements:
- Feature 1: Nitrogen content (ppm)
- Feature 2: Phosphorus content (ppm)

These nutrients are often correlated in soil.

In [None]:
# Create a simple 2D dataset (8 samples, 2 features)
np.random.seed(42)

# Let's make the data manually for clarity
data = np.array([
    [2.5, 2.4],
    [0.5, 0.7],
    [2.2, 2.9],
    [1.9, 2.2],
    [3.1, 3.0],
    [2.3, 2.7],
    [2.0, 1.6],
    [1.0, 1.1]
])

print("Original Data:")
print("Sample | Nitrogen | Phosphorus")
print("-------|----------|------------")
for i, (n, p) in enumerate(data, 1):
    print(f"  {i}    |   {n:.1f}    |    {p:.1f}")
    
print(f"\nData shape: {data.shape}")
print(f"Number of samples: {data.shape[0]}")
print(f"Number of features: {data.shape[1]}")

In [None]:
# Visualize the original data
plt.figure(figsize=(10, 8))
plt.scatter(data[:, 0], data[:, 1], s=100, alpha=0.7, edgecolors='k', linewidths=2)

# Add labels to points
for i, (x, y) in enumerate(data, 1):
    plt.annotate(f'S{i}', (x, y), xytext=(5, 5), textcoords='offset points', fontsize=10)

plt.xlabel('Nitrogen (ppm)', fontsize=13)
plt.ylabel('Phosphorus (ppm)', fontsize=13)
plt.title('Original 2D Data: Soil Nutrient Measurements', fontsize=15, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()

print("\nðŸ’¡ Observation: Notice the positive correlation - high N tends to go with high P")

## Step 2: Center the Data

**Why?** PCA cares about variance around the mean, not absolute values. We subtract the mean from each feature to center the data at the origin.

$$X_{centered} = X - \bar{X}$$

In [None]:
# Calculate mean of each feature
mean = data.mean(axis=0)
print("Mean values:")
print(f"  Nitrogen:   {mean[0]:.3f}")
print(f"  Phosphorus: {mean[1]:.3f}")

# Center the data
data_centered = data - mean

print("\nCentered Data:")
print("Sample | Nitrogen | Phosphorus")
print("-------|----------|------------")
for i, (n, p) in enumerate(data_centered, 1):
    print(f"  {i}    |  {n:6.3f}  |  {p:6.3f}")
    
# Verify: mean should now be [0, 0]
print(f"\nâœ“ Verification - New mean: {data_centered.mean(axis=0)}")

In [None]:
# Visualize original vs centered data
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

# Original data
ax1.scatter(data[:, 0], data[:, 1], s=100, alpha=0.7, edgecolors='k', linewidths=2, color='blue')
ax1.scatter(mean[0], mean[1], s=300, marker='X', color='red', edgecolors='k', linewidths=2, label='Mean', zorder=5)
ax1.set_xlabel('Nitrogen (ppm)', fontsize=12)
ax1.set_ylabel('Phosphorus (ppm)', fontsize=12)
ax1.set_title('Original Data', fontsize=14, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)
ax1.axis('equal')

# Centered data
ax2.scatter(data_centered[:, 0], data_centered[:, 1], s=100, alpha=0.7, edgecolors='k', linewidths=2, color='green')
ax2.scatter(0, 0, s=300, marker='X', color='red', edgecolors='k', linewidths=2, label='Origin', zorder=5)
ax2.axhline(0, color='gray', linestyle='--', alpha=0.5)
ax2.axvline(0, color='gray', linestyle='--', alpha=0.5)
ax2.set_xlabel('Nitrogen (centered)', fontsize=12)
ax2.set_ylabel('Phosphorus (centered)', fontsize=12)
ax2.set_title('Centered Data (Mean at Origin)', fontsize=14, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)
ax2.axis('equal')

plt.tight_layout()
plt.show()

print("\nðŸ’¡ Key Point: Centering moves the data cloud to the origin")
print("   This is essential for PCA to find the right directions!")

## Step 3: Calculate Covariance Matrix

The covariance matrix tells us how features vary together.

$$\text{Cov}(X) = \frac{1}{n-1}X^T X$$

For 2 features:
$$\text{Cov} = \begin{bmatrix} \text{Var}(X_1) & \text{Cov}(X_1,X_2) \\ \text{Cov}(X_1,X_2) & \text{Var}(X_2) \end{bmatrix}$$

In [None]:
# Calculate covariance matrix
cov_matrix = np.cov(data_centered.T)

print("Covariance Matrix:")
print(cov_matrix)
print("\nInterpretation:")
print(f"  Variance of Nitrogen:   {cov_matrix[0, 0]:.3f}")
print(f"  Variance of Phosphorus: {cov_matrix[1, 1]:.3f}")
print(f"  Covariance (N, P):      {cov_matrix[0, 1]:.3f}")
print(f"\n  Positive covariance means: When N is high, P tends to be high too!")

## Step 4: Calculate Eigenvalues and Eigenvectors

This is the core of PCA!

- **Eigenvectors** = Principal Component directions (where to project)
- **Eigenvalues** = Variance along each PC (how much information)

$$\text{Cov} \cdot v = \lambda v$$

Where $v$ is eigenvector and $\lambda$ is eigenvalue.

In [None]:
# Calculate eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

print("Eigenvalues (variance along each PC):")
print(eigenvalues)
print("\nEigenvectors (PC directions):")
print(eigenvectors)

# Sort by eigenvalue (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print("\n" + "="*50)
print("SORTED Results:")
print("="*50)
print(f"\nPC1 (First Principal Component):")
print(f"  Direction: [{eigenvectors[0, 0]:.3f}, {eigenvectors[1, 0]:.3f}]")
print(f"  Variance (eigenvalue): {eigenvalues[0]:.3f}")
print(f"  Variance explained: {100*eigenvalues[0]/eigenvalues.sum():.1f}%")

print(f"\nPC2 (Second Principal Component):")
print(f"  Direction: [{eigenvectors[0, 1]:.3f}, {eigenvectors[1, 1]:.3f}]")
print(f"  Variance (eigenvalue): {eigenvalues[1]:.3f}")
print(f"  Variance explained: {100*eigenvalues[1]/eigenvalues.sum():.1f}%")

print(f"\nâœ“ Total variance explained: {100*eigenvalues.sum()/eigenvalues.sum():.1f}%")

In [None]:
# Visualize the principal components
plt.figure(figsize=(10, 8))

# Plot centered data
plt.scatter(data_centered[:, 0], data_centered[:, 1], s=100, alpha=0.7, 
           edgecolors='k', linewidths=2, color='lightblue', label='Data points')

# Plot PC1
pc1_scale = 3 * np.sqrt(eigenvalues[0])
plt.arrow(0, 0, pc1_scale*eigenvectors[0, 0], pc1_scale*eigenvectors[1, 0],
         head_width=0.15, head_length=0.15, fc='red', ec='red', linewidth=3,
         label=f'PC1 ({100*eigenvalues[0]/eigenvalues.sum():.1f}% var)')

# Plot PC2
pc2_scale = 3 * np.sqrt(eigenvalues[1])
plt.arrow(0, 0, pc2_scale*eigenvectors[0, 1], pc2_scale*eigenvectors[1, 1],
         head_width=0.15, head_length=0.15, fc='blue', ec='blue', linewidth=3,
         label=f'PC2 ({100*eigenvalues[1]/eigenvalues.sum():.1f}% var)')

plt.axhline(0, color='gray', linestyle='--', alpha=0.3)
plt.axvline(0, color='gray', linestyle='--', alpha=0.3)
plt.xlabel('Nitrogen (centered)', fontsize=13)
plt.ylabel('Phosphorus (centered)', fontsize=13)
plt.title('Principal Components: New Coordinate System', fontsize=15, fontweight='bold')
plt.legend(fontsize=11, loc='upper left')
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()

print("\nðŸ’¡ Key Insights:")
print("   â€¢ PC1 (red) points where data spreads most")
print("   â€¢ PC2 (blue) is perpendicular to PC1")
print("   â€¢ Arrow lengths show relative variance")
print("   â€¢ These are our NEW axes for representing the data!")

## Step 5: Project Data onto Principal Components

Now we transform our data into the new coordinate system defined by the principal components.

$$X_{transformed} = X_{centered} \cdot \text{Eigenvectors}$$

This gives us coordinates along PC1 and PC2.

In [None]:
# Project data onto principal components
data_pca = data_centered.dot(eigenvectors)

print("Data in PCA Space:")
print("Sample |  PC1   |  PC2")
print("-------|--------|--------")
for i, (pc1, pc2) in enumerate(data_pca, 1):
    print(f"  {i}    | {pc1:6.3f} | {pc2:6.3f}")

print(f"\nOriginal shape: {data.shape}")
print(f"Transformed shape: {data_pca.shape}")
print("\nðŸ’¡ Same number of dimensions, but now in a rotated coordinate system!")

In [None]:
# Visualize the transformation
fig = plt.figure(figsize=(16, 7))

# Original space
ax1 = plt.subplot(1, 2, 1)
ax1.scatter(data_centered[:, 0], data_centered[:, 1], s=100, alpha=0.7,
           edgecolors='k', linewidths=2, color='lightblue')
for i, (x, y) in enumerate(data_centered, 1):
    ax1.annotate(f'S{i}', (x, y), xytext=(5, 5), textcoords='offset points')

# Draw PC axes
ax1.arrow(0, 0, 2*eigenvectors[0, 0], 2*eigenvectors[1, 0],
         head_width=0.1, head_length=0.1, fc='red', ec='red', linewidth=2)
ax1.arrow(0, 0, 2*eigenvectors[0, 1], 2*eigenvectors[1, 1],
         head_width=0.1, head_length=0.1, fc='blue', ec='blue', linewidth=2)

ax1.axhline(0, color='gray', linestyle='--', alpha=0.3)
ax1.axvline(0, color='gray', linestyle='--', alpha=0.3)
ax1.set_xlabel('Original Feature 1 (N)', fontsize=12)
ax1.set_ylabel('Original Feature 2 (P)', fontsize=12)
ax1.set_title('Before: Original Space', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.axis('equal')

# PCA space
ax2 = plt.subplot(1, 2, 2)
ax2.scatter(data_pca[:, 0], data_pca[:, 1], s=100, alpha=0.7,
           edgecolors='k', linewidths=2, color='lightgreen')
for i, (x, y) in enumerate(data_pca, 1):
    ax2.annotate(f'S{i}', (x, y), xytext=(5, 5), textcoords='offset points')

ax2.axhline(0, color='gray', linestyle='--', alpha=0.3)
ax2.axvline(0, color='gray', linestyle='--', alpha=0.3)
ax2.set_xlabel('PC1', fontsize=12)
ax2.set_ylabel('PC2', fontsize=12)
ax2.set_title('After: PCA Space (Rotated)', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.axis('equal')

plt.tight_layout()
plt.show()

print("\nðŸ’¡ Observation:")
print("   â€¢ Data is rotated so PC1 is horizontal, PC2 is vertical")
print("   â€¢ Data spreads more along PC1 (horizontal) than PC2")
print("   â€¢ Relative positions of points are preserved!")

## Step 6: Dimensionality Reduction

Here's where the magic happens! We can keep only PC1 and discard PC2, losing minimal information.

**2D â†’ 1D while keeping most information!**

In [None]:
# Keep only PC1 (1D representation)
data_1d = data_pca[:, 0]

print("Reduced to 1D (only PC1):")
print("Sample | PC1 Value")
print("-------|----------")
for i, val in enumerate(data_1d, 1):
    print(f"  {i}    | {val:7.3f}")

print(f"\nOriginal dimensions: {data.shape[1]}")
print(f"Reduced dimensions:  1")
print(f"Information retained: {100*eigenvalues[0]/eigenvalues.sum():.1f}%")
print(f"\nâœ“ We've reduced dimensions from 2D to 1D with minimal information loss!")

In [None]:
# Visualize the 1D representation
plt.figure(figsize=(12, 4))
plt.scatter(data_1d, np.zeros_like(data_1d), s=150, alpha=0.7,
           edgecolors='k', linewidths=2, c=range(len(data_1d)), cmap='viridis')

for i, val in enumerate(data_1d, 1):
    plt.annotate(f'S{i}', (val, 0), xytext=(0, 10), textcoords='offset points',
                ha='center', fontsize=11)

plt.axhline(0, color='gray', linestyle='-', linewidth=2)
plt.xlabel('PC1 Value', fontsize=13)
plt.yticks([])
plt.title('Data Reduced to 1D (Only PC1)', fontsize=15, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("\nðŸ’¡ Now we have a 1D representation that captures the main variation!")
print("   This is much easier to visualize and work with.")

## Step 7: Reconstruction from Reduced Dimensions

We can reconstruct the original data from PC1 only. There will be some loss (from discarding PC2), but it should be small.

$$X_{reconstructed} = X_{PCA} \cdot \text{Eigenvectors}^T + \text{mean}$$

In [None]:
# Reconstruct from 1D (only PC1)
# First, create a 2D array with PC2 = 0
data_pca_reduced = np.column_stack([data_1d, np.zeros_like(data_1d)])

# Transform back to original space
data_reconstructed = data_pca_reduced.dot(eigenvectors.T) + mean

# Also reconstruct with both PCs for comparison
data_reconstructed_full = data_pca.dot(eigenvectors.T) + mean

print("Comparison: Original vs Reconstructed Data")
print("\nSample | Original N | Original P | Recon N | Recon P | N Error | P Error")
print("-------|------------|------------|---------|---------|---------|--------")
for i in range(len(data)):
    n_err = abs(data[i, 0] - data_reconstructed[i, 0])
    p_err = abs(data[i, 1] - data_reconstructed[i, 1])
    print(f"  {i+1}    |   {data[i,0]:5.2f}    |   {data[i,1]:5.2f}    | {data_reconstructed[i,0]:6.2f}  | {data_reconstructed[i,1]:6.2f}  |  {n_err:.3f}  |  {p_err:.3f}")

# Calculate reconstruction error
reconstruction_error = np.mean((data - data_reconstructed)**2)
print(f"\nMean Squared Reconstruction Error: {reconstruction_error:.4f}")
print(f"\nâœ“ Small error means we didn't lose much information by keeping only PC1!")

In [None]:
# Visualize reconstruction
plt.figure(figsize=(10, 8))

# Original data
plt.scatter(data[:, 0], data[:, 1], s=150, alpha=0.7, color='blue',
           edgecolors='k', linewidths=2, label='Original', zorder=3)

# Reconstructed data (from PC1 only)
plt.scatter(data_reconstructed[:, 0], data_reconstructed[:, 1], s=150, 
           alpha=0.7, color='red', marker='s', edgecolors='k', linewidths=2,
           label='Reconstructed (PC1 only)', zorder=3)

# Draw lines showing reconstruction error
for i in range(len(data)):
    plt.plot([data[i, 0], data_reconstructed[i, 0]], 
            [data[i, 1], data_reconstructed[i, 1]], 
            'k--', alpha=0.3, linewidth=1)

# Draw PC1 line
t = np.linspace(-2, 2, 100)
pc1_line = mean[:, np.newaxis] + eigenvectors[:, 0:1].dot(t[np.newaxis, :])
plt.plot(pc1_line[0, :], pc1_line[1, :], 'g-', linewidth=2, 
        label='PC1 direction', alpha=0.7)

plt.xlabel('Nitrogen (ppm)', fontsize=13)
plt.ylabel('Phosphorus (ppm)', fontsize=13)
plt.title('Data Reconstruction: Original vs Reconstructed', fontsize=15, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()

print("\nðŸ’¡ Key Observations:")
print("   â€¢ Blue circles = original data")
print("   â€¢ Red squares = reconstructed data (from PC1 only)")
print("   â€¢ Dashed lines = reconstruction error")
print("   â€¢ Green line = PC1 direction (all reconstructed points lie on this line!)")
print("   â€¢ Small errors confirm we kept most information")

## Step 8: Summary of What We Did

Let's review the complete PCA pipeline:

In [None]:
# Create a comprehensive summary
summary = {
    'Original dimensions': data.shape[1],
    'Number of samples': data.shape[0],
    'PC1 variance': eigenvalues[0],
    'PC2 variance': eigenvalues[1],
    'PC1 variance %': 100*eigenvalues[0]/eigenvalues.sum(),
    'PC2 variance %': 100*eigenvalues[1]/eigenvalues.sum(),
    'Reduced dimensions': 1,
    'Information retained': f"{100*eigenvalues[0]/eigenvalues.sum():.1f}%",
    'Reconstruction error': reconstruction_error
}

print("="*60)
print("PCA SUMMARY")
print("="*60)
for key, value in summary.items():
    print(f"{key:.<40} {value}")
print("="*60)

In [None]:
# Visualize the complete pipeline
fig = plt.figure(figsize=(18, 5))

# Step 1: Original data
ax1 = plt.subplot(1, 4, 1)
ax1.scatter(data[:, 0], data[:, 1], s=80, alpha=0.7, edgecolors='k', linewidths=1.5)
ax1.set_xlabel('Nitrogen')
ax1.set_ylabel('Phosphorus')
ax1.set_title('1. Original Data\n(2D)', fontweight='bold')
ax1.grid(True, alpha=0.3)

# Step 2: Centered data with PCs
ax2 = plt.subplot(1, 4, 2)
ax2.scatter(data_centered[:, 0], data_centered[:, 1], s=80, alpha=0.7, 
           edgecolors='k', linewidths=1.5)
ax2.arrow(0, 0, 1.5*eigenvectors[0, 0], 1.5*eigenvectors[1, 0],
         head_width=0.1, head_length=0.1, fc='red', ec='red', linewidth=2)
ax2.arrow(0, 0, 1.5*eigenvectors[0, 1], 1.5*eigenvectors[1, 1],
         head_width=0.1, head_length=0.1, fc='blue', ec='blue', linewidth=2)
ax2.axhline(0, color='gray', linestyle='--', alpha=0.3)
ax2.axvline(0, color='gray', linestyle='--', alpha=0.3)
ax2.set_xlabel('N (centered)')
ax2.set_ylabel('P (centered)')
ax2.set_title('2. Find PCs\n(eigenvectors)', fontweight='bold')
ax2.grid(True, alpha=0.3)

# Step 3: Projected data
ax3 = plt.subplot(1, 4, 3)
ax3.scatter(data_pca[:, 0], data_pca[:, 1], s=80, alpha=0.7,
           edgecolors='k', linewidths=1.5)
ax3.axhline(0, color='gray', linestyle='--', alpha=0.3)
ax3.axvline(0, color='gray', linestyle='--', alpha=0.3)
ax3.set_xlabel('PC1')
ax3.set_ylabel('PC2')
ax3.set_title('3. Project\n(transform)', fontweight='bold')
ax3.grid(True, alpha=0.3)

# Step 4: Reduced 1D
ax4 = plt.subplot(1, 4, 4)
ax4.scatter(data_1d, np.zeros_like(data_1d), s=80, alpha=0.7,
           edgecolors='k', linewidths=1.5, c=range(len(data_1d)), cmap='viridis')
ax4.axhline(0, color='gray', linestyle='-', linewidth=2)
ax4.set_xlabel('PC1')
ax4.set_yticks([])
ax4.set_title(f'4. Reduce to 1D\n({summary["Information retained"]} info)', fontweight='bold')
ax4.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\nâœ“ Complete PCA Pipeline Visualization")

## Key Takeaways

### What You Learned

1. **PCA Steps**:
   - Center the data (subtract mean)
   - Compute covariance matrix
   - Find eigenvectors (directions) and eigenvalues (variances)
   - Project data onto principal components
   - Keep top components for dimensionality reduction

2. **Key Concepts**:
   - PCs are ordered by variance (PC1 > PC2 > ...)
   - PCs are orthogonal (perpendicular)
   - Can reconstruct approximate original data
   - Trade-off: fewer dimensions vs information loss

3. **Geometric Interpretation**:
   - PCA rotates coordinate system
   - New axes align with data spread
   - Projection = changing coordinate system

### Why This Matters for Agriculture

- Soil has many correlated features (NPK, texture, etc.)
- PCA can reveal underlying soil quality factors
- Reduces complexity while keeping information
- Makes visualization and analysis easier

### Next Steps

Now that you understand PCA deeply, we'll:
1. Implement PCA from scratch in Python (next section)
2. Use sklearn's optimized PCA
3. Apply to real agricultural data

## Exercise (Optional)

Try these on your own:

1. **Experiment with different data**: Create your own 2D dataset and run PCA
2. **Vary correlation**: Make features more or less correlated, observe PC1 variance %
3. **Reconstruction**: Try keeping different numbers of PCs and compare errors

```python
# Your experimentation code here
```

---

**Congratulations!** You've completed a full PCA walkthrough from scratch.

Continue to: `../2_from_scratch/pca_implementation.py`