# Covariance and Correlation: Relationships Between Variables

## Introduction

üéØ **This is the MOST IMPORTANT notebook for understanding PCA!**

So far, we've learned about:
- **Mean**: The center of a single variable
- **Variance**: How much a single variable spreads

But real-world agricultural data has **multiple variables** that often **relate to each other**:
- üåæ Higher nitrogen often means higher crop yield
- ‚òÄÔ∏è More sunlight often means less moisture
- üå°Ô∏è Temperature and growth rate move together

**Covariance** and **correlation** measure these **relationships**!

### Why This Matters for PCA

üéØ **Principal Component Analysis uses the COVARIANCE MATRIX as its input!**

PCA finds the directions (principal components) by:
1. Computing the covariance matrix
2. Finding its eigenvectors and eigenvalues
3. Ordering components by variance (eigenvalues)

**Without understanding covariance, you cannot understand PCA!**

### What You'll Learn

1. ‚úÖ Understand **covariance** ‚≠ê‚≠ê - How two variables vary together
2. ‚úÖ Calculate and interpret the **covariance matrix** ‚≠ê‚≠ê
3. ‚úÖ Understand **correlation coefficient** ‚≠ê - Normalized covariance
4. ‚úÖ Create and interpret **correlation matrices** ‚≠ê
5. ‚úÖ Know when to use covariance vs correlation
6. ‚úÖ Visualize relationships with scatter plots and heatmaps
7. ‚úÖ **Connect to PCA**: See how covariance matrix drives PCA!

Let's begin this critical topic! üöÄ

In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from matplotlib.patches import FancyBboxPatch

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
np.random.seed(42)

print("‚úì Setup complete!")

---

## 1. Understanding Covariance ‚≠ê‚≠ê

### What is Covariance?

**Covariance** measures how two variables **vary together**.

Think of it as asking:
- When X increases, does Y tend to increase? (positive covariance)
- When X increases, does Y tend to decrease? (negative covariance)
- Do X and Y have no relationship? (zero covariance)

### Mathematical Definition

For two variables $X$ and $Y$:

**Population covariance**:
$$\text{Cov}(X, Y) = \sigma_{XY} = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu_X)(y_i - \mu_Y)$$

**Sample covariance** (what we usually calculate):
$$\text{Cov}(X, Y) = s_{XY} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})$$

### Intuition: Product of Deviations

Covariance is the **average product of deviations** from their respective means.

**Step-by-step:**
1. Find how far each $x$ is from $\bar{x}$: $(x_i - \bar{x})$
2. Find how far each $y$ is from $\bar{y}$: $(y_i - \bar{y})$
3. **Multiply** these deviations: $(x_i - \bar{x}) \times (y_i - \bar{y})$
4. Average the products

### Three Cases

1. **Positive Covariance** ($\text{Cov}(X,Y) > 0$):
   - When $x$ is above its mean, $y$ tends to be above its mean
   - Variables move **together in the same direction**
   - Example: Nitrogen and yield both increase together

2. **Negative Covariance** ($\text{Cov}(X,Y) < 0$):
   - When $x$ is above its mean, $y$ tends to be below its mean
   - Variables move **in opposite directions**
   - Example: Temperature up, moisture down

3. **Zero Covariance** ($\text{Cov}(X,Y) \approx 0$):
   - No linear relationship
   - Variables are **independent**
   - Example: Field location and random weather events

### Agricultural Example: Nitrogen and Yield

In [None]:
# Generate agricultural data with positive relationship
np.random.seed(42)
n_fields = 15

# Nitrogen levels (ppm) and Wheat yield (tons/hectare)
nitrogen = np.array([35, 40, 45, 42, 50, 48, 55, 52, 38, 60, 58, 44, 62, 56, 47])
yield_wheat = np.array([3.2, 3.5, 3.8, 3.6, 4.1, 4.0, 4.4, 4.2, 3.4, 4.6, 4.5, 3.7, 4.7, 4.3, 3.9])

print("Data from 15 fields:")
print(f"Nitrogen (ppm):     {nitrogen}")
print(f"Yield (tons/ha):    {yield_wheat}")
print()

# Calculate means
mean_N = np.mean(nitrogen)
mean_Y = np.mean(yield_wheat)

print(f"Mean Nitrogen: {mean_N:.2f} ppm")
print(f"Mean Yield:    {mean_Y:.2f} tons/hectare")

In [None]:
# Calculate covariance step-by-step
print("=== Calculating Covariance Step-by-Step ===")
print()

# Step 1 & 2: Calculate deviations
dev_N = nitrogen - mean_N
dev_Y = yield_wheat - mean_Y

print("Deviations from mean:")
print(f"{'Field':<8} {'N':<10} {'N-mean':<12} {'Y':<10} {'Y-mean':<12} {'Product':<12}")
print("-" * 70)

products = []
for i in range(min(8, len(nitrogen))):  # Show first 8 rows
    prod = dev_N[i] * dev_Y[i]
    products.append(prod)
    print(f"{i+1:<8} {nitrogen[i]:<10.0f} {dev_N[i]:>+11.2f} {yield_wheat[i]:<10.2f} {dev_Y[i]:>+11.2f} {prod:>+11.4f}")

print("...")
print()

# Step 3: Multiply deviations and sum
all_products = dev_N * dev_Y
sum_products = np.sum(all_products)
print(f"Sum of all products: {sum_products:.4f}")

# Step 4: Divide by n-1
n = len(nitrogen)
covariance_manual = sum_products / (n - 1)
print(f"Covariance = {sum_products:.4f} / {n-1} = {covariance_manual:.4f}")
print()

# Verify with NumPy
covariance_numpy = np.cov(nitrogen, yield_wheat, ddof=1)[0, 1]
print(f"NumPy verification: {covariance_numpy:.4f}")
print(f"‚úì Match: {np.isclose(covariance_manual, covariance_numpy)}")
print()

print(f"üí° Interpretation: Cov(N, Y) = {covariance_manual:.4f}")
print(f"   Positive covariance ‚Üí As nitrogen increases, yield tends to increase!")

In [None]:
# Visualization: Positive covariance
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Scatter plot with quadrants
ax1 = axes[0]
ax1.scatter(nitrogen, yield_wheat, s=150, alpha=0.7, c='forestgreen', edgecolors='darkgreen', linewidth=2)

# Add mean lines
ax1.axvline(mean_N, color='red', linestyle='--', linewidth=2, label=f'Mean N = {mean_N:.1f}')
ax1.axhline(mean_Y, color='blue', linestyle='--', linewidth=2, label=f'Mean Y = {mean_Y:.2f}')

# Add quadrant labels
ax1.text(35, 4.6, 'Quadrant II\n(-,+)\nContributes\nNEGATIVELY', 
        ha='center', fontsize=10, bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.7))
ax1.text(60, 4.6, 'Quadrant I\n(+,+)\nContributes\nPOSITIVELY', 
        ha='center', fontsize=10, bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7))
ax1.text(35, 3.3, 'Quadrant III\n(-,-)\nContributes\nPOSITIVELY', 
        ha='center', fontsize=10, bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7))
ax1.text(60, 3.3, 'Quadrant IV\n(+,-)\nContributes\nNEGATIVELY', 
        ha='center', fontsize=10, bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.7))

ax1.set_xlabel('Nitrogen (ppm)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Yield (tons/hectare)', fontsize=12, fontweight='bold')
ax1.set_title(f'Positive Covariance: {covariance_numpy:.2f}\nMost points in Quadrants I & III', 
             fontsize=13, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Plot 2: Deviation products
ax2 = axes[1]
colors_prod = ['green' if p > 0 else 'red' for p in all_products]
ax2.bar(range(len(all_products)), all_products, color=colors_prod, alpha=0.7, edgecolor='black')
ax2.axhline(0, color='black', linewidth=2)
ax2.set_xlabel('Field Index', fontsize=12, fontweight='bold')
ax2.set_ylabel('(N - mean_N) √ó (Y - mean_Y)', fontsize=12, fontweight='bold')
ax2.set_title('Product of Deviations\nGreen = Positive contribution, Red = Negative', 
             fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Key Observations:")
print("   - Left: Most points in Quadrants I and III (both contribute positively)")
print("   - Right: Mostly green bars (positive products)")
print("   - Result: POSITIVE covariance!")

### Examples of All Three Cases

In [None]:
# Create three synthetic datasets showing different covariances
np.random.seed(42)
n = 50

# Positive covariance: Temperature and evaporation
temp = np.random.uniform(15, 35, n)
evap = 0.5 * temp + np.random.normal(0, 2, n) + 5  # Positive relationship

# Negative covariance: Rainfall and dust
rainfall = np.random.uniform(0, 50, n)
dust = -0.3 * rainfall + np.random.normal(0, 2, n) + 20  # Negative relationship

# Zero covariance: Random variables
field_size = np.random.uniform(5, 30, n)
random_pests = np.random.uniform(10, 40, n)  # No relationship

# Calculate covariances
cov_pos = np.cov(temp, evap)[0, 1]
cov_neg = np.cov(rainfall, dust)[0, 1]
cov_zero = np.cov(field_size, random_pests)[0, 1]

print("Three Different Relationships:")
print(f"1. Temperature vs Evaporation:  Cov = {cov_pos:>8.2f} (POSITIVE)")
print(f"2. Rainfall vs Dust:            Cov = {cov_neg:>8.2f} (NEGATIVE)")
print(f"3. Field Size vs Random Pests:  Cov = {cov_zero:>8.2f} (‚âà ZERO)")

In [None]:
# Visualization: Three types of covariance
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: Positive covariance
ax1 = axes[0]
ax1.scatter(temp, evap, s=80, alpha=0.6, c='red', edgecolors='darkred')
z1 = np.polyfit(temp, evap, 1)
p1 = np.poly1d(z1)
ax1.plot(temp, p1(temp), 'r--', linewidth=2, label='Trend')
ax1.set_xlabel('Temperature (¬∞C)', fontsize=11, fontweight='bold')
ax1.set_ylabel('Evaporation (mm)', fontsize=11, fontweight='bold')
ax1.set_title(f'POSITIVE Covariance\nCov = {cov_pos:.2f}\nAs X‚Üë, Y‚Üë', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Negative covariance
ax2 = axes[1]
ax2.scatter(rainfall, dust, s=80, alpha=0.6, c='blue', edgecolors='darkblue')
z2 = np.polyfit(rainfall, dust, 1)
p2 = np.poly1d(z2)
ax2.plot(rainfall, p2(rainfall), 'b--', linewidth=2, label='Trend')
ax2.set_xlabel('Rainfall (mm)', fontsize=11, fontweight='bold')
ax2.set_ylabel('Dust Level', fontsize=11, fontweight='bold')
ax2.set_title(f'NEGATIVE Covariance\nCov = {cov_neg:.2f}\nAs X‚Üë, Y‚Üì', fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Plot 3: Zero covariance
ax3 = axes[2]
ax3.scatter(field_size, random_pests, s=80, alpha=0.6, c='gray', edgecolors='black')
ax3.axhline(np.mean(random_pests), color='gray', linestyle='--', linewidth=2, label='No trend')
ax3.set_xlabel('Field Size (hectares)', fontsize=11, fontweight='bold')
ax3.set_ylabel('Random Pest Count', fontsize=11, fontweight='bold')
ax3.set_title(f'ZERO Covariance\nCov ‚âà {cov_zero:.2f}\nNo relationship', fontsize=12, fontweight='bold')
ax3.legend()
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Point: Covariance sign tells us the direction of relationship!")

---

## 2. Covariance Matrix ‚≠ê‚≠ê (PCA Input!)

When you have **multiple variables**, you organize all covariances into a **covariance matrix**.

### Structure of Covariance Matrix

For variables $X_1, X_2, ..., X_p$:

$$\Sigma = \begin{bmatrix}
\text{Var}(X_1) & \text{Cov}(X_1, X_2) & \cdots & \text{Cov}(X_1, X_p) \\
\text{Cov}(X_2, X_1) & \text{Var}(X_2) & \cdots & \text{Cov}(X_2, X_p) \\
\vdots & \vdots & \ddots & \vdots \\
\text{Cov}(X_p, X_1) & \text{Cov}(X_p, X_2) & \cdots & \text{Var}(X_p)
\end{bmatrix}$$

### Key Properties

1. **Diagonal elements** = Variances of each variable
   - $\Sigma_{ii} = \text{Var}(X_i)$

2. **Off-diagonal elements** = Covariances between variables
   - $\Sigma_{ij} = \text{Cov}(X_i, X_j)$

3. **Symmetric matrix**: $\Sigma_{ij} = \Sigma_{ji}$
   - $\text{Cov}(X_i, X_j) = \text{Cov}(X_j, X_i)$

4. **Square matrix**: $p \times p$ for $p$ variables

### Why This is CRITICAL for PCA

üéØ **PCA decomposes the covariance matrix!**

```python
# This is literally what PCA does:
cov_matrix = np.cov(X.T)  # Covariance matrix
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
# Eigenvectors = Principal Components!
```

### Agricultural Example: Multi-Variable Soil Data

In [None]:
# Create multi-variable soil dataset
np.random.seed(42)
n_samples = 30

# Four soil properties (with realistic relationships)
pH = np.random.normal(6.5, 0.5, n_samples)
nitrogen = np.random.normal(50, 10, n_samples)
phosphorus = 0.3 * nitrogen + np.random.normal(20, 5, n_samples)  # Related to N
organic_matter = 0.4 * nitrogen + np.random.normal(10, 3, n_samples)  # Related to N

# Create DataFrame
soil_data = pd.DataFrame({
    'pH': pH,
    'Nitrogen': nitrogen,
    'Phosphorus': phosphorus,
    'Organic_Matter': organic_matter
})

print("Soil Data (first 5 samples):")
print(soil_data.head())
print()

# Calculate covariance matrix
cov_matrix = soil_data.cov()

print("\n=== COVARIANCE MATRIX ===")
print(cov_matrix)
print()

print("üí° Reading the Covariance Matrix:")
print(f"   - Diagonal: Variances")
print(f"     Var(pH) = {cov_matrix.loc['pH', 'pH']:.4f}")
print(f"     Var(Nitrogen) = {cov_matrix.loc['Nitrogen', 'Nitrogen']:.4f}")
print()
print(f"   - Off-diagonal: Covariances")
print(f"     Cov(Nitrogen, Phosphorus) = {cov_matrix.loc['Nitrogen', 'Phosphorus']:.4f}")
print(f"     Cov(Nitrogen, Organic_Matter) = {cov_matrix.loc['Nitrogen', 'Organic_Matter']:.4f}")
print()
print(f"   - Symmetry check:")
print(f"     Cov(N, P) = {cov_matrix.loc['Nitrogen', 'Phosphorus']:.4f}")
print(f"     Cov(P, N) = {cov_matrix.loc['Phosphorus', 'Nitrogen']:.4f}  ‚Üê Same!")

In [None]:
# Visualization: Covariance matrix as heatmap
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Covariance matrix heatmap
sns.heatmap(cov_matrix, annot=True, fmt='.2f', cmap='RdYlGn', center=0,
            square=True, linewidths=2, cbar_kws={'label': 'Covariance'},
            ax=ax1, vmin=-50, vmax=100)
ax1.set_title('Covariance Matrix\n(Diagonal = Variances, Off-diagonal = Covariances)', 
             fontsize=13, fontweight='bold')

# Plot 2: Annotated matrix structure
ax2.axis('off')
matrix_str = f"""Covariance Matrix Structure:

‚îå                                                      ‚îê
‚îÇ  Var(pH)      Cov(pH,N)    Cov(pH,P)    Cov(pH,OM)  ‚îÇ
‚îÇ                                                      ‚îÇ
‚îÇ  Cov(N,pH)    Var(N)       Cov(N,P)     Cov(N,OM)   ‚îÇ
‚îÇ                                                      ‚îÇ
‚îÇ  Cov(P,pH)    Cov(P,N)     Var(P)       Cov(P,OM)   ‚îÇ
‚îÇ                                                      ‚îÇ
‚îÇ  Cov(OM,pH)   Cov(OM,N)    Cov(OM,P)    Var(OM)     ‚îÇ
‚îî                                                      ‚îò

Key Properties:
‚Ä¢ DIAGONAL: Variances (always positive)
‚Ä¢ OFF-DIAGONAL: Covariances (can be +, -, or 0)
‚Ä¢ SYMMETRIC: Cov(X,Y) = Cov(Y,X)
‚Ä¢ SIZE: 4√ó4 for 4 variables (p√óp for p variables)

üéØ PCA Input: This matrix is decomposed to find
              principal components!
"""
ax2.text(0.1, 0.5, matrix_str, fontsize=11, family='monospace',
        verticalalignment='center',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
ax2.set_title('Understanding the Structure', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nüéØ For PCA: This covariance matrix will be decomposed into:")
print("   - Eigenvalues: Amount of variance along each principal component")
print("   - Eigenvectors: Directions of principal components")

---

## 3. Correlation Coefficient ‚≠ê

**Problem with covariance**: The value depends on the **scale** of variables!

Example:
- Cov(pH, yield) might be 0.5
- Cov(nitrogen, yield) might be 50

But which relationship is stronger? Hard to tell!

**Solution**: **Correlation** = Standardized covariance

### Pearson Correlation Coefficient

$$r_{XY} = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2} \cdot \sqrt{\sum(y_i - \bar{y})^2}}$$

### Key Properties

1. **Range**: Always between -1 and +1
   - $-1 \leq r \leq +1$

2. **Interpretation**:
   - $r = +1$: Perfect positive linear relationship
   - $r = -1$: Perfect negative linear relationship
   - $r = 0$: No linear relationship
   - $|r| > 0.7$: Strong relationship
   - $|r| > 0.4$: Moderate relationship
   - $|r| < 0.3$: Weak relationship

3. **Unitless**: Can compare across different variable pairs!

4. **Relationship to covariance**:
   $$\text{Correlation} = \frac{\text{Covariance}}{\text{SD}_X \times \text{SD}_Y}$$

In [None]:
print(nitrogen)
print(yield_wheat)

In [None]:
nitrogen = nitrogen[:len(yield_wheat)]

In [None]:
# Calculate correlation from covariance
# Using nitrogen and yield data from earlier
assert len(nitrogen) == len(yield_wheat), "Length mismatch: cannot compute correlation"

cov_NY = np.cov(nitrogen, yield_wheat)[0, 1]
std_N = np.std(nitrogen, ddof=1)
std_Y = np.std(yield_wheat, ddof=1)

# Manual calculation
corr_manual = cov_NY / (std_N * std_Y)

# Using NumPy
corr_numpy = np.corrcoef(nitrogen, yield_wheat)[0, 1]

print("=== Correlation Calculation ===")
print()
print(f"Covariance(N, Y)    = {cov_NY:.4f}")
print(f"SD(Nitrogen)        = {std_N:.4f}")
print(f"SD(Yield)           = {std_Y:.4f}")
print()
print(f"Correlation = Cov(N,Y) / (SD_N √ó SD_Y)")
print(f"            = {cov_NY:.4f} / ({std_N:.4f} √ó {std_Y:.4f})")
print(f"            = {cov_NY:.4f} / {std_N * std_Y:.4f}")
print(f"            = {corr_manual:.4f}")
print()
print(f"NumPy verification: r = {corr_numpy:.4f}")
print()
print(f"üí° Interpretation: r = {corr_numpy:.2f}")
if abs(corr_numpy) > 0.7:
    strength = "STRONG"
elif abs(corr_numpy) > 0.4:
    strength = "MODERATE"
else:
    strength = "WEAK"
direction = "POSITIVE" if corr_numpy > 0 else "NEGATIVE"
print(f"   {strength} {direction} linear relationship!")

In [None]:
# Visualization: Different correlation strengths
np.random.seed(42)
n = 50
x = np.random.uniform(0, 10, n)

# Create data with different correlations
y_strong = x + np.random.normal(0, 0.5, n)      # r ‚âà 0.95
y_moderate = x + np.random.normal(0, 2, n)      # r ‚âà 0.70
y_weak = x + np.random.normal(0, 5, n)          # r ‚âà 0.35
y_none = np.random.uniform(0, 10, n)            # r ‚âà 0.00

correlations = [
    np.corrcoef(x, y_strong)[0, 1],
    np.corrcoef(x, y_moderate)[0, 1],
    np.corrcoef(x, y_weak)[0, 1],
    np.corrcoef(x, y_none)[0, 1]
]

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

datasets = [y_strong, y_moderate, y_weak, y_none]
titles = ['Strong Positive', 'Moderate Positive', 'Weak Positive', 'No Correlation']

for i, (ax, y_data, title, r) in enumerate(zip(axes, datasets, titles, correlations)):
    ax.scatter(x, y_data, s=80, alpha=0.6, c='steelblue', edgecolors='darkblue')
    z = np.polyfit(x, y_data, 1)
    p = np.poly1d(z)
    ax.plot(x, p(x), 'r--', linewidth=2, label='Trend line')
    ax.set_xlabel('X', fontsize=11, fontweight='bold')
    ax.set_ylabel('Y', fontsize=11, fontweight='bold')
    ax.set_title(f'{title}\nr = {r:.2f}', fontsize=13, fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Point: Correlation quantifies relationship strength on a -1 to +1 scale!")

---

## 4. Correlation Matrix

Just like the covariance matrix, but with **correlations** instead!

### Properties

1. **Diagonal = 1** (correlation of a variable with itself)
2. **Off-diagonal** = correlations between variables (-1 to +1)
3. **Symmetric**: $r_{ij} = r_{ji}$
4. **Easier to interpret** than covariance matrix (standardized scale)

In [None]:
# Calculate correlation matrix for soil data
corr_matrix = soil_data.corr()

print("=== CORRELATION MATRIX ===")
print(corr_matrix)
print()

print("üí° Reading the Correlation Matrix:")
print(f"   - Diagonal: All 1.0 (perfect correlation with self)")
print(f"   - Off-diagonal: Correlation coefficients (-1 to +1)")
print()
print(f"Strong relationships (|r| > 0.7):")
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        r = corr_matrix.iloc[i, j]
        if abs(r) > 0.7:
            print(f"   {corr_matrix.columns[i]} ‚Üî {corr_matrix.columns[j]}: r = {r:.3f}")

In [None]:
# Visualization: Covariance vs Correlation matrices side-by-side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Covariance matrix
sns.heatmap(cov_matrix, annot=True, fmt='.1f', cmap='coolwarm', center=0,
            square=True, linewidths=2, cbar_kws={'label': 'Covariance'},
            ax=ax1)
ax1.set_title('Covariance Matrix\n(Scale-dependent)', fontsize=13, fontweight='bold')

# Plot 2: Correlation matrix
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=2, cbar_kws={'label': 'Correlation'},
            ax=ax2, vmin=-1, vmax=1)
ax2.set_title('Correlation Matrix\n(Standardized, -1 to +1)', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nüìä Comparison:")
print("   - Left: Hard to compare values (different scales)")
print("   - Right: Easy to compare (all on -1 to +1 scale)")
print("\nüí° Correlation matrix is easier to interpret!")

---

## 5. Covariance vs Correlation: When to Use Each

### Comparison Table

| Feature | Covariance | Correlation |
|---------|-----------|-------------|
| **Formula** | $\frac{\sum(x-\bar{x})(y-\bar{y})}{n-1}$ | $\frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$ |
| **Range** | $-\infty$ to $+\infty$ | $-1$ to $+1$ |
| **Units** | Product of X and Y units | Unitless |
| **Scale-dependent** | YES | NO |
| **Interpretation** | Direction only | Direction AND strength |
| **Use for PCA** | YES (covariance matrix) | Sometimes (correlation matrix PCA) |
| **Comparing variables** | Difficult (different scales) | Easy (standardized) |

### When to Use Covariance

‚úÖ **Use covariance when:**
- Variables are in **same units** (all in kg, all in meters, etc.)
- You want to preserve **scale information**
- Doing **PCA on variables with similar scales**
- Building mathematical models requiring actual scales

### When to Use Correlation

‚úÖ **Use correlation when:**
- Variables have **different units** (pH vs kg vs ppm)
- You want to **compare relationship strengths**
- **Interpreting** relationships for reports
- Doing **PCA on variables with very different scales** (use correlation matrix)
- Need **standardized measure** (-1 to +1)

### PCA: Covariance or Correlation Matrix?

üéØ **Critical decision for PCA!**

**Use COVARIANCE matrix** when:
- Variables have similar scales and units
- You want PCA to weight variables by their actual variance
- Example: All features are soil nutrients in ppm

**Use CORRELATION matrix** (standardized PCA) when:
- Variables have very different scales
- You want all variables to contribute equally
- Example: Mixing pH (0-14), nitrogen (ppm), temperature (¬∞C)

```python
# In PCA:
from sklearn.decomposition import PCA

# Covariance-based PCA (default)
pca_cov = PCA()
pca_cov.fit(X)

# Correlation-based PCA (standardize first!)
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
pca_corr = PCA()
pca_corr.fit(X_scaled)  # Works on correlation matrix
```

---

## Key Takeaways

### üí° Main Concepts

1. **Covariance** ‚≠ê‚≠ê:
   - Measures how two variables vary **together**
   - Positive: Move in same direction
   - Negative: Move in opposite directions
   - Zero: No linear relationship
   - Formula: Average product of deviations

2. **Covariance Matrix** ‚≠ê‚≠ê:
   - Organizes all pairwise covariances
   - **THE input to PCA!**
   - Diagonal = variances
   - Off-diagonal = covariances
   - Symmetric, square (p√óp)

3. **Correlation** ‚≠ê:
   - **Standardized covariance**
   - Range: -1 to +1
   - Unitless, scale-independent
   - Easier to interpret
   - Shows both direction AND strength

4. **Correlation Matrix**:
   - Standardized version of covariance matrix
   - Diagonal always = 1
   - Easier comparison across variables
   - Sometimes used for PCA (correlation-based PCA)

### üîó Connection to PCA

**Why this notebook is MOST CRITICAL:**

1. **PCA Input**: Covariance (or correlation) matrix
2. **PCA Process**:
   ```
   Step 1: Center data (subtract mean)
   Step 2: Compute covariance matrix ‚Üê THIS NOTEBOOK!
   Step 3: Find eigenvectors (principal components)
   Step 4: Find eigenvalues (variance along each PC)
   Step 5: Sort by eigenvalues (descending)
   ```
3. **Interpreting PCA**: Loadings are correlations between original variables and PCs
4. **Decision**: Covariance vs correlation matrix affects PCA results!

**You'll see in PCA module:**
```python
# What PCA actually does:
X_centered = X - X.mean(axis=0)               # Center
cov_matrix = np.cov(X_centered.T)             # THIS!
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
principal_components = eigenvectors           # PCs!
```

### üåæ Agricultural Applications

**Positive covariance/correlation**:
- Nitrogen and yield
- Organic matter and water retention
- Temperature and evaporation
- Fertilizer amount and crop growth

**Negative covariance/correlation**:
- Rainfall and dust
- Pest pressure and yield
- Soil compaction and root depth
- Disease severity and profit

**Zero covariance/correlation**:
- Random weather and field location
- Unrelated soil properties
- Independent management decisions

### üìä Practical Decision Guide

**Use Covariance Matrix for PCA when:**
- ‚úÖ All variables in same units
- ‚úÖ Similar scales (pH 0-14, all nutrients in ppm)
- ‚úÖ Want to preserve variance information

**Use Correlation Matrix for PCA when:**
- ‚úÖ Variables in different units
- ‚úÖ Very different scales (pH vs kg vs temperature)
- ‚úÖ Want equal contribution from all variables
- ‚úÖ (Requires standardization first!)

---

## Next Steps

üéâ **Congratulations!** You've completed the MOST CRITICAL notebook for PCA!

You now understand:
- ‚úÖ How variables relate (covariance)
- ‚úÖ The covariance matrix structure
- ‚úÖ How to standardize relationships (correlation)
- ‚úÖ **Why PCA uses the covariance matrix!**

**Continue to the next notebook:**
`04_standardization_normalization.ipynb` - **Also CRITICAL for PCA!**

You'll learn:
- **Z-score standardization** (transforms covariance ‚Üí correlation)
- When standardization is **required** for PCA
- How to prepare data for PCA analysis
- Effect of scaling on PCA results

**You're mastering the foundations of PCA!** üöÄüìä‚ú®