# Central Tendency: Understanding the "Center" of Data

## Introduction

When working with agricultural data - whether it's soil pH measurements across 100 fields, crop yields over multiple seasons, or daily rainfall records - we need ways to **summarize** and **understand** our data at a glance. The most fundamental question we can ask is: **"What's a typical value?"**

This is where **measures of central tendency** come in. They help us find the "center" or "typical" value of a dataset.

### Real-World Agricultural Context

Imagine you're a farm consultant analyzing soil pH across a large agricultural region:
- You have pH measurements from 50 different fields
- Values range from 5.2 to 7.8
- You need to report a "typical" pH to farmers

Should you report the **mean**? The **median**? The **mode**? Each tells a different story!

### What You'll Learn

1. ✅ Calculate and interpret the **mean** (arithmetic average)
2. ✅ Calculate and interpret the **median** (middle value)
3. ✅ Calculate and interpret the **mode** (most frequent value)
4. ✅ Understand when to use each measure
5. ✅ Recognize how outliers affect each measure
6. ✅ Make better decisions with agricultural data
7. ✅ **Connection to PCA**: Understand why PCA centers data at the mean

### Why This Matters for Machine Learning

**Principal Component Analysis (PCA)** - the dimensionality reduction technique you'll learn next - **always centers data at the mean** as its first step. Understanding what the mean represents is essential for understanding PCA!

Let's begin! 🌾

In [None]:
# Setup: Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11

# For reproducibility
np.random.seed(42)

print("✓ Setup complete!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

---

## 1. Mean (Arithmetic Average)

The **mean** is what most people call the "average". It's the sum of all values divided by the count.

### Mathematical Definition

For a dataset $X = \{x_1, x_2, ..., x_n\}$:

$$\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{x_1 + x_2 + ... + x_n}{n}$$

Where:
- $\bar{x}$ ("x-bar") = sample mean
- $n$ = number of observations
- $x_i$ = individual values

### Intuition: The Balance Point

Think of the mean as the **balance point** of your data. If you placed your data points on a see-saw, the mean is where you'd place the fulcrum to balance it perfectly.

### When to Use the Mean

✅ **Use the mean when:**
- Data is roughly symmetric (no extreme outliers)
- You want to know the "typical" value considering all data points equally
- You're doing further statistical calculations (variance, standard deviation)
- Data is continuous (measurements, not categories)

### Agricultural Example: Soil Nitrogen Levels

Let's say we measured nitrogen levels (in ppm) across 10 fields:

In [None]:
# Soil nitrogen levels (ppm) from 10 fields
nitrogen_ppm = np.array([45, 52, 48, 51, 49, 47, 50, 53, 46, 54])

print("Nitrogen levels (ppm) across 10 fields:")
print(nitrogen_ppm)
print()

# Calculate mean manually
mean_manual = np.sum(nitrogen_ppm) / len(nitrogen_ppm)
print(f"Manual calculation: {mean_manual:.2f} ppm")

# Calculate mean using NumPy
mean_numpy = np.mean(nitrogen_ppm)
print(f"NumPy calculation: {mean_numpy:.2f} ppm")

# Verify they match
print(f"\n✓ Both methods give the same result: {mean_manual == mean_numpy}")

print(f"\n💡 Interpretation: The average nitrogen level across all fields is {mean_numpy:.2f} ppm")

In [None]:
# Visualization: Individual values vs mean
fig, ax = plt.subplots(figsize=(12, 6))

# Plot individual values as bars
fields = [f"Field {i+1}" for i in range(len(nitrogen_ppm))]
bars = ax.bar(fields, nitrogen_ppm, alpha=0.7, color='forestgreen', edgecolor='darkgreen')

# Add mean line
ax.axhline(y=mean_numpy, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_numpy:.2f} ppm')

# Add value labels on bars
for i, (field, value) in enumerate(zip(fields, nitrogen_ppm)):
    ax.text(i, value + 0.5, f'{value}', ha='center', va='bottom', fontweight='bold')

ax.set_xlabel('Field', fontsize=12, fontweight='bold')
ax.set_ylabel('Nitrogen Level (ppm)', fontsize=12, fontweight='bold')
ax.set_title('Soil Nitrogen Levels Across Fields\n(Mean shown as red dashed line)', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print("Notice how the mean (red line) sits roughly in the 'middle' of all the bars.")

### Mean is Sensitive to Outliers

⚠️ **Important caveat**: The mean can be heavily influenced by extreme values (outliers).

Let's see what happens if one field has unusually high nitrogen (maybe recent fertilization):

In [None]:
# Same data, but one field has extreme value
nitrogen_with_outlier = np.array([45, 52, 48, 51, 49, 47, 50, 53, 46, 120])  # Last value is outlier

mean_original = np.mean(nitrogen_ppm)
mean_with_outlier = np.mean(nitrogen_with_outlier)

print("Original data (no outlier):")
print(nitrogen_ppm)
print(f"Mean = {mean_original:.2f} ppm\n")

print("Data with outlier (last field has 120 ppm):")
print(nitrogen_with_outlier)
print(f"Mean = {mean_with_outlier:.2f} ppm\n")

change = mean_with_outlier - mean_original
percent_change = (change / mean_original) * 100

print(f"⚠️  The mean increased by {change:.2f} ppm ({percent_change:.1f}%)")
print(f"💡 One extreme value significantly pulled the mean upward!")

In [None]:
# Visualization: Effect of outliers on mean
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Without outlier
ax1.bar(range(len(nitrogen_ppm)), nitrogen_ppm, alpha=0.7, color='forestgreen', edgecolor='darkgreen')
ax1.axhline(y=mean_original, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_original:.2f} ppm')
ax1.set_xlabel('Field Index', fontsize=12, fontweight='bold')
ax1.set_ylabel('Nitrogen Level (ppm)', fontsize=12, fontweight='bold')
ax1.set_title('Without Outlier', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0, 130])

# Plot 2: With outlier
colors = ['forestgreen'] * 9 + ['orange']  # Outlier in different color
ax2.bar(range(len(nitrogen_with_outlier)), nitrogen_with_outlier, alpha=0.7, color=colors, edgecolor='darkgreen')
ax2.axhline(y=mean_with_outlier, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_with_outlier:.2f} ppm')
ax2.axhline(y=mean_original, color='blue', linestyle=':', linewidth=2, label=f'Original mean = {mean_original:.2f} ppm')
ax2.annotate('Outlier!', xy=(9, 120), xytext=(7, 115),
            arrowprops=dict(arrowstyle='->', color='red', lw=2),
            fontsize=12, fontweight='bold', color='red')
ax2.set_xlabel('Field Index', fontsize=12, fontweight='bold')
ax2.set_ylabel('Nitrogen Level (ppm)', fontsize=12, fontweight='bold')
ax2.set_title('With Outlier (Field 10)', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_ylim([0, 130])

plt.tight_layout()
plt.show()

print("\n💡 Key Observation: The outlier pulls the mean upward significantly!")
print("   The mean is no longer 'typical' of most fields.")

---

## 2. Median (Middle Value)

The **median** is the "middle" value when data is sorted. Half the values are below it, half are above it.

### How to Calculate the Median

1. **Sort** the data from smallest to largest
2. **If $n$ is odd**: Median is the middle value
3. **If $n$ is even**: Median is the average of the two middle values

### Mathematical Definition

For sorted data $x_{(1)} \leq x_{(2)} \leq ... \leq x_{(n)}$:

$$\text{Median} = \begin{cases}
x_{\left(\frac{n+1}{2}\right)} & \text{if } n \text{ is odd} \\
\frac{x_{\left(\frac{n}{2}\right)} + x_{\left(\frac{n}{2}+1\right)}}{2} & \text{if } n \text{ is even}
\end{cases}$$

### Intuition: The 50th Percentile

The median is also called the **50th percentile** or **Q2** (second quartile). It divides the data into two equal halves.

### When to Use the Median

✅ **Use the median when:**
- Data has outliers or is skewed
- You want a "typical" value that isn't affected by extremes
- Data is ordinal (ranked, but distances aren't meaningful)
- You're reporting income, home prices, or similar skewed data

### Agricultural Example: Crop Yield

In [None]:
# Wheat yield (tons/hectare) from 11 fields
wheat_yield = np.array([4.2, 4.5, 3.8, 4.3, 4.1, 4.4, 4.0, 4.6, 3.9, 4.2, 4.3])

print("Wheat yields (tons/hectare) from 11 fields:")
print(wheat_yield)
print()

# Calculate median manually
sorted_yield = np.sort(wheat_yield)
print("Sorted yields:")
print(sorted_yield)
print()

# Since n=11 (odd), median is the middle value (index 5)
n = len(sorted_yield)
middle_index = n // 2
median_manual = sorted_yield[middle_index]

print(f"Middle index: {middle_index}")
print(f"Median (manual): {median_manual} tons/hectare")

# Calculate median using NumPy
median_numpy = np.median(wheat_yield)
print(f"Median (NumPy): {median_numpy} tons/hectare")

# Also calculate mean for comparison
mean_yield = np.mean(wheat_yield)
print(f"Mean for comparison: {mean_yield:.2f} tons/hectare")

print(f"\n💡 Interpretation: Half the fields yield below {median_numpy} tons/hectare, half above.")

In [None]:
# Visualization: Showing median as middle value
fig, ax = plt.subplots(figsize=(12, 6))

# Plot sorted values
x_positions = range(len(sorted_yield))
colors_median = ['lightblue' if i < middle_index else 'lightcoral' if i > middle_index else 'gold' 
                for i in range(len(sorted_yield))]

bars = ax.bar(x_positions, sorted_yield, color=colors_median, edgecolor='black', linewidth=1.5)

# Highlight median
ax.axhline(y=median_numpy, color='darkgreen', linestyle='--', linewidth=2.5, 
          label=f'Median = {median_numpy} tons/hectare')

# Add value labels
for i, value in enumerate(sorted_yield):
    ax.text(i, value + 0.05, f'{value}', ha='center', va='bottom', fontweight='bold')

# Annotations
ax.text(2, 3.5, 'Below Median\n(5 values)', ha='center', fontsize=12, 
       bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))
ax.text(8, 3.5, 'Above Median\n(5 values)', ha='center', fontsize=12,
       bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.8))
ax.text(5, 4.7, 'MEDIAN', ha='center', fontsize=14, fontweight='bold',
       bbox=dict(boxstyle='round', facecolor='gold', alpha=0.9))

ax.set_xlabel('Sorted Field Index', fontsize=12, fontweight='bold')
ax.set_ylabel('Wheat Yield (tons/hectare)', fontsize=12, fontweight='bold')
ax.set_title('Median: The Middle Value (50% below, 50% above)', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nThe median (gold bar) divides the data exactly in half.")

### Median is Robust to Outliers

Unlike the mean, the median is **not affected much by extreme values**. Let's demonstrate:

In [None]:
# Original wheat yield data
wheat_normal = np.array([4.2, 4.5, 3.8, 4.3, 4.1, 4.4, 4.0, 4.6, 3.9, 4.2, 4.3])

# Add an outlier (one field had exceptional conditions)
wheat_with_outlier = np.array([4.2, 4.5, 3.8, 4.3, 4.1, 4.4, 4.0, 4.6, 3.9, 4.2, 8.5])  # Outlier!

# Compare mean and median
print("WITHOUT outlier:")
print(f"  Mean   = {np.mean(wheat_normal):.2f} tons/hectare")
print(f"  Median = {np.median(wheat_normal):.2f} tons/hectare")
print()

print("WITH outlier (8.5 tons/hectare):")
print(f"  Mean   = {np.mean(wheat_with_outlier):.2f} tons/hectare  ⚠️  Changed significantly!")
print(f"  Median = {np.median(wheat_with_outlier):.2f} tons/hectare  ✓  Barely changed!")
print()

mean_change = np.mean(wheat_with_outlier) - np.mean(wheat_normal)
median_change = np.median(wheat_with_outlier) - np.median(wheat_normal)

print(f"💡 Mean changed by {mean_change:.2f} tons/hectare")
print(f"💡 Median changed by {median_change:.2f} tons/hectare")
print(f"\nThe median is much more ROBUST to outliers!")

---

## 3. Mode (Most Frequent Value)

The **mode** is the value that appears most frequently in the dataset.

### Key Points About Mode

- Can be used with **any type of data**: numerical, ordinal, or categorical
- A dataset can have:
  - **No mode** (all values unique)
  - **One mode** (unimodal)
  - **Two modes** (bimodal)
  - **Multiple modes** (multimodal)

### When to Use the Mode

✅ **Use the mode when:**
- Data is categorical (soil types, crop varieties)
- You want the "most common" value
- Data is discrete with repeated values
- Identifying typical categories

### Agricultural Example: Soil Type Classification

In [None]:
# Soil types across 20 fields
soil_types = ['Clay', 'Loam', 'Sandy', 'Loam', 'Clay', 'Loam', 'Loam', 'Sandy',
              'Loam', 'Clay', 'Loam', 'Loam', 'Sandy', 'Clay', 'Loam', 'Sandy',
              'Loam', 'Clay', 'Loam', 'Loam']

print("Soil types from 20 fields:")
print(soil_types)
print()

# Find mode using scipy.stats
mode_result = stats.mode(soil_types, keepdims=True)
mode_value = mode_result.mode[0]
mode_count = mode_result.count[0]

print(f"Mode: {mode_value}")
print(f"Appears {mode_count} times out of {len(soil_types)} fields")
print(f"\n💡 Interpretation: '{mode_value}' is the most common soil type in this region")

# Count frequency of each type
from collections import Counter
soil_counts = Counter(soil_types)
print(f"\nFrequency of each soil type:")
for soil, count in sorted(soil_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"  {soil}: {count} fields ({count/len(soil_types)*100:.1f}%)")

In [None]:
# Visualization: Soil type frequencies
fig, ax = plt.subplots(figsize=(10, 6))

soil_types_unique = list(soil_counts.keys())
counts = [soil_counts[st] for st in soil_types_unique]

# Color the mode differently
colors_mode = ['gold' if st == mode_value else 'steelblue' for st in soil_types_unique]

bars = ax.bar(soil_types_unique, counts, color=colors_mode, edgecolor='black', linewidth=1.5)

# Add value labels
for i, (soil, count) in enumerate(zip(soil_types_unique, counts)):
    ax.text(i, count + 0.3, f'{count}\n({count/len(soil_types)*100:.1f}%)', 
           ha='center', va='bottom', fontweight='bold', fontsize=11)

# Highlight mode
mode_index = soil_types_unique.index(mode_value)
ax.text(mode_index, counts[mode_index]/2, 'MODE', ha='center', fontsize=14, 
       fontweight='bold', color='darkred',
       bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.8))

ax.set_xlabel('Soil Type', fontsize=12, fontweight='bold')
ax.set_ylabel('Number of Fields', fontsize=12, fontweight='bold')
ax.set_title('Distribution of Soil Types\n(Mode = Most Common Type)', 
            fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print(f"\nThe mode '{mode_value}' appears most frequently (gold bar).")

### Mode with Numerical Data

The mode can also be used with numerical data, especially when values repeat:

In [None]:
# Daily temperature readings (°C) with some repeated values
daily_temps = np.array([22, 23, 22, 24, 22, 25, 23, 22, 26, 22, 24, 23, 22, 25, 22])

print("Daily temperature readings (°C):")
print(daily_temps)
print()

# Find mode
mode_temp = stats.mode(daily_temps, keepdims=True)
mode_value_temp = mode_temp.mode[0]
mode_count_temp = mode_temp.count[0]

# Also calculate mean and median for comparison
mean_temp = np.mean(daily_temps)
median_temp = np.median(daily_temps)

print(f"Mean:   {mean_temp:.2f}°C")
print(f"Median: {median_temp:.2f}°C")
print(f"Mode:   {mode_value_temp}°C (appears {mode_count_temp} times)")
print(f"\n💡 The most frequently occurring temperature was {mode_value_temp}°C")

---

## 4. Comparison and Decision Guide

### Summary Table

| Measure | Formula | Best For | Affected by Outliers? | Works with Categorical Data? |
|---------|---------|----------|----------------------|-----------------------------|
| **Mean** | $\frac{\sum x_i}{n}$ | Symmetric data, further calculations | ✗ YES - very sensitive | ✗ NO |
| **Median** | Middle value when sorted | Skewed data, outliers present | ✓ NO - robust | ✗ NO |
| **Mode** | Most frequent value | Categorical data, identifying common values | ✓ NO - unaffected | ✓ YES |

### Decision Flowchart

```
Is your data CATEGORICAL (soil type, crop variety)?
├─ YES → Use MODE
│
└─ NO (numerical) → Does your data have OUTLIERS or is it SKEWED?
    ├─ YES → Use MEDIAN (robust to outliers)
    │
    └─ NO (roughly symmetric) → Use MEAN (standard choice)
```

### Agricultural Examples by Scenario

| Scenario | Best Measure | Reason |
|----------|--------------|--------|
| Average crop yield across fields | **Mean** | Data typically symmetric, used for total production estimates |
| Typical farm income | **Median** | Income is usually right-skewed with high earners |
| Most common pest occurrence | **Mode** | Categorical data (pest types) |
| Central soil pH value | **Mean or Median** | Depends on outliers; check data first |
| Representative field size | **Median** | Often skewed by very large farms |
| Average rainfall | **Mean** | Used for water budgets, but median if extreme storms |

Let's see all three measures in action with a real agricultural dataset:

In [None]:
# Create a realistic agricultural dataset: Field sizes (hectares)
np.random.seed(123)

# Most farms are small (5-20 hectares), but a few are very large (outliers)
small_farms = np.random.uniform(5, 20, 40)
medium_farms = np.random.uniform(20, 50, 8)
large_farms = np.array([150, 200])  # Two very large farms (outliers)

farm_sizes = np.concatenate([small_farms, medium_farms, large_farms])
np.random.shuffle(farm_sizes)

print(f"Farm sizes (hectares) for {len(farm_sizes)} farms:")
print(f"Smallest: {np.min(farm_sizes):.1f} ha")
print(f"Largest:  {np.max(farm_sizes):.1f} ha")
print()

# Calculate all three measures
mean_size = np.mean(farm_sizes)
median_size = np.median(farm_sizes)
# For mode with continuous data, we'll use a rounded version
farm_sizes_rounded = np.round(farm_sizes)
mode_size = stats.mode(farm_sizes_rounded, keepdims=True)

print("Central Tendency Measures:")
print(f"  Mean:   {mean_size:.2f} hectares")
print(f"  Median: {median_size:.2f} hectares")
print(f"  Mode:   {mode_size.mode[0]:.0f} hectares (appears {mode_size.count[0]} times)")
print()

# Interpretation
print("💡 Interpretation:")
print(f"   - Mean ({mean_size:.1f} ha) is pulled UP by the two large farms")
print(f"   - Median ({median_size:.1f} ha) better represents the 'typical' farm size")
print(f"   - Mode shows the most common farm size category")
print()
print("   For reporting 'typical farm size', MEDIAN is best here!")

In [None]:
# Comprehensive visualization: All three measures
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Plot 1: Histogram with all measures marked
ax1 = axes[0, 0]
ax1.hist(farm_sizes, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
ax1.axvline(mean_size, color='red', linestyle='--', linewidth=2.5, label=f'Mean = {mean_size:.1f} ha')
ax1.axvline(median_size, color='green', linestyle='--', linewidth=2.5, label=f'Median = {median_size:.1f} ha')
ax1.set_xlabel('Farm Size (hectares)', fontsize=11, fontweight='bold')
ax1.set_ylabel('Number of Farms', fontsize=11, fontweight='bold')
ax1.set_title('Distribution of Farm Sizes\n(Mean vs Median)', fontsize=13, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Plot 2: Box plot showing median and outliers
ax2 = axes[0, 1]
bp = ax2.boxplot(farm_sizes, vert=True, patch_artist=True, 
                 boxprops=dict(facecolor='lightgreen', alpha=0.7),
                 medianprops=dict(color='darkgreen', linewidth=2.5),
                 flierprops=dict(marker='o', markerfacecolor='red', markersize=10))
ax2.axhline(mean_size, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_size:.1f} ha')
ax2.set_ylabel('Farm Size (hectares)', fontsize=11, fontweight='bold')
ax2.set_title('Box Plot: Median (green line) vs Mean (red dashed)\nOutliers shown in red', 
             fontsize=13, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3, axis='y')
ax2.set_xticklabels(['All Farms'])

# Plot 3: Sorted values showing position of mean and median
ax3 = axes[1, 0]
sorted_sizes = np.sort(farm_sizes)
ax3.plot(range(len(sorted_sizes)), sorted_sizes, 'o-', color='steelblue', markersize=6)
ax3.axhline(mean_size, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_size:.1f} ha')
ax3.axhline(median_size, color='green', linestyle='--', linewidth=2, label=f'Median = {median_size:.1f} ha')
ax3.set_xlabel('Farm Index (sorted)', fontsize=11, fontweight='bold')
ax3.set_ylabel('Farm Size (hectares)', fontsize=11, fontweight='bold')
ax3.set_title('Sorted Farm Sizes', fontsize=13, fontweight='bold')
ax3.legend(fontsize=10)
ax3.grid(True, alpha=0.3)

# Plot 4: Comparison bar chart
ax4 = axes[1, 1]
measures = ['Mean', 'Median']
values = [mean_size, median_size]
colors_comp = ['red', 'green']
bars = ax4.bar(measures, values, color=colors_comp, alpha=0.7, edgecolor='black', linewidth=2)
for i, (measure, value) in enumerate(zip(measures, values)):
    ax4.text(i, value + 5, f'{value:.1f} ha', ha='center', va='bottom', 
            fontsize=14, fontweight='bold')
difference = mean_size - median_size
ax4.text(0.5, max(values)/2, f'Difference:\n{difference:.1f} ha\n({difference/median_size*100:.1f}%)',
        ha='center', fontsize=12, bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.8))
ax4.set_ylabel('Farm Size (hectares)', fontsize=11, fontweight='bold')
ax4.set_title('Mean vs Median: Which is "Typical"?', fontsize=13, fontweight='bold')
ax4.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n📊 Visualization Insights:")
print("   - Top-left: The mean is pulled right by outliers")
print("   - Top-right: Box plot shows outliers clearly")
print("   - Bottom-left: Median sits exactly in the middle of sorted data")
print("   - Bottom-right: Mean > Median indicates right-skewed data")

---

## 5. Connection to PCA

### Why Mean Matters for Principal Component Analysis

**Principal Component Analysis (PCA)** - the dimensionality reduction technique you'll learn next in the ML module - has a critical first step:

🎯 **PCA always centers data at the mean**

### What Does "Centering" Mean?

**Centering** means subtracting the mean from each data point:

$$x_{\text{centered}} = x - \bar{x}$$

After centering:
- The mean of the centered data is exactly **0**
- The data's **shape and spread** are preserved
- The data is shifted so its center is at the origin

### Why PCA Centers Data

PCA finds directions of **maximum variance** in your data. Centering ensures:
1. ✅ Variance calculations are correct
2. ✅ Principal components pass through the data's center
3. ✅ The covariance matrix is computed properly
4. ✅ Results are invariant to the origin location

Let's visualize data centering:

In [None]:
# Simple 2D agricultural example: Nitrogen vs Phosphorus levels
np.random.seed(42)
n_samples = 30

# Original data (not centered)
nitrogen = np.random.normal(50, 10, n_samples)  # Mean around 50 ppm
phosphorus = np.random.normal(30, 8, n_samples)  # Mean around 30 ppm

# Calculate means
mean_N = np.mean(nitrogen)
mean_P = np.mean(phosphorus)

print(f"Original data:")
print(f"  Nitrogen mean:    {mean_N:.2f} ppm")
print(f"  Phosphorus mean:  {mean_P:.2f} ppm")
print()

# Center the data (subtract mean)
nitrogen_centered = nitrogen - mean_N
phosphorus_centered = phosphorus - mean_P

print(f"Centered data:")
print(f"  Nitrogen mean:    {np.mean(nitrogen_centered):.10f} ppm (≈ 0)")
print(f"  Phosphorus mean:  {np.mean(phosphorus_centered):.10f} ppm (≈ 0)")
print()
print("💡 After centering, both means are exactly 0!")

In [None]:
# Visualization: Before and after centering
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

# Plot 1: Original data
ax1.scatter(nitrogen, phosphorus, s=100, alpha=0.6, color='forestgreen', edgecolors='darkgreen', linewidth=1.5)
ax1.axvline(mean_N, color='red', linestyle='--', linewidth=2, label=f'Mean N = {mean_N:.1f} ppm')
ax1.axhline(mean_P, color='blue', linestyle='--', linewidth=2, label=f'Mean P = {mean_P:.1f} ppm')
ax1.plot(mean_N, mean_P, 'r*', markersize=20, label='Center of data', markeredgecolor='darkred', markeredgewidth=2)
ax1.set_xlabel('Nitrogen (ppm)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Phosphorus (ppm)', fontsize=12, fontweight='bold')
ax1.set_title('BEFORE Centering\n(Data centered at ({:.1f}, {:.1f}))'.format(mean_N, mean_P), 
             fontsize=14, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)
ax1.set_xlim([20, 80])
ax1.set_ylim([10, 50])

# Plot 2: Centered data
ax2.scatter(nitrogen_centered, phosphorus_centered, s=100, alpha=0.6, color='forestgreen', 
           edgecolors='darkgreen', linewidth=1.5)
ax2.axvline(0, color='red', linestyle='--', linewidth=2, label='Mean N = 0')
ax2.axhline(0, color='blue', linestyle='--', linewidth=2, label='Mean P = 0')
ax2.plot(0, 0, 'r*', markersize=20, label='Center at origin', markeredgecolor='darkred', markeredgewidth=2)
ax2.set_xlabel('Nitrogen - Mean (ppm)', fontsize=12, fontweight='bold')
ax2.set_ylabel('Phosphorus - Mean (ppm)', fontsize=12, fontweight='bold')
ax2.set_title('AFTER Centering\n(Data centered at (0, 0))', fontsize=14, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)
ax2.set_xlim([-35, 35])
ax2.set_ylim([-25, 25])

plt.tight_layout()
plt.show()

print("\n🎯 Key Observation:")
print("   - Left: Original data has center at (50, 30)")
print("   - Right: Centered data has center at (0, 0)")
print("   - The SHAPE of the data is IDENTICAL, just shifted!")
print("\n💡 This is exactly what PCA does as its FIRST STEP before finding principal components!")

---

## Key Takeaways

### 💡 Main Concepts

1. **Mean (Average)**:
   - Sum of values divided by count
   - Balance point of data
   - ⚠️ Sensitive to outliers
   - ✅ Best for symmetric data
   - 🎯 **PCA uses mean for centering!**

2. **Median (Middle Value)**:
   - Middle value when sorted (50th percentile)
   - ✅ Robust to outliers
   - ✅ Best for skewed data
   - Used in income, home prices, farm sizes

3. **Mode (Most Common)**:
   - Most frequently occurring value
   - ✅ Works with categorical data
   - Can have multiple modes
   - Best for soil types, crop varieties, pest species

### 🔗 Connection to PCA

**Why you learned about the mean in this notebook:**

- PCA **always centers data** by subtracting the mean
- Centering shifts data so the mean is at (0, 0, ...)
- This ensures variance is calculated correctly
- Principal components pass through the data center

**When you learn PCA, you'll see:**
```python
# Step 1 of PCA: Center the data
X_centered = X - X.mean(axis=0)  # Subtract mean of each feature
```

Now you understand *why* this step is necessary!

### 📊 Decision Guide

**Use Mean when:**
- Data is roughly symmetric
- No extreme outliers
- You need to do further calculations (variance, SD)
- Reporting total production, average yields

**Use Median when:**
- Data is skewed
- Outliers are present
- Reporting "typical" values (income, prices, farm sizes)
- More representative of the "middle"

**Use Mode when:**
- Data is categorical
- You want the "most common" category
- Identifying typical soil types, crop varieties

### 🌾 Agricultural Applications

- **Soil Analysis**: Mean pH for field management, mode for soil type distribution
- **Yield Planning**: Median yield for realistic expectations (robust to bad years)
- **Economic Analysis**: Median farm income (not skewed by large operations)
- **Pest Management**: Mode for most common pest species
- **Weather Patterns**: Mean rainfall for water budgets, median to avoid storm skew

---

## Next Steps

You've mastered central tendency! Now you understand:
- ✅ How to find the "center" of data
- ✅ When to use mean vs median vs mode
- ✅ Why PCA centers data at the mean

**Continue to the next notebook:**
`02_measures_of_spread.ipynb` - Learn about **variance and standard deviation**

This next topic is **critical for PCA** because:
- PCA finds directions of **maximum variance**
- Variance measures how spread out (informative) data is
- Understanding variance is KEY to understanding PCA!

**Great work!** 🎉 You're building the statistical foundations for machine learning!