# Data Distributions

**Understanding Distribution Shapes and Patterns**

---

## Introduction

In previous notebooks, we learned about summary statistics (mean, variance) and scaling techniques. But **two datasets can have the same mean and variance yet look completely different!**

Understanding the **shape** of your data distribution is crucial for:
- Choosing appropriate statistical methods
- Identifying data quality issues
- Interpreting PCA results correctly
- Deciding if transformations are needed

### Why This Matters for PCA

While PCA works on any distribution, understanding your data's shape helps you:
1. **Detect outliers** that might dominate principal components
2. **Identify skewness** that might need transformation
3. **Understand results** - symmetric data is easier to interpret
4. **Choose preprocessing** - some distributions benefit from log transforms

### Learning Objectives

By the end of this notebook, you will:

1. ‚úì Visualize distributions with **histograms, density plots, box plots**
2. ‚úì Understand **skewness** (left, right, symmetric)
3. ‚úì Understand **kurtosis** (heavy vs light tails)
4. ‚úì Recognize **normal distributions** and test for normality
5. ‚úì Use **Q-Q plots** to assess normality
6. ‚úì Identify when **transformations** might help
7. ‚úì Connect distribution properties to **PCA interpretation**

**Agricultural Context**: We'll examine distributions of crop yields, rainfall, soil properties, and pest occurrences to see how different agricultural variables have different distribution shapes.

---

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import skew, kurtosis, shapiro, normaltest
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print("‚úì Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"SciPy version: {stats.__version__}")

## 1. Visualization Techniques

The first step in understanding distributions is **visualizing** them. Let's explore different visualization methods using agricultural data.

### Agricultural Scenario

We have wheat yield data from 200 fields over one growing season. Let's see how yield is distributed.

In [None]:
# Generate realistic wheat yield data (kg/hectare)
n_fields = 200

# Most yields are normal around 5000 kg/ha, but with some variability
wheat_yield = np.random.normal(5000, 800, n_fields)

# Make sure no negative yields
wheat_yield = np.maximum(wheat_yield, 1000)

print("üåæ Wheat Yield Data Summary")
print("=" * 60)
print(f"Number of fields: {len(wheat_yield)}")
print(f"Mean yield: {np.mean(wheat_yield):.1f} kg/ha")
print(f"Median yield: {np.median(wheat_yield):.1f} kg/ha")
print(f"Std deviation: {np.std(wheat_yield, ddof=1):.1f} kg/ha")
print(f"Min yield: {np.min(wheat_yield):.1f} kg/ha")
print(f"Max yield: {np.max(wheat_yield):.1f} kg/ha")

### Histogram

The most common way to visualize a distribution - shows frequency of values in bins.

In [None]:
# Create histogram
fig, ax = plt.subplots(figsize=(12, 6))

# Plot histogram
n, bins, patches = ax.hist(wheat_yield, bins=25, color='#4ECDC4', alpha=0.7, 
                            edgecolor='black', linewidth=1.5)

# Add mean and median lines
mean_yield = np.mean(wheat_yield)
median_yield = np.median(wheat_yield)

ax.axvline(mean_yield, color='red', linestyle='--', linewidth=2.5, label=f'Mean: {mean_yield:.0f}')
ax.axvline(median_yield, color='blue', linestyle='--', linewidth=2.5, label=f'Median: {median_yield:.0f}')

# Labels and title
ax.set_xlabel('Wheat Yield (kg/ha)', fontsize=13, fontweight='bold')
ax.set_ylabel('Frequency (Number of Fields)', fontsize=13, fontweight='bold')
ax.set_title('üåæ Distribution of Wheat Yield Across 200 Fields\nHistogram Shows Frequency of Yield Values', 
             fontsize=14, fontweight='bold')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Histogram Insights:")
print("  ‚Ä¢ Most fields have yields between 4,000-6,000 kg/ha")
print("  ‚Ä¢ Distribution appears roughly symmetric (bell-shaped)")
print("  ‚Ä¢ Mean and median are very close (sign of symmetry)")

### Density Plot (KDE)

Kernel Density Estimation (KDE) creates a smooth curve showing the probability density.

In [None]:
# Create density plot
fig, ax = plt.subplots(figsize=(12, 6))

# Plot histogram with density
ax.hist(wheat_yield, bins=25, density=True, alpha=0.5, color='#4ECDC4', 
        edgecolor='black', linewidth=1, label='Histogram (normalized)')

# Plot KDE
from scipy.stats import gaussian_kde
kde = gaussian_kde(wheat_yield)
x_range = np.linspace(wheat_yield.min(), wheat_yield.max(), 200)
ax.plot(x_range, kde(x_range), color='#FF6B6B', linewidth=3, label='KDE (smooth density)')

# Add mean
ax.axvline(mean_yield, color='red', linestyle='--', linewidth=2, alpha=0.7, label=f'Mean: {mean_yield:.0f}')

# Labels
ax.set_xlabel('Wheat Yield (kg/ha)', fontsize=13, fontweight='bold')
ax.set_ylabel('Density', fontsize=13, fontweight='bold')
ax.set_title('üåæ Wheat Yield Distribution: Density Plot\nSmooth Curve Shows Probability Density', 
             fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä KDE Insights:")
print("  ‚Ä¢ Smooth curve shows the overall shape more clearly")
print("  ‚Ä¢ Peaks around 5,000 kg/ha (most common yield)")
print("  ‚Ä¢ Tails show probability of extreme low/high yields")

### Box Plot

Shows the five-number summary: minimum, Q1, median, Q3, maximum (plus outliers).

In [None]:
# Create box plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('üåæ Box Plot: Visualizing the Five-Number Summary', fontsize=14, fontweight='bold')

# Vertical box plot
bp = ax1.boxplot(wheat_yield, vert=True, patch_artist=True,
                 boxprops=dict(facecolor='#4ECDC4', alpha=0.7),
                 medianprops=dict(color='red', linewidth=2.5),
                 whiskerprops=dict(linewidth=1.5),
                 capprops=dict(linewidth=1.5))

ax1.set_ylabel('Wheat Yield (kg/ha)', fontsize=12, fontweight='bold')
ax1.set_title('Vertical Box Plot', fontsize=12)
ax1.grid(True, alpha=0.3, axis='y')

# Add annotations
q1 = np.percentile(wheat_yield, 25)
q2 = np.percentile(wheat_yield, 50)
q3 = np.percentile(wheat_yield, 75)
iqr = q3 - q1

ax1.text(1.15, q1, f'Q1: {q1:.0f}', fontsize=10, va='center')
ax1.text(1.15, q2, f'Median: {q2:.0f}', fontsize=10, va='center', color='red', fontweight='bold')
ax1.text(1.15, q3, f'Q3: {q3:.0f}', fontsize=10, va='center')
ax1.text(1.15, q3 + 0.3*iqr, f'IQR: {iqr:.0f}', fontsize=10, va='center')

# Horizontal box plot with histogram
ax2.hist(wheat_yield, bins=25, orientation='horizontal', alpha=0.5, 
         color='#4ECDC4', edgecolor='black')
ax2.boxplot(wheat_yield, vert=False, positions=[max(plt.gca().get_xlim())*0.15],
            patch_artist=True, widths=max(plt.gca().get_xlim())*0.1,
            boxprops=dict(facecolor='#FF6B6B', alpha=0.7),
            medianprops=dict(color='darkred', linewidth=2.5))

ax2.set_xlabel('Wheat Yield (kg/ha)', fontsize=12, fontweight='bold')
ax2.set_ylabel('Frequency', fontsize=12, fontweight='bold')
ax2.set_title('Box Plot with Histogram', fontsize=12)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Box Plot Insights:")
print(f"  ‚Ä¢ Q1 (25th percentile): {q1:.0f} kg/ha")
print(f"  ‚Ä¢ Median (50th percentile): {q2:.0f} kg/ha")
print(f"  ‚Ä¢ Q3 (75th percentile): {q3:.0f} kg/ha")
print(f"  ‚Ä¢ IQR (spread of middle 50%): {iqr:.0f} kg/ha")
print(f"  ‚Ä¢ Box plot shows distribution is roughly symmetric")

### Violin Plot

Combines box plot with density plot - shows both summary statistics and distribution shape.

In [None]:
# Create violin plot
fig, ax = plt.subplots(figsize=(8, 8))

# Create violin plot
parts = ax.violinplot([wheat_yield], positions=[1], widths=0.7,
                      showmeans=True, showmedians=True)

# Customize colors
for pc in parts['bodies']:
    pc.set_facecolor('#4ECDC4')
    pc.set_alpha(0.7)
    pc.set_edgecolor('black')
    pc.set_linewidth(1.5)

parts['cmedians'].set_color('red')
parts['cmedians'].set_linewidth(2.5)
parts['cmeans'].set_color('blue')
parts['cmeans'].set_linewidth(2)

# Labels
ax.set_ylabel('Wheat Yield (kg/ha)', fontsize=13, fontweight='bold')
ax.set_title('üåæ Violin Plot: Box Plot + Distribution Shape\nWidth Shows Density at Each Yield Level', 
             fontsize=14, fontweight='bold')
ax.set_xticks([1])
ax.set_xticklabels(['Wheat Yield'])
ax.grid(True, alpha=0.3, axis='y')

# Add legend
from matplotlib.lines import Line2D
legend_elements = [Line2D([0], [0], color='red', linewidth=2.5, label='Median'),
                   Line2D([0], [0], color='blue', linewidth=2, label='Mean')]
ax.legend(handles=legend_elements, loc='upper right', fontsize=11)

plt.tight_layout()
plt.show()

print("\nüìä Violin Plot Insights:")
print("  ‚Ä¢ Width shows density (wider = more fields at that yield)")
print("  ‚Ä¢ Symmetric shape confirms normal-like distribution")
print("  ‚Ä¢ Combines information from both box plot and histogram")

---

## 2. Skewness

**Skewness measures the asymmetry of a distribution.**

### Definition

$$\text{Skewness} = \frac{1}{n} \sum_{i=1}^{n} \left(\frac{x_i - \mu}{\sigma}\right)^3$$

### Interpretation

- **Skewness ‚âà 0**: Symmetric distribution (normal distribution)
- **Skewness > 0**: Right-skewed (positive skew) - long tail to the right
- **Skewness < 0**: Left-skewed (negative skew) - long tail to the left

**Rule of thumb:**
- |Skewness| < 0.5: Fairly symmetric
- 0.5 < |Skewness| < 1: Moderately skewed
- |Skewness| > 1: Highly skewed

### Agricultural Examples

In [None]:
# Create three agricultural datasets with different skewness

# 1. Symmetric: Soil pH (roughly normal)
soil_pH = np.random.normal(6.5, 0.6, 200)

# 2. Right-skewed: Rainfall (most days little rain, few days heavy rain)
rainfall_mm = np.random.gamma(2, 10, 200)  # Gamma distribution

# 3. Left-skewed: Crop health score (most crops healthy, few sick)
crop_health = 100 - np.random.gamma(2, 5, 200)  # Inverted gamma
crop_health = np.clip(crop_health, 0, 100)

# Calculate skewness
skew_pH = skew(soil_pH)
skew_rainfall = skew(rainfall_mm)
skew_health = skew(crop_health)

print("üìä Skewness Examples")
print("=" * 60)
print(f"\n1. Soil pH (Symmetric):")
print(f"   Skewness: {skew_pH:.3f}  ‚Üê Close to 0 (symmetric)")
print(f"   Mean: {np.mean(soil_pH):.2f}, Median: {np.median(soil_pH):.2f}")

print(f"\n2. Daily Rainfall (Right-Skewed):")
print(f"   Skewness: {skew_rainfall:.3f}  ‚Üê Positive (right tail)")
print(f"   Mean: {np.mean(rainfall_mm):.2f}, Median: {np.median(rainfall_mm):.2f}")
print(f"   Note: Mean > Median (pulled by high values)")

print(f"\n3. Crop Health Score (Left-Skewed):")
print(f"   Skewness: {skew_health:.3f}  ‚Üê Negative (left tail)")
print(f"   Mean: {np.mean(crop_health):.2f}, Median: {np.median(crop_health):.2f}")
print(f"   Note: Mean < Median (pulled by low values)")

In [None]:
# Visualize the three types of skewness
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.suptitle('üìä Understanding Skewness: Three Agricultural Examples', 
             fontsize=16, fontweight='bold', y=0.98)

datasets = [
    (soil_pH, 'Soil pH', 'Symmetric\n(Skew ‚âà 0)', '#4ECDC4', skew_pH),
    (rainfall_mm, 'Daily Rainfall (mm)', 'Right-Skewed\n(Skew > 0)', '#FF6B6B', skew_rainfall),
    (crop_health, 'Crop Health Score', 'Left-Skewed\n(Skew < 0)', '#45B7D1', skew_health)
]

for idx, (data, xlabel, title, color, skewness) in enumerate(datasets):
    # Histogram
    ax_hist = axes[0, idx]
    ax_hist.hist(data, bins=25, color=color, alpha=0.7, edgecolor='black')
    ax_hist.axvline(np.mean(data), color='red', linestyle='--', linewidth=2, label='Mean')
    ax_hist.axvline(np.median(data), color='blue', linestyle='--', linewidth=2, label='Median')
    ax_hist.set_xlabel(xlabel, fontsize=11, fontweight='bold')
    ax_hist.set_ylabel('Frequency', fontsize=11)
    ax_hist.set_title(f'{title}\nSkewness = {skewness:.3f}', fontsize=12, fontweight='bold')
    ax_hist.legend(fontsize=9)
    ax_hist.grid(True, alpha=0.3)
    
    # Box plot
    ax_box = axes[1, idx]
    bp = ax_box.boxplot(data, vert=False, patch_artist=True,
                         boxprops=dict(facecolor=color, alpha=0.7),
                         medianprops=dict(color='darkred', linewidth=2.5))
    ax_box.set_xlabel(xlabel, fontsize=11, fontweight='bold')
    ax_box.set_title('Box Plot View', fontsize=11)
    ax_box.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüéØ Key Observations:")
print("\n  LEFT Column (Symmetric):")
print("    ‚Ä¢ Mean ‚âà Median")
print("    ‚Ä¢ Box plot symmetric around median")
print("    ‚Ä¢ Tails roughly equal on both sides")
print("\n  MIDDLE Column (Right-Skewed):")
print("    ‚Ä¢ Mean > Median (pulled right by high values)")
print("    ‚Ä¢ Long tail on the right")
print("    ‚Ä¢ Box plot shows upper whisker longer")
print("\n  RIGHT Column (Left-Skewed):")
print("    ‚Ä¢ Mean < Median (pulled left by low values)")
print("    ‚Ä¢ Long tail on the left")
print("    ‚Ä¢ Box plot shows lower whisker longer")

### Why Skewness Matters

**For PCA:**
- Highly skewed data might benefit from transformation (log, sqrt)
- Outliers in skewed data can dominate principal components
- Symmetric data is generally easier to interpret

**For Agriculture:**
- **Right-skewed** is common: Rainfall, pest damage, disease incidence
- **Left-skewed** examples: Crop quality scores, germination rates
- **Symmetric**: Soil pH, temperature, some nutrient levels

---

## 3. Kurtosis

**Kurtosis measures the "tailedness" of a distribution** - how much probability is in the tails vs the center.

### Definition

$$\text{Kurtosis} = \frac{1}{n} \sum_{i=1}^{n} \left(\frac{x_i - \mu}{\sigma}\right)^4 - 3$$

(The "-3" makes normal distribution have kurtosis = 0; this is called "excess kurtosis")

### Interpretation

- **Kurtosis ‚âà 0**: Normal amount of tail probability (mesokurtic)
- **Kurtosis > 0**: Heavy tails, more outliers (leptokurtic)
- **Kurtosis < 0**: Light tails, fewer outliers (platykurtic)

### Agricultural Examples

In [None]:
# Create three distributions with different kurtosis
n = 1000

# 1. Normal kurtosis (mesokurtic): Regular yield
normal_yield = np.random.normal(5000, 800, n)

# 2. High kurtosis (leptokurtic): Yield with extreme events
# Mix of normal yield with some extreme values
high_kurt_yield = np.concatenate([
    np.random.normal(5000, 400, int(n*0.9)),  # Most values clustered
    np.random.normal(5000, 2000, int(n*0.1))  # Some extreme values
])

# 3. Low kurtosis (platykurtic): More uniform yield
low_kurt_yield = np.random.uniform(3000, 7000, n)

# Calculate kurtosis
kurt_normal = kurtosis(normal_yield, fisher=True)  # fisher=True for excess kurtosis
kurt_high = kurtosis(high_kurt_yield, fisher=True)
kurt_low = kurtosis(low_kurt_yield, fisher=True)

print("üìä Kurtosis Examples")
print("=" * 60)
print(f"\n1. Normal Distribution (Mesokurtic):")
print(f"   Kurtosis: {kurt_normal:.3f}  ‚Üê Close to 0 (normal tails)")

print(f"\n2. Heavy Tails (Leptokurtic):")
print(f"   Kurtosis: {kurt_high:.3f}  ‚Üê Positive (more outliers)")
print(f"   Interpretation: More extreme yields than expected")

print(f"\n3. Light Tails (Platykurtic):")
print(f"   Kurtosis: {kurt_low:.3f}  ‚Üê Negative (fewer outliers)")
print(f"   Interpretation: Yields more evenly spread")

In [None]:
# Visualize kurtosis differences
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
fig.suptitle('üìä Understanding Kurtosis: Tail Heaviness', 
             fontsize=16, fontweight='bold')

datasets_kurt = [
    (normal_yield, 'Normal Kurtosis\n(Mesokurtic)', '#4ECDC4', kurt_normal),
    (high_kurt_yield, 'High Kurtosis\n(Leptokurtic - Heavy Tails)', '#FF6B6B', kurt_high),
    (low_kurt_yield, 'Low Kurtosis\n(Platykurtic - Light Tails)', '#45B7D1', kurt_low)
]

for idx, (data, title, color, kurt_val) in enumerate(datasets_kurt):
    ax = axes[idx]
    
    # Histogram
    ax.hist(data, bins=40, density=True, alpha=0.6, color=color, edgecolor='black')
    
    # KDE
    from scipy.stats import gaussian_kde
    kde = gaussian_kde(data)
    x_range = np.linspace(data.min(), data.max(), 200)
    ax.plot(x_range, kde(x_range), color='darkred', linewidth=2.5, label='KDE')
    
    # Mark mean
    ax.axvline(np.mean(data), color='blue', linestyle='--', linewidth=2, alpha=0.7)
    
    ax.set_xlabel('Yield (kg/ha)', fontsize=11, fontweight='bold')
    ax.set_ylabel('Density', fontsize=11, fontweight='bold')
    ax.set_title(f'{title}\nKurtosis = {kurt_val:.3f}', fontsize=11, fontweight='bold')
    ax.grid(True, alpha=0.3)
    ax.set_xlim([1000, 9000])

plt.tight_layout()
plt.show()

print("\nüéØ Key Observations:")
print("\n  LEFT (Normal Kurtosis ‚âà 0):")
print("    ‚Ä¢ Bell-shaped with moderate tails")
print("    ‚Ä¢ Typical for many agricultural variables")
print("\n  MIDDLE (High Kurtosis > 0):")
print("    ‚Ä¢ Sharp peak in center")
print("    ‚Ä¢ Heavy tails with more extreme values")
print("    ‚Ä¢ Common when there are rare extreme events")
print("\n  RIGHT (Low Kurtosis < 0):")
print("    ‚Ä¢ Flat top (more uniform)")
print("    ‚Ä¢ Light tails with fewer extremes")
print("    ‚Ä¢ Values more evenly distributed")

---

## 4. Normal Distribution

The **normal (Gaussian) distribution** is the most important distribution in statistics.

### Properties

- **Symmetric** (skewness = 0)
- **Bell-shaped**
- **Mean = Median = Mode**
- **68-95-99.7 Rule**: 
  - 68% of data within 1 SD of mean
  - 95% within 2 SD
  - 99.7% within 3 SD

### Why It Matters

- Many agricultural variables are approximately normal
- Central Limit Theorem: Averages tend to be normal
- PCA doesn't require normality, but interpretation is easier with normal data

In [None]:
# Generate perfect normal distribution
mu = 5000  # mean
sigma = 800  # standard deviation
normal_data = np.random.normal(mu, sigma, 10000)

# Visualize with 68-95-99.7 rule
fig, ax = plt.subplots(figsize=(14, 7))

# Histogram
n, bins, patches = ax.hist(normal_data, bins=50, density=True, alpha=0.6, 
                            color='#4ECDC4', edgecolor='black')

# Theoretical normal curve
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 200)
ax.plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=3, 
        label='Theoretical Normal Distribution')

# Mark standard deviations
colors_sd = ['#FFD93D', '#FFA07A', '#FF6B6B']
for i in range(1, 4):
    ax.axvline(mu - i*sigma, color=colors_sd[i-1], linestyle='--', linewidth=2, alpha=0.7)
    ax.axvline(mu + i*sigma, color=colors_sd[i-1], linestyle='--', linewidth=2, alpha=0.7)

# Add shaded regions
x_1sd = np.linspace(mu - sigma, mu + sigma, 100)
ax.fill_between(x_1sd, stats.norm.pdf(x_1sd, mu, sigma), alpha=0.3, color='yellow', 
                label='¬±1 SD (68%)')
x_2sd_left = np.linspace(mu - 2*sigma, mu - sigma, 100)
x_2sd_right = np.linspace(mu + sigma, mu + 2*sigma, 100)
ax.fill_between(x_2sd_left, stats.norm.pdf(x_2sd_left, mu, sigma), alpha=0.2, color='orange')
ax.fill_between(x_2sd_right, stats.norm.pdf(x_2sd_right, mu, sigma), alpha=0.2, color='orange', 
                label='¬±2 SD (95%)')

# Labels
ax.set_xlabel('Value', fontsize=13, fontweight='bold')
ax.set_ylabel('Density', fontsize=13, fontweight='bold')
ax.set_title('üìä Normal Distribution: The 68-95-99.7 Rule', 
             fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Add text annotations
ax.text(mu, max(n)*0.8, f'Œº = {mu}\nœÉ = {sigma}', 
        ha='center', fontsize=12, 
        bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

print("\nüìä The 68-95-99.7 Rule (Empirical Rule):")
print("=" * 60)
print(f"  Œº ¬± 1œÉ: [{mu-sigma:.0f}, {mu+sigma:.0f}] contains ~68% of data")
print(f"  Œº ¬± 2œÉ: [{mu-2*sigma:.0f}, {mu+2*sigma:.0f}] contains ~95% of data")
print(f"  Œº ¬± 3œÉ: [{mu-3*sigma:.0f}, {mu+3*sigma:.0f}] contains ~99.7% of data")

# Verify with actual data
within_1sd = np.sum((normal_data >= mu-sigma) & (normal_data <= mu+sigma)) / len(normal_data)
within_2sd = np.sum((normal_data >= mu-2*sigma) & (normal_data <= mu+2*sigma)) / len(normal_data)
within_3sd = np.sum((normal_data >= mu-3*sigma) & (normal_data <= mu+3*sigma)) / len(normal_data)

print("\n‚úì Verification with our data:")
print(f"  Within ¬±1 SD: {within_1sd*100:.1f}% (expected: 68%)")
print(f"  Within ¬±2 SD: {within_2sd*100:.1f}% (expected: 95%)")
print(f"  Within ¬±3 SD: {within_3sd*100:.1f}% (expected: 99.7%)")

### Testing for Normality: Q-Q Plot

A **Q-Q (Quantile-Quantile) plot** compares your data's quantiles to theoretical normal distribution quantiles.

**Interpretation:**
- Points on diagonal line = data is normal
- S-curve = skewed
- Curved ends = heavy/light tails

In [None]:
# Create Q-Q plots for different distributions
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.suptitle('üìä Q-Q Plots: Testing for Normality', fontsize=16, fontweight='bold', y=0.98)

# Three distributions
test_data = [
    (np.random.normal(0, 1, 200), 'Normal Distribution', '#4ECDC4'),
    (np.random.gamma(2, 2, 200), 'Right-Skewed (Gamma)', '#FF6B6B'),
    (np.random.uniform(-2, 2, 200), 'Uniform Distribution', '#45B7D1')
]

for idx, (data, title, color) in enumerate(test_data):
    # Histogram
    ax_hist = axes[0, idx]
    ax_hist.hist(data, bins=20, color=color, alpha=0.7, edgecolor='black', density=True)
    ax_hist.set_title(title, fontsize=12, fontweight='bold')
    ax_hist.set_ylabel('Density', fontsize=11)
    ax_hist.grid(True, alpha=0.3)
    
    # Q-Q plot
    ax_qq = axes[1, idx]
    stats.probplot(data, dist="norm", plot=ax_qq)
    ax_qq.get_lines()[0].set_markerfacecolor(color)
    ax_qq.get_lines()[0].set_markeredgecolor('black')
    ax_qq.get_lines()[0].set_markersize(6)
    ax_qq.get_lines()[1].set_color('red')
    ax_qq.get_lines()[1].set_linewidth(2)
    ax_qq.set_title('Q-Q Plot', fontsize=11)
    ax_qq.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüîç Q-Q Plot Interpretation:")
print("=" * 60)
print("\n  LEFT (Normal):")
print("    ‚úì Points fall on diagonal line")
print("    ‚úì Data is approximately normal")
print("\n  MIDDLE (Right-Skewed):")
print("    ‚úó S-curve pattern")
print("    ‚úó Upper tail deviates upward (right skew)")
print("\n  RIGHT (Uniform):")
print("    ‚úó Both ends curve away from line")
print("    ‚úó Light tails (platykurtic)")

### Statistical Tests for Normality

In [None]:
# Test normality of our agricultural datasets
test_datasets = [
    (wheat_yield, 'Wheat Yield'),
    (soil_pH, 'Soil pH'),
    (rainfall_mm, 'Daily Rainfall'),
    (crop_health, 'Crop Health Score')
]

print("üìä Normality Tests for Agricultural Data")
print("=" * 70)
print("\nShapiro-Wilk Test: p-value > 0.05 ‚Üí data is normal")
print("=" * 70)

for data, name in test_datasets:
    # Shapiro-Wilk test
    statistic, p_value = shapiro(data)
    
    # Calculate skewness and kurtosis
    skew_val = skew(data)
    kurt_val = kurtosis(data, fisher=True)
    
    print(f"\n{name}:")
    print(f"  Shapiro-Wilk p-value: {p_value:.4f}", end="")
    if p_value > 0.05:
        print(" ‚úì (appears normal)")
    else:
        print(" ‚úó (not normal)")
    print(f"  Skewness: {skew_val:.3f}")
    print(f"  Kurtosis: {kurt_val:.3f}")

print("\n" + "=" * 70)
print("\nüí° Important Note:")
print("  ‚Ä¢ PCA does NOT require normal distributions")
print("  ‚Ä¢ But normal data is easier to interpret")
print("  ‚Ä¢ Highly skewed data might benefit from transformation")

---

## 5. Summary and Key Insights

### What We Learned

1. **Visualization Methods**
   - Histogram: Shows frequency distribution
   - KDE (Density plot): Smooth probability density
   - Box plot: Five-number summary
   - Violin plot: Combines box plot + density

2. **Skewness** (Asymmetry)
   - Skewness ‚âà 0: Symmetric
   - Skewness > 0: Right-skewed (long right tail)
   - Skewness < 0: Left-skewed (long left tail)
   - Agricultural examples: Rainfall (right), crop scores (left)

3. **Kurtosis** (Tailedness)
   - Kurtosis ‚âà 0: Normal tails
   - Kurtosis > 0: Heavy tails (more outliers)
   - Kurtosis < 0: Light tails (fewer outliers)

4. **Normal Distribution**
   - Symmetric, bell-shaped
   - 68-95-99.7 rule for standard deviations
   - Q-Q plots test normality visually
   - Shapiro-Wilk test for statistical confirmation

### Connection to PCA

**PCA doesn't require normal distributions**, but understanding distributions helps you:

1. **Identify problematic outliers** before PCA
2. **Decide on transformations** (log for right-skewed data)
3. **Interpret results** more easily with symmetric data
4. **Understand variance** - what PCA maximizes

### Agricultural Insights

Different agricultural variables have different distribution shapes:

- **Approximately Normal**: Soil pH, temperature, some nutrient levels
- **Right-Skewed**: Rainfall, pest counts, disease incidence
- **Left-Skewed**: Crop quality scores, germination rates
- **Uniform/Mixed**: Some management practices

**Know your data's distribution before analysis!**

### Next Steps

In the next notebook, we'll learn about **outlier detection** - identifying unusual values that might distort our analysis and PCA results.

---

**Remember**: Always visualize your data before analysis! A histogram can reveal insights that summary statistics miss. üìä‚ú®