# Outlier Detection

**Identifying and Handling Unusual Observations in Agricultural Data**

---

## Introduction

**Outliers** are observations that are significantly different from other data points. They can be:
- **Errors**: Measurement mistakes, data entry errors
- **Real extreme values**: Drought years, exceptional yields, equipment failures

### Why Outliers Matter

Outliers can:
1. **Distort statistics**: Inflate mean and variance
2. **Affect correlations**: Create spurious relationships
3. **Dominate PCA**: Pull principal components toward extreme values
4. **Hide patterns**: Mask relationships in the main data

### Why This Matters for PCA ‚≠ê

**PCA is sensitive to outliers** because it:
- Maximizes variance (outliers have high variance!)
- Uses covariance matrix (influenced by outliers)
- Can create components that mainly capture outlier patterns

**Bottom line**: Always check for outliers BEFORE PCA!

### Learning Objectives

By the end of this notebook, you will:

1. ‚úì Understand what outliers are and why they occur
2. ‚úì Detect outliers using **visual methods** (box plots, scatter plots)
3. ‚úì Detect outliers using **statistical methods** (IQR, Z-score)
4. ‚úì Use **robust methods** (Modified Z-score)
5. ‚úì Identify **multivariate outliers** (Mahalanobis distance)
6. ‚úì Decide how to **handle outliers** appropriately
7. ‚úì Understand outlier impact on **PCA results**

**Agricultural Context**: We'll examine outliers in crop yield, soil properties, and weather data to learn practical detection and handling strategies.

---

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.spatial.distance import mahalanobis
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print("‚úì Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## 1. What Are Outliers?

Let's create agricultural data with some outliers to see them in action.

### Agricultural Scenario

You have wheat yield data from 100 fields. Most fields produce 4,000-6,000 kg/ha, but:
- One field had a disease outbreak ‚Üí very low yield (outlier)
- One field had exceptional conditions ‚Üí very high yield (outlier)
- One yield was recorded incorrectly ‚Üí data entry error (outlier)

In [None]:
# Generate wheat yield data with outliers
n_fields = 97

# Normal yields
normal_yields = np.random.normal(5000, 600, n_fields)

# Add outliers
outlier_low = np.array([1200])  # Disease outbreak
outlier_high = np.array([8500])  # Exceptional conditions
outlier_error = np.array([12000])  # Data entry error

# Combine
wheat_yields = np.concatenate([normal_yields, outlier_low, outlier_high, outlier_error])

# Create labels
labels = ['Normal'] * n_fields + ['Low Outlier', 'High Outlier', 'Error']

print("üåæ Wheat Yield Data Summary")
print("=" * 60)
print(f"Total fields: {len(wheat_yields)}")
print(f"Normal fields: {n_fields}")
print(f"Outliers: 3")
print(f"\nMean: {np.mean(wheat_yields):.1f} kg/ha")
print(f"Median: {np.median(wheat_yields):.1f} kg/ha")
print(f"Std Dev: {np.std(wheat_yields, ddof=1):.1f} kg/ha")
print(f"\nMin: {np.min(wheat_yields):.1f} kg/ha (outlier!)")
print(f"Max: {np.max(wheat_yields):.1f} kg/ha (outlier!)")
print(f"\n‚ö†Ô∏è  Notice: Mean is pulled by outliers, median is more robust")

In [None]:
# Visualize the impact of outliers
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('üö® Impact of Outliers on Data Visualization', fontsize=14, fontweight='bold')

# Histogram - with outliers
ax1.hist(wheat_yields, bins=30, color='#4ECDC4', alpha=0.7, edgecolor='black')
ax1.axvline(np.mean(wheat_yields), color='red', linestyle='--', linewidth=2.5, 
            label=f'Mean: {np.mean(wheat_yields):.0f}')
ax1.axvline(np.median(wheat_yields), color='blue', linestyle='--', linewidth=2.5, 
            label=f'Median: {np.median(wheat_yields):.0f}')

# Mark outliers
for outlier in [outlier_low[0], outlier_high[0], outlier_error[0]]:
    ax1.axvline(outlier, color='red', linewidth=2, alpha=0.3)
    ax1.text(outlier, ax1.get_ylim()[1]*0.9, '‚ö†Ô∏è', ha='center', fontsize=20)

ax1.set_xlabel('Wheat Yield (kg/ha)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Frequency', fontsize=12, fontweight='bold')
ax1.set_title('WITH Outliers', fontsize=12)
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Histogram - without outliers (for comparison)
ax2.hist(normal_yields, bins=30, color='#45B7D1', alpha=0.7, edgecolor='black')
ax2.axvline(np.mean(normal_yields), color='red', linestyle='--', linewidth=2.5, 
            label=f'Mean: {np.mean(normal_yields):.0f}')
ax2.axvline(np.median(normal_yields), color='blue', linestyle='--', linewidth=2.5, 
            label=f'Median: {np.median(normal_yields):.0f}')

ax2.set_xlabel('Wheat Yield (kg/ha)', fontsize=12, fontweight='bold')
ax2.set_ylabel('Frequency', fontsize=12, fontweight='bold')
ax2.set_title('WITHOUT Outliers (Normal Data Only)', fontsize=12)
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüéØ Key Observations:")
print(f"  WITH outliers: Mean = {np.mean(wheat_yields):.0f}, Median = {np.median(wheat_yields):.0f}")
print(f"  WITHOUT outliers: Mean = {np.mean(normal_yields):.0f}, Median = {np.median(normal_yields):.0f}")
print(f"\n  ‚Üí Outliers pull the mean {np.mean(wheat_yields) - np.mean(normal_yields):.0f} kg/ha higher!")
print(f"  ‚Üí Median barely changes (robust to outliers)")

---

## 2. Visual Methods for Outlier Detection

**Always start with visualization!** Visual methods are intuitive and often reveal patterns statistical tests miss.

### Box Plot Method

Box plots automatically identify outliers using the **IQR (Interquartile Range) rule**:

**Outliers are values:**
- Below: Q1 - 1.5 √ó IQR
- Above: Q3 + 1.5 √ó IQR

Where IQR = Q3 - Q1

In [None]:
# Box plot for outlier detection
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('üìä Box Plot: Automatic Outlier Detection', fontsize=14, fontweight='bold')

# Vertical box plot
bp = ax1.boxplot(wheat_yields, vert=True, patch_artist=True,
                 boxprops=dict(facecolor='#4ECDC4', alpha=0.7),
                 medianprops=dict(color='red', linewidth=2.5),
                 flierprops=dict(marker='o', markerfacecolor='red', markersize=10,
                                markeredgecolor='black', linewidth=1.5))

ax1.set_ylabel('Wheat Yield (kg/ha)', fontsize=12, fontweight='bold')
ax1.set_title('Vertical Box Plot\n(Red circles = Outliers)', fontsize=12)
ax1.grid(True, alpha=0.3, axis='y')

# Calculate IQR boundaries
q1 = np.percentile(wheat_yields, 25)
q3 = np.percentile(wheat_yields, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

# Add boundary lines
ax1.axhline(lower_bound, color='orange', linestyle='--', linewidth=2, alpha=0.7,
            label=f'Lower Bound: {lower_bound:.0f}')
ax1.axhline(upper_bound, color='orange', linestyle='--', linewidth=2, alpha=0.7,
            label=f'Upper Bound: {upper_bound:.0f}')
ax1.legend(fontsize=10)

# Add annotations
ax1.text(1.15, q1, f'Q1: {q1:.0f}', fontsize=10, va='center')
ax1.text(1.15, q3, f'Q3: {q3:.0f}', fontsize=10, va='center')
ax1.text(1.15, q3 + 0.3*iqr, f'IQR: {iqr:.0f}', fontsize=10, va='center', 
         bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))

# Scatter plot showing all data
colors = ['red' if (y < lower_bound or y > upper_bound) else '#4ECDC4' 
          for y in wheat_yields]
sizes = [150 if (y < lower_bound or y > upper_bound) else 50 
         for y in wheat_yields]

ax2.scatter(range(len(wheat_yields)), wheat_yields, c=colors, s=sizes, 
            alpha=0.6, edgecolors='black', linewidths=1)
ax2.axhline(lower_bound, color='orange', linestyle='--', linewidth=2, alpha=0.7,
            label='Outlier Boundaries')
ax2.axhline(upper_bound, color='orange', linestyle='--', linewidth=2, alpha=0.7)
ax2.axhline(np.median(wheat_yields), color='blue', linestyle='-', linewidth=2,
            label=f'Median: {np.median(wheat_yields):.0f}')

ax2.set_xlabel('Field Number', fontsize=12, fontweight='bold')
ax2.set_ylabel('Wheat Yield (kg/ha)', fontsize=12, fontweight='bold')
ax2.set_title('All Fields\n(Red = Outliers)', fontsize=12)
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Identify outliers
outliers_mask = (wheat_yields < lower_bound) | (wheat_yields > upper_bound)
outlier_values = wheat_yields[outliers_mask]

print("\nüìä IQR Method Results")
print("=" * 60)
print(f"Q1 (25th percentile): {q1:.1f} kg/ha")
print(f"Q3 (75th percentile): {q3:.1f} kg/ha")
print(f"IQR: {iqr:.1f} kg/ha")
print(f"\nOutlier Boundaries:")
print(f"  Lower: Q1 - 1.5√óIQR = {q1:.1f} - {1.5*iqr:.1f} = {lower_bound:.1f}")
print(f"  Upper: Q3 + 1.5√óIQR = {q3:.1f} + {1.5*iqr:.1f} = {upper_bound:.1f}")
print(f"\n‚ö†Ô∏è  Detected {len(outlier_values)} outliers:")
for i, val in enumerate(outlier_values, 1):
    print(f"   {i}. {val:.1f} kg/ha")

### Scatter Plot Method (Bivariate)

For relationships between two variables, scatter plots can reveal outliers that don't follow the pattern.

In [None]:
# Create bivariate agricultural data
# Nitrogen (ppm) vs Wheat Yield (kg/ha)

# Normal relationship: higher nitrogen ‚Üí higher yield
nitrogen_normal = np.random.uniform(40, 160, n_fields)
yield_normal = 2500 + 20 * nitrogen_normal + np.random.normal(0, 300, n_fields)

# Add outliers
nitrogen_outlier = np.array([150, 50, 100])  # Three outlier fields
yield_outlier = np.array([1500, 7500, 2000])  # Yields don't match pattern

# Combine
nitrogen_all = np.concatenate([nitrogen_normal, nitrogen_outlier])
yield_all = np.concatenate([yield_normal, yield_outlier])

# Visualize
fig, ax = plt.subplots(figsize=(12, 7))

# Normal points
ax.scatter(nitrogen_normal, yield_normal, c='#4ECDC4', s=80, alpha=0.6, 
           edgecolors='black', linewidths=1, label='Normal Fields')

# Outlier points
ax.scatter(nitrogen_outlier, yield_outlier, c='red', s=200, alpha=0.8, 
           edgecolors='black', linewidths=2, marker='X', label='Outliers')

# Add trend line for normal data
z = np.polyfit(nitrogen_normal, yield_normal, 1)
p = np.poly1d(z)
ax.plot(nitrogen_normal, p(nitrogen_normal), "r--", linewidth=2, alpha=0.7,
        label=f'Trend: Yield = {z[1]:.0f} + {z[0]:.1f}√óNitrogen')

# Labels
ax.set_xlabel('Nitrogen (ppm)', fontsize=13, fontweight='bold')
ax.set_ylabel('Wheat Yield (kg/ha)', fontsize=13, fontweight='bold')
ax.set_title('üåæ Bivariate Outliers: Nitrogen vs Yield\nOutliers Don\'t Follow the Pattern', 
             fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Annotate outliers
annotations = [
    "Low yield despite\nhigh nitrogen",
    "Exceptionally high\nyield for low nitrogen",
    "Very low yield\nfor medium nitrogen"
]

for i, (n, y, ann) in enumerate(zip(nitrogen_outlier, yield_outlier, annotations)):
    ax.annotate(ann, xy=(n, y), xytext=(n + 15, y + 500),
                arrowprops=dict(arrowstyle='->', color='red', lw=2),
                fontsize=10, ha='left',
                bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7))

plt.tight_layout()
plt.show()

print("\nüéØ Bivariate Outlier Insights:")
print("  ‚Ä¢ Most fields follow trend: more nitrogen ‚Üí higher yield")
print("  ‚Ä¢ Red X markers deviate significantly from pattern")
print("  ‚Ä¢ These could indicate:")
print("    - Disease/pest problems")
print("    - Soil quality issues")
print("    - Data errors")
print("\n  ‚Üí ALWAYS INVESTIGATE outliers before removing!")

---

## 3. Statistical Methods for Outlier Detection

### Method 1: IQR Method (Programmatic)

The same method box plots use, but calculated explicitly.

In [None]:
def detect_outliers_iqr(data):
    """
    Detect outliers using IQR method.
    
    Outliers are values below Q1 - 1.5√óIQR or above Q3 + 1.5√óIQR
    """
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    
    outliers_mask = (data < lower_bound) | (data > upper_bound)
    
    return outliers_mask, lower_bound, upper_bound

# Apply to wheat yield data
outliers_mask, lower, upper = detect_outliers_iqr(wheat_yields)

print("üìä IQR Method - Outlier Detection")
print("=" * 60)
print(f"Lower bound: {lower:.1f} kg/ha")
print(f"Upper bound: {upper:.1f} kg/ha")
print(f"\nNumber of outliers: {np.sum(outliers_mask)}")
print(f"Percentage of outliers: {100 * np.sum(outliers_mask) / len(wheat_yields):.1f}%")
print(f"\nOutlier values:")
for val in wheat_yields[outliers_mask]:
    print(f"  ‚Ä¢ {val:.1f} kg/ha")

print("\n‚úì IQR Method Pros:")
print("  ‚Ä¢ Simple and intuitive")
print("  ‚Ä¢ Works well for symmetric distributions")
print("  ‚Ä¢ Widely used and understood")
print("\n‚ö†Ô∏è  IQR Method Cons:")
print("  ‚Ä¢ Fixed threshold (not adaptive)")
print("  ‚Ä¢ May miss outliers in heavy-tailed distributions")

### Method 2: Z-Score Method

**Z-score** measures how many standard deviations a value is from the mean.

**Rule**: |Z-score| > 3 indicates an outlier

$$z = \frac{x - \mu}{\sigma}$$

**Note**: This assumes approximately normal distribution!

In [None]:
def detect_outliers_zscore(data, threshold=3):
    """
    Detect outliers using Z-score method.
    
    Outliers have |z-score| > threshold (default: 3)
    """
    mean = np.mean(data)
    std = np.std(data, ddof=1)
    
    z_scores = np.abs((data - mean) / std)
    
    outliers_mask = z_scores > threshold
    
    return outliers_mask, z_scores

# Apply to wheat yield data
outliers_mask_z, z_scores = detect_outliers_zscore(wheat_yields, threshold=3)

print("üìä Z-Score Method - Outlier Detection")
print("=" * 60)
print(f"Mean: {np.mean(wheat_yields):.1f} kg/ha")
print(f"Std Dev: {np.std(wheat_yields, ddof=1):.1f} kg/ha")
print(f"\nThreshold: |Z-score| > 3")
print(f"\nNumber of outliers: {np.sum(outliers_mask_z)}")
print(f"Percentage of outliers: {100 * np.sum(outliers_mask_z) / len(wheat_yields):.1f}%")

print(f"\nOutlier details:")
outlier_indices = np.where(outliers_mask_z)[0]
for idx in outlier_indices:
    print(f"  ‚Ä¢ Value: {wheat_yields[idx]:.1f} kg/ha, Z-score: {z_scores[idx]:.2f}")

print("\n‚úì Z-Score Method Pros:")
print("  ‚Ä¢ Uses standard deviations (familiar concept)")
print("  ‚Ä¢ Threshold adjustable")
print("  ‚Ä¢ Good for normal distributions")
print("\n‚ö†Ô∏è  Z-Score Method Cons:")
print("  ‚Ä¢ Assumes normal distribution")
print("  ‚Ä¢ Mean and SD affected by outliers (circular problem!)")
print("  ‚Ä¢ May miss outliers if they inflate SD")

### Method 3: Modified Z-Score (Robust)

Uses **median** and **MAD (Median Absolute Deviation)** instead of mean and SD.

**More robust to outliers!**

$$\text{Modified Z-score} = \frac{0.6745 \times (x - \text{median})}{\text{MAD}}$$

Where MAD = median(|x - median(x)|)

**Rule**: |Modified Z-score| > 3.5 indicates an outlier

In [None]:
def detect_outliers_modified_zscore(data, threshold=3.5):
    """
    Detect outliers using Modified Z-score (robust method).
    
    Uses median and MAD instead of mean and SD.
    """
    median = np.median(data)
    mad = np.median(np.abs(data - median))
    
    # Modified z-scores
    modified_z_scores = 0.6745 * (data - median) / mad
    
    outliers_mask = np.abs(modified_z_scores) > threshold
    
    return outliers_mask, modified_z_scores

# Apply to wheat yield data
outliers_mask_mod, mod_z_scores = detect_outliers_modified_zscore(wheat_yields, threshold=3.5)

print("üìä Modified Z-Score Method - Outlier Detection")
print("=" * 60)
print(f"Median: {np.median(wheat_yields):.1f} kg/ha")
median_val = np.median(wheat_yields)
mad = np.median(np.abs(wheat_yields - median_val))
print(f"MAD (Median Absolute Deviation): {mad:.1f} kg/ha")
print(f"\nThreshold: |Modified Z-score| > 3.5")
print(f"\nNumber of outliers: {np.sum(outliers_mask_mod)}")
print(f"Percentage of outliers: {100 * np.sum(outliers_mask_mod) / len(wheat_yields):.1f}%")

print(f"\nOutlier details:")
outlier_indices = np.where(outliers_mask_mod)[0]
for idx in outlier_indices:
    print(f"  ‚Ä¢ Value: {wheat_yields[idx]:.1f} kg/ha, Modified Z: {mod_z_scores[idx]:.2f}")

print("\n‚úì Modified Z-Score Method Pros:")
print("  ‚Ä¢ ROBUST to outliers (median and MAD not affected)")
print("  ‚Ä¢ Better for skewed distributions")
print("  ‚Ä¢ Recommended for agricultural data")
print("\n‚ö†Ô∏è  Modified Z-Score Method Cons:")
print("  ‚Ä¢ Slightly more complex to calculate")
print("  ‚Ä¢ Less familiar to some users")

### Comparison of Methods

In [None]:
# Compare all three methods
fig, ax = plt.subplots(figsize=(12, 6))

# Apply all methods
mask_iqr, _, _ = detect_outliers_iqr(wheat_yields)
mask_z, _ = detect_outliers_zscore(wheat_yields)
mask_mod, _ = detect_outliers_modified_zscore(wheat_yields)

# Create color array
colors = []
for i in range(len(wheat_yields)):
    if mask_iqr[i] and mask_z[i] and mask_mod[i]:
        colors.append('red')  # All three methods
    elif mask_iqr[i] or mask_z[i] or mask_mod[i]:
        colors.append('orange')  # At least one method
    else:
        colors.append('#4ECDC4')  # Not an outlier

sizes = [150 if c in ['red', 'orange'] else 50 for c in colors]

ax.scatter(range(len(wheat_yields)), wheat_yields, c=colors, s=sizes,
           alpha=0.6, edgecolors='black', linewidths=1)

# Legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='red', edgecolor='black', label='All 3 methods agree'),
    Patch(facecolor='orange', edgecolor='black', label='At least 1 method'),
    Patch(facecolor='#4ECDC4', edgecolor='black', label='Not an outlier')
]

ax.legend(handles=legend_elements, fontsize=11)
ax.set_xlabel('Field Number', fontsize=13, fontweight='bold')
ax.set_ylabel('Wheat Yield (kg/ha)', fontsize=13, fontweight='bold')
ax.set_title('‚öñÔ∏è Comparing Outlier Detection Methods', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary table
print("\nüìä Method Comparison Summary")
print("=" * 60)
print(f"IQR Method:           {np.sum(mask_iqr)} outliers ({100*np.sum(mask_iqr)/len(wheat_yields):.1f}%)")
print(f"Z-Score Method:       {np.sum(mask_z)} outliers ({100*np.sum(mask_z)/len(wheat_yields):.1f}%)")
print(f"Modified Z-Score:     {np.sum(mask_mod)} outliers ({100*np.sum(mask_mod)/len(wheat_yields):.1f}%)")
print(f"\nAll 3 agree:          {np.sum(mask_iqr & mask_z & mask_mod)} outliers")
print(f"At least 1 method:    {np.sum(mask_iqr | mask_z | mask_mod)} outliers")

print("\nüí° Recommendation:")
print("  ‚Ä¢ Use MULTIPLE methods for robust detection")
print("  ‚Ä¢ If methods agree ‚Üí strong outlier")
print("  ‚Ä¢ For agricultural data: Modified Z-score + IQR")

---

## 4. Multivariate Outlier Detection ‚≠ê

**Critical for PCA!** A point might not be an outlier in any single variable, but could be an outlier in multivariate space.

### Mahalanobis Distance

Measures how far a point is from the center of the distribution, accounting for correlations.

**Formula:**

$$D = \sqrt{(x - \mu)^T S^{-1} (x - \mu)}$$

Where:
- $x$ = data point
- $\mu$ = mean vector
- $S$ = covariance matrix

**Rule**: Points with Mahalanobis distance > threshold are outliers

In [None]:
# Create multivariate agricultural data
np.random.seed(42)
n = 100

# Normal correlated data: pH and Nitrogen
mean = [6.5, 100]
cov = [[0.3, 0.4], [0.4, 400]]  # Positive correlation
normal_data = np.random.multivariate_normal(mean, cov, n-3)

# Add multivariate outliers
outliers = np.array([
    [8.0, 50],   # High pH, low N
    [5.0, 180],  # Low pH, high N
    [7.5, 180]   # Both high
])

# Combine
data_multi = np.vstack([normal_data, outliers])
pH_multi = data_multi[:, 0]
N_multi = data_multi[:, 1]

# Calculate Mahalanobis distance
mean_vec = np.mean(data_multi, axis=0)
cov_matrix = np.cov(data_multi.T)
inv_cov = np.linalg.inv(cov_matrix)

mahal_distances = []
for point in data_multi:
    diff = point - mean_vec
    mahal_dist = np.sqrt(diff @ inv_cov @ diff.T)
    mahal_distances.append(mahal_dist)

mahal_distances = np.array(mahal_distances)

# Set threshold (e.g., chi-square critical value for 2 DOF at 0.95)
threshold = np.sqrt(stats.chi2.ppf(0.975, df=2))  # ~2.45
outliers_mask_mahal = mahal_distances > threshold

print("üìä Mahalanobis Distance - Multivariate Outlier Detection")
print("=" * 60)
print(f"Number of variables: 2 (pH, Nitrogen)")
print(f"Threshold: {threshold:.2f}")
print(f"\nNumber of multivariate outliers: {np.sum(outliers_mask_mahal)}")
print(f"Percentage: {100*np.sum(outliers_mask_mahal)/len(data_multi):.1f}%")

print(f"\nOutlier details:")
for idx in np.where(outliers_mask_mahal)[0]:
    print(f"  ‚Ä¢ pH={pH_multi[idx]:.2f}, N={N_multi[idx]:.1f} ppm, Distance={mahal_distances[idx]:.2f}")

In [None]:
# Visualize multivariate outliers
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
fig.suptitle('üéØ Multivariate Outlier Detection with Mahalanobis Distance', 
             fontsize=14, fontweight='bold')

# Scatter plot with outliers marked
colors_mahal = ['red' if outlier else '#4ECDC4' for outlier in outliers_mask_mahal]
sizes_mahal = [150 if outlier else 60 for outlier in outliers_mask_mahal]

ax1.scatter(pH_multi, N_multi, c=colors_mahal, s=sizes_mahal, 
            alpha=0.6, edgecolors='black', linewidths=1)

# Add confidence ellipse (2 SD)
from matplotlib.patches import Ellipse
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
angle = np.degrees(np.arctan2(eigenvectors[1, 0], eigenvectors[0, 0]))
width, height = 2 * 2 * np.sqrt(eigenvalues)  # 2 SD
ellipse = Ellipse(mean_vec, width, height, angle=angle, 
                  facecolor='none', edgecolor='blue', linewidth=2, linestyle='--',
                  label='95% confidence region')
ax1.add_patch(ellipse)

ax1.scatter([mean_vec[0]], [mean_vec[1]], c='blue', s=200, marker='X', 
            edgecolors='black', linewidths=2, label='Center (mean)', zorder=5)

ax1.set_xlabel('Soil pH', fontsize=12, fontweight='bold')
ax1.set_ylabel('Nitrogen (ppm)', fontsize=12, fontweight='bold')
ax1.set_title('Scatter Plot\n(Red = Multivariate Outliers)', fontsize=12)
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Mahalanobis distance plot
ax2.bar(range(len(mahal_distances)), mahal_distances, 
        color=['red' if d > threshold else '#4ECDC4' for d in mahal_distances],
        alpha=0.7, edgecolor='black')
ax2.axhline(threshold, color='orange', linestyle='--', linewidth=2.5, 
            label=f'Threshold: {threshold:.2f}')

ax2.set_xlabel('Sample Index', fontsize=12, fontweight='bold')
ax2.set_ylabel('Mahalanobis Distance', fontsize=12, fontweight='bold')
ax2.set_title('Mahalanobis Distances\n(Above threshold = Outlier)', fontsize=12)
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nüéØ Key Insights:")
print("  ‚Ä¢ Red points lie outside the 95% confidence ellipse")
print("  ‚Ä¢ These points are unusual in the JOINT distribution")
print("  ‚Ä¢ They might not be outliers in pH or N individually")
print("  ‚Ä¢ But they don't follow the pH-N relationship")
print("\n‚ö†Ô∏è  FOR PCA: These multivariate outliers can dominate components!")

---

## 5. Handling Outliers: What to Do?

**NEVER automatically remove outliers!** Follow this decision framework:

### Step 1: Investigate

Ask:
- Is it a **data error**? (typo, sensor failure, etc.)
- Is it a **real extreme event**? (drought, exceptional conditions)
- Does it provide **valuable information**? (rare but important)

### Step 2: Decide on Action

1. **If data error ‚Üí Remove or correct**
2. **If real but not relevant ‚Üí Remove (with justification)**
3. **If real and important ‚Üí Keep, but consider:**
   - Robust methods (Modified Z-score, Robust PCA)
   - Separate analysis for outliers
   - Transformation (log, sqrt)

### Step 3: Document

Always document:
- How many outliers detected
- Which method used
- Why removed/kept
- Impact on results

In [None]:
# Decision framework example
print("üîç Outlier Investigation Framework")
print("=" * 70)

# Assume we investigated our wheat yield outliers
investigations = [
    {
        'value': 12000,
        'finding': 'Data entry error (120.00 entered as 12000)',
        'action': 'CORRECT to 1200',
        'justification': 'Clear typo, correctable'
    },
    {
        'value': 8500,
        'finding': 'Exceptional growing conditions + optimal management',
        'action': 'KEEP',
        'justification': 'Real data, represents achievable potential'
    },
    {
        'value': 1200,
        'finding': 'Severe disease outbreak documented in field notes',
        'action': 'KEEP but FLAG',
        'justification': 'Real data, important for risk assessment'
    }
]

for i, inv in enumerate(investigations, 1):
    print(f"\nOutlier {i}: {inv['value']:.0f} kg/ha")
    print(f"  Investigation: {inv['finding']}")
    print(f"  Decision: {inv['action']}")
    print(f"  Justification: {inv['justification']}")

print("\n" + "=" * 70)
print("\n‚úì Best Practices:")
print("  1. ALWAYS investigate before removing")
print("  2. Document your decisions")
print("  3. Consider separate analysis with/without outliers")
print("  4. Report outlier handling in methods section")
print("  5. For PCA: Try both with/without to see impact")

---

## 6. Summary and Key Insights

### What We Learned

1. **Outlier Definition**
   - Values significantly different from other observations
   - Can be errors OR real extreme values
   - Impact statistics, correlations, and PCA

2. **Visual Detection Methods**
   - Box plots: Automatic using IQR rule
   - Scatter plots: Bivariate outliers
   - Always start with visualization!

3. **Statistical Detection Methods**
   - **IQR method**: Q1 - 1.5√óIQR to Q3 + 1.5√óIQR
   - **Z-score method**: |Z| > 3 (assumes normality)
   - **Modified Z-score**: Robust to outliers ‚úì Recommended

4. **Multivariate Outliers** ‚≠ê Critical for PCA
   - Mahalanobis distance accounts for correlations
   - Points can be outliers in multivariate space
   - Essential to check before PCA!

5. **Handling Outliers**
   - Investigate FIRST
   - Document decisions
   - Consider multiple approaches

### Connection to PCA ‚≠ê‚≠ê

**Outliers can DOMINATE PCA results because:**

1. PCA maximizes variance ‚Üí outliers have high variance
2. Principal components can point toward outliers
3. Loadings affected by extreme values
4. Interpretation becomes misleading

**ALWAYS check for outliers before PCA!**

```python
# Outlier detection workflow before PCA
# 1. Visual inspection
# 2. Univariate detection (IQR, Modified Z-score)
# 3. Multivariate detection (Mahalanobis)
# 4. Investigate and handle
# 5. THEN perform PCA
```

### Agricultural Insights

Common outlier sources in agriculture:
- **Measurement errors**: Sensor failures, data entry mistakes
- **Extreme weather**: Droughts, floods, heat waves
- **Disease/pests**: Severe outbreaks
- **Management practices**: Experimental treatments, equipment failures
- **Soil variability**: Unusual soil pockets

**Always consider agricultural context when handling outliers!**

### Recommended Workflow

1. **Visualize**: Box plots, scatter plots
2. **Detect**: Modified Z-score + IQR method
3. **Multivariate**: Mahalanobis distance
4. **Investigate**: Why are they outliers?
5. **Decide**: Remove/keep/transform
6. **Document**: Record all decisions
7. **Compare**: Analysis with/without outliers

### Next Steps

**Congratulations!** You've completed all 6 fundamental notebooks on descriptive statistics!

You now understand:
- ‚úì Central tendency (mean, median, mode)
- ‚úì Spread (variance, SD, IQR)
- ‚úì Relationships (covariance, correlation) ‚≠ê‚≠ê
- ‚úì Scaling (standardization) ‚≠ê‚≠ê
- ‚úì Distributions (skewness, kurtosis, normality)
- ‚úì Outliers (detection and handling)

**You're ready for Phase 2**: Building these concepts from scratch with NumPy!

After Phase 2, you'll learn professional tools (SciPy, pandas), and then apply everything to real agricultural datasets.

**Final goal**: Prepare agricultural data for PCA analysis!

---

**Remember**: Outliers are not always bad - they can reveal important patterns! Investigate before removing. üîç‚ú®