# Statistical Analysis of Irrigation Values Inside and Outside Dam Command Areas (2000 & 2015)

This notebook performs descriptive statistics, visualizations, and hypothesis testing on irrigation pixel values extracted from inside and outside dam command areas for the years 2000 and 2015. The analysis follows advisor feedback and is structured as follows:

**Outline:**
1. Data Loading
2. Descriptive Statistics
3. Distribution Visualization
4. Statistical Hypothesis Testing

---

In [None]:
# Task 1: Data Loading
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import mannwhitneyu
import os
import sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
# Add the project root to sys.path so we can import from Code.utils everywhere
project_root = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)
from Code.utils.utility import load_config, resolve_path, ssa_iso, africa_iso_countries, africa_iso_countries_filtered
from Code.utils.spatial_utility import load_and_reproject, optimized_clip

# Load configuration
config = load_config()

# Load the four CSV files using resolve_path and config (for consistency with project I/O)
inside_2000 = pd.read_csv(resolve_path(config['Irrigation_Inside_CA_2000_csv_path']))
outside_2000 = pd.read_csv(resolve_path(config['Irrigation_Outside_CA_2000_csv_path']))
inside_2015 = pd.read_csv(resolve_path(config['Irrigation_Inside_CA_2015_csv_path']))
outside_2015 = pd.read_csv(resolve_path(config['Irrigation_Outside_CA_2015_csv_path']))

# Ensure the column is named 'irrigation_value'
for df in [inside_2000, outside_2000, inside_2015, outside_2015]:
    if 'irrigation_value' not in df.columns:
        df.columns = ['irrigation_value']

print('Data loaded successfully. Sample rows:')
print('Inside 2000:', inside_2000.head(3), sep='



print('Outside 2015:', outside_2015.head(3), sep='
')print('Inside 2015:', inside_2015.head(3), sep='
')print('Outside 2000:', outside_2000.head(3), sep='
')')

FileNotFoundError: [Errno 2] No such file or directory: '/home/waves/data/Africa_Irrigation/Irrigation_Inside_CA_2000_csv_path'

## Task 2: Descriptive Statistics
For each group (inside/outside, 2000/2015), we calculate mean, median, standard deviation, variance, min, max, and quartiles. Variance is highlighted as a key metric.

In [None]:
def print_descriptive_stats(df, label):
    desc = df['irrigation_value'].describe(percentiles=[0.25, 0.75])
    variance = df['irrigation_value'].var()
    print(f'--- Descriptive Statistics for {label} ---')
    print(f'Mean: {desc[mean]:.4f}')
    print(f'Median: {desc[50%]:.4f}')
    print(f'Std Dev: {desc[std]:.4f}')
    print(f'Variance: {variance:.4f}')
    print(f'Min: {desc[min]:.4f}')
    print(f'Q1 (25th): {desc[25%]:.4f}')
    print(f'Q3 (75th): {desc[75%]:.4f}')
    print(f'Max: {desc[max]:.4f}')
    print()

print_descriptive_stats(inside_2000, 'Irrigation Inside Polygons - 2000')
print_descriptive_stats(outside_2000, 'Irrigation Outside Polygons - 2000')
print_descriptive_stats(inside_2015, 'Irrigation Inside Polygons - 2015')
print_descriptive_stats(outside_2015, 'Irrigation Outside Polygons - 2015')

**Variance Comparison:**
- Compare the variance between inside and outside groups for each year.
- Higher variance may indicate more heterogeneity in irrigation values.
- This is a key point for advisor feedback.

## Task 3: Distribution Visualization
We plot histograms and KDEs for inside and outside groups for each year. Both distributions are shown on the same axes for direct comparison.

In [None]:
def plot_distribution(df_in, df_out, year, bins=50):
    plt.figure(figsize=(10,6))
    sns.histplot(df_in['irrigation_value'], bins=bins, color='green', label='Inside Polygons', kde=True, stat='density', alpha=0.5)
    sns.histplot(df_out['irrigation_value'], bins=bins, color='orange', label='Outside Polygons', kde=True, stat='density', alpha=0.5)
    plt.xlabel('Irrigation Value')
    plt.ylabel('Density')
    plt.title(f'Distribution of Irrigation Values - {year}')
    plt.legend()
    plt.tight_layout()
    plt.savefig(f'irrigation_distribution_{year}.png', dpi=300)
    plt.show()

plot_distribution(inside_2000, outside_2000, 2000)
plot_distribution(inside_2015, outside_2015, 2015)

**Interpretation:**
- Visually inspect the plots for overlap, skewness, and spread.
- Note any clear differences or similarities between inside and outside distributions for each year.

## Task 4: Statistical Hypothesis Testing
We use the Mann-Whitney U test to formally compare the distributions of irrigation values inside and outside polygons for each year.

**Null Hypothesis (H0):** The distributions of irrigation values inside and outside polygons are the same.

**Alternative Hypothesis (Ha):** The distributions are different.

In [None]:
def mannwhitney_test(df_in, df_out, year):
    u_stat, p_val = mannwhitneyu(df_in['irrigation_value'], df_out['irrigation_value'], alternative='two-sided')
    print(f'--- Mann-Whitney U Test for {year} ---')
    print(f'U-statistic: {u_stat:.2f}')
    print(f'p-value: {p_val:.4e}')
    if p_val < 0.05:
        print('Result: Statistically significant difference (p < 0.05)')
    else:
        print('Result: No statistically significant difference (p >= 0.05)')
    print()

mannwhitney_test(inside_2000, outside_2000, 2000)
mannwhitney_test(inside_2015, outside_2015, 2015)

**Interpretation:**
- A p-value less than 0.05 suggests a statistically significant difference between the distributions.
- Discuss what this means in the context of irrigation inside vs. outside dam command areas.

---
*Prepared for advisor review. Please see code comments and outputs for further interpretation and next steps.*