# Task 3: Cross-Country Comparison

Objective: Synthesize the cleaned datasets from Benin, Sierra Leone, and Togo to identify relative solar potential and key differences across countries.

- Data paths: `data/benin_clean.csv`, `data/sierra_leone_clean.csv`, `data/togo_clean.csv`
- Metrics: GHI, DNI, DHI
- Outputs: Boxplots per metric, summary table (mean/median/std), ANOVA/Kruskal p-values, observations, and a bar chart ranking by average GHI.

In [None]:
# Imports
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

sns.set(style="whitegrid")
DATA_DIR = os.path.join("..", "data")
FILES = {
    "Benin": "benin_clean.csv",
    "Sierra Leone": "sierra_leone_clean.csv",
    "Togo": "togo_clean.csv",
}
METRICS = ["GHI", "DNI", "DHI"]

In [None]:
# Load datasets
frames = []
missing = []
for country, fname in FILES.items():
    fpath = os.path.join(DATA_DIR, fname)
    if not os.path.exists(fpath):
        missing.append((country, fpath))
        continue
    df = pd.read_csv(fpath)
    df["Country"] = country
    frames.append(df)

if missing:
    print("Warning: Missing files:")
    for c, p in missing:
        print(f" - {c}: {p}")

if frames:
    data = pd.concat(frames, ignore_index=True)
    # Keep only expected metrics + Country + any potential region columns if present
    keep_cols = [c for c in data.columns if c in METRICS + ["Country", "region", "Region", "admin_name", "site", "location"]]
    data = data[keep_cols]
    display(data.head())
else:
    data = pd.DataFrame()
    print("No data loaded.")

In [None]:
# Boxplots: one plot per metric, colored by country
if not data.empty:
    for metric in METRICS:
        if metric not in data.columns:
            print(f"Skipping {metric}: not found in data columns")
            continue
        plt.figure(figsize=(7, 5))
        sns.boxplot(data=data, x="Country", y=metric, palette="Set2")
        plt.title(f"{metric} distribution by Country")
        plt.xlabel("Country")
        plt.ylabel(metric)
        plt.tight_layout()
        plt.show()
else:
    print("No data to plot.")

In [None]:
# Summary table: mean, median, std for each metric by country
if not data.empty:
    summaries = []
    for metric in METRICS:
        if metric in data.columns:
            grp = data.groupby("Country")[metric].agg(["mean", "median", "std"]).reset_index()
            grp.insert(1, "Metric", metric)
            summaries.append(grp)
    if summaries:
        summary_table = pd.concat(summaries, ignore_index=True)
        display(summary_table)
    else:
        print("No metric columns found for summary.")
else:
    print("No data loaded.")

In [None]:
# Statistical testing: One-way ANOVA (fallback to Kruskal-Wallis if normality/homoscedasticity fail)
if not data.empty and all(m in data.columns for m in ["GHI", "Country"]):
    # Prepare groups
    groups = [grp["GHI"].dropna().values for _, grp in data.groupby("Country")]
    country_names = list(dict.fromkeys(data["Country"]))

    p_anova = np.nan
    p_kruskal = np.nan
    try:
        f_stat, p_anova = stats.f_oneway(*groups)
        print(f"ANOVA p-value: {p_anova:.4g}")
    except Exception as e:
        print("ANOVA failed:", e)

    try:
        h_stat, p_kruskal = stats.kruskal(*groups)
        print(f"Kruskal–Wallis p-value: {p_kruskal:.4g}")
    except Exception as e:
        print("Kruskal–Wallis failed:", e)
else:
    print("Insufficient data for statistical testing.")

In [None]:
# Visual summary: Bar chart ranking countries by average GHI
if not data.empty and "GHI" in data.columns:
    avg_ghi = data.groupby("Country")["GHI"].mean().sort_values(ascending=False)
    plt.figure(figsize=(6,4))
    sns.barplot(x=avg_ghi.values, y=avg_ghi.index, palette="Set2")
    plt.xlabel("Average GHI")
    plt.ylabel("Country")
    plt.title("Average GHI by Country")
    plt.tight_layout()
    plt.show()
else:
    print("No GHI data available for bar chart.")

## Key Observations

- Benin, Sierra Leone, and Togo all included in plots and summaries. After running the analysis above, note which country shows the highest median GHI and whether variability differs across countries.
- Check ANOVA and/or Kruskal–Wallis p-values reported above. If p < 0.05, differences in GHI across countries are statistically significant.
- Compare average GHI bar chart to boxplots: identify the country with the highest mean GHI and whether it also exhibits higher spread or outliers.