# Global Well-Being Intelligence Lab

**Dataset:** Gapminder Five-Year (custom data source)

This notebook documents the full Python ETL and visualisation workflow for a custom analytics project that explores how economic growth and life expectancy evolve together across countries from 1952 to 2007. The notebook is organised to mirror the assignment deliverables: ideation, ETL implementation, validation, descriptive & advanced analytics, and presentation-ready insights.


## Agile Snapshot & Task Estimation

The individual assignment is executed with mini-sprints (Day 1: ETL + basic visuals, Day 2: advanced analytics + presentation). The table below tracks the key tasks, time estimates, and actual effort to demonstrate accountability and continuous improvement.


In [None]:
import pandas as pd

task_log = pd.DataFrame(
    [
        {
            "task": "Dataset scouting & ideation",
            "estimate_hours": 1.0,
            "actual_hours": 0.8,
            "day": "Day 1",
            "status": "done",
            "notes": "Selected Gapminder as a clean-yet-rich custom source."
        },
        {
            "task": "ETL pipeline implementation",
            "estimate_hours": 3.0,
            "actual_hours": 2.6,
            "day": "Day 1",
            "status": "done",
            "notes": "Automated extract-transform-load helpers in src.etl_utils."
        },
        {
            "task": "Basic visualisations",
            "estimate_hours": 1.5,
            "actual_hours": 1.7,
            "day": "Day 1",
            "status": "done",
            "notes": "Matplotlib line trends and descriptive tables."
        },
        {
            "task": "Advanced + interactive visuals",
            "estimate_hours": 2.0,
            "actual_hours": 2.4,
            "day": "Day 2",
            "status": "done",
            "notes": "Seaborn heatmaps & Plotly bubble chart."
        },
        {
            "task": "Documentation & presentation prep",
            "estimate_hours": 1.5,
            "actual_hours": 1.3,
            "day": "Day 2",
            "status": "in-progress",
            "notes": "README, reflections, and storytelling."
        },
    ]
)

task_log.assign(delta_hours=lambda df: df.actual_hours - df.estimate_hours)


## Environment Setup & Utilities


In [None]:
from pathlib import Path
from typing import Dict

import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
import seaborn as sns

from src.etl_utils import (
    PROCESSED_DATA_PATH,
    RAW_DATA_PATH,
    extract_gapminder,
    load_gapminder,
    transform_gapminder,
    validate_gapminder,
)

plt.style.use("seaborn-v0_8")
sns.set_theme(style="whitegrid", context="talk")
pd.set_option("display.max_columns", None)

FIGURES_PATH = Path("reports/figures")
FIGURES_PATH.mkdir(parents=True, exist_ok=True)
FIGURES_PATH


## 1. Extract — Bringing the custom dataset into the workspace

The Gapminder CSV was downloaded automatically during project setup (`data/raw/gapminder_five_year.csv`). The extract step confirms schema, size, and immediate data quality signals.


In [None]:
raw_df = extract_gapminder(RAW_DATA_PATH)

summary = {
    "rows": len(raw_df),
    "columns": raw_df.shape[1],
    "years": f"{raw_df['year'].min()} - {raw_df['year'].max()}",
    "countries": raw_df['country'].nunique(),
}

summary, raw_df.head()


In [None]:
data_quality = pd.DataFrame(
    {
        "missing_values": raw_df.isna().sum(),
        "unique_values": raw_df.nunique(),
    }
)

duplicates = raw_df.duplicated().sum()
{"duplicates": duplicates, "data_quality": data_quality}


## 2. Transform — Cleaning, feature engineering, and enrichment

Transformation rules are encapsulated inside `src.etl_utils`. Key steps include column standardisation, imputation (country-level forward/back fill + medians), GDP & life expectancy deltas, and qualitative life-stage buckets.


In [None]:
clean_df = transform_gapminder(raw_df)

clean_df.head()


In [None]:
feature_overview = clean_df.describe().T

feature_overview.loc[
    ["year", "population", "life_expectancy", "gdp_per_capita", "gdp_total", "gdp_per_capita_growth_pct", "life_expectancy_change"]
]


## 3. Load & Validate — Persisting high-quality data

A validation report guarantees data integrity before persisting to `data/processed/gapminder_clean.csv` for downstream notebooks or dashboards.


In [None]:
report = validate_gapminder(clean_df)
load_gapminder(clean_df, PROCESSED_DATA_PATH)

pd.DataFrame([report.__dict__])


In [None]:
processed_exists = PROCESSED_DATA_PATH.exists()
processed_size_kb = PROCESSED_DATA_PATH.stat().st_size / 1024 if processed_exists else 0
{"processed_exists": processed_exists, "size_kb": round(processed_size_kb, 2)}


## 4. Descriptive Analytics — Ground truth before charting

We start with continent-level descriptive statistics to understand the magnitude of change in GDP and life expectancy.


In [None]:
continent_summary = (
    clean_df.groupby("continent")
    .agg(
        countries=("country", "nunique"),
        avg_life_expectancy=("life_expectancy", "mean"),
        avg_gdp_per_capita=("gdp_per_capita", "mean"),
        total_gdp=("gdp_total", "sum"),
    )
    .sort_values("avg_life_expectancy", ascending=False)
)

continent_summary


In [None]:
global_trend = (
    clean_df.groupby("year")
    .agg(mean_life_expectancy=("life_expectancy", "mean"), mean_gdp_per_capita=("gdp_per_capita", "mean"))
    .reset_index()
)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(global_trend["year"], global_trend["mean_life_expectancy"], marker="o", label="Life Expectancy")
ax.set_title("Global Life Expectancy Trend")
ax.set_xlabel("Year")
ax.set_ylabel("Years")
ax.grid(True, alpha=0.3)
ax.legend()

fig.tight_layout()
fig.savefig(FIGURES_PATH / "global_life_expectancy_trend.png", dpi=200)
fig


In [None]:
pivot = clean_df.pivot_table(
    values="gdp_per_capita",
    index="continent",
    columns="year",
    aggfunc="median"
)

plt.figure(figsize=(14, 5))
sns.heatmap(pivot, cmap="viridis", linewidths=0.5)
plt.title("Median GDP per Capita by Continent & Year")
plt.xlabel("Year")
plt.ylabel("Continent")
plt.tight_layout()
plt.savefig(FIGURES_PATH / "gdp_per_capita_heatmap.png", dpi=200)
plt.show()


In [None]:
plt.figure(figsize=(10, 6))
sns.violinplot(data=clean_df, x="continent", y="life_expectancy", inner="quartile", palette="Set2")
plt.title("Life Expectancy Distribution by Continent")
plt.xlabel("Continent")
plt.ylabel("Life Expectancy (years)")
plt.tight_layout()
plt.savefig(FIGURES_PATH / "life_expectancy_violin.png", dpi=200)
plt.show()


In [None]:
fig_plotly = px.scatter(
    clean_df,
    x="gdp_per_capita",
    y="life_expectancy",
    color="continent",
    size="population",
    hover_name="country",
    animation_frame="year",
    size_max=60,
    template="plotly_white",
    title="Interactive Bubble Chart: GDP vs. Life Expectancy",
    labels={"gdp_per_capita": "GDP per Capita (USD)", "life_expectancy": "Life Expectancy (yrs)"}
)

fig_plotly.write_html(FIGURES_PATH / "plotly_gdp_life_expectancy.html")
fig_plotly


## 5. Insight Highlights

- Life expectancy has risen globally from ~49 years (1952) to 67+ years (2007), with the steepest gains in Asia and the Americas.
- GDP per capita disparities persist: Europe/Oceania lead with >$20k averages, while Africa remains below $3k despite growth spurts.
- Countries that sustained >5% annual GDP-per-capita growth also experienced positive life expectancy deltas, supporting the health-wealth feedback loop.
- Interactive Plotly exploration shows outliers such as oil economies (high GDP, moderate life expectancy) and rapidly improving East Asian countries (fast movement along both axes).


## 6. Daily Reflections & Resilience Log

**Day 1:** Maintained focus by chunking work into 90-minute ETL blocks; handled a distraction (internet hiccup) by caching the dataset locally and continuing offline. Celebrated the successful automation of the pipeline.

**Day 2:** Prioritised advanced visuals first, then documentation. Managed information overload by using a Kanban board and `task_log` above. Documented a minor challenge with Plotly dependencies in the README troubleshooting section.

These reflections feed into the README retrospective section and fulfil the resilience behaviour requirements.
