# Summary Statistics from RF Predictions

This notebook:

1. Reads per-country RF prediction GPKGs: `{country}_rf_preds.gpkg`.
2. Applies a **quality filter**: at least 80% of blocks in a city must have valid  
   `rf_label` and `POP_SEG`.
3. Computes:
   - City-level deprived population (% of residents in deprived segments).
   - Country-level and region-level aggregates.
   - Region × city-size-class deprivation patterns.
   - Region-level counts and population by city-size class.

**Inputs**

- RF predictions (per-country GPKG):  
  `2_modelling/02_application/predictions/{country}_rf_preds.gpkg`,  
  produced by `02_apply_rf_predictions.ipynb`.  
  Required columns per GPKG:
  - `REG1_GHSL` (region label, e.g. *Africa*, *Asia*, *Latin America and the Caribbean*)
  - `UC_NM_MN` (city name)
  - `rf_label` (0/1, non-deprived / deprived)
  - `POP_SEG` (segment population)

**City size classification**

City size is classified following the UN *World Urbanization Prospects 2018* scheme:

- **Small**: < 500,000  
- **Medium**: 500,000–999,999  
- **Large**: 1,000,000–4,999,999  
- **Very large**: 5,000,000–9,999,999  
- **Megacity**: ≥ 10,000,000  

**Outputs**

CSV files are saved to:

`2_modelling/02_application/summary_statistics/`


# Imports and paths

In [None]:
import geopandas as gpd
import pandas as pd
import numpy as np
from pathlib import Path

# Folder with per-country RF prediction GPKGs
PREDICTIONS_FOLDER = Path("predictions")

# Output folder for summary statistics
SUMMARY_DIR = Path("summary_statistics")
SUMMARY_DIR.mkdir(parents=True, exist_ok=True)

print("PREDICTIONS_FOLDER:", PREDICTIONS_FOLDER.resolve())
print("SUMMARY_DIR:", SUMMARY_DIR.resolve())


# Parameters and helpers

In [None]:
# Quality requirement: at least this % of blocks in a city must have valid rf_label and POP_SEG
VALID_BLOCKS_THRESHOLD = 80.0  # %

def classify_city_size(pop):
    """UN WUP 2018 city size classes based on city population."""
    if pop < 500_000:
        return "Small"
    elif pop < 1_000_000:
        return "Medium"
    elif pop < 5_000_000:
        return "Large"
    elif pop < 10_000_000:
        return "Very large"
    else:
        return "Megacity"


# City level deprivation stats (with 80% QC)

In [None]:
records = []

for file in PREDICTIONS_FOLDER.glob("*_rf_preds.gpkg"):
    country = file.stem.replace("_rf_preds", "")
    
    try:
        gdf = gpd.read_file(
            file,
            columns=["REG1_GHSL", "UC_NM_MN", "rf_label", "POP_SEG"]
        )
    except Exception as e:
        print(f"⚠️ Skipping {file.name} due to error: {e}")
        continue

    # valid block = both rf_label and POP_SEG present
    gdf["valid"] = ~(gdf["rf_label"].isna() | gdf["POP_SEG"].isna())

    for city, sub in gdf.groupby("UC_NM_MN"):
        total_blocks = len(sub)
        valid_blocks = int(sub["valid"].sum())
        pct_valid = (valid_blocks / total_blocks * 100) if total_blocks > 0 else 0.0

        if pct_valid < VALID_BLOCKS_THRESHOLD:
            # city fails quality threshold – excluded from downstream stats
            continue

        sub_valid = sub[sub["valid"]]

        total_pop = float(sub_valid["POP_SEG"].sum())
        deprived_pop = float(
            sub_valid.loc[sub_valid["rf_label"] == 1, "POP_SEG"].sum()
        )
        pct_deprived = (deprived_pop / total_pop * 100) if total_pop > 0 else 0.0

        records.append({
            "Region": sub["REG1_GHSL"].iloc[0],
            "Country": country,
            "City": city,
            "TotalBlocks": total_blocks,
            "ValidBlocks": valid_blocks,
            "PctValid": pct_valid,
            "TotalPop": total_pop,
            "DeprivedPop": deprived_pop,
            "PctDeprived": pct_deprived,
        })

city_df = pd.DataFrame(records)
print("Number of cities passing QC (>= 80% valid blocks):", len(city_df))

# Save core city-level deprivation table
city_deprivation_csv = SUMMARY_DIR / "city_deprivation_80pct.csv"
city_df.to_csv(city_deprivation_csv, index=False, encoding="utf-8")
city_df.head()


In [None]:
# City size classification based on TotalPop (UN WUP 2018 scheme)
city_df["CitySizeClass"] = city_df["TotalPop"].apply(classify_city_size)

# Reordered view for manuscript tables / figures
city_table = city_df[[
    "Region", "Country", "City", "CitySizeClass",
    "TotalPop", "DeprivedPop", "PctDeprived"
]].copy()

city_with_size_csv = SUMMARY_DIR / "city_deprivation_with_sizeclass_80pct.csv"
city_table.to_csv(city_with_size_csv, index=False, encoding="utf-8")

city_table.head()


## Country level aggregation

In [None]:
country_df = (
    city_df.groupby(["Region", "Country"], as_index=False)
    .agg(
        TotalPop=("TotalPop", "sum"),
        DeprivedPop=("DeprivedPop", "sum")
    )
)
country_df["PctDeprived"] = (
    country_df["DeprivedPop"] / country_df["TotalPop"] * 100
)

country_csv = SUMMARY_DIR / "country_deprivation_80pct.csv"
country_df.to_csv(country_csv, index=False, encoding="utf-8")
country_df.head()


## Region-level aggregation

In [None]:
region_df = (
    city_df.groupby("Region", as_index=False)
    .agg(
        TotalPop=("TotalPop", "sum"),
        DeprivedPop=("DeprivedPop", "sum"),
        Countries=("Country", "nunique"),
        Cities=("City", "nunique")
    )
)
region_df["PctDeprived"] = (
    region_df["DeprivedPop"] / region_df["TotalPop"] * 100
)

regional_csv = SUMMARY_DIR / "regional_deprivation_80pct.csv"
region_df.to_csv(regional_csv, index=False, encoding="utf-8")
region_df


## Region x size-class deprivation

In [None]:
# Region × SizeClass breakdown
size_summary = (
    city_df.groupby(["Region", "CitySizeClass"], as_index=False)
    .agg(
        Cities=("City", "nunique"),
        TotalPop=("TotalPop", "sum"),
        DeprivedPop=("DeprivedPop", "sum")
    )
)
size_summary["PctDeprived"] = (
    size_summary["DeprivedPop"] / size_summary["TotalPop"] * 100
)

region_sizeclass_csv = SUMMARY_DIR / "region_sizeclass_deprivation_80pct.csv"
size_summary.to_csv(region_sizeclass_csv, index=False, encoding="utf-8")
size_summary


## Region-country-city list (no QC, all cities)

In [None]:
# Simple region–country–city inventory based on geometries
records_all = []

for file in PREDICTIONS_FOLDER.glob("*_rf_preds.gpkg"):
    country = file.stem.replace("_rf_preds", "")
    
    try:
        gdf = gpd.read_file(file, columns=["REG1_GHSL", "UC_NM_MN"])
    except Exception as e:
        print(f"⚠️ Skipping {file.name} due to error: {e}")
        continue

    gdf = gdf.drop_duplicates(subset=["UC_NM_MN"])

    for _, row in gdf.iterrows():
        records_all.append({
            "Region": row["REG1_GHSL"],
            "Country": country,
            "City": row["UC_NM_MN"]
        })

detailed_df = pd.DataFrame(records_all)

region_country_city_csv = SUMMARY_DIR / "region_country_city.csv"
detailed_df.to_csv(region_country_city_csv, index=False, encoding="utf-8")
detailed_df.head()


## Detailed city classification (80% QC) with size class

In [None]:
# Detailed classification table for cities passing the 80% QC
filtered_df = city_df[[
    "Region", "Country", "City", "TotalPop"
]].copy()
filtered_df["SizeClass"] = filtered_df["TotalPop"].apply(classify_city_size)
filtered_df.rename(columns={"TotalPop": "Population"}, inplace=True)

region_country_city_sizeclass_csv = (
    SUMMARY_DIR / "region_country_city_sizeclass_80pct.csv"
)
filtered_df.to_csv(region_country_city_sizeclass_csv, index=False, encoding="utf-8")
filtered_df.head()


## Region summary with city counts & population by size class

In [None]:
def count_size(x, label):
    return (x == label).sum()

def pop_size(df, label):
    return df.loc[df["SizeClass"] == label, "Population"].sum()

# Base region summary
region_summary = (
    filtered_df.groupby("Region", as_index=False)
    .agg(
        Countries=("Country", "nunique"),
        Cities=("City", "nunique"),
        TotalPopulation=("Population", "sum"),
    )
)

# Compute size-class specific counts and populations
labels = ["Small", "Medium", "Large", "Very large", "Megacity"]

grouped = filtered_df.groupby("Region")
for label in labels:
    region_summary[f"{label}_Cities"] = grouped["SizeClass"].apply(
        lambda x, lbl=label: count_size(x, lbl)
    ).values
    region_summary[f"{label}_Pop"] = grouped.apply(
        lambda df, lbl=label: pop_size(df, lbl)
    ).values

# Optional: population in millions for readability
region_summary["TotalPop_Millions"] = (region_summary["TotalPopulation"] / 1_000_000).round(2)
for label in labels:
    region_summary[f"{label}_Pop_Millions"] = (
        region_summary[f"{label}_Pop"] / 1_000_000
    ).round(2)

region_summary_csv = SUMMARY_DIR / "region_summary_sizeclass_population_80pct.csv"
region_summary.to_csv(region_summary_csv, index=False, encoding="utf-8")
region_summary


## CSV outputs produced by this notebook

These files are saved under: `2_modelling/02_application/summary_statistics/`

1. `city_deprivation_80pct.csv`  
   – One row per city (passing ≥80% valid blocks), with:
   - Region, Country, City
   - TotalBlocks, ValidBlocks, PctValid
   - TotalPop, DeprivedPop, PctDeprived

2. `city_deprivation_with_sizeclass_80pct.csv`  
   – City-level table with city size classes:
   - Region, Country, City, CitySizeClass, TotalPop, DeprivedPop, PctDeprived

3. `country_deprivation_80pct.csv`  
   – Country-level aggregate:
   - Region, Country, TotalPop, DeprivedPop, PctDeprived

4. `regional_deprivation_80pct.csv`  
   – Region-level aggregate:
   - Region, TotalPop, DeprivedPop, PctDeprived, Countries, Cities

5. `region_sizeclass_deprivation_80pct.csv`  
   – Region × city-size-class aggregate:
   - Region, CitySizeClass, Cities, TotalPop, DeprivedPop, PctDeprived

6. `region_country_city.csv`  
   – Simple inventory (no QC filter):
   - Region, Country, City

7. `region_country_city_sizeclass_80pct.csv`  
   – Detailed city classification (QC-filtered):
   - Region, Country, City, Population, SizeClass

8. `region_summary_sizeclass_population_80pct.csv`  
   – Region-level summary with counts and population by city-size class:
   - Region, Countries, Cities, TotalPopulation (+ per-size-class city counts and populations, including *_Pop_Millions)
