# Green Prism — Impact Realization Gap Model (MiniLM + XGBoost)

This notebook trains an **Impact Realization Gap** model.

**Goal:** predict **actual environmental impact** (e.g. tons CO₂ avoided) from:

- Bond / project **disclosure text** (or a structured summary)
- **Claimed** impact (`claimed_impact_co2_tons`)
- Bond / project **metadata**:
  - sector, region, technology (solar/wind/transport/etc.)
  - use-of-proceeds category
  - bond size, tenor

We then compute for each record:

- **Realization ratio**: `ratio = actual / claimed`
- **Impact gap**: `gap = claimed - predicted_actual`

---

## 0. Prerequisites

Before running this notebook, create a unified training CSV at:

`backend/app/data/impact_training_data.csv`

with at least the following columns:

- `text` — disclosure or impact report text (or a stitched summary from ADB / WB tables)
- `claimed_impact_co2_tons` — claimed CO₂ impact (tons)
- `actual_impact_co2_tons` — observed / realized CO₂ impact (tons)

Optional (recommended) metadata columns:

- `sector` — project or issuer sector (e.g. 'Energy', 'Transport')
- `region` — region / country group (e.g. 'EMEA', 'Asia', 'Pacific')
- `technology` — coarse tech bucket (e.g. 'solar', 'wind', 'metro', 'hydro')
- `use_of_proceeds` — UoP category (e.g. 'Renewable Energy', 'Clean Transport')
- `amount_issued_usd` — bond size in USD
- `tenor_years` — bond tenor in years

You can populate this CSV by extracting rows from:

- **ADB Green & Blue Bond Impact Report (Excel)** — projects with claimed + realized impacts
- **World Bank / IFC** impact reports — wherever both claimed & realized metrics are published

The cells below assume that file exists and use it for training.


## 1. Imports & Path Setup

In [None]:
from pathlib import Path
import sys

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

from xgboost import XGBRegressor

import torch
from sentence_transformers import SentenceTransformer

# Repo layout assumptions:
#   repo_root/
#     backend/
#       app/
#         data/impact_training_data.csv
#
REPO_ROOT = Path.cwd().resolve().parents[0] if Path.cwd().name == "notebooks" else Path.cwd().resolve()
BACKEND_ROOT = REPO_ROOT / "backend"
DATA_PATH = BACKEND_ROOT / "app" / "data" / "impact_training_data.csv"

print("CWD:", Path.cwd())
print("REPO_ROOT:", REPO_ROOT)
print("BACKEND_ROOT exists:", BACKEND_ROOT.exists())
print("DATA_PATH:", DATA_PATH)

sys.path.insert(0, str(BACKEND_ROOT))

from app.ml.preprocessing import clean_text


## 2. Load & Inspect Training Data

In [None]:
if not DATA_PATH.exists():
    raise FileNotFoundError(
        f"Training CSV not found at {DATA_PATH}.\n"
        "Create impact_training_data.csv with at least the columns "
        "[text, claimed_impact_co2_tons, actual_impact_co2_tons]."
    )

df = pd.read_csv(DATA_PATH)
print("Raw shape:", df.shape)
df.head()


### 2.1 Basic Cleaning & Column Normalization

In [None]:
# Rename columns if your CSV uses slightly different names.
# Adjust the mapping below to match your file.
col_map = {
    "text": "text",
    "claimed_impact_co2_tons": "claimed_impact_co2_tons",
    "actual_impact_co2_tons": "actual_impact_co2_tons",
    "sector": "sector",
    "region": "region",
    "technology": "technology",
    "use_of_proceeds": "use_of_proceeds",
    "amount_issued_usd": "amount_issued_usd",
    "tenor_years": "tenor_years",
}

df = df.rename(columns=col_map)

required_cols = ["text", "claimed_impact_co2_tons", "actual_impact_co2_tons"]
missing = [c for c in required_cols if c not in df.columns]
if missing:
    raise ValueError(f"Missing required columns in CSV: {missing}")

# Drop rows with missing claimed/actual or text
df = df.dropna(subset=["text", "claimed_impact_co2_tons", "actual_impact_co2_tons"]).copy()
df["text"] = df["text"].astype(str)

# Ensure numeric
for c in ["claimed_impact_co2_tons", "actual_impact_co2_tons",
          "amount_issued_usd", "tenor_years"]:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors="coerce")

df = df.dropna(subset=["claimed_impact_co2_tons", "actual_impact_co2_tons"]).copy()

print("After cleaning:", df.shape)
df[["claimed_impact_co2_tons", "actual_impact_co2_tons"]].describe()


### 2.2 Realization Ratio & Gap (for inspection only)

In [None]:
df["realization_ratio"] = df["actual_impact_co2_tons"] / df["claimed_impact_co2_tons"].replace(0, np.nan)
df["impact_gap"] = df["claimed_impact_co2_tons"] - df["actual_impact_co2_tons"]

df[["claimed_impact_co2_tons", "actual_impact_co2_tons", "realization_ratio", "impact_gap"]].head()


## 3. Clean Text

In [None]:
df["clean_text"] = df["text"].fillna("").astype(str).apply(clean_text)
df[["clean_text"]].head()


## 4. Technology Buckets (Solar / Wind / Transport / etc.)

In [None]:
# We create a coarse 'tech_bucket' feature based on free-text 'technology' or 'use_of_proceeds'.

def infer_tech_bucket(row):
    text = (str(row.get("technology", "")) + " " + str(row.get("use_of_proceeds", ""))).lower()
    if any(k in text for k in ["solar", "pv", "photovoltaic"]):
        return "solar"
    if any(k in text for k in ["wind", "offshore wind", "onshore wind"]):
        return "wind"
    if any(k in text for k in ["metro", "rail", "bus", "transport", "subway", "tram"]):
        return "transport"
    if any(k in text for k in ["hydro", "geothermal"]):
        return "hydro_geo"
    if any(k in text for k in ["building", "efficiency", "retrofit"]):
        return "buildings_efficiency"
    if any(k in text for k in ["water", "wastewater", "sewer", "drinking water"]):
        return "water"
    if any(k in text for k in ["waste", "landfill", "recycling"]):
        return "waste"
    return "other"

df["tech_bucket"] = df.apply(infer_tech_bucket, axis=1)
df[["technology", "use_of_proceeds", "tech_bucket"]].head()


## 5. Categorical Encodings & Numeric Features

In [None]:
# Define metadata columns we will use (if present)
cat_cols = [c for c in ["sector", "region", "technology", "use_of_proceeds", "tech_bucket"] if c in df.columns]
num_cols = [c for c in ["claimed_impact_co2_tons", "amount_issued_usd", "tenor_years"] if c in df.columns]

print("Categorical columns:", cat_cols)
print("Numeric columns:", num_cols)

# Simple label-encoding for categoricals
cat_maps = {}
for col in cat_cols:
    df[col] = df[col].fillna("UNKNOWN").astype(str)
    uniques = sorted(df[col].unique())
    mapping = {val: i for i, val in enumerate(uniques)}
    cat_maps[col] = mapping
    df[col + "_idx"] = df[col].map(mapping)

# Build numeric feature array
num_features = []
for col in num_cols:
    vals = df[col].fillna(0.0).astype(float).values.reshape(-1, 1)
    num_features.append(vals)

if num_features:
    num_features = np.concatenate(num_features, axis=1)
else:
    num_features = np.zeros((len(df), 0), dtype=np.float32)

# Add categorical indices as numeric features
for col in cat_cols:
    idx_vals = df[col + "_idx"].astype(float).values.reshape(-1, 1)
    num_features = np.concatenate([num_features, idx_vals], axis=1)

print("Numeric+cat feature shape:", num_features.shape)


## 6. Text Embeddings (MiniLM Sentence Transformer)

In [None]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

st_model = SentenceTransformer(MODEL_NAME, device=DEVICE)
print("Loaded sentence-transformers model:", MODEL_NAME, "on", DEVICE)


In [None]:
texts_clean = df["clean_text"].tolist()

print("Embedding texts with MiniLM...")
text_embeddings = st_model.encode(texts_clean, batch_size=32, show_progress_bar=True)
text_embeddings = np.asarray(text_embeddings, dtype=np.float32)
print("Embeddings shape:", text_embeddings.shape)


### 6.1 Build Final Feature Matrix X and Target y

In [None]:
# Combine text embeddings + numeric metadata
X = np.concatenate([text_embeddings, num_features], axis=1)

# Target: actual impact (tons CO2)
y = df["actual_impact_co2_tons"].astype(float).values

print("X shape:", X.shape, "  y shape:", y.shape)


## 7. Train XGBoost Regressor (Actual Impact)

In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Optional: scale targets if very skewed; here we keep raw tons CO2 for interpretability.

# A relatively small XGBoost model as MVP
xgb = XGBRegressor(
    n_estimators=400,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.9,
    colsample_bytree=0.9,
    objective="reg:squarederror",
    random_state=42,
    n_jobs=4,
)

xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_val)

rmse = mean_squared_error(y_val, y_pred, squared=False)
r2 = r2_score(y_val, y_pred)

print(f"Validation RMSE: {rmse:.3f} tons CO2")
print(f"Validation R^2: {r2:.3f}")

print("Actual stats: mean", y_val.mean(), "std", y_val.std())
print("Pred stats:   mean", y_pred.mean(), "std", y_pred.std())


### 7.1 Realization Ratio & Gap on Validation Set

In [None]:
val_claimed = df.loc[y_val.index if hasattr(y_val, 'index') else X_val.shape[0]:, "claimed_impact_co2_tons"] if False else None

In [None]:
# To compute ratio/gap metrics on the validation set,
# we need the subset of 'claimed' values aligned with X_val/y_val.

# Re-do split on claimed column with same random_state & test_size
claimed = df["claimed_impact_co2_tons"].astype(float).values
_, claimed_val = train_test_split(
    claimed, test_size=0.2, random_state=42
)

ratio_val = claimed_val / np.where(y_pred == 0, np.nan, y_pred)
gap_val = claimed_val - y_pred

print("Realization ratio (claimed / pred_actual) — summary:")
print(pd.Series(ratio_val).describe())

print("\nImpact gap (claimed - pred_actual) — summary:")
print(pd.Series(gap_val).describe())


### 7.2 Diagnostic Plot

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(5, 5))
plt.scatter(y_val, y_pred, alpha=0.4)
plt.xlabel("Actual CO2 impact (tons)")
plt.ylabel("Predicted CO2 impact (tons)")
plt.title("Actual vs Predicted Impact (XGBoost + MiniLM)")
min_val = float(min(y_val.min(), y_pred.min()))
max_val = float(max(y_val.max(), y_pred.max()))
plt.plot([min_val, max_val], [min_val, max_val], "r--", alpha=0.6)
plt.grid(True)
plt.show()


## 8. Save Model Artifact

In [None]:
from joblib import dump

MODEL_DIR = BACKEND_ROOT / "app" / "models"
MODEL_DIR.mkdir(parents=True, exist_ok=True)
MODEL_PATH = MODEL_DIR / "impact_gap_xgb_minilm.joblib"

artifact = {
    "model": xgb,
    "text_model_name": MODEL_NAME,
    "cat_maps": cat_maps,
    "num_cols": num_cols,
    "cat_cols": cat_cols,
}

dump(artifact, MODEL_PATH)
print("Saved impact gap model to:", MODEL_PATH)


## 9. Inference Helper — Predict Impact for One Bond/Project

In [None]:
from joblib import load

def load_impact_model(model_path: Path = MODEL_PATH):
    return load(model_path)

def _encode_metadata(row: dict, artifact: dict) -> np.ndarray:
    num_cols = artifact["num_cols"]
    cat_cols = artifact["cat_cols"]
    cat_maps = artifact["cat_maps"]

    feats = []

    # numeric
    for col in num_cols:
        val = row.get(col, 0.0)
        try:
            v = float(val)
        except Exception:
            v = 0.0
        feats.append(v)

    # categorical as label indices
    for col in cat_cols:
        mapping = cat_maps[col]
        raw_val = str(row.get(col, "UNKNOWN"))
        idx = mapping.get(raw_val, mapping.get("UNKNOWN", 0))
        feats.append(float(idx))

    return np.array(feats, dtype=np.float32).reshape(1, -1)

@torch.no_grad()
def predict_impact_gap_ml(
    text: str,
    claimed_impact_co2_tons: float,
    meta: dict,
    artifact: dict,
    st_model_for_inference: SentenceTransformer,
):
    """
    Returns:
      {
        'predicted_impact_mean': float,
        'predicted_impact_std': float (placeholder, can refine later),
        'gap': claimed - predicted_mean,
        'realization_ratio': predicted_mean / claimed (if claimed>0)
      }
    """
    # text embedding
    cleaned = clean_text(text)
    emb = st_model_for_inference.encode([cleaned])
    emb = np.asarray(emb, dtype=np.float32)  # (1, H)

    # ensure claimed impact is in meta for numeric features if used
    meta = dict(meta)
    meta.setdefault("claimed_impact_co2_tons", claimed_impact_co2_tons)

    meta_vec = _encode_metadata(meta, artifact)  # (1, M)

    feats = np.concatenate([emb, meta_vec], axis=1)  # (1, H+M)

    model = artifact["model"]
    pred = float(model.predict(feats)[0])

    # Placeholder uncertainty: we can plug in a more advanced method later
    predicted_std = float(abs(pred) * 0.15)  # e.g. 15% of magnitude

    gap = float(claimed_impact_co2_tons - pred)
    realization_ratio = float(pred / claimed_impact_co2_tons) if claimed_impact_co2_tons > 0 else None

    return {
      "predicted_impact_mean": pred,
      "predicted_impact_std": predicted_std,
      "gap": gap,
      "realization_ratio": realization_ratio,
    }

artifact_loaded = load_impact_model()

# Example usage on the first row
example_row = df.iloc[0]
example_meta = {col: example_row.get(col) for col in (cat_cols + num_cols)}

example_result = predict_impact_gap_ml(
    text=example_row["text"],
    claimed_impact_co2_tons=float(example_row["claimed_impact_co2_tons"]),
    meta=example_meta,
    artifact=artifact_loaded,
    st_model_for_inference=st_model,
)

example_row[["claimed_impact_co2_tons", "actual_impact_co2_tons"]], example_result


---
## 10. Next Steps

- Move `predict_impact_gap_ml` into a backend module, e.g.
  `app/ml/impact_gap_model_ml.py`.
- In your FastAPI service, call this from the scoring service instead of (or in
  addition to) the current placeholder impact model.
- In the React UI, display:
  - `claimed` vs `predicted_impact_mean`
  - `gap = claimed - predicted`
  - An uncertainty band: `predicted ± predicted_impact_std`
  - A qualitative label, e.g. `Aligned / Overstated / Understated` based on the gap.
