# Prediction Baseline: Gradient Boosted Decision Trees

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ucid-foundation/ucid/blob/main/notebooks/09_prediction_baseline_gbdt.ipynb)

---

## Overview

This notebook builds predictive models for UCID scores using GBDT:

1. Feature engineering from spatial data
2. LightGBM model training
3. Model evaluation and interpretation
4. Score prediction pipeline

---

In [None]:
%pip install -q ucid lightgbm scikit-learn

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

import ucid

print(f"UCID version: {ucid.__version__}")

---

## 1. Feature Engineering

In [None]:
# Generate synthetic training data
np.random.seed(42)
n_samples = 1000

data = pd.DataFrame(
    {
        "poi_count": np.random.poisson(20, n_samples),
        "transit_stops": np.random.poisson(5, n_samples),
        "green_coverage": np.random.uniform(0, 0.5, n_samples),
        "intersection_density": np.random.uniform(50, 200, n_samples),
        "population_density": np.random.uniform(1000, 20000, n_samples),
    }
)

# Generate target score
data["score"] = (
    0.3 * data["poi_count"]
    + 2 * data["transit_stops"]
    + 50 * data["green_coverage"]
    + 0.1 * data["intersection_density"]
    + 0.001 * data["population_density"]
    + np.random.normal(0, 5, n_samples)
).clip(0, 100)

print("Dataset shape:", data.shape)
data.head()

---

## 2. Model Training

In [None]:
# Split data
features = [
    "poi_count",
    "transit_stops",
    "green_coverage",
    "intersection_density",
    "population_density",
]
X = data[features]
y = data["score"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")

In [None]:
try:
    import lightgbm as lgb

    model = lgb.LGBMRegressor(n_estimators=100, random_state=42, verbose=-1)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    print(f"RMSE: {rmse:.2f}")
    print(f"R²: {r2:.3f}")
except ImportError:
    print("LightGBM not installed")

---

## 3. Feature Importance

In [None]:
try:
    importance = pd.DataFrame(
        {"feature": features, "importance": model.feature_importances_}
    ).sort_values("importance", ascending=False)
    print("Feature Importance:")
    print(importance)
except:
    print("Model not available")

---

## Summary

Key steps:
- **Feature Engineering**: Create predictive features
- **GBDT Training**: Use LightGBM for regression
- **Evaluation**: RMSE and R² metrics

---

*Copyright 2026 UCID Foundation. Licensed under EUPL-1.2.*