# Baselines for DeepDemand: OLS Linear, Ridge (L2), Random Forest

This notebook constructs **edge-level features** from `od_use.feather` using **simple means** of the unique origin and destination nodes observed in the OD table.

Key settings:
- No travel-time (`t_OD`) features
- No interaction term
- No OD-count feature
- Missing/empty `od_use.feather` => origin and destination summaries are set to zero vectors
- Evaluation:
  - Random 5-fold CV
  - Spatial 9-fold CV (by English region)
- Metrics: MAE, MGEH, R2 using `model.utils` (with the same `scaler` from `load_gt()`)


In [None]:
# path setting
import sys
from pathlib import Path
# set the notebook's CWD to your repo root
%cd D:/deepdemand
ROOT = Path.cwd().parents[0]   # go up one level
sys.path.insert(0, str(ROOT))

In [16]:
import os
import numpy as np
import pandas as pd
import torch

from config import DATA, TRAINING
from model.dataloader import load_gt, load_json, get_lsoa_vector
import model.utils as utils


## 0) Reproducibility

In [17]:
np.random.seed(TRAINING['seed'])
torch.manual_seed(TRAINING['seed'])


<torch._C.Generator at 0x240ddcc0d50>

## 1) Load GT and build LSOA feature bank

In [18]:
# ---- GT (filtered + optionally normalized) ----
edge_to_gt, scaler = load_gt()
all_edge_ids = list(edge_to_gt.keys())
print('Edges:', len(all_edge_ids), 'Scaler:', type(scaler).__name__ if scaler else None)

# ---- LSOA JSON + node->LSOA mapping ----
lsoa_json = load_json(DATA['lsoa_json'])
node_to_lsoa = load_json('data/node_features/node_to_lsoa.json')

# ---- Build feature bank: {lsoa_code: np.ndarray(feat_dim,)} ----
lsoa_codes = sorted(lsoa_json.keys())
feat_rows = []
for code in lsoa_codes:
    v = get_lsoa_vector(lsoa_json[code])
    feat_rows.append(v.cpu().numpy())
X_lsoa = np.vstack(feat_rows).astype(np.float32)
feature_bank = {code: X_lsoa[i] for i, code in enumerate(lsoa_codes)}

feat_dim = X_lsoa.shape[1]
print('LSOA feature dim:', feat_dim)


Number of valid edges: 5088

=== GT Descriptive Statistics (raw) ===
Min     : 191.405
Max     : 113436.372
Mean    : 25243.410
Median  : 20618.627
Std     : 18893.461

Edges: 5088 Scaler: None
LSOA feature dim: 121


## 2) Edge-level feature construction (simple means)

For each target edge `e`:
- Load the OD table `od_use.feather` (columns `O`, `D`)
- Collect the unique origin nodes and unique destination nodes
- Compute the mean LSOA feature vector across unique origins -> XO
- Compute the mean LSOA feature vector across unique destinations -> XD
- Concatenate: `[XO || XD]`

If the OD table is missing or empty, XO and XD are set to all zeros.


In [19]:
SUBGRAPH_ROOT = 'data/subgraphs/subgraphs'

def _get_lsoa_vec_from_node(node_id_str: str) -> np.ndarray:
    lsoa_code = node_to_lsoa[str(node_id_str)][0]
    return feature_bank[lsoa_code]

def build_one_edge_feature(edge_id: str) -> np.ndarray:
    fpath = os.path.join(SUBGRAPH_ROOT, edge_id, 'od_use.feather')

    XO = np.zeros((feat_dim,), dtype=np.float32)
    XD = np.zeros((feat_dim,), dtype=np.float32)

    if os.path.exists(fpath):
        try:
            df = pd.read_feather(fpath, columns=['O', 'D'])
        except Exception:
            df = None

        if df is not None and len(df) > 0:
            O = df['O'].astype(str).tolist()
            D = df['D'].astype(str).tolist()

            uniq_O = list(dict.fromkeys(O))
            uniq_D = list(dict.fromkeys(D))

            if len(uniq_O) > 0:
                O_mat = np.vstack([_get_lsoa_vec_from_node(n) for n in uniq_O]).astype(np.float32)
                XO = O_mat.mean(axis=0)

            if len(uniq_D) > 0:
                D_mat = np.vstack([_get_lsoa_vec_from_node(n) for n in uniq_D]).astype(np.float32)
                XD = D_mat.mean(axis=0)

    return np.concatenate([XO, XD], axis=0)

def build_dataset(edge_ids: list) -> tuple[np.ndarray, np.ndarray, list]:
    X_rows, y_rows, kept = [], [], []
    for eid in edge_ids:
        x = build_one_edge_feature(eid)
        X_rows.append(x)
        y_rows.append(float(edge_to_gt[eid]))
        kept.append(eid)
    X = np.vstack(X_rows).astype(np.float32)
    y = np.array(y_rows, dtype=np.float32)
    return X, y, kept


### Build full dataset once

In [20]:
X_all, y_all, kept_edges = build_dataset(all_edge_ids)
print('Built X:', X_all.shape, 'y:', y_all.shape)
edge_to_idx = {eid: i for i, eid in enumerate(kept_edges)}


Built X: (5088, 242) y: (5088,)


## 3) Models

- OLS Linear Regression
- Ridge Regression (L2)
- Random Forest Regression

We standardize features for OLS and Ridge; Random Forest does not require scaling.


In [21]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor


In [22]:
def make_ols():
    return Pipeline([
        ('scaler', StandardScaler(with_mean=True, with_std=True)),
        ('ols', LinearRegression())
    ])

def make_ridge(alpha: float = 1.0):
    return Pipeline([
        ('scaler', StandardScaler(with_mean=True, with_std=True)),
        ('ridge', Ridge(alpha=alpha, random_state=TRAINING['seed']))
    ])

def make_rf(n_estimators: int = 500, max_depth=None, n_jobs: int = -1):
    return RandomForestRegressor(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=TRAINING['seed'],
        n_jobs=n_jobs,
    )


## 4) Metrics (use your `model.utils`)

We evaluate in the same way as DeepDemand, passing the `scaler` from `load_gt()`.


In [23]:
def eval_metrics(y_true_np: np.ndarray, y_pred_np: np.ndarray):
    yt = torch.tensor(y_true_np, dtype=torch.float32)
    yp = torch.tensor(y_pred_np, dtype=torch.float32)
    return {
        'MAE': utils.MAE(yt, yp, scaler).item(),
        'MGEH': utils.MGEH(yt, yp, scaler).item(),
        'R2': utils.R_square(yt, yp, scaler).item(),
    }


## 5) CV runners

- Random 5-fold CV: `utils.get_cv_split`
- Spatial 9-fold CV: `utils.get_spatial_cv_split`


In [24]:
def ids_to_indices(ids: list[str]) -> np.ndarray:
    return np.array([edge_to_idx[e] for e in ids if e in edge_to_idx], dtype=np.int64)

def run_cv(model_name: str, model_factory, split_type: str):
    results = []

    if split_type == 'kfold5':
        fold_list = list(range(5))
        for fold_idx in fold_list:
            train_ids, test_ids = utils.get_cv_split(
                kept_edges,
                k=5,
                fold_idx=fold_idx,
                seed=TRAINING['seed'],
            )
            tr = ids_to_indices(train_ids)
            te = ids_to_indices(test_ids)

            model = model_factory()
            model.fit(X_all[tr], y_all[tr])
            pred_tr = model.predict(X_all[tr])
            pred_te = model.predict(X_all[te])

            m_tr = eval_metrics(y_all[tr], pred_tr)
            m_te = eval_metrics(y_all[te], pred_te)

            results.append({
                'split': 'kfold5',
                'fold': fold_idx,
                'model': model_name,
                'train_MAE': m_tr['MAE'],
                'train_MGEH': m_tr['MGEH'],
                'train_R2': m_tr['R2'],
                'test_MAE': m_te['MAE'],
                'test_MGEH': m_te['MGEH'],
                'test_R2': m_te['R2'],
            })

            print(f'[kfold5] {model_name} fold={fold_idx} test_MGEH={m_te["MGEH"]:.4f} test_R2={m_te["R2"]:.4f}')

    elif split_type == 'spatial9':
        fold_list = list(range(1, 10))
        for fold_idx in fold_list:
            train_ids, test_ids = utils.get_spatial_cv_split(
                kept_edges,
                fold_idx=fold_idx,
            )
            tr = ids_to_indices(train_ids)
            te = ids_to_indices(test_ids)

            model = model_factory()
            model.fit(X_all[tr], y_all[tr])
            pred_tr = model.predict(X_all[tr])
            pred_te = model.predict(X_all[te])

            m_tr = eval_metrics(y_all[tr], pred_tr)
            m_te = eval_metrics(y_all[te], pred_te)

            results.append({
                'split': 'spatial9',
                'fold': fold_idx,
                'model': model_name,
                'train_MAE': m_tr['MAE'],
                'train_MGEH': m_tr['MGEH'],
                'train_R2': m_tr['R2'],
                'test_MAE': m_te['MAE'],
                'test_MGEH': m_te['MGEH'],
                'test_R2': m_te['R2'],
            })

            print(f'[spatial9] {model_name} fold={fold_idx} test_MGEH={m_te["MGEH"]:.4f} test_R2={m_te["R2"]:.4f}')

    else:
        raise ValueError('split_type must be kfold5 or spatial9')

    return pd.DataFrame(results)


## 6) Run baselines

In [25]:
RIDGE_ALPHA = 1.0
RF_TREES = 500
RF_MAX_DEPTH = None

ols_factory   = lambda: make_ols()
ridge_factory = lambda: make_ridge(alpha=RIDGE_ALPHA)
rf_factory    = lambda: make_rf(n_estimators=RF_TREES, max_depth=RF_MAX_DEPTH)

# --- Random 5-fold ---
df_ols_k5   = run_cv('OLS',   ols_factory,   'kfold5')
df_ridge_k5 = run_cv('Ridge', ridge_factory, 'kfold5')
df_rf_k5    = run_cv('RF',    rf_factory,    'kfold5')

# --- Spatial 9-fold ---
df_ols_sp   = run_cv('OLS',   ols_factory,   'spatial9')
df_ridge_sp = run_cv('Ridge', ridge_factory, 'spatial9')
df_rf_sp    = run_cv('RF',    rf_factory,    'spatial9')

df_all = pd.concat([df_ols_k5, df_ridge_k5, df_rf_k5, df_ols_sp, df_ridge_sp, df_rf_sp], ignore_index=True)
df_all.head()


[kfold5] OLS fold=0 test_MGEH=89.6816 test_R2=-0.0821
[kfold5] OLS fold=1 test_MGEH=89.9197 test_R2=0.1376
[kfold5] OLS fold=2 test_MGEH=89.5582 test_R2=0.1037
[kfold5] OLS fold=3 test_MGEH=89.8252 test_R2=-0.0081
[kfold5] OLS fold=4 test_MGEH=88.1109 test_R2=-0.1571
[kfold5] Ridge fold=0 test_MGEH=89.4852 test_R2=-0.0636
[kfold5] Ridge fold=1 test_MGEH=89.7559 test_R2=0.1430
[kfold5] Ridge fold=2 test_MGEH=89.3023 test_R2=0.1126
[kfold5] Ridge fold=3 test_MGEH=89.3838 test_R2=0.0075
[kfold5] Ridge fold=4 test_MGEH=88.1455 test_R2=-0.1409
[kfold5] RF fold=0 test_MGEH=56.4104 test_R2=0.6348
[kfold5] RF fold=1 test_MGEH=56.0289 test_R2=0.6575
[kfold5] RF fold=2 test_MGEH=57.0083 test_R2=0.6479
[kfold5] RF fold=3 test_MGEH=55.6613 test_R2=0.6441
[kfold5] RF fold=4 test_MGEH=55.0646 test_R2=0.6213
[Spatial CV] Validation region: E12000001
[Spatial CV] #val_edges = 143, #train_edges = 4945
[spatial9] OLS fold=1 test_MGEH=77.4586 test_R2=-0.4385
[Spatial CV] Validation region: E12000002
[Spa

Unnamed: 0,split,fold,model,train_MAE,train_MGEH,train_R2,test_MAE,test_MGEH,test_R2
0,kfold5,0,OLS,13389.15332,85.18866,0.205887,14285.084961,89.681557,-0.082121
1,kfold5,1,OLS,13339.666016,85.244499,0.197533,14217.740234,89.919655,0.13761
2,kfold5,2,OLS,13386.791016,85.547836,0.198674,14090.71582,89.558212,0.103691
3,kfold5,3,OLS,13237.697266,84.62204,0.204096,14440.417969,89.825241,-0.008055
4,kfold5,4,OLS,13276.297852,84.857819,0.211543,14238.849609,88.110893,-0.157149


## 7) Summary across folds

In [26]:
def summarize(df: pd.DataFrame):
    metrics = ['train_MAE','train_MGEH','train_R2','test_MAE','test_MGEH','test_R2']
    g = df.groupby(['split','model'])[metrics]
    mean = g.mean().add_suffix('_mean')
    std  = g.std(ddof=1).add_suffix('_std')
    out = pd.concat([mean, std], axis=1).reset_index()
    return out

summary = summarize(df_all)
summary


Unnamed: 0,split,model,train_MAE_mean,train_MGEH_mean,train_R2_mean,test_MAE_mean,test_MGEH_mean,test_R2_mean,train_MAE_std,train_MGEH_std,train_R2_std,test_MAE_std,test_MGEH_std,test_R2_std
0,kfold5,OLS,13325.921094,85.092171,0.203546,14254.561719,89.419112,-0.001205,67.339243,0.359197,0.005693,126.414383,0.744168,0.123679
1,kfold5,RF,3440.684424,26.013625,0.933649,8388.933398,56.034692,0.641107,15.583644,0.165597,0.000639,62.355719,0.736474,0.013729
2,kfold5,Ridge,13332.237695,85.105069,0.202771,14207.556445,89.214549,0.011719,69.047982,0.385109,0.005744,116.796758,0.621609,0.118728
3,spatial9,OLS,13325.464301,85.139524,0.19871,14813.15625,92.48632,-0.095923,222.926806,0.903363,0.005263,3186.239433,11.096831,0.190004
4,spatial9,RF,3340.237061,25.379456,0.935707,10211.305122,65.69724,0.452117,46.77331,0.26106,0.002544,2021.438857,8.570271,0.076875
5,spatial9,Ridge,13328.38878,85.146649,0.198125,14780.136393,92.335804,-0.084203,224.173235,0.910478,0.005263,3215.217252,11.250688,0.18227


## 8) Save outputs

In [27]:
os.makedirs('eval/baselines', exist_ok=True)
df_all.to_csv('eval/baselines/baseline_ols_ridge_rf_simplemean_all_folds.csv', index=False)
summary.to_csv('eval/baselines/baseline_ols_ridge_rf_simplemean_summary.csv', index=False)
print('Saved:')
print(' - eval/baselines/baseline_ols_ridge_rf_simplemean_all_folds.csv')
print(' - eval/baselines/baseline_ols_ridge_rf_simplemean_summary.csv')


Saved:
 - eval/baselines/baseline_ols_ridge_rf_simplemean_all_folds.csv
 - eval/baselines/baseline_ols_ridge_rf_simplemean_summary.csv
