# Gravity-interaction baseline for DeepDemand (distance-decay gravity regression)

This notebook:
- Loads GT via `load_gt()` (same filtering/normalization as DeepDemand)
- Loads **raw** LSOA features JSON (same structure as normalized)
- Defines mass per LSOA: `mass = population_total(level) + employment_total(level)`
- For each edge, aggregates OD pairs: `S_e(gamma) = sum( M_O * M_D * exp(-gamma * t_OD) )`
- Fits log-linear regression: `log(y) = a0 + a1 * log(S + eps)`
- Chooses gamma per fold via a small inner split on training edges
- Evaluates 5-fold CV and 9-fold spatial CV
- Reports train/test metrics: MGEH, MAE, R2 using your `utils` definitions


In [None]:
# path setting
import sys
from pathlib import Path
# set the notebook's CWD to your repo root
%cd D:/deepdemand
ROOT = Path.cwd().parents[0]   # go up one level
sys.path.insert(0, str(ROOT))

In [3]:
import os
import json
import numpy as np
import pandas as pd
import torch

from config import DATA, TRAINING
from model.dataloader import load_gt, load_json
import model.utils as utils

from sklearn.linear_model import LinearRegression


## 0) Reproducibility

In [4]:
np.random.seed(TRAINING['seed'])
torch.manual_seed(TRAINING['seed'])


<torch._C.Generator at 0x1f5cd5b4c50>

## 1) Load GT and RAW LSOA features

You must set `RAW_LSOA_JSON_PATH` to your raw-feature JSON file.
It has the same structure as the normalized file.


In [5]:
# ---- GT (filtered + maybe normalized) ----
edge_to_gt, scaler = load_gt()
all_edge_ids = list(edge_to_gt.keys())
print('Edges:', len(all_edge_ids), 'Scaler:', type(scaler).__name__ if scaler else None)

# ---- node -> LSOA mapping ----
node_to_lsoa = load_json('data/node_features/node_to_lsoa.json')

# ---- RAW LSOA features ----
# Update this path to your raw JSON.
RAW_LSOA_JSON_PATH = 'data/node_features/lsoa21_features_raw.json'
raw_lsoa_json = load_json(RAW_LSOA_JSON_PATH)
print('Raw LSOAs:', len(raw_lsoa_json))


Number of valid edges: 5088

=== GT Descriptive Statistics (raw) ===
Min     : 191.405
Max     : 113436.372
Mean    : 25243.410
Median  : 20618.627
Std     : 18893.461

Edges: 5088 Scaler: None
Raw LSOAs: 35672


## 2) Define mass per LSOA

Mass is `population_total(level) + employment_total(level)`.

We use the level specified in config:
- `DATA['population_level']` (e.g. 'lv3')
- `DATA['employment_level']` (e.g. 'lv3')

In your example, the first element `[0]` is the total.


In [6]:
POP_LEVEL = DATA.get('population_level', 'lv3')
EMP_LEVEL = DATA.get('employment_level', 'lv3')
print('Using mass levels:', POP_LEVEL, EMP_LEVEL)

def lsoa_mass(rec: dict) -> float:
    pop = 0.0
    emp = 0.0
    if isinstance(rec.get('population'), dict):
        arr = rec['population'].get(POP_LEVEL, [])
        if len(arr) > 0:
            pop = float(arr[0])
    if isinstance(rec.get('employment'), dict):
        arr = rec['employment'].get(EMP_LEVEL, [])
        if len(arr) > 0:
            emp = float(arr[0])
    return pop + emp

# Precompute LSOA -> mass
lsoa_to_mass = {code: lsoa_mass(rec) for code, rec in raw_lsoa_json.items()}

masses = np.array(list(lsoa_to_mass.values()), dtype=np.float64)
print('Mass stats: min', masses.min(), 'mean', masses.mean(), 'max', masses.max())


Using mass levels: lv3 lv3
Mass stats: min 991.0 mean 2496.8032069970845 max 392068.0


## 3) Preload OD data per edge (so we don’t keep reading in the gamma grid)

For each edge_id we cache:
- `n_od`
- arrays of `mO`, `mD`, `t` for each OD pair

If `od_use.feather` is missing/empty, we treat it as `n_od=0` and later predict 0.


In [7]:
SUBGRAPH_ROOT = 'data/subgraphs/subgraphs'

def node_mass(node_id_str: str) -> float:
    lsoa = node_to_lsoa[str(node_id_str)][0]
    return float(lsoa_to_mass.get(lsoa, 0.0))

edge_cache = {}  # edge_id -> dict(n_od, mO, mD, t)
missing_or_empty = 0

for eid in all_edge_ids:
    fpath = os.path.join(SUBGRAPH_ROOT, eid, 'od_use.feather')
    if not os.path.exists(fpath):
        edge_cache[eid] = {'n_od': 0, 'mO': None, 'mD': None, 't': None}
        missing_or_empty += 1
        continue
    try:
        df = pd.read_feather(fpath, columns=['O','D','t_OD'])
    except Exception:
        edge_cache[eid] = {'n_od': 0, 'mO': None, 'mD': None, 't': None}
        missing_or_empty += 1
        continue

    if len(df) == 0:
        edge_cache[eid] = {'n_od': 0, 'mO': None, 'mD': None, 't': None}
        missing_or_empty += 1
        continue

    O = df['O'].astype(str).tolist()
    D = df['D'].astype(str).tolist()
    t = df['t_OD'].to_numpy(dtype=np.float64)  # use float64 for stability in exp

    mO = np.array([node_mass(o) for o in O], dtype=np.float64)
    mD = np.array([node_mass(d) for d in D], dtype=np.float64)

    edge_cache[eid] = {'n_od': int(len(df)), 'mO': mO, 'mD': mD, 't': t}

print('Cached edges:', len(edge_cache))
print('Edges with missing/empty OD:', missing_or_empty)


Cached edges: 5088
Edges with missing/empty OD: 504


## 4) Gravity score and log-linear regression helper

- For given gamma, compute `S_e(gamma)` for each edge.
- Fit `log(y) ~ log(S + eps)` on edges with `n_od>0`.
- Predict:
  - if `n_od==0` => pred=0 (match DeepDemand)
  - else => exp(a0 + a1 * log(S+eps))

We evaluate metrics using your `utils` by converting raw predictions back to normalized space (if scaler exists).


In [8]:
EPS_S = 1e-12  # for log(S+eps)
EPS_Y = 1e-12  # for log(y+eps)

def gravity_score_for_edges(edge_ids: list[str], gamma: float) -> np.ndarray:
    S = np.zeros((len(edge_ids),), dtype=np.float64)
    for i, eid in enumerate(edge_ids):
        rec = edge_cache[eid]
        if rec['n_od'] == 0:
            S[i] = 0.0
            continue
        mO, mD, t = rec['mO'], rec['mD'], rec['t']
        # sum( mO*mD*exp(-gamma*t) )
        S[i] = np.sum((mO * mD) * np.exp(-gamma * t))
    return S

def fit_loglinear_on_train(edge_ids_train: list[str], y_train_raw: np.ndarray, S_train: np.ndarray):
    # use only edges with OD pairs and positive-ish y
    n_od = np.array([edge_cache[e]['n_od'] for e in edge_ids_train], dtype=np.int64)
    mask = (n_od > 0)
    # if your GT can be 0, keep it with EPS_Y; otherwise still safe
    X = np.log(S_train[mask] + EPS_S).reshape(-1, 1)
    y = np.log(y_train_raw[mask] + EPS_Y)
    
    model = LinearRegression()
    model.fit(X, y)
    return model

def predict_raw(edge_ids: list[str], S: np.ndarray, model: LinearRegression) -> np.ndarray:
    n_od = np.array([edge_cache[e]['n_od'] for e in edge_ids], dtype=np.int64)
    pred = np.zeros((len(edge_ids),), dtype=np.float64)
    mask = (n_od > 0)
    if np.any(mask):
        X = np.log(S[mask] + EPS_S).reshape(-1, 1)
        logy = model.predict(X)
        pred[mask] = np.exp(logy)
    # n_od==0 stays 0
    pred = np.clip(pred, 0.0, None)
    return pred

def raw_to_norm(y_raw: np.ndarray) -> np.ndarray:
    if scaler is None:
        return y_raw.astype(np.float32)
    yt = torch.tensor(y_raw, dtype=torch.float32)
    yn = scaler.transform(yt).cpu().numpy()
    return yn.astype(np.float32)

def norm_to_raw(y_norm: np.ndarray) -> np.ndarray:
    if scaler is None:
        return y_norm.astype(np.float64)
    yt = torch.tensor(y_norm, dtype=torch.float32)
    yr = scaler.inverse_transform(yt).cpu().numpy()
    return yr.astype(np.float64)

def eval_metrics(y_true_norm: np.ndarray, y_pred_norm: np.ndarray):
    yt = torch.tensor(y_true_norm, dtype=torch.float32)
    yp = torch.tensor(y_pred_norm, dtype=torch.float32)
    return {
        'MAE': utils.MAE(yt, yp, scaler).item(),
        'MGEH': utils.MGEH(yt, yp, scaler).item(),
        'R2': utils.R_square(yt, yp, scaler).item(),
    }


## 5) Choose gamma via inner split (train-only)

We do a simple inner holdout: 80% inner-train, 20% inner-val sampled from the outer training fold.
We select gamma minimizing inner-val MGEH (you can change to MAE if you prefer).


In [9]:
def choose_gamma_inner(train_ids: list[str], y_train_norm: np.ndarray, gamma_grid: np.ndarray, inner_frac: float = 0.2):
    rng = np.random.default_rng(TRAINING['seed'])
    idx = np.arange(len(train_ids))
    rng.shuffle(idx)
    n_val = max(1, int(len(idx) * inner_frac))
    idx_val = idx[:n_val]
    idx_tr  = idx[n_val:]
    if len(idx_tr) == 0:
        idx_tr = idx_val

    tr_ids = [train_ids[i] for i in idx_tr]
    va_ids = [train_ids[i] for i in idx_val]

    y_tr_raw = norm_to_raw(y_train_norm[idx_tr])
    y_va_norm = y_train_norm[idx_val]

    best_gamma = None
    best_score = float('inf')

    for gamma in gamma_grid:
        S_tr = gravity_score_for_edges(tr_ids, float(gamma))
        model = fit_loglinear_on_train(tr_ids, y_tr_raw, S_tr)

        S_va = gravity_score_for_edges(va_ids, float(gamma))
        pred_va_raw = predict_raw(va_ids, S_va, model)
        pred_va_norm = raw_to_norm(pred_va_raw)

        m = eval_metrics(y_va_norm, pred_va_norm)
        score = m['MGEH']  # selection criterion
        if score < best_score:
            best_score = score
            best_gamma = float(gamma)

    return best_gamma, best_score


## 6) Outer CV runner (5-fold and spatial 9-fold)

We evaluate:
- 5-fold CV using your `utils.get_cv_split`
- 9-fold spatial CV using your `utils.get_spatial_cv_split`


In [10]:
# Prepare aligned y arrays
kept_edges = all_edge_ids
y_all_norm = np.array([float(edge_to_gt[e]) for e in kept_edges], dtype=np.float32)

def run_gravity_cv(split_type: str, gamma_grid: np.ndarray):
    rows = []

    if split_type == 'kfold5':
        fold_list = list(range(5))
        split_fn = lambda fold: utils.get_cv_split(kept_edges, k=5, fold_idx=fold, seed=TRAINING['seed'])
    elif split_type == 'spatial9':
        fold_list = list(range(1, 10))
        split_fn = lambda fold: utils.get_spatial_cv_split(kept_edges, fold_idx=fold)
    else:
        raise ValueError('split_type must be kfold5 or spatial9')

    edge_to_pos = {e:i for i,e in enumerate(kept_edges)}

    for fold in fold_list:
        train_ids, test_ids = split_fn(fold)

        tr_idx = np.array([edge_to_pos[e] for e in train_ids], dtype=np.int64)
        te_idx = np.array([edge_to_pos[e] for e in test_ids], dtype=np.int64)

        y_tr_norm = y_all_norm[tr_idx]
        y_te_norm = y_all_norm[te_idx]

        # --- pick gamma using inner split on training edges ---
        best_gamma, inner_score = choose_gamma_inner(train_ids, y_tr_norm, gamma_grid)

        # --- fit on full training fold with chosen gamma ---
        S_tr = gravity_score_for_edges(train_ids, best_gamma)
        y_tr_raw = norm_to_raw(y_tr_norm)
        model = fit_loglinear_on_train(train_ids, y_tr_raw, S_tr)

        # --- predict train/test ---
        pred_tr_raw = predict_raw(train_ids, S_tr, model)
        pred_tr_norm = raw_to_norm(pred_tr_raw)

        S_te = gravity_score_for_edges(test_ids, best_gamma)
        pred_te_raw = predict_raw(test_ids, S_te, model)
        pred_te_norm = raw_to_norm(pred_te_raw)

        m_tr = eval_metrics(y_tr_norm, pred_tr_norm)
        m_te = eval_metrics(y_te_norm, pred_te_norm)

        rows.append({
            'split': split_type,
            'fold': fold,
            'model': 'GravityLogLinear',
            'gamma': best_gamma,
            'inner_val_MGEH': inner_score,
            'train_MAE': m_tr['MAE'],
            'train_MGEH': m_tr['MGEH'],
            'train_R2': m_tr['R2'],
            'test_MAE': m_te['MAE'],
            'test_MGEH': m_te['MGEH'],
            'test_R2': m_te['R2'],
        })

        print(f'[{split_type}] fold={fold} gamma={best_gamma:.3e} test_MGEH={m_te["MGEH"]:.4f} test_R2={m_te["R2"]:.4f}')

    return pd.DataFrame(rows)


## 7) Run gravity baseline

Gamma grid: you can tune this. Since `t_OD` is in **seconds**, gamma is in 1/seconds.
A reasonable starting grid is logspace from 1e-7 to 1e-3.


In [11]:
gamma_grid = np.logspace(-7, -3, 17)  # 17 values
print('Gamma grid:', gamma_grid)

df_k5 = run_gravity_cv('kfold5', gamma_grid)
df_sp = run_gravity_cv('spatial9', gamma_grid)

df_all = pd.concat([df_k5, df_sp], ignore_index=True)
df_all.head()


Gamma grid: [1.00000000e-07 1.77827941e-07 3.16227766e-07 5.62341325e-07
 1.00000000e-06 1.77827941e-06 3.16227766e-06 5.62341325e-06
 1.00000000e-05 1.77827941e-05 3.16227766e-05 5.62341325e-05
 1.00000000e-04 1.77827941e-04 3.16227766e-04 5.62341325e-04
 1.00000000e-03]
[kfold5] fold=0 gamma=1.000e-07 test_MGEH=65.8560 test_R2=0.5350
[kfold5] fold=1 gamma=1.778e-04 test_MGEH=65.5835 test_R2=0.5648
[kfold5] fold=2 gamma=3.162e-04 test_MGEH=65.7102 test_R2=0.5648
[kfold5] fold=3 gamma=1.778e-04 test_MGEH=65.7839 test_R2=0.5360
[kfold5] fold=4 gamma=1.000e-04 test_MGEH=65.0893 test_R2=0.5351
[Spatial CV] Validation region: E12000001
[Spatial CV] #val_edges = 143, #train_edges = 4945
[spatial9] fold=1 gamma=1.000e-07 test_MGEH=66.2016 test_R2=0.1432
[Spatial CV] Validation region: E12000002
[Spatial CV] #val_edges = 608, #train_edges = 4480
[spatial9] fold=2 gamma=1.000e-07 test_MGEH=72.2736 test_R2=0.4168
[Spatial CV] Validation region: E12000003
[Spatial CV] #val_edges = 580, #train_ed

Unnamed: 0,split,fold,model,gamma,inner_val_MGEH,train_MAE,train_MGEH,train_R2,test_MAE,test_MGEH,test_R2
0,kfold5,0,GravityLogLinear,1e-07,62.142414,9386.25293,65.242035,0.545796,9300.935547,65.856033,0.534991
1,kfold5,1,GravityLogLinear,0.0001778279,62.138081,9386.032227,65.499939,0.54357,9381.845703,65.583466,0.564786
2,kfold5,2,GravityLogLinear,0.0003162278,63.098816,9438.019531,65.752205,0.549133,9322.496094,65.710228,0.564762
3,kfold5,3,GravityLogLinear,0.0001778279,65.616547,9362.286133,65.527046,0.549604,9546.560547,65.783852,0.53596
4,kfold5,4,GravityLogLinear,0.0001,66.180199,9297.393555,65.23555,0.555723,9441.710938,65.089348,0.535093


## 8) Summaries (mean ± std across folds)

In [12]:
def summarize(df: pd.DataFrame):
    metrics = ['train_MAE','train_MGEH','train_R2','test_MAE','test_MGEH','test_R2']
    g = df.groupby(['split','model'])[metrics]
    mean = g.mean().add_suffix('_mean')
    std  = g.std(ddof=1).add_suffix('_std')
    out = pd.concat([mean, std], axis=1).reset_index()
    return out

summary = summarize(df_all)
summary


Unnamed: 0,split,model,train_MAE_mean,train_MGEH_mean,train_R2_mean,test_MAE_mean,test_MGEH_mean,test_R2_mean,train_MAE_std,train_MGEH_std,train_R2_std,test_MAE_std,test_MGEH_std,test_R2_std
0,kfold5,GravityLogLinear,9373.996875,65.451355,0.548765,9398.709766,65.604585,0.547118,50.988163,0.217363,0.004611,99.198882,0.305139,0.016122
1,spatial9,GravityLogLinear,9362.868598,65.41461,0.544534,9565.117947,66.813452,0.477722,138.770902,0.649954,0.012172,1636.469666,6.51582,0.188032


## 9) Save outputs

In [13]:
os.makedirs('eval/baselines', exist_ok=True)
df_all.to_csv('eval/baselines/baseline_gravity_all_folds.csv', index=False)
summary.to_csv('eval/baselines/baseline_gravity_summary.csv', index=False)
print('Saved:')
print(' - eval/baselines/baseline_gravity_all_folds.csv')
print(' - eval/baselines/baseline_gravity_summary.csv')


Saved:
 - eval/baselines/baseline_gravity_all_folds.csv
 - eval/baselines/baseline_gravity_summary.csv
