# 🧩 FlexTrack Challenge — Modeling Setup

**Notebook:** 02_modeling_setup.ipynb  
**Goal:** Establish a robust modeling pipeline starting with  
time-aware, site-aware cross-validation and a baseline classifier.  

---

## 📌 Objectives
- Implement a **cross-validation strategy** that prevents leakage:
  - Time-based splits within each site
  - Optional leave-one-site-out for robustness
- Verify split integrity (train always before validation, no overlap)
- Train and evaluate a **baseline classifier** (LightGBM/XGBoost) with class weights
- Prepare foundation for Phase 2 (regression)

---

## 🔑 Context Recap
- **Target variable (Phase 1):** `Demand_Response_Flag` (-1, 0, +1)  
- **Target variable (Phase 2):** `Demand_Response_Capacity_kW` (continuous, when flag ≠ 0)  
- **EDA findings:**  
  - Severe class imbalance (~97% zeros)  
  - Events occur in business hours, hot/sunny conditions  
  - Building power shows strong autocorrelation  
- **Feature blueprint:**  
  - Time, weather, load dynamics, site identity, persistence, interactions  

---

## 📅 Next Steps
1. Define and test **time-aware CV splitters**
2. Evaluate **baseline classifier** performance across folds
3. Store CV framework for downstream feature engineering and tuning

In [86]:
# --- Core imports ---
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb

# --- Modeling ---
from sklearn.model_selection import TimeSeriesSplit, GroupKFold
from sklearn.metrics import classification_report, confusion_matrix
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

# --- Utilities ---
import warnings
warnings.filterwarnings("ignore")

# --- Reproducibility ---
SEED = 42
np.random.seed(SEED)

print("Environment ready. Seed set to", SEED)

Environment ready. Seed set to 42


In [40]:
train_path = "../data/processed/flextrack_train.csv"
df_train = pd.read_csv(train_path, parse_dates=["Timestamp_Local"])

In [41]:
def make_site_timeseries_folds(df, n_splits=3):
    """
    Creates n_splits folds where each fold is a union of
    time-ordered splits made independently within each site.
    Assumes df is already sorted by ['Site','Timestamp_Local'].
    Returns: list of (train_idx, val_idx) tuples (as numpy arrays).
    """
    folds_per_site = {}
    for site, dsite in df.groupby("Site", sort=False):
        idx = dsite.index.to_numpy()
        tscv = TimeSeriesSplit(n_splits=n_splits)
        site_folds = []
        for tr_rel, va_rel in tscv.split(idx):
            site_folds.append((idx[tr_rel], idx[va_rel]))
        folds_per_site[site] = site_folds

    folds = []
    for k in range(n_splits):
        tr_parts, va_parts = [], []
        for site in folds_per_site:
            tr_idx, va_idx = folds_per_site[site][k]
            tr_parts.append(tr_idx)
            va_parts.append(va_idx)
        folds.append((np.concatenate(tr_parts), np.concatenate(va_parts)))
    return folds

def make_leave_one_site_out_folds(df):
    """
    Strict generalization check: train on two sites, validate on the third.
    Returns: list of (train_idx, val_idx) tuples.
    """
    folds = []
    all_idx = df.index.to_numpy()
    for site, dsite in df.groupby("Site", sort=False):
        val_idx = dsite.index.to_numpy()
        train_mask = np.isin(all_idx, val_idx, invert=True)
        tr_idx = all_idx[train_mask]
        folds.append((tr_idx, val_idx))
    return folds

# 🔎 Quick integrity check helper
def summarize_fold(df, tr_idx, va_idx, label=""):
    tr = df.loc[tr_idx]
    va = df.loc[va_idx]
    print(f"{label} | Train: {tr.shape[0]:>6} rows  [{tr['Timestamp_Local'].min()} → {tr['Timestamp_Local'].max()}]")
    print(f"{' '*len(label)} | Valid: {va.shape[0]:>6} rows  [{va['Timestamp_Local'].min()} → {va['Timestamp_Local'].max()}]")
    # Ensure no temporal leakage within each site
    for s in df['Site'].unique():
        tr_s = tr[tr['Site']==s]
        va_s = va[va['Site']==s]
        if not tr_s.empty and not va_s.empty:
            assert tr_s['Timestamp_Local'].max() < va_s['Timestamp_Local'].min(), f"Leakage in site {s}"

# ▶️ Build folds 
ts_folds = make_site_timeseries_folds(df_train, n_splits=3)   # primary
loso_folds = make_leave_one_site_out_folds(df_train)          # secondary (robustness)

In [42]:
# Summarize all time-series folds
for i, (tr_idx, va_idx) in enumerate(ts_folds, 1):
    summarize_fold(df_train, tr_idx, va_idx, label=f"TS Fold {i}")

# Summarize all LOSO folds (strict generalization)
for i, (tr_idx, va_idx) in enumerate(loso_folds, 1):
    summarize_fold(df_train, tr_idx, va_idx, label=f"LOSO Fold {i}")

TS Fold 1 | Train:  26280 rows  [2019-01-01 00:00:00 → 2023-04-02 05:45:00]
          | Valid:  26280 rows  [2019-04-02 06:00:00 → 2023-07-02 11:45:00]
TS Fold 2 | Train:  52560 rows  [2019-01-01 00:00:00 → 2023-07-02 11:45:00]
          | Valid:  26280 rows  [2019-07-02 12:00:00 → 2023-10-01 17:45:00]
TS Fold 3 | Train:  78840 rows  [2019-01-01 00:00:00 → 2023-10-01 17:45:00]
          | Valid:  26280 rows  [2019-10-01 18:00:00 → 2023-12-31 23:45:00]
LOSO Fold 1 | Train:  70080 rows  [2019-01-01 00:00:00 → 2023-12-31 23:45:00]
            | Valid:  35040 rows  [2019-01-01 00:00:00 → 2019-12-31 23:45:00]
LOSO Fold 2 | Train:  70080 rows  [2019-01-01 00:00:00 → 2023-12-31 23:45:00]
            | Valid:  35040 rows  [2019-01-01 00:00:00 → 2019-12-31 23:45:00]
LOSO Fold 3 | Train:  70080 rows  [2019-01-01 00:00:00 → 2019-12-31 23:45:00]
            | Valid:  35040 rows  [2023-01-01 00:00:00 → 2023-12-31 23:45:00]


In [82]:
def build_features(df):
    d = df.copy()

    # ensure datetime + sorted
    if not pd.api.types.is_datetime64_any_dtype(d["Timestamp_Local"]):
        d["Timestamp_Local"] = pd.to_datetime(d["Timestamp_Local"], errors="raise")
    d = d.sort_values(["Site","Timestamp_Local"])

    # target (only if available)
    if "Demand_Response_Flag" in d.columns:
        d["y_event"] = (d["Demand_Response_Flag"] != 0).astype(int)
    else:
        d["y_event"] = np.nan  # test data has no labels

    # time
    d["hour"]   = d["Timestamp_Local"].dt.hour
    d["dow"]    = d["Timestamp_Local"].dt.dayofweek
    d["month"]  = d["Timestamp_Local"].dt.month
    d["bizhrs"] = d["hour"].between(10,17).astype(int)
    d["wknd"]   = d["dow"].isin([5,6]).astype(int)

    # weather
    d["temp"] = d["Dry_Bulb_Temperature_C"]
    d["rad"]  = d["Global_Horizontal_Radiation_W/m2"]
    d["temp_x_rad"] = d["temp"] * d["rad"]

    # load lags / rolling (per-site)
    for lag in [1, 4, 12, 24, 96]:
        d[f"power_lag{lag}"] = d.groupby("Site")["Building_Power_kW"].shift(lag)

    d["power_roll4"]  = d.groupby("Site")["Building_Power_kW"].transform(lambda s: s.rolling(4,  min_periods=1).mean())
    d["power_roll12"] = d.groupby("Site")["Building_Power_kW"].transform(lambda s: s.rolling(12, min_periods=1).mean())
    d["power_roll96"] = d.groupby("Site")["Building_Power_kW"].transform(lambda s: s.rolling(96, min_periods=1).mean())
    d["power_std96"]  = d.groupby("Site")["Building_Power_kW"].transform(lambda s: s.rolling(96, min_periods=10).std())

    # deltas
    d["power_delta"] = d["Building_Power_kW"] - d["power_lag1"]

    # per-site z-score
    d["power_z"] = d.groupby("Site")["Building_Power_kW"].transform(
        lambda s: (s - s.mean()) / (s.std() + 1e-6)
    )

    # site one-hot
    d = pd.get_dummies(d, columns=["Site"], drop_first=False)
    return d

In [83]:
# --- paths (adjust if needed) ---
train_path = "../data/processed/flextrack_train.csv"      # or your processed path if you saved one
test_path  = "../data/raw/flextrack-public-test-data-v0.1.csv"       # <- official warm-up test set

# --- load (parse timestamp) ---
df_train = pd.read_csv(train_path, parse_dates=["Timestamp_Local"])
df_test  = pd.read_csv(test_path,  parse_dates=["Timestamp_Local"])

# --- build features for train & test ---
df_feat_train = build_features(df_train)
df_feat_test  = build_features(df_test)

# --- align one-hot columns (Site_*) between train and test ---
train_sites = [c for c in df_feat_train.columns if c.startswith("Site_")]
test_sites  = [c for c in df_feat_test.columns  if c.startswith("Site_")]

# add any missing site dummies to test
for missing in set(train_sites) - set(test_sites):
    df_feat_test[missing] = 0
# drop extra site dummies in test (shouldn't happen, but safe)
for extra in set(test_sites) - set(train_sites):
    df_feat_test.drop(columns=[extra], inplace=True)

# --- define feature list (same as our best CV run) ---
feature_cols = [
    "hour","dow","month","bizhrs","wknd",
    "temp","rad","temp_x_rad",
    "power_lag1","power_lag4","power_lag12","power_lag24","power_lag96",
    "power_roll4","power_roll12","power_roll96","power_std96",
    "power_delta","power_z",
] + [c for c in df_feat_train.columns if c.startswith("Site_")]

# keep only features present in BOTH train & test (after alignment)
feature_cols_final = [c for c in feature_cols if c in df_feat_train.columns and c in df_feat_test.columns]

print("Train shape (feat):", df_feat_train.shape)
print("Test  shape (feat):", df_feat_test.shape)
print("Num features:", len(feature_cols_final))

Train shape (feat): (105120, 29)
Test  shape (feat): (35040, 27)
Num features: 22


In [47]:
def gmean_binary(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0,1]).ravel()
    tpr = tp / (tp + fn) if (tp+fn) > 0 else 0.0   # sensitivity (recall for events)
    tnr = tn / (tn + fp) if (tn+fp) > 0 else 0.0   # specificity (non-events)
    return np.sqrt(tpr * tnr), tpr, tnr

In [58]:
def run_cv_baseline(df_feat, folds, feature_cols):
    scores = []
    for i, (tr_idx, va_idx) in enumerate(folds, 1):
        tr = df_feat.loc[tr_idx].dropna(subset=feature_cols + ["y_event"])
        va = df_feat.loc[va_idx].dropna(subset=feature_cols + ["y_event"])

        X_tr, y_tr = tr[feature_cols], tr["y_event"]
        X_va, y_va = va[feature_cols], va["y_event"]

        # handle imbalance → weight positive (event) class
        pos_weight = (y_tr==0).sum() / max((y_tr==1).sum(), 1)

        clf = LGBMClassifier(
            random_state=42,
            n_estimators=200,
            learning_rate=0.05,
            num_leaves=31,
            class_weight={0:1.0, 1:float(pos_weight)}
        )
        clf.fit(X_tr, y_tr)

        y_pred = clf.predict(X_va)
        gm, tpr, tnr = gmean_binary(y_va, y_pred)
        scores.append(gm)

        print(f"Fold {i}: G-Mean={gm:.4f} (TPR={tpr:.3f}, TNR={tnr:.3f})")

    print(f"\nCV G-Mean: mean={np.mean(scores):.4f}, std={np.std(scores):.4f}")
    return scores


# ▶️ Run with your time-series folds
scores = run_cv_baseline(df_feat, ts_folds, feature_cols)

[LightGBM] [Info] Number of positive: 977, number of negative: 25015
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000651 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3361
[LightGBM] [Info] Number of data points in the train set: 25992, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
Fold 1: G-Mean=0.0000 (TPR=0.000, TNR=1.000)
[LightGBM] [Info] Number of positive: 1458, number of negative: 50814
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000917 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3364
[LightGBM] [Info] Number of data points in the train set: 52272, number

In [59]:
def gmean_from_probs(y_true, y_prob, thresholds):
    """
    Returns:
      best_gm, best_thr, best_tpr, best_tnr
    """
    best = (-1.0, 0.5, 0.0, 0.0)  # (gm, thr, tpr, tnr)
    for thr in thresholds:
        y_pred = (y_prob >= thr).astype(int)
        tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0,1]).ravel()
        tpr = tp / (tp + fn) if (tp+fn)>0 else 0.0
        tnr = tn / (tn + fp) if (tn+fp)>0 else 0.0
        gm  = np.sqrt(tpr * tnr)
        if gm > best[0]:
            best = (gm, thr, tpr, tnr)
    return best

In [None]:
def run_cv_threshold_tuned(df_feat, folds, feature_cols, thresholds=None):
    if thresholds is None:
        thresholds = np.linspace(0.02, 0.5, 25)

    scores, chosen_thrs, per_fold_stats = [], [], []

    for i, (tr_idx, va_idx) in enumerate(folds, 1):
        tr = df_feat.loc[tr_idx].dropna(subset=feature_cols + ["y_event"])
        va = df_feat.loc[va_idx].dropna(subset=feature_cols + ["y_event"])

        X_tr, y_tr = tr[feature_cols], tr["y_event"]
        X_va, y_va = va[feature_cols], va["y_event"]

        # 🔹 softer imbalance handling: sqrt(neg/pos)
        pos = max((y_tr == 1).sum(), 1)
        neg = (y_tr == 0).sum()
        spw = (neg / pos) ** 0.5

        clf = LGBMClassifier(
            random_state=42,
            n_estimators=1500,
            learning_rate=0.03,
            num_leaves=127,
            max_depth=-1,
            min_data_in_leaf=20,     # was 50
            subsample=0.8,
            colsample_bytree=0.8,
            reg_lambda=1.0,
            scale_pos_weight=float(spw),
            n_jobs=-1
        )

        callbacks = [lgb.early_stopping(stopping_rounds=150, verbose=False)]
        clf.fit(
            X_tr, y_tr,
            eval_set=[(X_va, y_va)],
            eval_metric="binary_logloss",
            callbacks=callbacks
        )

        # probabilities (use best iteration if present)
        try:
            y_prob = clf.predict_proba(X_va, num_iteration=clf.best_iteration_)[:, 1]
        except Exception:
            y_prob = clf.predict_proba(X_va)[:, 1]

        # threshold sweep → best G-Mean
        best_gm, best_thr, best_tpr, best_tnr = gmean_from_probs(y_va, y_prob, thresholds)
        scores.append(best_gm)
        chosen_thrs.append(best_thr)
        per_fold_stats.append((best_tpr, best_tnr))

        bi = getattr(clf, "best_iteration_", None)
        print(
            f"Fold {i}: G-Mean={best_gm:.4f} @ thr={best_thr:.3f} "
            f"(TPR={best_tpr:.3f}, TNR={best_tnr:.3f})"
            + (f" | best_iter={bi}" if bi is not None else "")
        )

    print(f"\nCV G-Mean: mean={np.mean(scores):.4f}, std={np.std(scores):.4f}")
    print(f"Chosen thresholds per fold: {[round(t,3) for t in chosen_thrs]}")
    return scores, chosen_thrs, per_fold_stats

In [80]:
scores_tt, thrs_tt, stats_tt = run_cv_threshold_tuned(df_feat, ts_folds, feature_cols)

[LightGBM] [Info] Number of positive: 977, number of negative: 25015
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000686 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3616
[LightGBM] [Info] Number of data points in the train set: 25992, number of used features: 22
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.037588 -> initscore=-3.242744
[LightGBM] [Info] Start training from score -3.242744
Fold 1: G-Mean=0.4406 @ thr=0.020 (TPR=0.208, TNR=0.934) | best_iter=27
[LightGBM] [Info] Number of positive: 1458, number of negative: 50814
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001007 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3619
[LightGBM] [Info] Number of data points in

In [81]:
scores_loso, thrs_loso, stats_loso = run_cv_threshold_tuned(df_feat, loso_folds, feature_cols)

[LightGBM] [Info] Number of positive: 2338, number of negative: 67550
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001097 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3622
[LightGBM] [Info] Number of data points in the train set: 69888, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.033454 -> initscore=-3.363572
[LightGBM] [Info] Start training from score -3.363572
Fold 1: G-Mean=0.8871 @ thr=0.020 (TPR=0.907, TNR=0.868) | best_iter=64
[LightGBM] [Info] Number of positive: 2000, number of negative: 67888
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001145 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3622
[LightGBM] [Info] Number of data points i

In [87]:
# --- Train full model ---
X = df_feat_train[feature_cols_final]
y = (df_feat_train["Demand_Response_Flag"] != 0).astype(int)

clf = LGBMClassifier(
    random_state=42,
    n_estimators=1000,
    learning_rate=0.03,
    num_leaves=63,
    min_data_in_leaf=20,
    colsample_bytree=0.8,
    subsample=0.8,
    n_jobs=-1
)

clf.fit(X, y)

# --- Predict on test ---
X_test = df_feat_test[feature_cols_final]
y_prob = clf.predict_proba(X_test)[:, 1]

# apply tuned threshold (0.02)
FIXED_THR = 0.02
y_pred = (y_prob >= FIXED_THR).astype(int)

# For warm-up: submit event vs no-event. We'll emit -1 for events, 0 otherwise.
y_flag = np.where(y_pred == 1, -1, 0)

# --- Build submission using the ORIGINAL test df (has Site & Timestamp) ---
sub = pd.DataFrame({
    "Site": df_test["Site"].values,
    "Timestamp_Local": pd.to_datetime(df_test["Timestamp_Local"]).dt.strftime("%Y-%m-%d %H:%M:%S"),
    "Demand_Response_Flag": y_flag
})

# optional sanity check
print("Predicted events in test:", (sub["Demand_Response_Flag"] != 0).sum(), "/", len(sub))

# --- Save to CSV ---
save_path = "../submissions/submission_phase1.csv"
os.makedirs(os.path.dirname(save_path), exist_ok=True)
sub.to_csv(save_path, index=False)
print("Submission saved:", save_path, "| shape:", sub.shape)
sub.head()

[LightGBM] [Info] Number of positive: 3109, number of negative: 102011
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001399 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3624
[LightGBM] [Info] Number of data points in the train set: 105120, number of used features: 22
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.029576 -> initscore=-3.490780
[LightGBM] [Info] Start training from score -3.490780
Predicted events in test: 1649 / 35040
Submission saved: ../submissions/submission_phase1.csv | shape: (35040, 3)


Unnamed: 0,Site,Timestamp_Local,Demand_Response_Flag
0,siteD,2020-01-01 00:00:00,0
1,siteD,2020-01-01 00:15:00,0
2,siteD,2020-01-01 00:30:00,0
3,siteD,2020-01-01 00:45:00,0
4,siteD,2020-01-01 01:00:00,0


In [92]:
def write_submission_from_probs(y_prob, thr, df_test, path):
    y_pred = (y_prob >= thr).astype(int)
    # Warm-up phase: treat any event as -1 (binary evaluation)
    y_flag = np.where(y_pred == 1, -1, 0)
    sub = pd.DataFrame({
        "Site": df_test["Site"].values,
        "Timestamp_Local": pd.to_datetime(df_test["Timestamp_Local"]).dt.strftime("%Y-%m-%d %H:%M:%S"),
        "Demand_Response_Flag": y_flag
    })
    os.makedirs(os.path.dirname(path), exist_ok=True)
    sub.to_csv(path, index=False)
    print(f"thr={thr:0.3f} | events={int((sub['Demand_Response_Flag']!=0).sum())}/{len(sub)} | saved -> {path}")

# thresholds to try (tighter around your current best)
ths = [0.003, 0.004, 0.005]

for thr in ths:
    out = f"../submissions/submission_phase1_thr{str(thr).replace('.','p')}.csv"
    write_submission_from_probs(y_prob, thr, df_test, out)

thr=0.003 | events=2922/35040 | saved -> ../submissions/submission_phase1_thr0p003.csv
thr=0.004 | events=2683/35040 | saved -> ../submissions/submission_phase1_thr0p004.csv
thr=0.005 | events=2529/35040 | saved -> ../submissions/submission_phase1_thr0p005.csv
