# 02 — Direction Baselines (Rolling Backtest)

**Goal:** Backtest simple probabilistic baselines for the 7-business-day direction of USD/CAD.

We evaluate:
- accuracy (direction)
- log loss / Brier score (probability quality)
- confidence gating (coverage vs accuracy tradeoff)


## 1) Imports

We use:
- `pandas/numpy` for data manipulation
- `pyarrow` for parquet loading
- `scikit-learn` for logistic regression and probabilistic metrics


In [22]:
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
from pathlib import Path

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss, brier_score_loss

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 160)


## 2) Paths (Repo-safe)

We avoid absolute paths so the notebook can run on any machine without leaking local directories into Git.

We locate the repo root by searching upward for `data/` and `src/`, then define file paths relative to that root.


In [23]:
def find_repo_root(start: Path | None = None) -> Path:
    start = start or Path.cwd()
    for p in [start, *start.parents]:
        if (p / "data").exists() and (p / "src").exists():
            return p
    raise RuntimeError("Repo root not found. Run the notebook from inside the repo.")

REPO_ROOT = find_repo_root()
DATA_DIR = REPO_ROOT / "data"
OUT_DIR = REPO_ROOT / "outputs"
OUT_DIR.mkdir(parents=True, exist_ok=True)

PARQUET_PATH = DATA_DIR / "data-USD-CAD.parquet"

def load_gold_parquet(path: Path, series_id: str = "FXUSDCAD") -> pd.DataFrame:
    table = pq.read_table(str(path))
    df = table.to_pandas()

    if "series_id" in df.columns:
        df = df[df["series_id"] == series_id].copy()

    if "obs_date" not in df.columns:
        raise ValueError("Expected column 'obs_date' in gold parquet.")

    df["obs_date"] = pd.to_datetime(df["obs_date"])
    df = df.sort_values("obs_date").reset_index(drop=True).set_index("obs_date")

    # convert numeric-ish columns
    for c in df.columns:
        if c in ("series_id", "base_currency", "quote_currency", "source", "run_id", "processed_at"):
            continue
        # only attempt conversion if dtype is object/string
        if df[c].dtype == "object":
            try:
                df[c] = pd.to_numeric(df[c])
            except Exception:
                pass

    if "value" not in df.columns:
        raise ValueError("Expected column 'value' in gold parquet.")

    df = df[~df["value"].isna()].copy()
    return df

df = load_gold_parquet(PARQUET_PATH, series_id="FXUSDCAD")
df.head()


Unnamed: 0_level_0,series_id,base_currency,quote_currency,value,prev_value,daily_return,log_return,return_5d,return_21d,lag_1d,lag_2d,lag_3d,lag_5d,lag_21d,rolling_mean_5d,rolling_mean_21d,rolling_std_5d,rolling_std_21d,volatility_ratio,ma_crossover,distance_from_ma21,day_of_week,day_of_month,week_of_year,month,quarter,year,is_month_start,is_month_end,is_quarter_end,is_year_start,is_year_end,target_return_1d,target_direction_1d,target_return_5d,source,run_id,processed_at
obs_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1
2017-01-03,FXUSDCAD,USD,CAD,1.3435,,,,,,,,,,,1.3435,1.3435,,,,0.0,0.0,1,3,1,1,1,2017,False,False,False,False,False,-0.008972,0,-0.016524,bankofcanada_valet,gold_backfill_20251223T210750Z,2025-12-23T21:12:37.627684+00:00
2017-01-04,FXUSDCAD,USD,CAD,1.3315,1.3435,-0.008932,-0.008972,,,1.3435,,,,,1.3375,1.3375,0.008485,,,0.0,-0.004486,2,4,1,1,1,2017,False,False,False,False,False,-0.005347,0,-0.004882,bankofcanada_valet,gold_backfill_20251223T210750Z,2025-12-23T21:12:37.627684+00:00
2017-01-05,FXUSDCAD,USD,CAD,1.3244,1.3315,-0.005332,-0.005347,,,1.3315,1.3435,,,,1.333133,1.333133,0.009654,,,0.0,-0.006551,3,5,1,1,1,2017,False,False,False,False,False,-0.002268,0,-0.01042,bankofcanada_valet,gold_backfill_20251223T210750Z,2025-12-23T21:12:37.627684+00:00
2017-01-06,FXUSDCAD,USD,CAD,1.3214,1.3244,-0.002265,-0.002268,,,1.3244,1.3315,1.3435,,,1.3302,1.3302,0.009826,,,0.0,-0.006616,4,6,1,1,1,2017,False,False,False,False,False,0.001966,1,-0.005524,bankofcanada_valet,gold_backfill_20251223T210750Z,2025-12-23T21:12:37.627684+00:00
2017-01-09,FXUSDCAD,USD,CAD,1.324,1.3214,0.001968,0.001966,,,1.3214,1.3244,1.3315,,,1.32896,1.32896,0.00895,0.00895,1.0,0.0,-0.003732,0,9,2,1,1,2017,False,False,False,False,False,-0.002041,0,-0.006647,bankofcanada_valet,gold_backfill_20251223T210750Z,2025-12-23T21:12:37.627684+00:00


## 3) Load Gold Parquet

We load the Gold-layer parquet and filter to the `FXUSDCAD` series.

We enforce:
- datetime index (`obs_date`)
- sorted chronology
- numeric conversion where appropriate
- removal of invalid `value` rows


In [31]:
H = 7

df["target_return_7d"] = df["value"].shift(-H) / df["value"] - 1.0
df["target_direction_7d"] = (df["target_return_7d"] > 0).astype(int)

df_model = df.iloc[:-H].copy()

print("Rows (modelable):", len(df_model))
print("Positive class rate:", df_model["target_direction_7d"].mean())
df_model[["value", "target_return_7d", "target_direction_7d"]].tail(10)


Rows (modelable): 2232
Positive class rate: 0.5085125448028673


Unnamed: 0_level_0,value,target_return_7d,target_direction_7d
obs_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2025-11-28,1.3979,-0.009657,0
2025-12-01,1.3979,-0.010301,0
2025-12-02,1.3986,-0.015158,0
2025-12-03,1.3949,-0.012904,0
2025-12-04,1.3952,-0.013045,0
2025-12-05,1.386,-0.008081,0
2025-12-08,1.3837,-0.003975,0
2025-12-09,1.3844,-0.005056,0
2025-12-10,1.3835,-0.003903,0
2025-12-11,1.3774,-0.001888,0


## 4) Define the modeling target (H = 7 business days)

We define a 7-business-day forward return and convert it into a binary direction label:

- `target_return_7d = value[t+7] / value[t] - 1`
- `target_direction_7d = 1` if `target_return_7d > 0`, else `0`

This supports a “direction + confidence” product narrative.


In [32]:
FEATURES = [
    "daily_return", "log_return",
    "return_5d", "return_21d",
    "lag_1d", "lag_2d", "lag_3d", "lag_5d", "lag_21d",
    "rolling_std_5d", "rolling_std_21d",
    "rolling_mean_5d", "rolling_mean_21d",
    "volatility_ratio",
    "day_of_week", "month", "is_month_end",
]

FEATURES = [c for c in FEATURES if c in df_model.columns]
print("Using features:", FEATURES)
print("Feature count:", len(FEATURES))

X_all = df_model[FEATURES].replace([np.inf, -np.inf], np.nan)
X_all = X_all.ffill().fillna(0.0)
y_all = df_model["target_direction_7d"].astype(int).copy()


Using features: ['daily_return', 'log_return', 'return_5d', 'return_21d', 'lag_1d', 'lag_2d', 'lag_3d', 'lag_5d', 'lag_21d', 'rolling_std_5d', 'rolling_std_21d', 'rolling_mean_5d', 'rolling_mean_21d', 'volatility_ratio', 'day_of_week', 'month', 'is_month_end']
Feature count: 17


## 5) Feature set and model-ready matrices

We use a disciplined, conservative feature set from the Gold layer:
- lagged returns
- rolling means and volatility
- calendar features (day-of-week, month, month-end)

We build:
- `X`: feature matrix
- `y`: direction labels (`target_direction_7d`)

We handle:
- `inf` values (replace with NaN)
- missing values (forward-fill, then fallback to 0)
- class balance check (positive class rate)


In [33]:
def sigmoid(x: np.ndarray) -> np.ndarray:
    return 1.0 / (1.0 + np.exp(-x))

def coinflip_prob(n: int) -> np.ndarray:
    return np.full(n, 0.5, dtype=float)

def momentum_prob(train_df: pd.DataFrame, n_test: int) -> np.ndarray:
    """
    Simple, stable baseline:
    score = last_log_return / (recent_vol + eps)
    p = sigmoid(score * k)
    """
    eps = 1e-8
    k = 1.5  # strength; keep modest to avoid extreme probabilities

    if "log_return" in train_df.columns:
        last_ret = float(train_df["log_return"].iloc[-1])
    elif "daily_return" in train_df.columns:
        last_ret = float(train_df["daily_return"].iloc[-1])
    else:
        # fallback: use value diff
        v = train_df["value"].astype(float)
        last_ret = float(v.pct_change().iloc[-1])

    if "rolling_std_21d" in train_df.columns:
        vol = float(train_df["rolling_std_21d"].iloc[-1])
    elif "rolling_std_5d" in train_df.columns:
        vol = float(train_df["rolling_std_5d"].iloc[-1])
    else:
        vol = float(train_df["value"].pct_change().rolling(21).std().iloc[-1])

    score = last_ret / (vol + eps)
    p = float(sigmoid(np.array([score * k]))[0])
    return np.full(n_test, p, dtype=float)


## 6) Baseline probability models

We benchmark three probabilistic baselines:

- **Coinflip**: always predicts `p(up)=0.5` (no-skill baseline)
- **Momentum**: converts the most recent return scaled by recent volatility into a probability via a sigmoid
- **Logistic regression**: simple interpretable ML baseline trained on an expanding window

These baselines are intentionally lightweight and fast to backtest.


In [38]:
def rolling_backtest_direction(
    df_model: pd.DataFrame,
    X_all: pd.DataFrame,
    y_all: pd.Series,
    min_train_size: int = 252 * 2,
    use_logistic: bool = True,
) -> pd.DataFrame:
    rows = []

    n = len(df_model)
    for t in range(min_train_size, n):
        # expanding window
        train_idx = slice(0, t)
        test_idx = t  # one-step evaluation at each date (target already encodes 7d forward)

        train_df = df_model.iloc[train_idx]
        x_train = X_all.iloc[train_idx]
        y_train = y_all.iloc[train_idx]

        x_test = X_all.iloc[[test_idx]]
        y_test = int(y_all.iloc[test_idx])
        obs_date = df_model.index[test_idx]

        # Baselines
        p_coin = 0.5
        p_momo = float(momentum_prob(train_df, 1)[0])

        # Optional logistic regression baseline
        p_log = np.nan
        if use_logistic:
            clf = LogisticRegression(
                max_iter=1000,
                solver="lbfgs",
            )
            clf.fit(x_train, y_train)
            p_log = float(clf.predict_proba(x_test)[0, 1])

        row = {
            "date": obs_date,
            "horizon": 7,
            "y_true": y_test,

            "p_coin": p_coin,
            "p_momentum": p_momo,
            "p_logistic": p_log,
        }

        # Convert probabilities → predicted class (threshold 0.5)
        for name in ["coin", "momentum", "logistic"]:
            pcol = f"p_{name}"
            if pcol in row and not (row[pcol] is None or (isinstance(row[pcol], float) and np.isnan(row[pcol]))):
                row[f"yhat_{name}"] = int(row[pcol] >= 0.5)
                row[f"conf_{name}"] = float(max(row[pcol], 1.0 - row[pcol]))
            else:
                row[f"yhat_{name}"] = np.nan
                row[f"conf_{name}"] = np.nan

        rows.append(row)

    return pd.DataFrame(rows)

bt = rolling_backtest_direction(
    df_model=df_model,
    X_all=X_all,
    y_all=y_all,
    min_train_size=252 * 2,
    use_logistic=True
)

bt.head(), bt.tail(), bt.shape


(        date  horizon  y_true  p_coin  p_momentum  p_logistic  yhat_coin  conf_coin  yhat_momentum  conf_momentum  yhat_logistic  conf_logistic
 0 2019-01-09        7       1     0.5    0.451812    0.568030          1        0.5              0       0.548188              1       0.568030
 1 2019-01-10        7       1     0.5    0.344940    0.547572          1        0.5              0       0.655060              1       0.547572
 2 2019-01-11        7       1     0.5    0.527996    0.525729          1        0.5              1       0.527996              1       0.525729
 3 2019-01-14        7       1     0.5    0.551499    0.479161          1        0.5              1       0.551499              0       0.520839
 4 2019-01-15        7       1     0.5    0.519946    0.488125          1        0.5              1       0.519946              0       0.511875,
            date  horizon  y_true  p_coin  p_momentum  p_logistic  yhat_coin  conf_coin  yhat_momentum  conf_momentum  yhat_logis

## 7) Rolling backtest (expanding window)

We evaluate one prediction per business day using an expanding training window.

At each time step:
- train on all data up to date `t`
- predict the probability of “up” for the 7-business-day forward target at `t`
- record probability, predicted label, and confidence

This matches how the system would run in production (daily inference).


In [42]:
def overall_metrics(bt: pd.DataFrame, model: str) -> dict:
    y = bt["y_true"].values
    p = bt[f"p_{model}"].values.astype(float)
    yhat = bt[f"yhat_{model}"].values.astype(int)

    return {
        "model": model,
        "n": int(len(bt)),
        "accuracy": float(accuracy_score(y, yhat)),
        "log_loss": float(log_loss(y, p, labels=[0,1])),
        "brier": float(brier_score_loss(y, p)),
        "mean_confidence": float(bt[f"conf_{model}"].mean()),
    }

metrics = pd.DataFrame([
    overall_metrics(bt, "coin"),
    overall_metrics(bt, "momentum"),
    overall_metrics(bt, "logistic"),
]).sort_values("accuracy", ascending=False)

metrics


Unnamed: 0,model,n,accuracy,log_loss,brier,mean_confidence
2,logistic,1728,0.53125,0.689689,0.248268,0.544571
0,coin,1728,0.50463,0.693147,0.25,0.5
1,momentum,1728,0.48669,0.764591,0.278661,0.622876


## 8) Overall metrics (probabilistic + classification)

We evaluate:
- **Accuracy** for direction
- **Log loss** for probability quality (lower is better)
- **Brier score** for calibration / probability error (lower is better)

In FX, probability metrics matter because small edge is only useful if confidence is reliable.


In [43]:
def confidence_bucket_metrics(bt: pd.DataFrame, model: str, thresholds=(0.50, 0.55, 0.60, 0.65, 0.70)) -> pd.DataFrame:
    y = bt["y_true"].values
    p = bt[f"p_{model}"].values.astype(float)
    yhat = bt[f"yhat_{model}"].values.astype(int)
    conf = bt[f"conf_{model}"].values.astype(float)

    rows = []
    for thr in thresholds:
        mask = conf >= thr
        if mask.sum() == 0:
            rows.append({"model": model, "conf_thr": thr, "coverage": 0.0, "accuracy": np.nan, "log_loss": np.nan, "brier": np.nan})
            continue

        rows.append({
            "model": model,
            "conf_thr": thr,
            "coverage": float(mask.mean()),
            "accuracy": float(accuracy_score(y[mask], yhat[mask])),
            "log_loss": float(log_loss(y[mask], p[mask], labels=[0,1])),
            "brier": float(brier_score_loss(y[mask], p[mask])),
        })

    return pd.DataFrame(rows)

bucket = pd.concat([
    confidence_bucket_metrics(bt, "coin"),
    confidence_bucket_metrics(bt, "momentum"),
    confidence_bucket_metrics(bt, "logistic"),
], ignore_index=True)

bucket.sort_values(["model", "conf_thr"])


Unnamed: 0,model,conf_thr,coverage,accuracy,log_loss,brier
0,coin,0.5,1.0,0.50463,0.693147,0.25
1,coin,0.55,0.0,,,
2,coin,0.6,0.0,,,
3,coin,0.65,0.0,,,
4,coin,0.7,0.0,,,
10,logistic,0.5,1.0,0.53125,0.689689,0.248268
11,logistic,0.55,0.390625,0.555556,0.687396,0.247114
12,logistic,0.6,0.051505,0.617978,0.667219,0.237027
13,logistic,0.65,0.004051,0.428571,0.816853,0.308296
14,logistic,0.7,0.000579,1.0,0.342682,0.084179


## 9) Confidence gating (coverage vs performance)

FX is often close to random walk most of the time.

Instead of optimizing global accuracy only, we also measure performance when the model is confident:
- define confidence as `max(p, 1-p)`
- compute metrics only for rows where confidence ≥ threshold (e.g., 0.55, 0.60, 0.65)

This yields a realistic tradeoff:
- higher confidence → higher accuracy, but lower coverage


In [44]:
bt_out = OUT_DIR / "direction_baseline_backtest_rows.csv"
overall_out = OUT_DIR / "direction_baseline_metrics_overall.csv"
bucket_out = OUT_DIR / "direction_baseline_metrics_by_confidence.csv"

bt.to_csv(bt_out, index=False)
metrics.to_csv(overall_out, index=False)
bucket.to_csv(bucket_out, index=False)

print("Saved:")
print("-", bt_out)
print("-", overall_out)
print("-", bucket_out)

Saved:
- /Users/ianvicente/Desktop/FX-Rate-Forecasting-Pipeline/outputs/direction_baseline_backtest_rows.csv
- /Users/ianvicente/Desktop/FX-Rate-Forecasting-Pipeline/outputs/direction_baseline_metrics_overall.csv
- /Users/ianvicente/Desktop/FX-Rate-Forecasting-Pipeline/outputs/direction_baseline_metrics_by_confidence.csv


## 10) Save outputs

We export:
- row-level backtest predictions
- overall metrics
- confidence-gated metrics

These outputs are designed to be:
- reproducible
- easy to load into Notion / dashboards
- consistent with future model upgrades (XGBoost, etc.)
