# 03 — Directional Feature Engineering (USD/CAD)

Goal: Construct a leakage-safe, explainable feature set for predicting the
7-business-day direction (UP / DOWN) of USD/CAD.

This notebook follows the Gold-layer data contract:
- obs_date (date)
- series_id (single series per file)
- value (float)
- prev_value (float, guaranteed lag-1 of value)


## Scope

This notebook is limited to feature construction only.

It:
- reads a Gold-layer parquet snapshot
- validates the Gold contract
- builds a directional target over a fixed horizon
- creates a minimal, defensible set of predictive features

It does **not** train models or run backtests.



In [14]:
from __future__ import annotations

from pathlib import Path
import os
import numpy as np
import pandas as pd

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)

RANDOM_SEED = 7
np.random.seed(RANDOM_SEED)

def find_repo_root(start: Path | None = None) -> Path:
    start = start or Path.cwd()
    for p in [start, *start.parents]:
        if (p / "data").is_dir() and (p / "src").is_dir():
            return p
    raise RuntimeError(f"Repo root not found from: {start}. Run notebook from inside repo.")

REPO_ROOT = find_repo_root()

DATA_DIR = REPO_ROOT / "data"
OUT_DIR = REPO_ROOT / "outputs"
OUT_DIR.mkdir(parents=True, exist_ok=True)

PARQUET_PATH = DATA_DIR / "data-USD-CAD.parquet"
if not PARQUET_PATH.exists():
    raise FileNotFoundError(
        f"Parquet not found at: {PARQUET_PATH}\n"
        f"REPO_ROOT={REPO_ROOT}\n"
        f"Contents of DATA_DIR: {[p.name for p in DATA_DIR.glob('*')][:20]}"
    )

H = 7
SAVE_FEATURE_PARQUET = True
FEATURE_OUT_PATH = OUT_DIR / f"usdcad_features_h{H}.parquet"


## Load Gold Data + Validate Contract

We validate:
- required columns exist
- obs_date is parseable as datetime
- exactly one series_id exists in the file
- prev_value matches lag-1(value)



In [15]:
df = pd.read_parquet(PARQUET_PATH)

required_cols = {"obs_date", "series_id", "value", "prev_value"}
missing = required_cols - set(df.columns)
assert not missing, f"Gold contract violated. Missing columns: {missing}"

df["obs_date"] = pd.to_datetime(df["obs_date"])
df = df.sort_values("obs_date").set_index("obs_date")

series_ids = df["series_id"].dropna().unique()
assert len(series_ids) == 1, f"Expected exactly one series_id, found: {series_ids}"
SERIES_ID = series_ids[0]

lag1 = df["value"].shift(1)
mask = lag1.notna()
assert (df.loc[mask, "prev_value"].values == lag1.loc[mask].values).all(), \
    "Gold contract violated: prev_value is not exactly lag-1(value)."

df[["series_id", "value", "prev_value"]].head(), SERIES_ID, df.index.min(), df.index.max(), df.shape


(           series_id   value  prev_value
 obs_date                                
 2017-01-03  FXUSDCAD  1.3435         NaN
 2017-01-04  FXUSDCAD  1.3315      1.3435
 2017-01-05  FXUSDCAD  1.3244      1.3315
 2017-01-06  FXUSDCAD  1.3214      1.3244
 2017-01-09  FXUSDCAD  1.3240      1.3214,
 'FXUSDCAD',
 Timestamp('2017-01-03 00:00:00'),
 Timestamp('2025-12-22 00:00:00'),
 (2239, 38))

## Target: Direction over H business days

We define:

- forward_return_H = value[t+H] / value[t] - 1
- direction_H = 1 if forward_return_H > 0, else 0

Leakage handling:
- the target uses future prices by construction
- the final H rows are dropped
- all features are computed using information available at time t or earlier


In [16]:
df_model = df.copy()

df_model[f"fwd_return_{H}d"] = df_model["value"].shift(-H) / df_model["value"] - 1.0
df_model[f"direction_{H}d"] = (df_model[f"fwd_return_{H}d"] > 0).astype(int)

# Drop last H rows where target is undefined
df_model = df_model.iloc[:-H].copy()

df_model[[ "value", f"fwd_return_{H}d", f"direction_{H}d" ]].tail(10)


Unnamed: 0_level_0,value,fwd_return_7d,direction_7d
obs_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2025-11-28,1.3979,-0.009657,0
2025-12-01,1.3979,-0.010301,0
2025-12-02,1.3986,-0.015158,0
2025-12-03,1.3949,-0.012904,0
2025-12-04,1.3952,-0.013045,0
2025-12-05,1.386,-0.008081,0
2025-12-08,1.3837,-0.003975,0
2025-12-09,1.3844,-0.005056,0
2025-12-10,1.3835,-0.003903,0
2025-12-11,1.3774,-0.001888,0


## Feature Groups (minimal + defensible)

1) Lagged returns
- Hypothesis: weak momentum / mean reversion may exist across short horizons.

2) Rolling volatility
- Hypothesis: signal behavior changes across risk regimes.

3) Rolling momentum (smoothed returns)
- Hypothesis: persistent drift may exist over multi-day windows.

4) Standardized return (z-score)
- Hypothesis: move size relative to recent vol may be informative.

5) Simple volatility regime flags
- Hypothesis: model performance differs between low/high vol environments.

6) Calendar features
- Hypothesis: mild weekday/month effects may exist.


In [17]:
def pct_return_from_values(curr: pd.Series, lagged: pd.Series) -> pd.Series:
    return curr / lagged - 1.0

def pct_return(s: pd.Series, n: int) -> pd.Series:
    return s / s.shift(n) - 1.0

def rolling_std(s: pd.Series, w: int) -> pd.Series:
    return s.rolling(window=w, min_periods=w).std()

def rolling_mean(s: pd.Series, w: int) -> pd.Series:
    return s.rolling(window=w, min_periods=w).mean()

EPS = 1e-12


In [18]:
value = df_model["value"]
prev_value = df_model["prev_value"]  # contract: exact lag-1
ret_1d = pct_return_from_values(value, prev_value)

feat = pd.DataFrame(index=df_model.index)
feat["value"] = value

# 1) Lagged returns (multi-horizon)
for n in [1, 3, 5, 10, 21]:
    feat[f"ret_{n}d"] = pct_return(value, n)

# 2) Rolling volatility of 1d returns
for w in [5, 10, 21, 63]:
    feat[f"vol_{w}d"] = rolling_std(ret_1d, w)

# 3) Rolling momentum (mean of 1d returns)
for w in [5, 10, 21]:
    feat[f"mom_{w}d"] = rolling_mean(ret_1d, w)

# 4) Z-scores: standardized 1d return vs trailing vol
for w in [21, 63]:
    feat[f"zret_1d_{w}d"] = ret_1d / (feat[f"vol_{w}d"] + EPS)

# 5) Simple regime indicators
feat["vol_ratio_21_63"] = (feat["vol_21d"] + EPS) / (feat["vol_63d"] + EPS)

# High-vol regime: vol_21 above its trailing median over last ~252 business days
feat["vol_21_med_252"] = feat["vol_21d"].rolling(252, min_periods=252).median()
feat["is_high_vol"] = (feat["vol_21d"] > feat["vol_21_med_252"]).astype(int)

# 6) Calendar features
idx = feat.index
feat["day_of_week"] = idx.dayofweek.astype(int)   # 0=Mon
feat["month"] = idx.month.astype(int)
feat["is_month_end"] = idx.is_month_end.astype(int)

# Attach target
feat[f"direction_{H}d"] = df_model[f"direction_{H}d"].astype(int)
feat[f"fwd_return_{H}d"] = df_model[f"fwd_return_{H}d"].astype(float)

feat.head()


Unnamed: 0_level_0,value,ret_1d,ret_3d,ret_5d,ret_10d,ret_21d,vol_5d,vol_10d,vol_21d,vol_63d,mom_5d,mom_10d,mom_21d,zret_1d_21d,zret_1d_63d,vol_ratio_21_63,vol_21_med_252,is_high_vol,day_of_week,month,is_month_end,direction_7d,fwd_return_7d
obs_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
2017-01-03,1.3435,,,,,,,,,,,,,,,,,0,1,1,0,0,-0.024488
2017-01-04,1.3315,-0.008932,,,,,,,,,,,,,,,,0,2,1,0,0,-0.013068
2017-01-05,1.3244,-0.005332,,,,,,,,,,,,,,,,0,3,1,0,0,-0.006947
2017-01-06,1.3214,-0.002265,-0.01645,,,,,,,,,,,,,,,0,4,1,0,0,-0.011881
2017-01-09,1.324,0.001968,-0.005633,,,,,,,,,,,,,,,0,0,1,0,0,-0.01065


## Leakage Safety

- All rolling statistics are trailing.
- No feature uses future values (no negative shifts).
- Features at time t align with a target defined over [t, t+H].


In [19]:
coverage = (1.0 - feat.isna().mean()).sort_values(ascending=True).to_frame("non_null_rate")
coverage.head(15), coverage.tail(15)

target_cols = [f"direction_{H}d", f"fwd_return_{H}d"]
feature_cols = [c for c in feat.columns if c not in target_cols]

# Strict contract for modeling notebooks: no missing feature values
df_feat = feat.dropna(subset=feature_cols + target_cols).copy()

df_feat.shape, df_feat.index.min(), df_feat.index.max()


((1960, 23),
 Timestamp('2018-02-02 00:00:00'),
 Timestamp('2025-12-11 00:00:00'))

In [20]:
df_feat[f"direction_{H}d"].value_counts(normalize=True)

corr = df_feat[feature_cols + [f"direction_{H}d"]].corr(numeric_only=True)
corr[f"direction_{H}d"].sort_values(ascending=False).head(10), \
corr[f"direction_{H}d"].sort_values().head(10)


(direction_7d      1.000000
 vol_21_med_252    0.038657
 zret_1d_63d       0.019777
 zret_1d_21d       0.019088
 is_month_end      0.015234
 ret_1d            0.015139
 month            -0.000355
 day_of_week      -0.010079
 ret_3d           -0.010889
 ret_10d          -0.020373
 Name: direction_7d, dtype: float64,
 value             -0.126236
 vol_5d            -0.114916
 is_high_vol       -0.106623
 vol_10d           -0.105013
 vol_21d           -0.095203
 mom_21d           -0.090714
 ret_21d           -0.090220
 vol_63d           -0.086256
 vol_ratio_21_63   -0.052777
 mom_5d            -0.021520
 Name: direction_7d, dtype: float64)

In [22]:
if SAVE_FEATURE_PARQUET:
    os.makedirs("outputs", exist_ok=True)
    df_feat.to_parquet(FEATURE_OUT_PATH, index=True)
    print("| rows:", len(df_feat), "| cols:", df_feat.shape[1], "| series_id:", SERIES_ID)
else:
    print("SAVE_FEATURE_PARQUET=False — nothing written.")


| rows: 1960 | cols: 23 | series_id: FXUSDCAD
