# ML Feature Engineering & IC Analysis — BTC/USDT

**Goal:** Establish which features carry genuine predictive signal *before* building any model.
The key diagnostic is the **Information Coefficient (IC)** — the Spearman rank correlation
between each feature at time *t* and the actual forward return at *t+h*.

| IC range | Interpretation |
|---|---|
| < 0.02 | Negligible — likely noise |
| 0.02–0.05 | Weak but potentially useful with many features combined |
| 0.05–0.10 | Moderate — worth modelling |
| > 0.10 | Strong (rare in equity/crypto) |

## §1 — Config

In [1]:
import sys
from pathlib import Path

repo_root = Path("__file__").resolve().parent.parent
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import scipy.stats as stats

plt.rcParams.update({
    "figure.dpi":        120,
    "axes.spines.top":   False,
    "axes.spines.right": False,
    "font.size":         9,
})

SINCE       = "2024-01-01"
UNTIL       = "2025-01-01"
LABEL_H     = 1          # primary label horizon (bars)
HORIZONS    = [1, 2, 5, 10, 20, 48]   # for IC-decay analysis
IC_ROLL_WIN = 30 * 24    # rolling IC window in bars (≈30 days at 1h)

print(f"Primary dataset: {SINCE} → {UNTIL} | label horizon: {LABEL_H} bar(s)")

Primary dataset: 2024-01-01 → 2025-01-01 | label horizon: 1 bar(s)


## §2 — Data

In [2]:
from data.fetch import fetch_ohlcv

df = fetch_ohlcv(since=SINCE, until=UNTIL)
print(f"{len(df):,} bars  |  {df.index[0]} → {df.index[-1]}")

log_ret = np.log(df["close"] / df["close"].shift(1)).dropna()

fig, axes = plt.subplots(1, 2, figsize=(14, 3))
df["close"].plot(ax=axes[0], color="steelblue", linewidth=0.7)
axes[0].set_title("BTC/USDT close (1h)", fontsize=10)
axes[0].set_ylabel("USDT")

log_ret.plot(ax=axes[1], color="gray", linewidth=0.5, alpha=0.6)
axes[1].set_title("1h log-returns", fontsize=10)
axes[1].set_ylabel("log-return")
axes[1].axhline(0, color="black", linewidth=0.5)

plt.tight_layout()
plt.show()

print(f"\nReturn stats: mean={log_ret.mean():.6f}  std={log_ret.std():.6f}  "
      f"skew={log_ret.skew():.3f}  kurt={log_ret.kurt():.2f}")

8,785 bars  |  2024-01-01 00:00:00+00:00 → 2025-01-01 00:00:00+00:00



Return stats: mean=0.000091  std=0.005619  skew=-0.210  kurt=7.63


## §3 — Feature matrix

In [3]:
from ml.features import build_feature_matrix
from ml.labels  import forward_return, direction_label

feats = build_feature_matrix(df)
fwd1  = forward_return(df, horizon=LABEL_H)

# Align and drop warm-up rows
combined = pd.concat([feats, fwd1], axis=1).dropna()
X = combined[feats.columns]
y = combined[fwd1.name]

print(f"Feature matrix: {X.shape}  ({X.shape[1]} features, {X.shape[0]} usable rows)")
print(f"Label: {y.name}  |  mean={y.mean():.6f}  std={y.std():.6f}")
print(f"\nFeature groups:")
print(f"  Technical  : {[c for c in X.columns if not c.startswith('ret_') and not c.startswith('hour') and not c.startswith('dow') and not c in ('bar_ret','hl_range','upper_wick','lower_wick','vol_log_chg')]}")
print(f"  Lag/rolling: {[c for c in X.columns if c.startswith('ret_') or c in ('bar_ret','hl_range','upper_wick','lower_wick','vol_log_chg')]}")
print(f"  Time       : {[c for c in X.columns if c.startswith('hour') or c.startswith('dow')]}")

Feature matrix: (8763, 34)  (34 features, 8763 usable rows)
Label: fwd_ret_1  |  mean=0.000088  std=0.005623

Feature groups:
  Technical  : ['rsi', 'bb_width', 'bb_pct_b', 'bb_zscore', 'atr_pct', 'macd_hist_norm', 'volume_ratio', 'stoch_k', 'adx', 'di_diff']
  Lag/rolling: ['ret_lag1', 'ret_lag2', 'ret_lag3', 'ret_lag5', 'ret_lag10', 'ret_lag20', 'ret_mean_5', 'ret_std_5', 'ret_skew_5', 'ret_mean_10', 'ret_std_10', 'ret_skew_10', 'ret_mean_20', 'ret_std_20', 'ret_skew_20', 'bar_ret', 'hl_range', 'upper_wick', 'lower_wick', 'vol_log_chg']
  Time       : ['hour_sin', 'hour_cos', 'dow_sin', 'dow_cos']


## §4 — IC analysis: all features vs 1h forward return

In [4]:
def compute_ic(X: pd.DataFrame, y: pd.Series) -> pd.Series:
    """Spearman rank correlation of each feature column with y."""
    ic = {}
    for col in X.columns:
        valid = X[col].notna() & y.notna()
        if valid.sum() < 30:
            ic[col] = np.nan
        else:
            rho, _ = stats.spearmanr(X.loc[valid, col], y[valid])
            ic[col] = rho
    return pd.Series(ic, name="IC")


ic = compute_ic(X, y).sort_values(key=abs, ascending=False)

# ── Ranked bar chart ──────────────────────────────────────────────────────────
fig, ax = plt.subplots(figsize=(14, 5))
colors = ["#2ecc71" if v >= 0 else "#e74c3c" for v in ic.values]
bars = ax.bar(range(len(ic)), ic.values, color=colors, edgecolor="white", linewidth=0.3)
ax.set_xticks(range(len(ic)))
ax.set_xticklabels(ic.index, rotation=45, ha="right", fontsize=8)
ax.axhline(0,     color="black", linewidth=0.7)
ax.axhline( 0.02, color="gray",  linewidth=0.7, linestyle="--", label="±0.02 threshold")
ax.axhline(-0.02, color="gray",  linewidth=0.7, linestyle="--")
ax.set_title(f"Feature IC (Spearman) vs 1h forward log-return  [n={len(y):,}]", fontsize=11)
ax.set_ylabel("Information Coefficient (IC)")
ax.legend(fontsize=8)
plt.tight_layout()
plt.show()

# ── Summary table ─────────────────────────────────────────────────────────────
ic_df = pd.DataFrame({"IC": ic, "|IC|": ic.abs()}).sort_values("|IC|", ascending=False)
print("Top 15 features by |IC|:")
print(ic_df.head(15).to_string(float_format="{:.5f}".format))
print(f"\nFeatures with |IC| > 0.02: {(ic_df['|IC|'] > 0.02).sum()} / {len(ic_df)}")
print(f"Mean |IC| across all features: {ic_df['|IC|'].mean():.5f}")

Top 15 features by |IC|:
                     IC    |IC|
bar_ret        -0.08120 0.08120
ret_lag1       -0.08120 0.08120
stoch_k        -0.06316 0.06316
bb_pct_b       -0.04904 0.04904
bb_zscore      -0.04904 0.04904
rsi            -0.04470 0.04470
upper_wick      0.03746 0.03746
ret_std_5       0.03288 0.03288
ret_mean_5     -0.02997 0.02997
di_diff        -0.02733 0.02733
ret_lag3       -0.02592 0.02592
atr_pct         0.02574 0.02574
ret_std_10      0.02260 0.02260
bb_width        0.02253 0.02253
macd_hist_norm -0.02227 0.02227

Features with |IC| > 0.02: 15 / 34
Mean |IC| across all features: 0.02386


## §5 — IC stability: rolling IC for top features

In [5]:
TOP_N = 6
top_features = ic_df.head(TOP_N).index.tolist()

# Rolling IC: compute Spearman correlation in a rolling window
def rolling_ic(feat: pd.Series, label: pd.Series, window: int) -> pd.Series:
    aligned = pd.concat([feat, label], axis=1).dropna()
    f, l    = aligned.iloc[:, 0], aligned.iloc[:, 1]
    roll_ic = []
    idx     = []
    for i in range(window, len(f) + 1):
        rho, _ = stats.spearmanr(f.iloc[i-window:i], l.iloc[i-window:i])
        roll_ic.append(rho)
        idx.append(f.index[i-1])
    return pd.Series(roll_ic, index=idx)


fig, axes = plt.subplots(2, 3, figsize=(16, 6), sharey=True)
fig.suptitle(f"Rolling IC (window={IC_ROLL_WIN//24}d) — top {TOP_N} features", fontsize=11)

for ax, feat in zip(axes.flat, top_features):
    ric = rolling_ic(X[feat], y, IC_ROLL_WIN)
    ric.plot(ax=ax, color="steelblue", linewidth=1.0)
    ax.axhline(0,     color="black", linewidth=0.6)
    ax.axhline( 0.05, color="green", linewidth=0.6, linestyle="--", alpha=0.5)
    ax.axhline(-0.05, color="red",   linewidth=0.6, linestyle="--", alpha=0.5)
    mean_ic = ric.mean()
    ax.set_title(f"{feat}  (mean IC={mean_ic:+.4f})", fontsize=8)
    ax.tick_params(axis="x", labelsize=6, rotation=20)

plt.tight_layout()
plt.show()

## §6 — IC by label horizon (IC decay)

In [6]:
horizon_ics = {}
for h in HORIZONS:
    fwd_h   = forward_return(df, horizon=h)
    joined  = pd.concat([feats, fwd_h], axis=1).dropna()
    X_h     = joined[feats.columns]
    y_h     = joined[fwd_h.name]
    ic_h    = compute_ic(X_h, y_h)
    horizon_ics[h] = ic_h

ic_decay = pd.DataFrame(horizon_ics).T   # rows=horizon, cols=features
ic_decay.index.name = "horizon_bars"

# ── Mean |IC| across all features per horizon ─────────────────────────────────
mean_abs_ic = ic_decay.abs().mean(axis=1)

fig, axes = plt.subplots(1, 2, figsize=(14, 4))

mean_abs_ic.plot(ax=axes[0], marker="o", color="steelblue", linewidth=1.5)
axes[0].set_title("Mean |IC| across all features vs horizon", fontsize=10)
axes[0].set_xlabel("Horizon (bars)")
axes[0].set_ylabel("Mean |IC|")
axes[0].axhline(0.02, color="gray", linestyle="--", linewidth=0.8, label="0.02 threshold")
axes[0].legend(fontsize=8)

# Top 8 features IC decay curves
top8 = ic_df.head(8).index.tolist()
for feat in top8:
    ic_decay[feat].abs().plot(ax=axes[1], marker=".", linewidth=1.0, label=feat)
axes[1].set_title("|IC| decay for top 8 features", fontsize=10)
axes[1].set_xlabel("Horizon (bars)")
axes[1].set_ylabel("|IC|")
axes[1].legend(fontsize=7, ncol=2)

plt.tight_layout()
plt.show()

print("Mean |IC| by horizon:")
print(mean_abs_ic.to_frame("mean_abs_IC").to_string(float_format="{:.5f}".format))

Mean |IC| by horizon:
              mean_abs_IC
horizon_bars             
1                 0.02386
2                 0.02301
5                 0.02257
10                0.02207
20                0.02500
48                0.02451


## §7 — Intraday seasonality

In [7]:
hourly = pd.DataFrame({
    "log_return": np.log(df["close"] / df["close"].shift(1)),
    "hour":       df.index.hour,
    "dow":        df.index.dayofweek,
}).dropna()

hour_mean = hourly.groupby("hour")["log_return"].agg(["mean", "std", "count"])
hour_mean["se"] = hour_mean["std"] / np.sqrt(hour_mean["count"])
hour_mean["t_stat"] = hour_mean["mean"] / hour_mean["se"]

dow_mean  = hourly.groupby("dow")["log_return"].agg(["mean", "std", "count"])
dow_mean["se"] = dow_mean["std"] / np.sqrt(dow_mean["count"])

fig, axes = plt.subplots(1, 2, figsize=(14, 4))

colors_h = ["#2ecc71" if v > 0 else "#e74c3c" for v in hour_mean["mean"]]
axes[0].bar(hour_mean.index, hour_mean["mean"] * 1e4, color=colors_h, alpha=0.8)
axes[0].errorbar(hour_mean.index, hour_mean["mean"] * 1e4,
                 yerr=2 * hour_mean["se"] * 1e4,
                 fmt="none", color="black", linewidth=0.8, capsize=2)
axes[0].axhline(0, color="black", linewidth=0.5)
axes[0].set_title("Mean 1h return by hour of day (×10⁻⁴)", fontsize=10)
axes[0].set_xlabel("UTC hour")
axes[0].set_ylabel("Mean log-return (×10⁻⁴)")
axes[0].set_xticks(range(24))

dow_labels = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
colors_d   = ["#2ecc71" if v > 0 else "#e74c3c" for v in dow_mean["mean"]]
axes[1].bar(dow_labels, dow_mean["mean"] * 1e4, color=colors_d, alpha=0.8)
axes[1].errorbar(range(7), dow_mean["mean"] * 1e4,
                 yerr=2 * dow_mean["se"] * 1e4,
                 fmt="none", color="black", linewidth=0.8, capsize=2)
axes[1].axhline(0, color="black", linewidth=0.5)
axes[1].set_title("Mean 1h return by day of week (×10⁻⁴)", fontsize=10)
axes[1].set_ylabel("Mean log-return (×10⁻⁴)")

plt.tight_layout()
plt.show()

# Hours where |t-stat| > 2 (roughly 95% significant)
sig_hours = hour_mean[hour_mean["t_stat"].abs() > 2]
if len(sig_hours):
    print(f"Statistically significant hours (|t|>2): {list(sig_hours.index)}")
    print(sig_hours[["mean", "t_stat"]].to_string(float_format="{:.5f}".format))
else:
    print("No hour with |t-stat| > 2 — no significant intraday seasonality detected.")

Statistically significant hours (|t|>2): [17, 19]
         mean   t_stat
hour                  
17    0.00064  2.07353
19   -0.00070 -2.00147


## §8 — IC by timeframe (5m, 1h, 4h, 1d)

In [8]:
def resample_ohlcv(df: pd.DataFrame, rule: str) -> pd.DataFrame:
    return df.resample(rule).agg({
        "open":   "first",
        "high":   "max",
        "low":    "min",
        "close":  "last",
        "volume": "sum",
    }).dropna()


def ic_for_df(df_tf: pd.DataFrame, horizon: int = 1) -> pd.Series:
    feats_tf = build_feature_matrix(df_tf)
    fwd_tf   = forward_return(df_tf, horizon=horizon)
    joined   = pd.concat([feats_tf, fwd_tf], axis=1).dropna()
    if len(joined) < 50:
        return pd.Series(dtype=float)
    return compute_ic(joined[feats_tf.columns], joined[fwd_tf.name])


# Build timeframes from 1h via resampling (no extra API calls)
tf_data = {
    "1h":  df,
    "4h":  resample_ohlcv(df, "4h"),
    "1d":  resample_ohlcv(df, "1d"),
}

# Fetch 5m separately (Jan–Mar 2024 only to keep it fast)
try:
    df_5m = fetch_ohlcv(timeframe="5m", since="2024-01-01", until="2024-04-01")
    print(f"5m data: {len(df_5m):,} bars")
    tf_data["5m"] = df_5m
    tf_data["15m"] = resample_ohlcv(df_5m, "15min")
    print(f"15m data (resampled): {len(tf_data['15m']):,} bars")
    TIMEFRAMES = ["5m", "15m", "1h", "4h", "1d"]
except Exception as e:
    print(f"5m fetch skipped ({e}); running 1h/4h/1d only")
    TIMEFRAMES = ["1h", "4h", "1d"]

for tf, d in tf_data.items():
    print(f"  {tf:4s}: {len(d):6,} bars")

5m data: 26,209 bars
15m data (resampled): 8,737 bars
  1h  :  8,785 bars
  4h  :  2,197 bars
  1d  :    367 bars
  5m  : 26,209 bars
  15m :  8,737 bars


In [9]:
tf_ic = {}
for tf in TIMEFRAMES:
    if tf in tf_data:
        ic_tf = ic_for_df(tf_data[tf], horizon=1)
        tf_ic[tf] = ic_tf
        print(f"{tf:4s}  bars={len(tf_data[tf]):6,}  mean|IC|={ic_tf.abs().mean():.5f}  max|IC|={ic_tf.abs().max():.5f}")

tf_ic_df = pd.DataFrame(tf_ic)   # rows=features, cols=timeframes

5m    bars=26,209  mean|IC|=0.01515  max|IC|=0.03676
15m   bars= 8,737  mean|IC|=0.02011  max|IC|=0.05636


1h    bars= 8,785  mean|IC|=0.02386  max|IC|=0.08120
4h    bars= 2,197  mean|IC|=0.02415  max|IC|=0.07408
1d    bars=   367  mean|IC|=0.04085  max|IC|=0.16470


In [10]:
# ── Heatmap: |IC| per feature per timeframe ────────────────────────────────────
tf_ic_abs = tf_ic_df.abs().reindex(ic_df.index)   # sort by 1h IC magnitude

fig, ax = plt.subplots(figsize=(max(6, len(TIMEFRAMES) * 2), 10))
import matplotlib.colors as mcolors
cmap = plt.cm.YlOrRd
im = ax.imshow(tf_ic_abs.values, aspect="auto", cmap=cmap, vmin=0, vmax=0.08)
ax.set_xticks(range(len(TIMEFRAMES)))
ax.set_xticklabels(TIMEFRAMES, fontsize=9)
ax.set_yticks(range(len(tf_ic_abs)))
ax.set_yticklabels(tf_ic_abs.index, fontsize=7)
ax.set_title("|IC| heatmap — feature × timeframe (sorted by 1h IC)", fontsize=11)
plt.colorbar(im, ax=ax, label="|IC|")
plt.tight_layout()
plt.show()

# Summary: mean |IC| per timeframe
print("\nMean |IC| per timeframe:")
print(tf_ic_df.abs().mean().to_string(float_format="{:.5f}".format))


Mean |IC| per timeframe:
5m    0.01515
15m   0.02011
1h    0.02386
4h    0.02415
1d    0.04085


## §9 — Feature correlation heatmap

In [11]:
corr = X.corr(method="spearman")

fig, ax = plt.subplots(figsize=(13, 11))
im = ax.imshow(corr.values, cmap="RdBu_r", vmin=-1, vmax=1, aspect="auto")
ax.set_xticks(range(len(corr)))
ax.set_yticks(range(len(corr)))
ax.set_xticklabels(corr.columns, rotation=90, fontsize=6.5)
ax.set_yticklabels(corr.columns,              fontsize=6.5)
ax.set_title("Spearman feature correlation matrix", fontsize=11)
plt.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()

# Highly correlated pairs (|r| > 0.7, excluding diagonal)
corr_abs = corr.abs()
upper = corr_abs.where(np.triu(np.ones_like(corr_abs, dtype=bool), k=1))
high_corr = (
    upper.stack()
    .reset_index()
    .rename(columns={"level_0": "feat_a", "level_1": "feat_b", 0: "|r|"})
    .query("`|r|` > 0.7")
    .sort_values("|r|", ascending=False)
)
print(f"Highly correlated pairs (|r| > 0.7): {len(high_corr)}")
if len(high_corr):
    print(high_corr.to_string(index=False, float_format="{:.3f}".format))

Highly correlated pairs (|r| > 0.7): 27
        feat_a         feat_b   |r|
      bb_pct_b      bb_zscore 1.000
      ret_lag1        bar_ret 1.000
           rsi        di_diff 0.932
     bb_zscore        stoch_k 0.897
      bb_pct_b        stoch_k 0.897
           rsi       bb_pct_b 0.891
           rsi      bb_zscore 0.891
       atr_pct     ret_std_20 0.855
           rsi    ret_mean_20 0.803
       di_diff    ret_mean_20 0.802
      bb_pct_b        di_diff 0.792
     bb_zscore        di_diff 0.792
    ret_std_10     ret_std_20 0.789
       atr_pct     ret_std_10 0.789
      bb_width     ret_std_20 0.788
     ret_std_5     ret_std_10 0.785
macd_hist_norm    ret_mean_10 0.782
           rsi        stoch_k 0.772
macd_hist_norm        stoch_k 0.755
      bb_pct_b macd_hist_norm 0.753
     bb_zscore macd_hist_norm 0.753
     bb_zscore    ret_mean_10 0.753
      bb_pct_b    ret_mean_10 0.753
      bb_width        atr_pct 0.743
      bb_width     ret_std_10 0.731
           rsi    ret_me

## §10 — Return autocorrelation (PACF)

In [12]:
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf
from statsmodels.stats.diagnostic import acorr_ljungbox

ret_series = np.log(df["close"] / df["close"].shift(1)).dropna()

fig, axes = plt.subplots(1, 2, figsize=(14, 4))
plot_acf(ret_series,  ax=axes[0], lags=48, alpha=0.05, title="ACF — 1h log-returns (48 lags)")
plot_pacf(ret_series, ax=axes[1], lags=48, alpha=0.05, title="PACF — 1h log-returns (48 lags)",
          method="ywm")
plt.tight_layout()
plt.show()

# Ljung-Box test: any autocorrelation up to lag 10?
lb = acorr_ljungbox(ret_series, lags=[5, 10, 20], return_df=True)
print("Ljung-Box test for autocorrelation in 1h log-returns:")
print(lb.to_string(float_format="{:.4f}".format))
print("\n(p > 0.05 → fail to reject H₀: no autocorrelation)")

# Also check squared returns (volatility clustering)
lb_sq = acorr_ljungbox(ret_series**2, lags=[5, 10, 20], return_df=True)
print("\nLjung-Box test on SQUARED returns (volatility clustering):")
print(lb_sq.to_string(float_format="{:.4f}".format))

Ljung-Box test for autocorrelation in 1h log-returns:
    lb_stat  lb_pvalue
5    4.1149     0.5330
10  10.4837     0.3991
20  22.1987     0.3298

(p > 0.05 → fail to reject H₀: no autocorrelation)

Ljung-Box test on SQUARED returns (volatility clustering):
     lb_stat  lb_pvalue
5   921.7477     0.0000
10 1208.9330     0.0000
20 1580.7245     0.0000


## §11 — Conclusions (Finding F7)

### IC results summary

**At 1h, 1-bar-ahead (n = 8,763 bars):**

| Feature | IC | Interpretation |
|---|---|---|
| `bar_ret` / `ret_lag1` | **−0.081** | Strongest signal — mean reversion dominates |
| `stoch_k` | −0.063 | Overbought → next bar down |
| `bb_pct_b` / `bb_zscore` | −0.049 | Position in band → mean reversion |
| `rsi` | −0.045 | Classic overbought/oversold |
| `upper_wick` | +0.037 | Large upper shadow → next bar up (surprise) |
| `atr_pct`, `bb_width` | +0.026 | Volatility — more predictable than direction |
| `volume_ratio` | +0.001 | **No signal** — volume alone is noise |

**15 / 34 features have |IC| > 0.02.** Mean |IC| = 0.024 — weak but non-zero signal exists.

**All top-IC features are negative-signed** → the dominant edge at 1h is mean reversion,
consistent with the Bollinger strategy experiments (F1–F6).

### IC by timeframe

| Timeframe | Bars | Mean |IC| | Max |IC| |
|---|---|---|---|
| 5m | 26,209 | 0.015 | 0.037 |
| 15m | 8,737 | 0.020 | 0.056 |
| 1h | 8,785 | 0.024 | 0.081 |
| 4h | 2,197 | 0.024 | 0.074 |
| **1d** | **367** | **0.041** | **0.165** |

**Daily IC is 70% higher than hourly.** The `upper_wick` at 1d reaches IC = 0.165 —
the best single-feature signal found. **Start modelling at the daily timeframe.**

### IC by horizon: flat

Mean |IC| stays at 0.022–0.025 across horizons 1–48 bars. No meaningful
decay or improvement — the noise floor is hit immediately at 1h resolution.

### Feature redundancy (27 pairs with |r| > 0.7)

Key duplicates to remove before modelling:
- `ret_lag1` = `bar_ret` (|r| = 1.0 — identical)
- `bb_pct_b` = `bb_zscore` (|r| = 1.0 — identical after scaling)
- `rsi` ≈ `stoch_k` ≈ `bb_pct_b` (oscillator cluster, pick one)
- `atr_pct` ≈ `bb_width` ≈ `ret_std_20` (volatility cluster, pick one or two)

### Autocorrelation

- **Raw returns:** Ljung-Box p > 0.05 at lags 5/10/20 → no linear autocorrelation (weak-form EMH holds)
- **Squared returns:** Ljung-Box p ≈ 0.000 → **strong volatility clustering** (GARCH effect)

### Intraday seasonality

Only UTC hours 17 (t=+2.07) and 19 (t=−2.00) are marginally significant.
Effect size is tiny (~0.06% per bar) — not tradeable as a standalone signal.
Drop `hour_sin/cos` from the first model iteration.

### Recommended feature set for LightGBM (12 features)

```
Momentum (mean-reversion):  bar_ret, bb_zscore, rsi, macd_hist_norm
Volatility:                 atr_pct, bb_width
Bar structure:              upper_wick, lower_wick, hl_range
Volume:                     vol_log_chg
Directional:                di_diff, adx
```

### Next step: P-ML2 — LightGBM baseline model

- **Target:** 1d forward log-return (daily IC = 0.041, best signal)
- **Model:** LightGBM regressor on the 12-feature set above
- **Validation:** purged walk-forward (5 folds), embargo = 1 day
- **Baseline to beat:** persistence model (IC ≈ 0 by definition)
- **Success criterion:** OOS IC > 0.03 consistently across folds
