# Episode Feature Ablation (PPO) — what really carries signal?

**Idea:** train the *same PPO agent* on the same episode-transition environment, but with different observation feature sets, and compare out-of-sample performance.

Features we ablate: `price/log_price`, `vol`, `dd`, `cumret`, `similarity`, and `suggestedAction`-derived signals.

**Artifacts**
- Ablation table: PPO performance for multiple observation sets
- Sanity: are we just learning to copy `suggestedAction`? (action match rate vs PnL)


## Setup
This notebook expects: `pandas`, `numpy`, `plotly`, plus RL deps: `gymnasium` and `stable-baselines3`.

If missing, install (in your venv):
```bash
cd python-sdk
/Users/serg/projects/prod/ai_patterns/.venv/bin/python -m pip install -e .
/Users/serg/projects/prod/ai_patterns/.venv/bin/python -m pip install pandas numpy plotly gymnasium stable-baselines3
```


In [1]:
import os
import json
import gzip
import time
from dataclasses import dataclass
from pathlib import Path
from datetime import datetime, timezone

import numpy as np
import pandas as pd
import plotly.express as px

from aipricepatterns import Client

try:
    import gymnasium as gym
    from gymnasium import spaces
except Exception as e:
    raise ImportError("gymnasium is required. Install: pip install gymnasium") from e

try:
    from stable_baselines3 import PPO
    from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
except Exception as e:
    raise ImportError("stable-baselines3 is required. Install: pip install stable-baselines3") from e

pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 200)


## Parameters
Defaults align with the earlier research notebooks (BTCUSDT 1h, 300 anchors).


In [2]:
BASE_URL = os.getenv("AIPP_BASE_URL", "https://aipricepatterns.com/api/rust")
API_KEY = os.getenv("AIPP_API_KEY")

SYMBOL = os.getenv("AIPP_RL_SYMBOL", "BTCUSDT")
INTERVAL = os.getenv("AIPP_RL_INTERVAL", "1h")

ANCHOR_POINTS = int(os.getenv("AIPP_SWEEP_ANCHORS", "300"))
LOOKBACK_DAYS = int(os.getenv("AIPP_SWEEP_LOOKBACK_DAYS", "120"))
FORECAST_HORIZON = int(os.getenv("AIPP_RL_HORIZON", "24"))
EPISODES_PER_ANCHOR = int(os.getenv("AIPP_SWEEP_EPISODES_PER_ANCHOR", "30"))
MIN_SIMILARITY = float(os.getenv("AIPP_RL_MIN_SIMILARITY", "0.70"))
SAMPLING_STRATEGY = os.getenv("AIPP_RL_SAMPLING_STRATEGY", "uniform")

# Cost model (applied per step when position != 0)
FEE_PCT = float(os.getenv("AIPP_FEE_PCT", "0.00"))
SLIP_PCT = float(os.getenv("AIPP_SLIP_PCT", "0.00"))
ROUND_TRIP = bool(int(os.getenv("AIPP_ROUND_TRIP", "0")))

# RL training budget (keep small by default for notebook ergonomics)
TRAIN_TIMESTEPS = int(os.getenv("AIPP_PPO_TIMESTEPS", "20000"))
N_ENVS = int(os.getenv("AIPP_PPO_ENVS", "8"))
SEED = int(os.getenv("AIPP_SEED", "7"))

# Cache
NOTEBOOK_DIR = Path.cwd()
CACHE_DIR = Path(os.getenv("AIPP_RESEARCH_CACHE_DIR", str(NOTEBOOK_DIR / "_cache")))
CACHE_DIR.mkdir(parents=True, exist_ok=True)
CACHE_PATH = CACHE_DIR / f"08_ablation_eps_{SYMBOL}_{INTERVAL}_{ANCHOR_POINTS}.json.gz"

print("Base URL:", BASE_URL)
print(f"{SYMBOL} {INTERVAL} anchors={ANCHOR_POINTS} lookbackDays={LOOKBACK_DAYS}")
print(f"episodes/anchor={EPISODES_PER_ANCHOR} minSim={MIN_SIMILARITY} horizon={FORECAST_HORIZON}")
print(f"feePct={FEE_PCT} slipPct={SLIP_PCT} roundTrip={ROUND_TRIP}")
print(f"TRAIN_TIMESTEPS={TRAIN_TIMESTEPS} N_ENVS={N_ENVS} seed={SEED}")
print("cache:", str(CACHE_PATH))


Base URL: https://aipricepatterns.com/api/rust
BTCUSDT 1h anchors=300 lookbackDays=120
episodes/anchor=30 minSim=0.7 horizon=24
feePct=0.0 slipPct=0.0 roundTrip=False
TRAIN_TIMESTEPS=20000 N_ENVS=8 seed=7
cache: /Users/serg/projects/prod/ai_patterns/python-sdk/research/_cache/08_ablation_eps_BTCUSDT_1h_300.json.gz


## Fetch & cache episodes (train/test split by time)
We collect episodes across many anchors, then split by `anchorTs` time order (80% train / 20% test).


In [3]:
def _safe_float(x, default=np.nan):
    try:
        return float(x)
    except Exception:
        return float(default)

def _map_suggested_action_to_pos(x) -> int:
    if x is None:
        return 0
    if isinstance(x, (int, float)):
        v = int(x)
        if v in (-1, 0, 1):
            return v
        if v in (0, 1, 2):
            return 1 if v == 1 else (-1 if v == 2 else 0)
        return 0
    s = str(x).strip().lower()
    if s in ("hold","flat","none","neutral","wait"): return 0
    if s in ("long","buy","bull","up"): return 1
    if s in ("short","sell","bear","down"): return -1
    return 0

def load_or_fetch_episodes() -> list[dict]:
    if CACHE_PATH.exists():
        with gzip.open(CACHE_PATH, "rt", encoding="utf-8") as f:
            data = json.load(f)
        print("loaded cache episodes:", len(data))
        return data

    client = Client(base_url=BASE_URL, api_key=API_KEY)
    now_ms = int(time.time() * 1000)
    start_ms = now_ms - LOOKBACK_DAYS * 24 * 60 * 60 * 1000
    anchors = np.linspace(start_ms, now_ms, num=ANCHOR_POINTS, dtype=np.int64).tolist()
    out = []
    for i, anchor_ts in enumerate(anchors, start=1):
        res = client.get_rl_episodes(
            symbol=SYMBOL,
            interval=INTERVAL,
            anchor_ts=int(anchor_ts),
            forecast_horizon=FORECAST_HORIZON,
            num_episodes=EPISODES_PER_ANCHOR,
            min_similarity=MIN_SIMILARITY,
            include_actions=True,
            reward_type="returns",
            sampling_strategy=SAMPLING_STRATEGY,
        )
        eps = res.get("episodes") if isinstance(res, dict) else None
        if isinstance(eps, list):
            for ep in eps:
                ts = ep.get("transitions")
                if not isinstance(ts, list) or len(ts) < 2:
                    continue
                out.append({
                    "anchorTs": int(anchor_ts),
                    "similarity": _safe_float(ep.get("similarity"), np.nan),
                    "transitions": ts,
                })
        if i % 25 == 0:
            print(f"{i}/{len(anchors)} anchors, episodes={len(out)}")
        time.sleep(0.02)

    out = [e for e in out if np.isfinite(e.get("similarity", np.nan))]
    out.sort(key=lambda e: e["anchorTs"])
    with gzip.open(CACHE_PATH, "wt", encoding="utf-8") as f:
        json.dump(out, f)
    print("wrote cache episodes:", len(out))
    return out

episodes = load_or_fetch_episodes()
len(episodes), episodes[0].keys()


25/300 anchors, episodes=750
50/300 anchors, episodes=1500
75/300 anchors, episodes=2250
100/300 anchors, episodes=3000
125/300 anchors, episodes=3750
150/300 anchors, episodes=4500
175/300 anchors, episodes=5250
200/300 anchors, episodes=6000
225/300 anchors, episodes=6750
250/300 anchors, episodes=7500
275/300 anchors, episodes=8250
300/300 anchors, episodes=8994
wrote cache episodes: 8994


(8994, dict_keys(['anchorTs', 'similarity', 'transitions']))

In [4]:
episodes = sorted(episodes, key=lambda e: e["anchorTs"])
cut = int(0.8 * len(episodes))
train_eps = episodes[:cut]
test_eps = episodes[cut:]
print("episodes:", len(episodes), "train:", len(train_eps), "test:", len(test_eps))
if episodes:
    print("train range:", datetime.fromtimestamp(train_eps[0]["anchorTs"]/1000, tz=timezone.utc), "→", datetime.fromtimestamp(train_eps[-1]["anchorTs"]/1000, tz=timezone.utc))
    print("test  range:", datetime.fromtimestamp(test_eps[0]["anchorTs"]/1000, tz=timezone.utc), "→", datetime.fromtimestamp(test_eps[-1]["anchorTs"]/1000, tz=timezone.utc))


episodes: 8994 train: 7195 test: 1799
train range: 2025-08-21 17:56:41.390000+00:00 → 2025-11-25 16:01:06.272000+00:00
test  range: 2025-11-25 16:01:06.272000+00:00 → 2025-12-19 17:56:41.390000+00:00


## Environment: transitions → Gymnasium
We treat each episode as a small trajectory of returns.
- Action space: {0=HOLD, 1=LONG, 2=SHORT}
- Reward: position * ret - cost (fee+slippage)

Observation is feature-set dependent.


In [5]:
@dataclass
class ObsConfig:
    name: str
    features: list[str]

FEATURE_SETS: list[ObsConfig] = [
    ObsConfig("ret_only", ["ret"]),
    ObsConfig("price_log", ["price", "log_price"]),
    ObsConfig("price_log_vol", ["price", "log_price", "vol"]),
    ObsConfig("price_log_vol_dd_cumret", ["price", "log_price", "vol", "dd", "cumret"]),
    ObsConfig("similarity_only", ["similarity"]),
    ObsConfig("price_log_vol_dd_cumret_similarity", ["price", "log_price", "vol", "dd", "cumret", "similarity"]),
    ObsConfig("suggested_only", ["suggested_pos"]),
    ObsConfig("price_plus_suggested", ["price", "log_price", "vol", "dd", "cumret", "suggested_pos"]),
    ObsConfig("all", ["price", "log_price", "vol", "dd", "cumret", "similarity", "suggested_pos"]),
]

def build_episode_arrays(ep: dict):
    ts = ep["transitions"]
    rets = []
    sugg = []
    for t in ts:
        if not isinstance(t, dict):
            continue
        rets.append(_safe_float(t.get("ret", t.get("return", 0.0)), 0.0))
        sugg.append(_map_suggested_action_to_pos(t.get("suggestedAction")))
    rets = np.asarray(rets, dtype=np.float32)
    sugg = np.asarray(sugg, dtype=np.int8)
    n = len(rets)
    # synthetic price from returns if no explicit price in transitions
    price = np.ones(n, dtype=np.float32)
    for i in range(1, n):
        price[i] = max(1e-6, price[i-1] * (1.0 + rets[i-1]))
    log_price = np.log(price)
    cumret = np.cumsum(rets).astype(np.float32)
    peak = np.maximum.accumulate(cumret)
    dd = (cumret - peak).astype(np.float32)
    # rolling vol of returns
    w = 10
    vol = np.zeros(n, dtype=np.float32)
    for i in range(n):
        a = max(0, i - w + 1)
        vol[i] = float(np.std(rets[a:i+1]))
    sim = float(ep.get("similarity", 0.0))
    similarity = np.full(n, sim, dtype=np.float32)
    return {
        "ret": rets,
        "price": price,
        "log_price": log_price.astype(np.float32),
        "vol": vol,
        "dd": dd,
        "cumret": cumret,
        "similarity": similarity,
        "suggested_pos": sugg.astype(np.float32),
    }

class EpisodeEnv(gym.Env):
    metadata = {"render_modes": []}
    def __init__(self, episodes: list[dict], obs_features: list[str], fee_pct: float, slip_pct: float, round_trip: bool, seed: int = 0):
        super().__init__()
        self.episodes = episodes
        self.obs_features = list(obs_features)
        self.fee = float(fee_pct) / 100.0
        self.slip = float(slip_pct) / 100.0
        self.round_trip = bool(round_trip)
        self.rng = np.random.default_rng(seed)
        self.action_space = spaces.Discrete(3)
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(len(self.obs_features),), dtype=np.float32)
        self._cur = None
        self._t = 0
        self._pos = 0
        self._arr = None

    def reset(self, *, seed=None, options=None):
        super().reset(seed=seed)
        self._cur = self.episodes[int(self.rng.integers(0, len(self.episodes)))]
        self._arr = build_episode_arrays(self._cur)
        self._t = 0
        self._pos = 0
        return self._obs(), {}

    def _obs(self):
        o = []
        for k in self.obs_features:
            v = self._arr[k][self._t]
            o.append(float(v))
        return np.asarray(o, dtype=np.float32)

    def step(self, action):
        # map action to position
        if int(action) == 0:
            self._pos = 0
        elif int(action) == 1:
            self._pos = 1
        else:
            self._pos = -1
        r = float(self._arr["ret"][self._t])
        pnl = float(self._pos) * r
        # simple per-step cost when in position
        cost = (self.fee + self.slip)
        if self.round_trip:
            cost *= 2.0
        pnl -= abs(self._pos) * cost
        self._t += 1
        terminated = self._t >= (len(self._arr["ret"]) - 1)
        obs = self._obs() if not terminated else np.zeros((len(self.obs_features),), dtype=np.float32)
        info = {"pos": int(self._pos), "suggested_pos": int(self._arr["suggested_pos"][min(self._t, len(self._arr["suggested_pos"]) - 1)])}
        return obs, float(pnl), terminated, False, info


## Training + Evaluation
We train PPO on train episodes, then evaluate on held-out test episodes.
For sanity, we also compute how often the agent’s action matches `suggestedAction`.


In [6]:
def eval_policy_on_episodes(model, episodes: list[dict], obs_features: list[str], n_eval: int = 300) -> dict:
    rng = np.random.default_rng(SEED + 123)
    n_eval = min(n_eval, len(episodes))
    picks = rng.choice(len(episodes), size=n_eval, replace=False)
    pnls = []
    match = []
    for idx in picks:
        ep = episodes[int(idx)]
        arr = build_episode_arrays(ep)
        pos = 0
        pnl = 0.0
        m = 0
        steps = 0
        for t in range(len(arr["ret"]) - 1):
            obs = np.asarray([float(arr[k][t]) for k in obs_features], dtype=np.float32)
            action, _ = model.predict(obs, deterministic=True)
            a = int(action)
            pos = 0 if a == 0 else (1 if a == 1 else -1)
            sugg = int(arr["suggested_pos"][t])
            if pos == sugg:
                m += 1
            steps += 1
            r = float(arr["ret"][t])
            pnl += float(pos) * r
            cost = (float(FEE_PCT) + float(SLIP_PCT)) / 100.0
            if ROUND_TRIP:
                cost *= 2.0
            pnl -= abs(pos) * cost
        pnls.append(float(pnl))
        match.append(m / max(1, steps))
    pnls = np.asarray(pnls, dtype=float)
    match = np.asarray(match, dtype=float)
    return {
        "n": int(len(pnls)),
        "avgPnL": float(np.mean(pnls)),
        "medPnL": float(np.median(pnls)),
        "winrate": float(np.mean(pnls > 0)),
        "p05": float(np.quantile(pnls, 0.05)),
        "p95": float(np.quantile(pnls, 0.95)),
        "actionMatch": float(np.mean(match)),
    }

def eval_suggested_baseline(episodes: list[dict], n_eval: int = 300) -> dict:
    rng = np.random.default_rng(SEED + 456)
    n_eval = min(n_eval, len(episodes))
    picks = rng.choice(len(episodes), size=n_eval, replace=False)
    pnls = []
    for idx in picks:
        ep = episodes[int(idx)]
        arr = build_episode_arrays(ep)
        pnl = 0.0
        for t in range(len(arr["ret"]) - 1):
            pos = int(arr["suggested_pos"][t])
            pnl += float(pos) * float(arr["ret"][t])
            cost = (float(FEE_PCT) + float(SLIP_PCT)) / 100.0
            if ROUND_TRIP:
                cost *= 2.0
            pnl -= abs(pos) * cost
        pnls.append(float(pnl))
    pnls = np.asarray(pnls, dtype=float)
    return {
        "n": int(len(pnls)),
        "avgPnL": float(np.mean(pnls)),
        "medPnL": float(np.median(pnls)),
        "winrate": float(np.mean(pnls > 0)),
        "p05": float(np.quantile(pnls, 0.05)),
        "p95": float(np.quantile(pnls, 0.95)),
    }

baseline = eval_suggested_baseline(test_eps, n_eval=min(300, len(test_eps)))
baseline


{'n': 300,
 'avgPnL': 5.986620332598687,
 'medPnL': 4.473500072956085,
 'winrate': 0.91,
 'p05': 0.0,
 'p95': 19.111634740233427}

In [7]:
results = []
for cfg in FEATURE_SETS:
    print("\n===", cfg.name, cfg.features)
    def make_env(seed_offset=0):
        return EpisodeEnv(train_eps, cfg.features, fee_pct=FEE_PCT, slip_pct=SLIP_PCT, round_trip=ROUND_TRIP, seed=SEED + seed_offset)
    vec = DummyVecEnv([lambda i=i: make_env(i) for i in range(N_ENVS)])
    vec = VecNormalize(vec, norm_obs=True, norm_reward=False, clip_obs=10.0)
    model = PPO("MlpPolicy", vec, verbose=0, seed=SEED, n_steps=256, batch_size=256)
    model.learn(total_timesteps=TRAIN_TIMESTEPS)

    # evaluation uses raw observations; wrap with the same VecNormalize stats
    # create a dummy vec env for normalization during predict
    eval_env = DummyVecEnv([lambda: EpisodeEnv(test_eps, cfg.features, fee_pct=FEE_PCT, slip_pct=SLIP_PCT, round_trip=ROUND_TRIP, seed=SEED + 999)])
    eval_env = VecNormalize(eval_env, norm_obs=True, norm_reward=False, clip_obs=10.0)
    eval_env.obs_rms = vec.obs_rms
    # trick: use eval_env to normalize obs manually in eval loop
    def predict_with_norm(obs: np.ndarray):
        o = obs.reshape((1, -1))
        o = eval_env.normalize_obs(o)
        act, _ = model.predict(o, deterministic=True)
        return int(act[0])

    class _Wrapper:
        def predict(self, obs, deterministic=True):
            return np.asarray([predict_with_norm(obs)]), None
    wrapped = _Wrapper()

    metrics = eval_policy_on_episodes(wrapped, test_eps, cfg.features, n_eval=min(300, len(test_eps)))
    metrics.update({
        "featureSet": cfg.name,
        "nFeatures": len(cfg.features),
        "trainTimesteps": TRAIN_TIMESTEPS,
        "baselineSuggestedAvgPnL": baseline["avgPnL"],
        "deltaVsSuggested": metrics["avgPnL"] - baseline["avgPnL"],
    })
    results.append(metrics)

ablation = pd.DataFrame(results).sort_values(["avgPnL"], ascending=False).reset_index(drop=True)
ablation



=== ret_only ['ret']


  a = int(action)



=== price_log ['price', 'log_price']

=== price_log_vol ['price', 'log_price', 'vol']

=== price_log_vol_dd_cumret ['price', 'log_price', 'vol', 'dd', 'cumret']

=== similarity_only ['similarity']

=== price_log_vol_dd_cumret_similarity ['price', 'log_price', 'vol', 'dd', 'cumret', 'similarity']

=== suggested_only ['suggested_pos']

=== price_plus_suggested ['price', 'log_price', 'vol', 'dd', 'cumret', 'suggested_pos']

=== all ['price', 'log_price', 'vol', 'dd', 'cumret', 'similarity', 'suggested_pos']


Unnamed: 0,n,avgPnL,medPnL,winrate,p05,p95,actionMatch,featureSet,nFeatures,trainTimesteps,baselineSuggestedAvgPnL,deltaVsSuggested
0,300,9.530171,8.0429,1.0,3.151405,19.07553,0.249275,ret_only,1,20000,5.98662,3.543551
1,300,6.686934,4.8601,0.993333,0.620655,17.3858,0.333333,price_plus_suggested,6,20000,5.98662,0.700314
2,300,6.659272,5.16895,0.983333,0.51178,16.78618,0.261739,all,7,20000,5.98662,0.672651
3,300,6.217369,4.4637,0.95,0.023425,16.82542,0.249275,suggested_only,1,20000,5.98662,0.230749
4,300,5.127717,4.5887,1.0,1.21172,11.036045,0.208261,price_log_vol_dd_cumret_similarity,6,20000,5.98662,-0.858904
5,300,4.767032,4.36625,0.983333,0.662625,10.650915,0.210725,price_log_vol_dd_cumret,5,20000,5.98662,-1.219588
6,300,0.240605,0.46915,0.62,-3.85449,3.424735,0.24913,price_log_vol,3,20000,5.98662,-5.746015
7,300,0.140317,0.0,0.16,-1.539205,2.694655,0.563623,similarity_only,1,20000,5.98662,-5.846303
8,300,0.08917,0.1127,0.526667,-4.68901,4.41437,0.127971,price_log,2,20000,5.98662,-5.89745


In [8]:
fig = px.bar(ablation, x="featureSet", y="avgPnL", title="Ablation: PPO avgPnL on test episodes", text="nFeatures")
fig.update_layout(xaxis_tickangle=-35, height=520)
fig


In [9]:
fig = px.scatter(
    ablation,
    x="actionMatch",
    y="avgPnL",
    text="featureSet",
    title="Sanity: are we just copying suggestedAction? (match rate vs PnL)",
)
fig.update_traces(textposition="top center")
fig.update_layout(height=520)
fig


## Reading the results
- If `suggested_only` wins *and* has very high actionMatch, PPO is mostly learning to imitate suggestedAction, not to extract independent signal.
- If `price_log_vol_dd_cumret_similarity` (or `all`) beats `suggested_only` with *lower* actionMatch, you have evidence of incremental signal beyond imitation.
- If `similarity_only` is strong, the agent may be learning *gating* rather than directional trading.

Next step (if you want): repeat with a small delay/cost stress (like notebook 07) and see which feature set remains robust.
