# Week 4: IPO Trading Policy Optimization — Implementation

This notebook documents **problem setup**, **implementation**, **validation**, and **next steps** for the IPO trading policy (REINFORCE + risk-adjusted objective).

## Problem Setup

- **Goal**: Maximize risk-adjusted fitness over IPO episodes via a policy \(\pi_\theta\) that chooses: participate/skip, entry day, hold days, position size.
- **Objective**: \(\mathrm{Score}_\theta = \mathbb{E}[R_\theta] - \lambda \cdot \mathrm{CVaR}_\alpha(R_\theta) - \kappa \cdot \mathbb{E}[C_\theta] - \mu \cdot \mathrm{MDD}_\theta\)
- **Data**: Synthetic episodes (or path to rich CSV / yfinance). Each episode has a price DataFrame with `date`, `close`.
- **Success metrics**: Objective score on train/val; CVaR and MDD; test cases passing.

---
## Implementation

**Required imports**, **objective function**, **optimization** (REINFORCE), **parameters**, and basic **logging**.

In [None]:
# All required imports (run from project root)
import sys
from pathlib import Path
import numpy as np
import pandas as pd

root = Path(".").resolve()
if str(root) not in sys.path:
    sys.path.insert(0, str(root))

from src.data import Episode, generate_synthetic_prices
from src.backtest import backtest_all, backtest_all_with_decisions
from src.objective import score
from src.metrics import cvar, max_drawdown
from src.policy import PolicyParams, decide_trade
from src.features import episodes_to_tensor
from src.policy_network import IPOPolicyNetwork, sample_and_log_prob
from src.train_policy import train_reinforce
from datetime import date, timedelta

In [None]:
# Key parameters (course: hyperparameters, objective weights)
LAM = 1.0   # CVaR penalty
ALPHA = 0.9 # CVaR confidence level
KAPPA = 1.0 # Cost penalty
MU = 1.0    # MDD penalty
COST_BPS = 10.0
N_EPOCHS = 30
LR = 1e-3
BATCH_SIZE = 32
SEED = 0

In [None]:
# Objective function implementation (wraps src.objective.score)
def compute_score(results_df, equity, lam=LAM, alpha=ALPHA, kappa=KAPPA, mu=MU):
    """Score = E[R] - lam*CVaR - kappa*E[Cost] - mu*MDD."""
    sc, metrics = score(results_df, equity, lam=lam, alpha=alpha, kappa=kappa, mu=mu)
    return sc, metrics

In [None]:
# Build synthetic episodes for demonstration
def make_synthetic_episodes(n=80, N=10, seed=SEED):
    rng = np.random.default_rng(seed)
    base_date = date(2020, 1, 1)
    episodes = []
    for i in range(n):
        ticker = f"SYNTH{i:03d}"
        ipo_date = base_date + timedelta(days=i * 7)
        price_df = generate_synthetic_prices(
            ticker=ticker, ipo_date=ipo_date, N=N,
            initial_price=float(rng.uniform(10, 100)),
            volatility=float(rng.uniform(0.01, 0.05)), rng=rng,
        )
        ep = Episode(ticker=ticker, ipo_date=ipo_date, df=price_df, day0_index=0, N=N)
        episodes.append(ep)
    return episodes

episodes = make_synthetic_episodes(80, N=10)
print(f"Created {len(episodes)} synthetic episodes.")

In [None]:
# Rule-based baseline: fixed policy params (participate_threshold, hold_k, raw_weight)
params_baseline = PolicyParams(participate_threshold=0.5, entry_day=0, hold_k=3, raw_weight=0.5)
results_df, equity = backtest_all(episodes, params_baseline, cost_bps=COST_BPS)
sc_baseline, metrics_baseline = compute_score(results_df, equity)
print("Baseline (rule) score:", round(sc_baseline, 6))
print("Metrics:", metrics_baseline)

In [None]:
# REINFORCE optimization (PyTorch)
n_val = max(1, int(len(episodes) * 0.2))
n_train = len(episodes) - n_val
perm = np.random.RandomState(SEED).permutation(len(episodes))
train_ep = [episodes[i] for i in perm[:n_train]]
val_ep = [episodes[i] for i in perm[n_train:]]

result = train_reinforce(
    train_ep, val_episodes=val_ep,
    n_epochs=N_EPOCHS, lr=LR, lr_schedule="constant",
    cost_bps=COST_BPS, lam=LAM, alpha=ALPHA, kappa=KAPPA, mu=MU,
    batch_size=min(BATCH_SIZE, n_train), seed=SEED,
    out_dir=Path("results"),
)
print("Final train score:", result["history"]["train_score"][-1])
if result["history"]["val_score"]:
    print("Final val score:", result["history"]["val_score"][-1])

---
## Validation

Test cases, performance measurements, resource monitoring, edge cases.

In [None]:
# Test 1: Empty results -> score 0
empty_df = pd.DataFrame()
empty_equity = pd.Series(dtype=float)
sc_empty, m_empty = score(empty_df, empty_equity)
assert sc_empty == 0.0 and m_empty["score"] == 0.0, "Empty case should yield 0"
print("Test 1 (empty): PASS — score =", sc_empty)

In [None]:
# Test 2: CVaR and MDD — constant positive returns
constant_ret = 0.001
n_days = 20
equity_curve = np.cumprod([1.0] + [1 + constant_ret] * n_days)[1:]
mdd_val = max_drawdown(pd.Series(equity_curve))
cvar_val = cvar(np.full(10, constant_ret), alpha=0.9)
print("Test 2 (constant positive ret): MDD =", mdd_val, "CVaR(0.9) =", cvar_val)
assert mdd_val == 0.0, "No drawdown for monotonically increasing equity"
print("Test 2: PASS")

In [None]:
# Test 3: Backtest "never participate" -> zero net_ret and cost (threshold=999 => no signal >= 999)
params_skip = PolicyParams(participate_threshold=999.0, entry_day=0, hold_k=1, raw_weight=0.0)
res_skip, eq_skip = backtest_all(episodes[:5], params_skip, cost_bps=COST_BPS)
assert (res_skip["net_ret"] == 0).all() and (res_skip["cost"] == 0).all()
print("Test 3 (never participate): PASS — net_ret and cost all zero")

In [None]:
# Performance: single epoch timing (optional)
import time
tiny = make_synthetic_episodes(20, N=5)
t0 = time.perf_counter()
train_reinforce(tiny, val_episodes=tiny[:5], n_epochs=2, batch_size=10, seed=0)
elapsed = time.perf_counter() - t0
print(f"Rough timing: 2 epochs on 20 episodes ≈ {elapsed:.2f}s")

---
## Walk-Forward (Time-Based) Validation

Split episodes by `ipo_date` (earliest 80% → train, latest 20% → test) to simulate out-of-sample evaluation. This prevents temporal leakage and is the correct validation approach for time-series data.

In [None]:
# Walk-forward (time-based) validation
# Split by ipo_date: earliest 80% → train, latest 20% → out-of-sample test
# This avoids temporal leakage (no training on future episodes)

sorted_episodes = sorted(episodes, key=lambda ep: ep.ipo_date)
cutoff = int(len(sorted_episodes) * 0.8)
wf_train = sorted_episodes[:cutoff]
wf_test  = sorted_episodes[cutoff:]

print(f'Walk-forward: {len(wf_train)} train episodes, {len(wf_test)} test episodes')
print(f'  Train date range: {wf_train[0].ipo_date} → {wf_train[-1].ipo_date}')
print(f'  Test  date range: {wf_test[0].ipo_date}  → {wf_test[-1].ipo_date}')

# Train policy on the earlier cohort, evaluate on later cohort (OOS)
wf_result = train_reinforce(
    wf_train, val_episodes=wf_test,
    n_epochs=N_EPOCHS, lr=LR, lr_schedule='constant',
    cost_bps=COST_BPS, lam=LAM, alpha=ALPHA, kappa=KAPPA, mu=MU,
    batch_size=min(BATCH_SIZE, len(wf_train)), seed=SEED,
)

wf_train_score = wf_result['history']['train_score'][-1]
wf_oos_score   = wf_result['history']['val_score'][-1] if wf_result['history']['val_score'] else None

print(f'\nWalk-forward OOS score  : {wf_oos_score:.6f}' if wf_oos_score is not None else 'No OOS score')
print(f'Walk-forward Train score: {wf_train_score:.6f}')

# Compare vs. always-participate baseline on the test set
params_always = PolicyParams(participate_threshold=0.0, entry_day=0, hold_k=3, raw_weight=0.5)
res_always, eq_always = backtest_all(wf_test, params_always, cost_bps=COST_BPS)
sc_always, _ = score(res_always, eq_always)
print(f'Always-participate (OOS): {sc_always:.6f}')
if wf_oos_score is not None:
    print(f'OOS gain vs always-participate: {(wf_oos_score - sc_always):.6f}')

---
## Real-Data Pipeline

The full pipeline supports live data via `run_pytorch.py`. The notebook uses synthetic data for reproducibility and speed. To run on real S&P 500 data (from Yahoo Finance), use:

```bash
# From the project root:
python run_pytorch.py --data yfinance --max_tickers 50 --n_epochs 20 --N 10
```

This fetches 50 S&P 500 tickers, builds 10-day rolling episodes, and trains the policy with REINFORCE. Expected output format:

```
Fetching S&P 500 constituent list...
Fetching prices from Yahoo Finance for 50 S&P 500 tickers...
Fetched 48 tickers
Train 384 episodes, val 96 episodes
Epoch 1/20  loss=0.002341  train_score=-0.003412  val_score=-0.002876
...
Epoch 20/20 loss=0.001644  train_score= 0.000218  val_score=-0.000891
Best epoch (by val score): 15  |  Best val score: -0.000712
```

Scores near zero are expected for short windows with 10 bps costs; a negative val score with a positive or improving train score indicates the policy is learning but the test set is noisy with few episodes.

---
## Documentation

**Key design decisions:**
- **Objective** (`src/objective.score`): `Score = E[R] - λ·CVaR - κ·Cost - μ·MDD`. No β·Sharpe in the current code (listed as next step). Default λ=κ=μ=1.0 puts risk and return on the same decimal scale; α=0.9 uses the worst 10% tail for CVaR.
- **REINFORCE** (`src/train_policy.py`): Non-differentiable backtest → policy gradient. Mean-reward baseline reduces variance; gradient clipping (max norm 1.0) prevents large steps. Adam optimizer chosen for adaptive learning rate.
- **Backtest** (`src/backtest.py`): `net_ret = weight × excess_ret − cost`; equity curve is cumulative product of (1 + net_ret). Backtest is pure NumPy/pandas, decoupled from PyTorch.
- **Features** (`src/features.py`): 20-dim vector: 8 price/volume base features + 12 meta features (offer price, shares, sector hash, CEO info). Missing meta filled with zeros.
- **Validation split**: Random 80/20 in default runs; time-based (ipo_date sort) in the walk-forward section above.

**Known limitations:**
- Synthetic data only in this notebook; real data via `run_pytorch.py --data yfinance`.
- High REINFORCE variance with few episodes; entropy bonus (coef=0.01) encourages exploration.
- No walk-forward over multiple years yet (only one train/test split above).
- Sharpe not in objective; regime dependency not handled.

**Debug / test strategies:**
- `pytest tests/test_basic.py` — objective, CVaR, MDD, backtest edge cases.
- Small `n_epochs=2, batch_size=10` for quick sanity checks.
- Check `result["history"]["train_score"]` and `val_score` lists for convergence.

**Next steps:**
- Walk-forward over real IPO cohorts by year (2021 train → 2022 test → …).
- Add β·Sharpe term to `src/objective.score()`.
- Connect notebook to `src/dailyhistorical_21-26.csv` for actual IPO episodes.
- Tune λ, κ, μ via grid search on held-out cohort.