# Week 11 — Reinforcement Learning for Portfolio Management

**Key ideas:**
- Trading is a sequential decision problem — naturally fits the MDP framework
- RL agents learn policies that map market states to portfolio actions
- RL is hard in finance: non-stationarity, partial observability, delayed rewards
- Realistic expectation: RL learns risk management, not alpha generation

**Outline:**
1. Trading as an MDP
2. Key RL algorithms for finance
3. Why RL is hard in finance
4. FinRL architecture
5. Where RL works in industry
6. Demo: Custom gym environment for portfolio allocation

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gymnasium as gym
from gymnasium import spaces

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

np.random.seed(42)

---
## 1. Trading as a Markov Decision Process (MDP)

An MDP is defined by $(S, A, P, R, \gamma)$:

| Component | Finance Interpretation |
|-----------|------------------------|
| **State** $s_t$ | Portfolio weights, asset prices, technical features, account balance |
| **Action** $a_t$ | Target portfolio weights (how much to allocate to each asset) |
| **Transition** $P(s_{t+1} \mid s_t, a_t)$ | Market dynamics (unknown, non-stationary) |
| **Reward** $r_t$ | Portfolio return, risk-adjusted return (Sharpe), or custom |
| **Discount** $\gamma$ | Time preference (typically close to 1 for daily trading) |

The agent's goal: find a policy $\pi(a_t \mid s_t)$ that maximizes cumulative discounted reward:

$$\max_\pi \; \mathbb{E} \left[ \sum_{t=0}^{T} \gamma^t \, r_t \right]$$

### State representation

A practical state vector for portfolio RL:

$$s_t = \big[ w_t^{(1)}, \ldots, w_t^{(n)}, \; \text{ret}_t^{(1)}, \ldots, \text{ret}_t^{(n)}, \; \text{features}_t \big]$$

Where:
- $w_t^{(i)}$ = current weight of asset $i$
- $\text{ret}_t^{(i)}$ = recent return(s) of asset $i$
- $\text{features}_t$ = MACD, RSI, volatility, etc.

### Reward design choices

| Reward | Formula | Pros | Cons |
|--------|---------|------|------|
| Raw return | $r_t = \sum_i w_i \cdot r_i^{(t)}$ | Simple | Ignores risk |
| Sharpe-based | $r_t = \frac{\bar{r}}{\sigma_r}$ (rolling) | Risk-adjusted | Noisy estimate |
| Return - penalty | $r_t = r_p - \lambda \cdot \text{drawdown}_t$ | Controls drawdowns | Sensitive to $\lambda$ |
| Differential Sharpe | $r_t = \frac{\partial \text{Sharpe}}{\partial r_p}$ | Incremental | Harder to implement |

In [None]:
# Quick illustration: different reward signals for the same portfolio return stream
T = 252
portfolio_returns = np.random.normal(0.0005, 0.015, T)

# Raw return reward
reward_raw = portfolio_returns.copy()

# Sharpe-based (rolling 20-day)
window = 20
reward_sharpe = np.zeros(T)
for t in range(window, T):
    chunk = portfolio_returns[t - window:t]
    reward_sharpe[t] = chunk.mean() / (chunk.std() + 1e-8)

# Return - drawdown penalty
cumulative = np.cumprod(1 + portfolio_returns)
running_max = np.maximum.accumulate(cumulative)
drawdown = (running_max - cumulative) / running_max
lam = 2.0
reward_dd = portfolio_returns - lam * drawdown

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, rw, title in zip(axes, [reward_raw, reward_sharpe, reward_dd],
                          ['Raw Return', 'Rolling Sharpe', 'Return - 2*Drawdown']):
    ax.plot(rw, linewidth=0.8)
    ax.set_title(title)
    ax.set_xlabel('Day')
    ax.axhline(0, color='gray', linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.show()

---
## 2. Key RL Algorithms for Finance

| Algorithm | Type | Action Space | Key Idea | Finance Fit |
|-----------|------|-------------|----------|-------------|
| **DQN** | Value-based | Discrete | Q-network with experience replay | Good for discrete actions (buy/hold/sell) |
| **PPO** | Policy gradient | Continuous | Clipped surrogate objective, stable training | Most popular for portfolios |
| **A2C** | Actor-critic | Continuous | Advantage function reduces variance | Faster but less stable than PPO |
| **SAC** | Actor-critic | Continuous | Maximum entropy — explores more | Good for complex action spaces |
| **DDPG** | Actor-critic | Continuous | Deterministic policy gradient | Off-policy, sample efficient |

### Why PPO is the default choice

PPO uses a clipped objective that prevents large policy updates:

$$L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$$

- $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)$ — probability ratio
- $\hat{A}_t$ — estimated advantage
- $\epsilon \approx 0.2$ — clip range

This is great for finance because:
1. Stable training (no catastrophic policy shifts)
2. Works with continuous action spaces (portfolio weights)
3. Robust to hyperparameters
4. Easy to parallelize

---
## 3. Why RL Is Hard in Finance

### Problem 1: Non-stationarity
Markets change regimes — a policy trained on bull markets fails in bear markets.
The transition function $P(s_{t+1}|s_t, a_t)$ is not fixed.

### Problem 2: Partial observability
The agent sees prices and features but not:
- Order flow / institutional positions
- Macro policy announcements before they happen
- Other agents' strategies

This means the true state is partially hidden — technically a POMDP.

### Problem 3: Delayed and noisy rewards
- A good trade might look bad for weeks before paying off
- Daily returns are extremely noisy (signal-to-noise ratio ~0.05)
- Credit assignment is hard: which action caused the loss?

### Problem 4: Sample efficiency
- We have ONE history of the market (no parallel environments)
- Financial data is expensive and limited
- Simulated environments don't capture real market microstructure

### Problem 5: Overfitting
- RL agents are extremely good at memorizing training episodes
- Backtested RL performance rarely generalizes to live trading
- Unlike supervised ML, there's no clear train/test split for RL

---
## 4. FinRL Architecture

FinRL is the most popular open-source library for financial RL. Its pipeline:

```
Data Layer         Environment Layer       Agent Layer         Backtest Layer
----------         -----------------       -----------         --------------
Yahoo Finance      StockTradingEnv         PPO / A2C /         Pyfolio
Alpaca API    -->  (gym.Env)          -->  SAC / DDPG    -->   Quantstats
WRDS               Custom rewards          (Stable-Baselines3) Custom metrics
```

### Data flow:
1. **Download** OHLCV data + technical indicators
2. **Preprocess** into a DataFrame with columns: date, tic, open, high, low, close, volume, features...
3. **Create environment** that reads this data and exposes gym interface
4. **Train agent** using Stable-Baselines3
5. **Backtest** on held-out period

### Key FinRL classes:
```python
from finrl.meta.env_stock_trading.env_stocktrading import StockTradingEnv
from stable_baselines3 import PPO, A2C, SAC, DDPG
```

---
## 5. Where RL Actually Works in Industry

| Application | Why RL Works Here | Companies |
|-------------|-------------------|----------|
| **Optimal execution** | Clear reward (minimize slippage), fast feedback | JPMorgan, Goldman |
| **Market making** | Stationary-ish dynamics, high-frequency feedback | Citadel, Two Sigma |
| **Options hedging** | Can simulate with known models, clear objective | Various sell-side |
| **Order routing** | Discrete actions, fast rewards | Most brokers |

### Where RL struggles:
- **Alpha generation**: too noisy, too non-stationary
- **Long-horizon portfolio management**: delayed rewards, regime changes
- **Low-frequency trading**: not enough data to train

### Realistic expectations
RL in portfolio management typically learns to:
- Reduce drawdowns (risk management)
- Smooth position transitions (reduce turnover)
- Adapt to volatility regimes

It does NOT typically:
- Beat simple momentum or mean-reversion strategies
- Generate consistent alpha
- Work out-of-the-box without careful reward engineering

---
## 6. Demo: Custom Gym Environment for Portfolio Allocation

We'll build a simple portfolio environment from scratch to understand the mechanics.

In [None]:
class SimplePortfolioEnv(gym.Env):
    """A minimal portfolio allocation environment.
    
    State:  [current_weights (n), recent_returns (n), volatility (n)]
    Action: target_weights (n) — softmax-normalized internally
    Reward: portfolio return (or risk-adjusted variant)
    """
    
    def __init__(self, returns_data, lookback=20, transaction_cost=0.001):
        super().__init__()
        self.returns_data = returns_data  # (T, n_assets)
        self.n_assets = returns_data.shape[1]
        self.lookback = lookback
        self.tc = transaction_cost
        
        # Observation: weights + returns + volatility
        obs_dim = 3 * self.n_assets
        self.observation_space = spaces.Box(
            low=-np.inf, high=np.inf, shape=(obs_dim,), dtype=np.float32
        )
        # Action: target weights (before softmax)
        self.action_space = spaces.Box(
            low=-1, high=1, shape=(self.n_assets,), dtype=np.float32
        )
        
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        self.t = self.lookback
        self.weights = np.ones(self.n_assets) / self.n_assets  # equal weight
        self.portfolio_value = 1.0
        self.history = []
        return self._get_obs(), {}
    
    def _get_obs(self):
        recent = self.returns_data[self.t - self.lookback:self.t]
        mean_ret = recent.mean(axis=0)
        vol = recent.std(axis=0) + 1e-8
        obs = np.concatenate([self.weights, mean_ret, vol]).astype(np.float32)
        return obs
    
    def _softmax(self, x):
        e = np.exp(x - x.max())
        return e / e.sum()
    
    def step(self, action):
        # Convert action to portfolio weights
        new_weights = self._softmax(action)
        
        # Transaction costs
        turnover = np.abs(new_weights - self.weights).sum()
        tc_cost = self.tc * turnover
        
        # Portfolio return
        asset_returns = self.returns_data[self.t]
        port_return = np.dot(new_weights, asset_returns) - tc_cost
        
        # Update state
        self.portfolio_value *= (1 + port_return)
        self.weights = new_weights * (1 + asset_returns)
        self.weights /= self.weights.sum()  # renormalize after price changes
        self.t += 1
        
        self.history.append({
            'portfolio_value': self.portfolio_value,
            'return': port_return,
            'turnover': turnover,
        })
        
        # Episode ends when we run out of data
        terminated = self.t >= len(self.returns_data)
        
        return self._get_obs(), port_return, terminated, False, {}

print("SimplePortfolioEnv defined.")

In [None]:
# Generate synthetic stock returns for 5 assets
n_days = 504  # ~2 years
n_assets = 5
asset_names = ['Tech', 'Finance', 'Healthcare', 'Energy', 'Consumer']

# Different risk/return profiles
mus = np.array([0.0008, 0.0004, 0.0006, 0.0003, 0.0005])
sigmas = np.array([0.02, 0.015, 0.018, 0.025, 0.012])

# Correlated returns
corr = np.array([
    [1.0, 0.5, 0.3, 0.2, 0.4],
    [0.5, 1.0, 0.4, 0.3, 0.5],
    [0.3, 0.4, 1.0, 0.2, 0.3],
    [0.2, 0.3, 0.2, 1.0, 0.2],
    [0.4, 0.5, 0.3, 0.2, 1.0],
])
cov = np.outer(sigmas, sigmas) * corr

returns_data = np.random.multivariate_normal(mus, cov, n_days)

# Visualize cumulative returns
cumulative = np.cumprod(1 + returns_data, axis=0)
for i, name in enumerate(asset_names):
    plt.plot(cumulative[:, i], label=name)
plt.title('Synthetic Asset Cumulative Returns')
plt.xlabel('Day')
plt.ylabel('Cumulative Return')
plt.legend()
plt.show()

In [None]:
# Test the environment manually
env = SimplePortfolioEnv(returns_data)
obs, info = env.reset()
print(f"Observation shape: {obs.shape}")
print(f"Action space: {env.action_space}")
print(f"\nInitial obs (first 5 = weights):")
print(f"  Weights: {obs[:5]}")
print(f"  Mean returns: {obs[5:10]}")
print(f"  Volatilities: {obs[10:15]}")

In [None]:
# Run a random agent vs equal-weight baseline
def run_episode(env, policy='random'):
    obs, _ = env.reset()
    done = False
    while not done:
        if policy == 'random':
            action = env.action_space.sample()
        elif policy == 'equal_weight':
            action = np.zeros(env.n_assets)  # softmax(0,...,0) = equal weight
        else:
            action = policy(obs)
        obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
    return pd.DataFrame(env.history)

# Compare random vs equal weight
env = SimplePortfolioEnv(returns_data)

results_random = run_episode(env, 'random')
results_equal = run_episode(env, 'equal_weight')

plt.plot(results_random['portfolio_value'].values, label='Random Agent', alpha=0.7)
plt.plot(results_equal['portfolio_value'].values, label='Equal Weight', alpha=0.7)
plt.title('Random Agent vs Equal Weight Baseline')
plt.xlabel('Day')
plt.ylabel('Portfolio Value')
plt.legend()
plt.show()

print(f"Random agent final value:  {results_random['portfolio_value'].iloc[-1]:.4f}")
print(f"Equal weight final value:  {results_equal['portfolio_value'].iloc[-1]:.4f}")

In [None]:
# Train a simple PPO agent using Stable-Baselines3
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv

# Split data: train on first 75%, test on last 25%
split = int(0.75 * n_days)
train_returns = returns_data[:split]
test_returns = returns_data[split:]

# Create training environment
train_env = DummyVecEnv([lambda: SimplePortfolioEnv(train_returns)])

# Train PPO
model = PPO(
    'MlpPolicy',
    train_env,
    learning_rate=3e-4,
    n_steps=128,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    verbose=0,
)

print("Training PPO agent...")
model.learn(total_timesteps=50_000)
print("Training complete.")

In [None]:
# Evaluate on test data
test_env = SimplePortfolioEnv(test_returns)

def ppo_policy(obs):
    action, _ = model.predict(obs, deterministic=True)
    return action

results_ppo = run_episode(test_env, ppo_policy)

test_env_eq = SimplePortfolioEnv(test_returns)
results_eq_test = run_episode(test_env_eq, 'equal_weight')

plt.plot(results_ppo['portfolio_value'].values, label='PPO Agent')
plt.plot(results_eq_test['portfolio_value'].values, label='Equal Weight')
plt.title('PPO Agent vs Equal Weight (Test Period)')
plt.xlabel('Day')
plt.ylabel('Portfolio Value')
plt.legend()
plt.show()

# Performance metrics
for name, res in [('PPO', results_ppo), ('Equal Weight', results_eq_test)]:
    rets = res['return'].values
    sharpe = np.sqrt(252) * rets.mean() / (rets.std() + 1e-8)
    cum_ret = res['portfolio_value'].iloc[-1] - 1
    max_dd = (res['portfolio_value'].cummax() - res['portfolio_value']).max()
    print(f"{name:15s} | Return: {cum_ret:+.2%} | Sharpe: {sharpe:.2f} | Max DD: {max_dd:.2%}")

---
## Key Takeaways

1. **Trading fits the MDP framework** — state/action/reward are well-defined
2. **PPO is the workhorse** — stable, works with continuous actions, easy to tune
3. **Reward design matters enormously** — raw returns lead to risk-seeking behavior
4. **RL in finance is hard** — non-stationarity, noise, overfitting
5. **Real success stories** are in execution, market making, hedging — not alpha
6. **Always compare to simple baselines** — equal weight is hard to beat

### Next: Seminar
We'll use FinRL to train multiple agents on real stock data and experiment with reward functions.