# Homework 11 â€” RL Portfolio Manager

**Total: 100 points**

Build a reinforcement learning portfolio manager using 30 stocks. Train multiple agents, benchmark them against classical strategies, and analyze agent behavior.

| Part | Topic | Points |
|------|-------|--------|
| 1 | Set up FinRL with 30 stocks | 15 |
| 2 | Train PPO, A2C, SAC agents | 25 |
| 3 | Benchmark against equal-weight, min-var, best ML model | 25 |
| 4 | Reward shaping experiments | 20 |
| 5 | Analysis of agent behavior | 15 |

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

---
## Part 1: Data Setup (15 pts)

**Requirements:**
- Download daily OHLCV data for 30 stocks (e.g., DJIA constituents) from 2018-01-01 to 2024-01-01
- Compute daily returns and basic technical indicators (MACD, RSI, or rolling volatility)
- Split: Train on 2018-2022, Test on 2023
- Build a `PortfolioEnv` (or use FinRL's `StockTradingEnv`) that:
  - Takes the preprocessed data
  - Exposes gym-compatible interface
  - Supports configurable reward functions

**Grading:**
- 5 pts: Correct data download and cleaning (no NaNs, proper alignment)
- 5 pts: Working gym environment with proper observation/action spaces
- 5 pts: Correct train/test split respecting temporal ordering

In [None]:
# Suggested tickers (DJIA-like universe)
tickers = [
    'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'JPM', 'JNJ', 'V', 'PG', 'UNH', 'HD',
    'MA', 'DIS', 'NVDA', 'BAC', 'ADBE', 'CRM', 'CMCSA', 'XOM', 'CSCO', 'PFE',
    'NFLX', 'ABT', 'KO', 'PEP', 'TMO', 'AVGO', 'COST', 'WMT', 'MRK', 'CVX',
]

# YOUR CODE HERE
# 1. Download data using yfinance
# 2. Compute returns
# 3. Add technical indicators
# 4. Build or configure environment
# 5. Verify with env.reset() and env.step()

---
## Part 2: Train PPO, A2C, SAC Agents (25 pts)

**Requirements:**
- Train three agents using Stable-Baselines3: PPO, A2C, SAC
- Use the same environment and reward function for fair comparison
- Train each for at least 100K timesteps
- Plot training reward curves (use callbacks or log)
- Report final train performance for each

**Grading:**
- 5 pts: Correct PPO setup and training
- 5 pts: Correct A2C setup and training
- 5 pts: Correct SAC setup and training
- 5 pts: Training curves plotted
- 5 pts: Reasonable hyperparameter choices (documented)

In [None]:
# YOUR CODE HERE
# 1. Create training environment
# 2. Train PPO with documented hyperparameters
# 3. Train A2C with documented hyperparameters
# 4. Train SAC with documented hyperparameters
# 5. Plot learning curves

---
## Part 3: Benchmark Comparison (25 pts)

**Requirements:**
Compare all RL agents against these baselines on the TEST set:
1. **Equal-weight** portfolio (rebalanced daily)
2. **Minimum-variance** portfolio (using rolling covariance, rebalanced monthly)
3. **Best ML model from Week 5** (if available; otherwise use a simple momentum strategy)

**Metrics to report:**
- Cumulative return
- Annualized Sharpe ratio
- Maximum drawdown
- Annualized volatility
- Average daily turnover

**Deliverables:**
- Cumulative return plot (all strategies on one chart)
- Summary metrics table
- 2-3 sentence commentary on results

**Grading:**
- 5 pts: Correct equal-weight baseline
- 5 pts: Correct min-variance baseline
- 5 pts: ML/momentum baseline
- 5 pts: Clean comparison plot and table
- 5 pts: Thoughtful commentary

In [None]:
# YOUR CODE HERE
# 1. Implement equal-weight baseline
# 2. Implement min-variance baseline
# 3. Implement momentum or ML baseline
# 4. Evaluate all RL agents on test set
# 5. Create comparison table and plot

---
## Part 4: Reward Shaping Experiments (20 pts)

**Requirements:**
Using PPO, train agents with at least 3 different reward functions:
1. Raw portfolio return: $r_t = r_p$
2. Sharpe-based: $r_t = \bar{r}_{20} / \sigma_{20}$
3. Return minus drawdown penalty: $r_t = r_p - \lambda \cdot \text{drawdown}_t$

For reward 3, try at least 2 values of $\lambda$ (e.g., 1.0 and 5.0).

**Deliverables:**
- Test performance comparison across reward functions
- How does the reward function affect turnover?
- How does it affect drawdown behavior?
- Which reward function would you use in practice? Why?

**Grading:**
- 5 pts: Correct implementation of 3+ reward functions
- 5 pts: Lambda sensitivity analysis
- 5 pts: Comparison plots and metrics
- 5 pts: Thoughtful analysis of results

In [None]:
# YOUR CODE HERE
# 1. Implement reward functions in environment
# 2. Train PPO with each reward
# 3. Compare on test set
# 4. Analyze turnover and drawdown behavior

---
## Part 5: Analysis of Agent Behavior (15 pts)

**Requirements:**
Pick your best-performing RL agent and analyze its behavior:
1. **Weight evolution**: Plot how portfolio weights change over time on the test set. Does the agent concentrate or diversify?
2. **Turnover analysis**: Plot daily turnover. Does the agent trade too much?
3. **Regime behavior**: How does the agent behave during volatile vs calm periods?
4. **Overfitting check**: Report train vs test metrics. Train with 3 different seeds and report variance.

**Deliverables:**
- Weight evolution plot (stacked area chart)
- Turnover time series
- Train vs test comparison table
- Multi-seed results (mean +/- std)
- 3-5 sentences: Does this agent behave like a reasonable portfolio manager?

**Grading:**
- 5 pts: Weight and turnover visualizations
- 5 pts: Overfitting analysis (seeds, train/test gap)
- 5 pts: Behavioral interpretation

In [None]:
# YOUR CODE HERE
# 1. Run best agent on test set, record weights at each step
# 2. Create stacked area chart of weights
# 3. Plot turnover over time
# 4. Train with 3+ seeds, report mean/std
# 5. Write analysis

---
## Submission

Submit this notebook with:
- All code cells executed
- All plots rendered
- All commentary/analysis written in markdown cells
- Training should be reproducible (set random seeds)