# Training and Comparing RL Agents (PPO vs. DQN)

This notebook provides a complete, end-to-end example of using the `rl_trading_project` framework with a robust data pipeline. We will perform the following steps:

1.  **Data Ingestion:** Create a high-performance DuckDB database from raw, gzipped CSV files. This simulates a real-world scenario where you have large amounts of historical data.
2.  **Data Loading & Preparation:** Load the prepared data into a Pandas DataFrame, ready for our RL environments.
3.  **Train a PPO Agent:** Train a Proximal Policy Optimization (PPO) agent to manage a **multi-asset portfolio**. PPO is ideal for this task due to its ability to handle continuous, multi-dimensional action spaces.
4.  **Train a DQN Agent:** Train a Dueling Deep Q-Network (DQN) agent on a **single-asset** task. We do this to highlight the strengths of DQN in discrete action spaces and show how different agents are suited to different problems.
5.  **Backtesting & Comparison:** Evaluate both trained agents on out-of-sample data and compare their performance metrics and equity curves.

## Part 1: Data Ingestion with DuckDB

First, we need data. We'll start by ingesting the `RAW_DIR` directory filled with monthly `csv.gz` files for multiple assets using our `ingest_raw_data_to_duckdb` function to build a persistent, columnar database named `market_data.duckdb`.

In [None]:
import os
import gzip
import pandas as pd
import numpy as np
import duckdb
import shutil

from rl_trading_project.data.duckdb_loader import ingest_raw_data_to_duckdb

RAW_DIR = '../AlphaVantage Data/raw'
DB_PATH = 'market_data.duckdb'

ingest_summary = ingest_raw_data_to_duckdb(raw_dir=RAW_DIR, db_path=DB_PATH, source_timezone='UTC')
print("\n--- Ingestion Summary ---")
print(ingest_summary)


## Part 2: Load Data and Set Up Environments

With our database created, we can now easily query the data we need for our training and testing periods. We'll load all the data and preprocess to make sure all assets have enough history to create a portfolio. 2/3rd is used for training and 1/3rd for backtest.

In [None]:
import matplotlib.pyplot as plt

# Connect to the database and load all data into pandas
con = duckdb.connect(DB_PATH, read_only=True)
portfolio_df = con.execute("SELECT * FROM ohlcv ORDER BY timestamp, asset").fetchdf()
con.close()

# Convert timestamp to timezone-aware and set the multi-index required by PortfolioEnv
portfolio_df['timestamp'] = pd.to_datetime(portfolio_df['timestamp']).dt.tz_convert('UTC')
portfolio_df = portfolio_df.set_index(['timestamp', 'asset'])

print(f"Loaded {len(portfolio_df)} total rows for assets: {portfolio_df.index.get_level_values('asset').unique().tolist()}")
print("Data Head:")
print(portfolio_df.head())

# Visualize the data
fig, ax = plt.subplots(figsize=(15, 7))
for asset in portfolio_df.index.get_level_values('asset').unique():
    asset_prices = portfolio_df.xs(asset, level='asset')['close']
    ax.plot(asset_prices, label=asset)
ax.set_title('Loaded Asset Price History')
ax.set_xlabel('Time')
ax.set_ylabel('Price')
ax.legend()
ax.grid(True)
plt.show()

# --- Data Filtering and Splitting ---
# In a real-world scenario, you first filter for assets with sufficient history.
# We will demonstrate this by filtering for assets with at least 23 months of data.
MIN_MONTHS = 23
month_counts = portfolio_df.reset_index().groupby('asset')['timestamp'].apply(lambda x: (x.max() - x.min()).days / 30.44)
assets_with_enough_data = month_counts[month_counts >= MIN_MONTHS].index.tolist()

print(f"Assets with >= {MIN_MONTHS} months of data: {assets_with_enough_data}")

# assets_with_enough_data = ['AAPL', 'AMZN', 'MSFT', 'NVDA', 'QQQ']
# Filter the main DataFrame to only include these assets
portfolio_df = portfolio_df[portfolio_df.index.get_level_values('asset').isin(assets_with_enough_data)]

# CRITICAL: Check if any assets remain after filtering before proceeding.
if portfolio_df.empty:
    raise ValueError("No assets met the minimum data requirement. Cannot proceed with training.")

# Get unique timestamps for the filtered set of assets
unique_timestamps = portfolio_df.index.get_level_values('timestamp').unique().sort_values()

# Calculate the split point (2/3 for training, 1/3 for testing)
split_index = int(len(unique_timestamps) * (2/3))
split_date = unique_timestamps[split_index]

print(f"Total unique timestamps: {len(unique_timestamps)}")
print(f"Calculated split date for 2/3 train, 1/3 test: {split_date}")

# Split the DataFrame based on the calculated timestamp
train_df = portfolio_df[portfolio_df.index.get_level_values('timestamp') < split_date]
test_df = portfolio_df[portfolio_df.index.get_level_values('timestamp') >= split_date]

print(f"\nTraining data from {train_df.index.get_level_values('timestamp').min()} to {train_df.index.get_level_values('timestamp').max()}")
print(f"Testing data from {test_df.index.get_level_values('timestamp').min()} to {test_df.index.get_level_values('timestamp').max()}")

## Part 3: Train the PPO Agent for Multi-Asset Portfolio Management

PPO is perfectly suited for our `PortfolioEnv` because its action space is a continuous vector representing the target allocation for each asset. We will train it on our multi-assset portfolio.

In [None]:
# --- NEW: Import the environment factory from our utility file ---
from ppo_training_utils import make_env_fn

# --- Import other necessary libraries ---
from rl_trading_project.agents import PPOAgent
from rl_trading_project.utils import ExperimentLogger
import torch
from tqdm.notebook import tqdm
import numpy as np
import os

# --- CRITICAL: Use the official Gymnasium Asynchronous Vectorized Environment ---
import gymnasium as gym

# --- High-Performance PPO Setup ---
SEED = 42
WINDOW_SIZE = 30
# --- NEW: Use all available CPU cores for true parallel execution ---
NUM_ENVS = os.cpu_count() or 4
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {DEVICE} with {NUM_ENVS} parallel environments.")


# --- Create the Asynchronous Vectorized Environment ---
# We pass the *imported* make_env_fn to the vectorizer.
env_fns = [make_env_fn(train_df, WINDOW_SIZE, seed=SEED + i) for i in range(NUM_ENVS)]
vec_env = gym.vector.AsyncVectorEnv(env_fns)


# --- Agent and Logger Setup ---
ppo_logger = ExperimentLogger(base_dir='runs/notebook_runs', exp_name='PPO_Parallel_RiskShaped_Notebook')
ppo_agent = PPOAgent(
    obs_dim=vec_env.observation_space.shape[1],  # Note the shape is (num_envs, obs_dim)
    action_dim=vec_env.action_space.shape[1], # Note the shape is (num_envs, action_dim)
    lr=3e-4,
    epochs=10,
    minibatch_size=128,
    entropy_coef=0.01,
    seed=SEED,
    device=DEVICE
)

# --- High-Performance PPO Training Loop with Reward Shaping ---
TRAIN_STEPS = 50000
ROLLOUT_LEN = 512
TERMINATION_PENALTY = 500.0
LEVERAGE_PENALTY_COEF = 0.1

print("Starting PARALLEL PPO training in Jupyter Notebook...")
obs, _ = vec_env.reset(seed=SEED)
trajectories = []
total_steps_done = 0

pbar = tqdm(total=TRAIN_STEPS, desc="Training PPO", unit="step")

while total_steps_done < TRAIN_STEPS:
    # Rollout Phase
    for _ in range(ROLLOUT_LEN):
        actions, logps, values = [], [], []
        for i in range(NUM_ENVS):
            action, logp, value = ppo_agent.act(obs[i], deterministic=False)
            actions.append(action)
            logps.append(logp)
            values.append(value)

        next_obs, rewards, terminated, truncated, infos = vec_env.step(actions)
        
        final_infos = infos.get("final_info", [None] * NUM_ENVS)
        leverage_values = np.array([info.get('leverage', 0.0) if info else 0.0 for info in final_infos])
        margin_called_flags = np.array([info.get('margin_called', False) if info else False for info in final_infos])
        leverage_penalties = LEVERAGE_PENALTY_COEF * leverage_values
        termination_penalties = TERMINATION_PENALTY * margin_called_flags
        
        shaped_rewards = rewards - leverage_penalties - termination_penalties
        dones = np.logical_or(terminated, truncated)

        for i in range(NUM_ENVS):
            trajectories.append({
                'obs': obs[i], 'act': actions[i], 'rew': shaped_rewards[i],
                'done': dones[i], 'logp': logps[i], 'value': values[i]
            })

        obs = next_obs
        total_steps_done += NUM_ENVS
    
    steps_to_update = total_steps_done - pbar.n
    pbar.update(steps_to_update)

    # Update Phase
    stats = ppo_agent.update(trajectories)
    trajectories.clear()
    ppo_logger.log_metrics(stats, step=total_steps_done)
    
    pbar.set_postfix({
        "policy_loss": f"{stats.get('policy_loss', 0):.4f}",
        "value_loss": f"{stats.get('value_loss', 0):.4f}",
    })

pbar.close()
# --- NEW: It's critical to close the vectorized environment to terminate the child processes ---
vec_env.close()
print("\nParallel PPO Training finished!")

In [None]:
# --- NEW: Visualize Training Metrics ---
import pandas as pd
import matplotlib.pyplot as plt

metrics_df = pd.read_csv(ppo_logger.metrics_file)

fig, axes = plt.subplots(2, 1, figsize=(12, 10), sharex=True)

# Plot Losses
axes[0].plot(metrics_df['step'], metrics_df['policy_loss'], label='Policy Loss')
axes[0].plot(metrics_df['step'], metrics_df['value_loss'], label='Value Loss')
axes[0].set_ylabel('Loss')
axes[0].set_title('PPO Training Losses')
axes[0].legend()
axes[0].grid(True)

# Plot Gradient Norms
axes[1].plot(metrics_df['step'], metrics_df['policy_grad_norm'], label='Policy Grad Norm')
axes[1].plot(metrics_df['step'], metrics_df['value_grad_norm'], label='Value Grad Norm')
axes[1].set_xlabel('Training Steps')
axes[1].set_ylabel('Gradient Norm')
axes[1].set_title('PPO Gradient Norms')
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
plt.show()

## Part 4: Train the DQN Agent for Single-Asset Trading

Our `DuelingDQNAgent` is a value-based agent designed for problems with a **discrete action space**. It outputs a Q-value for each possible action (e.g., full short, half short, hold, half long, full long). This is fundamentally different from PPO, which can output a continuous action vector.

Therefore, we cannot directly apply our DQN agent to the multi-asset `PortfolioEnv`. Instead, we will train it on a simplified, **single-asset** version of the environment using only the `QQQ` data. This is a common and powerful use case for DQN-style agents.

In [None]:
from rl_trading_project.agents import DuelingDQNAgent
from rl_trading_project.envs import TradingEnv, GymWrapper

# --- DQN Setup ---
# Filter the training data for our single asset
qqq_train_df = train_df.xs('QQQ', level='asset').reset_index()

# Create a single-asset environment
dqn_env = TradingEnv(
    df=qqq_train_df,
    window_size=WINDOW_SIZE,
    initial_balance=100_000,
    max_position=100.0, # Max position in units of the asset
    commission=0.0005
)
dqn_wrapped_env = GymWrapper(dqn_env)

# Instantiate the DQN agent
dqn_agent = DuelingDQNAgent(
    obs_dim=dqn_wrapped_env.observation_space.shape[0],
    action_bins=11, # Discretize action into 11 bins from -1 (full short) to +1 (full long)
    lr=5e-4,
    buffer_size=100_000,
    batch_size=128,
    seed=SEED
)

# --- DQN Training Loop ---
print("\nStarting DQN training...")
obs, _ = dqn_wrapped_env.reset(options={'start_index': WINDOW_SIZE})

for step in range(TRAIN_STEPS):
    action = dqn_agent.act(obs, deterministic=False)
    next_obs, reward, terminated, truncated, info = dqn_wrapped_env.step(action)
    done = terminated or truncated
    
    dqn_agent.add_experience(obs, action, reward, next_obs, done)
    stats = dqn_agent.update(sync_freq=200)
    
    obs = next_obs
    if done:
        obs, _ = dqn_wrapped_env.reset(options={'start_index': WINDOW_SIZE})
        
    if (step + 1) % 2000 == 0:
         print(f"Step {step+1}/{TRAIN_STEPS}, Loss={stats.get('loss'):.4f}, Epsilon={stats.get('eps'):.2f}")

print("DQN Training finished!")

In [None]:
import os
import time

# --- Part 5a: Save Trained Agents to Disk ---
# We create a directory to store the model weights with descriptive names.
# This prevents overwriting previous results and helps with experiment tracking.

# --- NEW: Create a unique timestamped directory for this training run ---
run_timestamp = time.strftime("%Y%m%d-%H%M%S")
save_dir = f"saved_models/run_{run_timestamp}"
os.makedirs(save_dir, exist_ok=True)
print(f"Models for this run will be saved in: {save_dir}")

# --- NEW: More descriptive filenames ---
# For PPO, we can include key hyperparameters in the name.
ppo_run_name = f"PPO_MultiAsset_LR{ppo_agent.policy_optimizer.defaults['lr']}_Rollout{ROLLOUT_LEN}"
PPO_SAVE_PATH = os.path.join(save_dir, ppo_run_name)

# For DQN, the action space discretization is a key parameter.
dqn_run_name = f"DQN_SingleAsset_Bins{dqn_agent.action_bins}"
DQN_SAVE_PATH = os.path.join(save_dir, dqn_run_name)


print(f"\nSaving PPO agent to {PPO_SAVE_PATH}...")
# The .save() method is part of the base Agent class and handles creating the directory
# and saving the necessary model files (e.g., policy.pth, value.pth).
ppo_agent.save(PPO_SAVE_PATH)

print(f"Saving DQN agent to {DQN_SAVE_PATH}...")
# Similarly, the DQN agent saves its Q-network and target network weights.
dqn_agent.save(DQN_SAVE_PATH)

print("\nAgents saved successfully.")

# --- NEW: Save the paths for the next cell ---
# This makes loading much easier and less prone to typos.
# We'll store them in a dictionary for clarity.
saved_model_paths = {
    'ppo': PPO_SAVE_PATH,
    'dqn': DQN_SAVE_PATH
}

## Part 5b: Load Agents and Run Backtests

Now for the moment of truth. To simulate a real-world workflow where training and evaluation are separate, we will first load our saved agents from disk. Then, we will use the `Backtester` module to run both of our trained agents on the out-of-sample test data. For the evaluation, we always use `deterministic=True` to make the agent exploit its learned policy without random exploration.

## Part 5: Backtesting and Comparison

Now for the moment of truth. We will use the `Backtester` module to run both of our trained agents on the out-of-sample test data (the 3rd month). For the evaluation, we always use `deterministic=True` to make the agent exploit its learned policy without random exploration.

In [None]:
import os
import glob
from rl_trading_project.agents import PPOAgent, DuelingDQNAgent
from rl_trading_project.envs import PortfolioEnv, TradingEnv, GymWrapper
from rl_trading_project.trainers import Backtester, compare_strategies
import pandas as pd
import matplotlib.pyplot as plt

# --- 1. Locate or Discover Saved Model Paths ---
# This logic makes the cell runnable in a new session.

try:
    # First, try to use the 'saved_model_paths' dict if it exists in the current session.
    if 'saved_model_paths' in locals() and isinstance(saved_model_paths, dict):
        print("Using model paths from the current training session.")
        PPO_SAVE_PATH = saved_model_paths['ppo']
        DQN_SAVE_PATH = saved_model_paths['dqn']
    else:
        raise NameError # Force fallback to file system search

except NameError:
    print("Could not find model paths in session. Searching the file system for the latest run...")
    base_dir = 'saved_models'
    
    # Find the most recent 'run_*' directory
    list_of_runs = glob.glob(os.path.join(base_dir, 'run_*'))
    if not list_of_runs:
        raise FileNotFoundError(f"No run directories found in '{base_dir}'. Please run the training and saving cell first.")
    
    latest_run_dir = max(list_of_runs, key=os.path.getctime)
    print(f"Found latest run directory: {latest_run_dir}")

    # Find the PPO and DQN model paths within the latest run directory
    ppo_paths = glob.glob(os.path.join(latest_run_dir, 'PPO_*'))
    dqn_paths = glob.glob(os.path.join(latest_run_dir, 'DQN_*'))

    if not ppo_paths or not dqn_paths:
        raise FileNotFoundError(f"Could not find both a PPO and a DQN model in '{latest_run_dir}'.")
        
    PPO_SAVE_PATH = ppo_paths[0]
    DQN_SAVE_PATH = dqn_paths[0]


# --- 2. Re-create Environments and Load Agents ---
print(f"\nLoading PPO model from: {PPO_SAVE_PATH}")
print(f"Loading DQN model from: {DQN_SAVE_PATH}")

SEED = 42
WINDOW_SIZE = 30

print("\nRe-creating environment templates to define agent architectures...")
ppo_template_env = GymWrapper(PortfolioEnv(df=test_df, window_size=WINDOW_SIZE))
qqq_test_df = test_df.xs('QQQ', level='asset').reset_index()
dqn_template_env = GymWrapper(TradingEnv(df=qqq_test_df, window_size=WINDOW_SIZE))

ppo_agent_loaded = PPOAgent(
    obs_dim=ppo_template_env.observation_space.shape[0],
    action_dim=ppo_template_env.action_space.shape[0],
    seed=SEED
)
ppo_agent_loaded.load(PPO_SAVE_PATH)

dqn_agent_loaded = DuelingDQNAgent(
    obs_dim=dqn_template_env.observation_space.shape[0],
    action_bins=11,
    seed=SEED
)
dqn_agent_loaded.load(DQN_SAVE_PATH)

print("\nAgents loaded and ready for backtesting.")


# --- 3. PPO Backtest (Multi-Asset) ---
def ppo_policy_fn(obs, t):
    action, _, _ = ppo_agent_loaded.act(obs, deterministic=True)
    return action

ppo_test_env_factory = lambda: GymWrapper(PortfolioEnv(df=test_df, window_size=WINDOW_SIZE, initial_balance=100_000))

print("\nRunning PPO backtest...")
ppo_backtester = Backtester(ppo_test_env_factory, start_index=WINDOW_SIZE)
max_ppo_steps = len(test_df.index.get_level_values('timestamp').unique()) - WINDOW_SIZE - 1
ppo_results = ppo_backtester.run(ppo_policy_fn, max_steps=max_ppo_steps)


# --- 4. DQN Backtest (Single-Asset on QQQ) ---
def dqn_policy_fn(obs, t):
    action = dqn_agent_loaded.act(obs, deterministic=True)
    return action

dqn_test_env_factory = lambda: GymWrapper(TradingEnv(df=qqq_test_df, window_size=WINDOW_SIZE, initial_balance=100_000))

print("Running DQN backtest...")
dqn_backtester = Backtester(dqn_test_env_factory, start_index=WINDOW_SIZE)
max_dqn_steps = len(qqq_test_df) - WINDOW_SIZE - 1
dqn_results = dqn_backtester.run(dqn_policy_fn, max_steps=max_dqn_steps)


# --- 5. Comparison and Plotting ---
comparison = compare_strategies({
    'PPO_MultiAsset': ppo_results,
    'DQN_SingleAsset': dqn_results
})

print("\n--- Backtest Comparison Summary ---")
summary_df = pd.DataFrame(comparison).T[['total_return', 'sharpe_ratio', 'max_drawdown', 'end_value']]
summary_df['total_return'] = summary_df['total_return'].apply(lambda x: f"{x:.2%}")
summary_df['max_drawdown'] = summary_df['max_drawdown'].apply(lambda x: f"{x:.2%}")
print(summary_df)

ppo_history_df = pd.DataFrame(ppo_results['history'])
dqn_history_df = pd.DataFrame(dqn_results['history'])

fig, ax = plt.subplots(figsize=(15, 8))
ax.plot(ppo_history_df['total_value'], label='PPO (Multi-Asset Portfolio)', lw=2)
ax.plot(dqn_history_df['total_value'], label='DQN (Single-Asset: QQQ)', lw=2)
ax.set_title('Agent Equity Curves (Out-of-Sample)')
ax.set_xlabel('Test Steps')
ax.set_ylabel('Portfolio Value ($)')
ax.legend()
ax.grid(True)
plt.show()

## Conclusion

This notebook demonstrated the full workflow: from raw data ingestion to training multiple, distinct RL agents and comparing their performance. 

The **PPO agent** successfully learned a policy to manage a portfolio of correlated assets, leveraging its ability to output continuous allocation vectors. 

The **DQN agent**, while not suitable for the multi-asset task in its current form, proved effective for a single-asset trading problem with a discretized set of actions (buy/sell/hold decisions). 

This highlights a key principle in applied RL: choosing the right agent architecture for the specific problem and action space is critical for success.