Intermediate Deep Learning - Deep Reinforcement Learning
# Project: Deep Reinforcement Learning in Trading Environment

## Overview
In this project, you will implement and evaluate a deep reinforcement learning (DRL) agent in a trading environment. The goal is to train an agent that can make profitable trading decisions based on historical market data.

You are encouraged to experiment with different DRL algorithms, architectures, and hyperparameters to optimize the agent's performance.

### Deliverables

You are required to prepare:

- Code implementation of the DRL agent(s) and training process. Can be in the form of Jupyter Notebooks or Python scripts.
- A report (4 pages max) summarizing your approach, results, and insights gained from the project, including:  
    - Description of the DRL algorithm(s) used.
    - Training process and hyperparameter choices.
    - Challenges faced and how they were addressed.

    Justify your design choices. You are also encouraged to include visualizations of the agent's performance over time, or compare strategies developed with DRL against financial benchmarks.
- An evaluation results CSV file `evaluation_results.csv` generated by your agent after training using the provided evaluation function.
- A presentation (15 minutes) to showcase your work, findings, and any interesting observations.

*The documents are to be submitted in a zip file, before 16/11/2025 11:59PM, and the presentation is scheduled for next session.*

### Environment
We will use Gym Trading Env as our trading environment. (https://gym-trading-env.readthedocs.io/en/latest/)

- This is a gymnasium-compatible environment designed to simulate trading (stocks or crypto) from historical market data.
- Its goal is to provide a fast and customizable platform for training RL agents in a trading scenario.

We will use BTC/USDT hour step historical data from Binance for training and evaluation. The agent will be evaluated on the period from 2025-10-01 to 2025-11-01.

Following code blocks demonstrate how to set up the environment and evaluate your agent.

### Grading Criteria
- Implementation of the DRL agent and training process (40%)
    - DRL algorithm correctly implemented (15%)
    - Appropriate training procedure (15%)
    - Effective use of hyperparameters (10%)
    - 10 % bonus for innovative approaches or techniques
- Performance of the agent based on evaluation metrics (30%)
    - If the agent shows progress during training (10%)
    - If the agent outperforms a random strategy during evaluation (10%)
    - If the portfolio return exceeds market return (10%)
    - 30%, 20%, 10% bonus for the top 3 agents respectively
- Quality and clarity of the report (20%)
    - Clear explanation of methods and results (10%)
    - Justification of design choices (10%)
- Presentation (20%)





---

## Environment Setup

### Install Required Packages

In [None]:
import sys
print(sys.executable)

!"{sys.executable}" -m pip install gym-trading-env torch

/usr/local/bin/python3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [184]:
import numpy as np
import pandas as pd
import gymnasium as gym
import gym_trading_env
from gym_trading_env.downloader import download
from pathlib import Path
import matplotlib.pyplot as plt
import time
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler


### Prepare Data Set

- Create a data folder to store historical data
- Download historical data for BTC/USDT from Binance using the provided utility function.
- Preprocess the data to create features. The features (plus two dynamic features: last position taken by the agent, and the current real position) are the state of the environment at each time step.  
    *(You can add more features if you want to experiment with different state representations.)*
- Select training and evaluation data based on the specified date ranges.  
    *(You can modify the training range if you want to experiment with different time periods, however, keep in mind the evaluation period should always be after the training period.)*

```python

In [185]:

# --- Setup Folders ---
data_folder = Path("data/")
data_folder.mkdir(parents=True, exist_ok=True)
eval_folder = Path("eval/")
eval_folder.mkdir(parents=True, exist_ok=True)

"""
download(exchange_names = ["binance"],
    symbols= ["BTC/USDT"],
    timeframe= "1h",
    dir = data_folder,
    since= datetime.datetime(year= 2020, month=10, day=1),
)"""

# Import your fresh data
# Assuming 'binance-BTCUSDT-1h.pkl' exists from the download step
df = pd.read_pickle(data_folder / "binance-BTCUSDT-1h.pkl")

""" Preprocess the data to create features """
# Create the feature : ( close[t] - close[t-1] )/ close[t-1]
df["feature_close"] = df["close"].pct_change()

# Create the feature : open[t] / close[t]
df["feature_open"] = df["open"]/df["close"]

# Create the feature : high[t] / close[t]
df["feature_high"] = df["high"]/df["close"]

# Create the feature : low[t] / close[t]
df["feature_low"] = df["low"]/df["close"]

 # Create the feature : volume[t] / max(*volume[t-7*24:t+1])
df["feature_volume"] = df["volume"] / df["volume"].rolling(7*24).max()

# --- New attributs ---
# Sharp ratio
ANNUALIZATION_FACTOR = 24 * 365 # 8760 hours in a year
ROLLING_WINDOW = 7 * 24          # 168 hours (7 days)
RISK_FREE_RATE_ANNUAL = 0.04     # Placeholder: 4.0% annual risk-free rate
# Convert annual R_f to hourly R_f: (1 + R_f^ann)^(1/T) - 1
RISK_FREE_RATE_HOURLY = (1 + RISK_FREE_RATE_ANNUAL)**(1/ANNUALIZATION_FACTOR) - 1
df['Excess_Return'] = df['feature_close'] - RISK_FREE_RATE_HOURLY

rolling_mean_excess = df['Excess_Return'].rolling(window=ROLLING_WINDOW).mean()
rolling_std_excess = df['Excess_Return'].rolling(window=ROLLING_WINDOW).std()
# Sharpe Ratio = (Rolling Mean / Rolling Std Dev) * sqrt(T)
df['Rolling_Sharpe_Ratio'] = ( rolling_mean_excess / rolling_std_excess) * np.sqrt(ANNUALIZATION_FACTOR)

# Moving Average Convergence Divergence (MACD)
df['EMA_12'] = df['close'].ewm(span=12, adjust=False).mean()
df['EMA_26'] = df['close'].ewm(span=26, adjust=False).mean()
df['MACD'] = df['EMA_12'] - df['EMA_26']
df['MACD_Signal'] = df['MACD'].ewm(span=9, adjust=False).mean()
# Create features normalized by close price
df['feature_MACD'] = df['MACD'] / df['close']
df['feature_MACD_Signal'] = df['MACD_Signal'] / df['close']


# Bollinger Bands

ROLLING_WINDOW_BB = 20
df['BB_Middle'] = df['close'].rolling(window=ROLLING_WINDOW_BB).mean()
df['BB_Std'] = df['close'].rolling(window=ROLLING_WINDOW_BB).std()
df['BB_Upper'] = df['BB_Middle'] + (df['BB_Std'] * 2)
df['BB_Lower'] = df['BB_Middle'] - (df['BB_Std'] * 2)
# Create features relative to the close price
df['feature_BB_Upper'] = df['BB_Upper'] / df['close']
df['feature_BB_Lower'] = df['BB_Lower'] / df['close']


# On-Balance Volume (OBV)
df['OBV'] = (np.sign(df['close'].diff()) * df['volume']).cumsum().fillna(0)
# Normalize OBV (e.g., divide by a rolling max, similar to volume)
df['feature_OBV'] = df['OBV'] / df['OBV'].rolling(7*24).max()


# --- Final Cleanup ---
# Add all intermediate calculation columns to this list to drop them
cols_to_drop = [
    "Excess_Return", "EMA_12", "EMA_26", "MACD", "MACD_Signal",
    "BB_Middle", "BB_Std", "BB_Upper", "BB_Lower", "OBV","date_close"
]
df = df.drop(columns=cols_to_drop)

df.dropna(inplace= True) # Clean again!

# Your final feature set now includes:
# "feature_close", "feature_open", "feature_high", "feature_low", "feature_volume",
# "Rolling_Sharpe_Ratio", "feature_MACD", "feature_MACD_Signal",
# "feature_BB_Upper", "feature_BB_Lower", "feature_OBV"

# --- Data Splitting ---
df_train = df.loc['2024-10-01':'2025-09-30'] # Training data
df_eval = df.loc['2025-10-01':'2025-11-01'] # Evaluation data

print("Data preprocessing complete.")
print(f"Training data shape: {df_train.shape}")
print(f"Evaluation data shape: {df_eval.shape}")
print("\nFinal DataFrame columns:")
print(df.columns.to_list())

Data preprocessing complete.
Training data shape: (8760, 16)
Evaluation data shape: (768, 16)

Final DataFrame columns:
['open', 'high', 'low', 'close', 'volume', 'feature_close', 'feature_open', 'feature_high', 'feature_low', 'feature_volume', 'Rolling_Sharpe_Ratio', 'feature_MACD', 'feature_MACD_Signal', 'feature_BB_Upper', 'feature_BB_Lower', 'feature_OBV']


In [186]:
print(df_train.shape)
df_train.head(5)

(8760, 16)


Unnamed: 0_level_0,open,high,low,close,volume,feature_close,feature_open,feature_high,feature_low,feature_volume,Rolling_Sharpe_Ratio,feature_MACD,feature_MACD_Signal,feature_BB_Upper,feature_BB_Lower,feature_OBV
date_open,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2024-10-01 00:00:00,63327.6,63606.0,63006.7,63531.99,1336.93335,0.003228,0.996783,1.001165,0.991732,0.287876,1.640393,-0.006277,-0.006823,1.016025,0.992171,0.991183
2024-10-01 01:00:00,63532.0,63639.86,63370.01,63458.0,1004.08763,-0.001165,1.001166,1.002866,0.998613,0.216205,1.358962,-0.006059,-0.006677,1.015258,0.993596,0.990114
2024-10-01 02:00:00,63458.0,63458.0,63180.0,63443.76,716.11822,-0.000224,1.000224,1.000224,0.995843,0.154198,0.856832,-0.005833,-0.006509,1.013925,0.993942,0.989351
2024-10-01 03:00:00,63443.76,63744.0,63430.0,63723.48,822.21265,0.004409,0.99561,1.000322,0.995394,0.177043,1.563768,-0.005214,-0.006227,1.006489,0.991249,0.990227
2024-10-01 04:00:00,63723.47,63879.81,63652.06,63868.94,778.75286,0.002283,0.997722,1.00017,0.996604,0.167685,2.021614,-0.004497,-0.00587,1.00235,0.990145,0.991056


In [187]:
print(df_eval.shape)
df_eval.head(5)

(768, 16)


Unnamed: 0_level_0,open,high,low,close,volume,feature_close,feature_open,feature_high,feature_low,feature_volume,Rolling_Sharpe_Ratio,feature_MACD,feature_MACD_Signal,feature_BB_Upper,feature_BB_Lower,feature_OBV
date_open,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2025-10-01 00:00:00,114048.94,114308.0,113966.67,114239.53,434.59016,0.001671,0.998332,1.000599,0.997612,0.149874,3.203921,0.002135,0.001749,1.003953,0.984844,0.996682
2025-10-01 01:00:00,114239.53,114550.0,114142.99,114549.99,597.2536,0.002718,0.99729,1.0,0.996447,0.205971,3.335956,0.002428,0.001881,1.002157,0.981814,0.997216
2025-10-01 02:00:00,114549.99,114551.76,114272.15,114272.15,508.42422,-0.002425,1.002431,1.002447,1.0,0.175337,3.446609,0.002447,0.001998,1.005022,0.984106,0.996761
2025-10-01 03:00:00,114272.16,114530.48,114096.58,114176.92,502.30318,-0.000833,1.000834,1.003097,0.999296,0.173226,4.003348,0.002365,0.002073,1.00627,0.984935,0.996312
2025-10-01 04:00:00,114176.93,114700.0,114151.0,114289.01,597.89328,0.000982,0.999019,1.003596,0.998792,0.206191,3.423206,0.002348,0.002126,1.005869,0.984117,0.996847


### Setting up the Trading Environment

We use the `df_train` DataFrame for training and `df_eval` DataFrame for evaluation.

The `positions` parameter defines the discrete positions the agent can take, it is a list containing possible position values. A position value corresponds to the ratio of the portfolio valuation engaged in the position ( > 0 to bet on the rise, < 0 to bet on the decrease)

- if `position < 0` : the agent is shorting the asset
- if `position = 0` : the agent is out of the market
- if `position > 0` : the agent is longing the asset
- if `position = 1` : the agent is fully invested in the asset
- if `position > 1` : the agent is using leverage to invest more than its portfolio valuation in the asset

You are free to modify the `positions` list to experiment with different position options for the agent.


In [188]:
POSITIONS = [-1, -0.75, -0.5, -0.25, 0, 0.25, 0.5, 0.75, 1]

env_train = gym.make("TradingEnv",
        name= "BTCUSD",
        df = df_train, # Your dataset with your custom features
        positions = POSITIONS,
        trading_fees = 0.01/100, # 0.01% per stock buy / sell (Binance fees)
        borrow_interest_rate= 0.0003/100, # 0.0003% per timestep (one timestep = 1h here)
    )

env_eval = gym.make("TradingEnv",
        name= "BTCUSD",
        df = df_eval, # Your dataset with your custom features
        positions = POSITIONS,
        trading_fees = 0.01/100, # 0.01% per stock buy / sell (Binance fees)
        borrow_interest_rate= 0.0003/100, # 0.0003% per timestep (one timestep = 1h here)
    )

In [189]:
def evaluate_agent(agent, env, num_episodes=20, max_steps=None, render=False, csv_path="evaluation_results.csv", renderer_logs_dir="render_logs"):
    """
    Evaluate the agent on the environment for a number of episodes.
    """
    results = []
    
    # Ensure render dir exists
    if render:
        Path(renderer_logs_dir).mkdir(parents=True, exist_ok=True)

    for ep in range(num_episodes):
        obs, info = env.reset()
        done = False
        truncated = False
        step = 0
        reward_total = 0.0
        while not done and not truncated:
            action = agent.choose_action_eval(obs)
            obs, reward, done, truncated, info = env.step(action)
            reward_total += reward
            step += 1
            if (max_steps is not None) and (step >= max_steps):
                break

        metrics = env.get_metrics()
        port_ret = float(metrics["Portfolio Return"].strip('%')) / 100.0
        market_ret = float(metrics["Market Return"].strip('%')) / 100.0

        results.append({
            "episode": ep + 1,
            "portfolio_return": port_ret,
            "market_return": market_ret,
            "excess_return": port_ret - market_ret,
            "steps": step,
            "total_reward": reward_total,
        })
        
        if render:
            print(f"Eval Episode {ep+1}: Total Reward: {reward_total:.2f}, Portfolio Return: {port_ret:.2%}, Market Return: {market_ret:.2%}, Excess Return: {(port_ret - market_ret):.2%}, Steps: {step}")
            time.sleep(1)
            env.save_for_render(dir=renderer_logs_dir)

    df_results = pd.DataFrame(results)
    
    # Ensure the directory for the CSV exists
    Path(csv_path).parent.mkdir(parents=True, exist_ok=True)
    df_results.to_csv(csv_path, index=False)
    print(f"Saved evaluation results to {csv_path}")

    return df_results

# Random agent
We are going to create and test a random agent. This agent will be a baseline

In [190]:
class RandomAgent:
    def __init__(self, action_space):
        self.action_space = action_space

    def act(self, observation):
        return self.action_space.sample()

    def choose_action_eval(self, state):
        return self.action_space.sample()

In [191]:
# Create a random agent for evaluation
agent = RandomAgent(env_eval.action_space)

# Evaluate the trained agent
df_results = evaluate_agent(agent, env_eval, num_episodes=20, render=True, csv_path=eval_folder / "evaluation_results.csv", renderer_logs_dir=eval_folder / "render_logs")

Market Return : -3.63%   |   Portfolio Return : -13.70%   |   
Eval Episode 1: Total Reward: -0.15, Portfolio Return: -13.70%, Market Return: -3.63%, Excess Return: -10.07%, Steps: 767
Market Return : -3.63%   |   Portfolio Return :  0.12%   |   
Eval Episode 2: Total Reward: 0.00, Portfolio Return: 0.12%, Market Return: -3.63%, Excess Return: 3.75%, Steps: 767
Market Return : -3.63%   |   Portfolio Return : -7.59%   |   
Eval Episode 3: Total Reward: -0.08, Portfolio Return: -7.59%, Market Return: -3.63%, Excess Return: -3.96%, Steps: 767
Market Return : -3.63%   |   Portfolio Return : -12.16%   |   
Eval Episode 4: Total Reward: -0.13, Portfolio Return: -12.16%, Market Return: -3.63%, Excess Return: -8.53%, Steps: 767
Market Return : -3.63%   |   Portfolio Return : -7.04%   |   
Eval Episode 5: Total Reward: -0.07, Portfolio Return: -7.04%, Market Return: -3.63%, Excess Return: -3.41%, Steps: 767
Market Return : -3.63%   |   Portfolio Return : -3.43%   |   
Eval Episode 6: Total Rewa

# PPO agent
We are going to use a PPO agent with the new features we added. 

In [192]:
class ActorCriticNetwork(nn.Module):
    def __init__(self, state_size, n_actions, hidden_size=128, n_layers=2): 
        super().__init__()
        
        layers = []
        
        # Input layer
        layers.append(nn.Linear(state_size, hidden_size))
        layers.append(nn.Tanh())
        
        # Add (n_layers - 1) hidden layers
        for _ in range(n_layers - 1):
            layers.append(nn.Linear(hidden_size, hidden_size))
            layers.append(nn.Tanh())
            
        # Create the sequential shared network
        self.shared = nn.Sequential(*layers)
        
        # Actor and Critic heads
        self.actor = nn.Linear(hidden_size, n_actions)
        self.critic = nn.Linear(hidden_size, 1)

    def forward(self, x):
        features = self.shared(x)
        logits = self.actor(features)
        action_probs = torch.softmax(logits, dim=-1)
        state_value = self.critic(features)
        return action_probs, state_value

class PPOAgent:
    def __init__(
        self, state_size, n_actions,
        lr=3e-4, gamma=0.99, gae_lambda=0.95,
        entropy_beta=0.01, clip_epsilon=0.2, ppo_epochs=10, batch_size=64,
        hidden_size=128,
        n_layers=2  # <<< --- 1. ADD THIS (with a default)
    ):
        
        # Hyperparameters
        self.lr = lr
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.entropy_beta = entropy_beta
        self.clip_epsilon = clip_epsilon
        self.ppo_epochs = ppo_epochs
        self.batch_size = batch_size

        # Environment parameters
        self.state_size = state_size
        self.n_actions = n_actions

        # Device configuration
        if torch.backends.mps.is_available():
            self.device = torch.device("mps")  
        else:
            self.device = torch.device("cpu")

        # Create policy network
        self.network = ActorCriticNetwork(
            state_size, 
            n_actions, 
            hidden_size, 
            n_layers  
        ).to(self.device)

        # Optimizer
        self.optimizer = optim.Adam(self.network.parameters(), lr=lr)

        # Memory buffers
        self.reset_memory()

    def reset_memory(self):
        """Clear rollout buffers."""
        self.states = []
        self.actions = []
        self.rewards = []
        self.values = []
        self.dones = []
        self.log_probs = []

    def get_action_value_logprob(self, state):
        """
        Samples an action for the training loop.
        Returns the action, its value, and log probability.
        """
        state_tensor = torch.tensor(state, dtype=torch.float32, device=self.device).unsqueeze(0)

        with torch.no_grad():
            probs, value = self.network(state_tensor)

        dist = torch.distributions.Categorical(probs=probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)

        return action.item(), value.item(), log_prob.item()

    def choose_action_eval(self, state):
        """
        Chooses the best action for evaluation (deterministic).
        Returns only the action index.
        """
        state_tensor = torch.tensor(state, dtype=torch.float32, device=self.device).unsqueeze(0)
        
        with torch.no_grad():
            probs, _ = self.network(state_tensor)
        
        action = torch.argmax(probs, dim=-1)
        
        return action.item()

    def store(self, state, action, reward, value, done, log_prob):
        """Store a single transition in memory."""
        self.states.append(state)
        self.actions.append(action)
        self.rewards.append(reward)
        self.values.append(value)
        self.dones.append(done)
        self.log_probs.append(log_prob)

    def compute_gae(self, next_value):
        """
        Compute returns and advantages using GAE (Generalized Advantage Estimation)
        """
        rewards = np.array(self.rewards, dtype=np.float32)
        values = np.array(self.values + [next_value], dtype=np.float32)
        dones = np.array(self.dones, dtype=np.float32)

        T = len(rewards)
        returns = np.zeros(T, dtype=np.float32)
        advantages = np.zeros(T, dtype=np.float32)

        gae = 0.0
        for t in reversed(range(T)):
            delta = rewards[t] + self.gamma * values[t + 1] * (1.0 - dones[t]) - values[t]
            gae = delta + self.gamma * self.gae_lambda * (1.0 - dones[t]) * gae
            advantages[t] = gae
            returns[t] = advantages[t] + values[t]

        return returns, advantages

    def update(self, next_value):
        """Perform one PPO update step."""
        if len(self.states) == 0:
            return {"actor_loss": 0.0, "critic_loss": 0.0}

        returns, advantages = self.compute_gae(next_value)

        # Convert to tensors
        states = torch.tensor(np.array(self.states), dtype=torch.float32, device=self.device)
        actions = torch.tensor(np.array(self.actions), dtype=torch.int64, device=self.device)
        returns = torch.tensor(returns, dtype=torch.float32, device=self.device)
        advantages = torch.tensor(advantages, dtype=torch.float32, device=self.device)
        old_log_probs = torch.tensor(np.array(self.log_probs), dtype=torch.float32, device=self.device)

        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        total_actor_loss = 0
        total_critic_loss = 0
        updates = 0
        
        for _ in range(self.ppo_epochs):
            indices = torch.randperm(len(states))
            
            for start in range(0, len(states), self.batch_size):
                end = start + self.batch_size
                idx = indices[start:end]
                
                if len(idx) == 0:
                    continue

                batch_states = states[idx]
                batch_actions = actions[idx]
                batch_old_log_probs = old_log_probs[idx]
                batch_returns = returns[idx]
                batch_advantages = advantages[idx]
                
                # Forward pass
                action_probs, values = self.network(batch_states)
                dist = torch.distributions.Categorical(action_probs)
                log_probs = dist.log_prob(batch_actions)
                entropy = dist.entropy().mean()
                
                # PPO loss computation
                ratio = torch.exp(log_probs - batch_old_log_probs)
                surr1 = ratio * batch_advantages
                surr2 = torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon) * batch_advantages
                actor_loss = -torch.min(surr1, surr2).mean()
                
                critic_loss = (batch_returns - values.squeeze()).pow(2).mean()
                
                loss = actor_loss + 0.5 * critic_loss - self.entropy_beta * entropy
                
                self.optimizer.zero_grad()
                loss.backward()
                torch.nn.utils.clip_grad_norm_(self.network.parameters(), 0.5)
                self.optimizer.step()
                
                total_actor_loss += actor_loss.item()
                total_critic_loss += critic_loss.item()
                updates += 1
        
        self.reset_memory()
        
        if updates == 0:
            return {"actor_loss": 0.0, "critic_loss": 0.0}

        return {
            'actor_loss': total_actor_loss / updates,
            'critic_loss': total_critic_loss / updates
        }

In [180]:
# Get state and action dimensions from the environment
state_size = env_eval.observation_space.shape[0]
n_actions = env_eval.action_space.n

print(f"State size: {state_size}")
print(f"Number of actions: {n_actions}")

# Create the agent with the correct dimensions
agent = PPOAgent(state_size=state_size, n_actions=n_actions)

# Evaluate the (untrained) agent
# This will now run without errors
df_results = evaluate_agent(agent, env_eval, num_episodes=20, render=True, csv_path=eval_folder / "evaluation_results.csv", renderer_logs_dir=eval_folder / "render_logs")

State size: 12
Number of actions: 9
Market Return : -3.63%   |   Portfolio Return :  3.61%   |   
Eval Episode 1: Total Reward: 0.04, Portfolio Return: 3.61%, Market Return: -3.63%, Excess Return: 7.24%, Steps: 767
Market Return : -3.63%   |   Portfolio Return :  3.62%   |   
Eval Episode 2: Total Reward: 0.04, Portfolio Return: 3.62%, Market Return: -3.63%, Excess Return: 7.25%, Steps: 767
Market Return : -3.63%   |   Portfolio Return :  3.61%   |   
Eval Episode 3: Total Reward: 0.04, Portfolio Return: 3.61%, Market Return: -3.63%, Excess Return: 7.24%, Steps: 767
Market Return : -3.63%   |   Portfolio Return :  3.62%   |   
Eval Episode 4: Total Reward: 0.04, Portfolio Return: 3.62%, Market Return: -3.63%, Excess Return: 7.25%, Steps: 767
Market Return : -3.63%   |   Portfolio Return :  3.62%   |   
Eval Episode 5: Total Reward: 0.04, Portfolio Return: 3.62%, Market Return: -3.63%, Excess Return: 7.25%, Steps: 767
Market Return : -3.63%   |   Portfolio Return :  3.60%   |   
Eval Ep

We observe that the PPO agent outperformed the market. Let's optimize the hyperparameters now. 

# Hyperparameter optimization with Optuna 

In [193]:

# --- Optuna Objective Function ---
def objective(trial):
    """
    Defines the objective for Optuna to optimize.
    A "trial" consists of training and evaluating an agent with a specific
    set of hyperparameters.
    """
    # --- 1. Suggest Hyperparameters ---
    # We define the search space for each hyperparameter.
    ppo_hps = {
        "lr": trial.suggest_float("lr", 1e-5, 1e-3, log=True),
        "gamma": trial.suggest_categorical("gamma", [0.98, 0.99, 0.995, 0.999]),
        "gae_lambda": trial.suggest_float("gae_lambda", 0.8, 0.999),
        "entropy_beta": trial.suggest_float("entropy_beta", 1e-5, 0.1, log=True),
        "clip_epsilon": trial.suggest_float("clip_epsilon", 0.1, 0.3),
        "ppo_epochs": trial.suggest_int("ppo_epochs", 2, 20),
        "batch_size": trial.suggest_categorical("batch_size", [32, 64, 128, 256, 512, 1024]),
        "hidden_size": trial.suggest_categorical("hidden_size", [64, 128, 256, 512]),
        "n_layers": trial.suggest_int("n_layers", 1, 3)
    }

    # --- 2. Training Hyperparameters ---
    TOTAL_TIMESTEPS = 200_000  # Reduced for faster trials, increase for better results
    ROLLOUT_STEPS = 2048
    
    # --- 3. Initialize Agent ---
    agent = PPOAgent(
        state_size=state_size,
        n_actions=n_actions,
        **ppo_hps
    )

    # --- 4. Training Loop ---
    obs, info = env_train.reset()
    for step in range(1, TOTAL_TIMESTEPS + 1):
        action, value, log_prob = agent.get_action_value_logprob(obs)
        next_obs, reward, done, truncated, info = env_train.step(action)
        agent.store(obs, action, reward, value, done, log_prob)
        obs = next_obs

        # Update if rollout buffer is full
        if step % ROLLOUT_STEPS == 0:
            next_value = 0.0
            if not done:
                with torch.no_grad():
                    _, next_value_tensor = agent.network(torch.tensor(obs, dtype=torch.float32, device=agent.device).unsqueeze(0))
                    next_value = next_value_tensor.item()
            
            agent.update(next_value)

        if done or truncated:
            obs, info = env_train.reset()

    # --- 5. Evaluate the Agent ---
    # Use a smaller number of episodes for faster evaluation during HPO
    eval_results = evaluate_agent(agent, env_eval, num_episodes=20, render=False)
    mean_portfolio_return = eval_results['portfolio_return'].mean()

    # --- 6. Report result to Optuna ---
    # Optuna will use this value to determine the best hyperparameters
    return mean_portfolio_return

In [None]:
sampler = TPESampler(seed=42)
pruner = MedianPruner()

# Create the study
study = optuna.create_study(
    study_name="ppo_trading_agent_optimization",
    direction="maximize",  # We want to maximize the portfolio return
    sampler=sampler,
    pruner=pruner
)

# Start the optimization
# n_trials is the number of different hyperparameter combinations to test.
# Increase this for a more thorough search.
try:
    study.optimize(objective, n_trials=10, timeout=1800) # 25 trials, 30min timeout
except KeyboardInterrupt:
    print("Optimization stopped manually.")

# --- Print Results ---
print("\n--- Optimization Finished ---")
print(f"Number of finished trials: {len(study.trials)}")

print("\nBest trial:")
best_trial = study.best_trial
print(f"Value (Mean Portfolio Return): {best_trial.value:.4f}")

print("Best Hyperparameters:")
for key, value in best_trial.params.items():
    print(f"{key}: {value}")

# You can now use these best hyperparameters to train your final agent
# for a longer duration (e.g., more TOTAL_TIMESTEPS).
best_hps = best_trial.params
print("\nBest hyperparameters dictionary:")
print(best_hps)

[I 2025-11-16 21:12:44,670] A new study created in memory with name: ppo_trading_agent_optimization


Market Return : 79.51%   |   Portfolio Return : -34.77%   |   
Market Return : 79.51%   |   Portfolio Return : -2.87%   |   
Market Return : 79.51%   |   Portfolio Return : -24.98%   |   
Market Return : 79.51%   |   Portfolio Return : -15.24%   |   
Market Return : 79.51%   |   Portfolio Return : -7.73%   |   
Market Return : 79.51%   |   Portfolio Return : -8.14%   |   
Market Return : 79.51%   |   Portfolio Return : -2.08%   |   
Market Return : 79.51%   |   Portfolio Return :  0.01%   |   
Market Return : 79.51%   |   Portfolio Return : -14.00%   |   
Market Return : 79.51%   |   Portfolio Return : -1.54%   |   
Market Return : 79.51%   |   Portfolio Return : -9.84%   |   
Market Return : 79.51%   |   Portfolio Return :  2.64%   |   
Market Return : 79.51%   |   Portfolio Return : -9.75%   |   
Market Return : 79.51%   |   Portfolio Return : -7.04%   |   
Market Return : 79.51%   |   Portfolio Return : 15.33%   |   
Market Return : 79.51%   |   Portfolio Return : -5.79%   |   
Mark

[I 2025-11-16 21:22:52,177] Trial 0 finished with value: -4e-05 and parameters: {'lr': 5.6115164153345e-05, 'gamma': 0.98, 'gae_lambda': 0.8310429095469044, 'entropy_beta': 1.7073967431528103e-05, 'clip_epsilon': 0.27323522915498705, 'ppo_epochs': 13, 'batch_size': 128, 'hidden_size': 256, 'n_layers': 1}. Best is trial 0 with value: -4e-05.


Market Return : -3.63%   |   Portfolio Return : -0.00%   |   
Saved evaluation results to evaluation_results.csv
Market Return : 79.51%   |   Portfolio Return : -25.40%   |   
Market Return : 79.51%   |   Portfolio Return : -65.92%   |   
Market Return : 79.51%   |   Portfolio Return : -43.32%   |   
Market Return : 79.51%   |   Portfolio Return : -54.76%   |   
Market Return : 79.51%   |   Portfolio Return : -34.06%   |   
Market Return : 79.51%   |   Portfolio Return : -40.51%   |   
Market Return : 79.51%   |   Portfolio Return : -25.16%   |   
Market Return : 79.51%   |   Portfolio Return : -28.99%   |   
Market Return : 79.51%   |   Portfolio Return : -4.52%   |   
Market Return : 79.51%   |   Portfolio Return :  2.05%   |   
Market Return : 79.51%   |   Portfolio Return : -6.52%   |   
Market Return : 79.51%   |   Portfolio Return :  0.36%   |   
Market Return : 79.51%   |   Portfolio Return :  6.40%   |   
Market Return : 79.51%   |   Portfolio Return :  9.17%   |   
Market Retu

[I 2025-11-16 21:31:40,322] Trial 1 finished with value: -0.018195 and parameters: {'lr': 0.00016738085788752134, 'gamma': 0.999, 'gae_lambda': 0.9562500163172097, 'entropy_beta': 6.290644294586152e-05, 'clip_epsilon': 0.20284688768272233, 'ppo_epochs': 13, 'batch_size': 1024, 'hidden_size': 64, 'n_layers': 2}. Best is trial 0 with value: -4e-05.


Market Return : -3.63%   |   Portfolio Return : -1.82%   |   
Saved evaluation results to evaluation_results.csv
Market Return : 79.51%   |   Portfolio Return : -46.98%   |   
Market Return : 79.51%   |   Portfolio Return : -24.01%   |   
Market Return : 79.51%   |   Portfolio Return : -59.91%   |   
Market Return : 79.51%   |   Portfolio Return : -59.37%   |   
Market Return : 79.51%   |   Portfolio Return : -17.04%   |   
Market Return : 79.51%   |   Portfolio Return : -42.74%   |   
Market Return : 79.51%   |   Portfolio Return : -34.35%   |   
Market Return : 79.51%   |   Portfolio Return : -49.80%   |   
Market Return : 79.51%   |   Portfolio Return : -31.85%   |   
Market Return : 79.51%   |   Portfolio Return : -21.90%   |   
Market Return : 79.51%   |   Portfolio Return : -5.43%   |   
Market Return : 79.51%   |   Portfolio Return : -18.11%   |   
Market Return : 79.51%   |   Portfolio Return : 47.59%   |   
Market Return : 79.51%   |   Portfolio Return : 16.17%   |   
Market R

Now that we have found optimimal hyperparameters. We need to train the agent with these parameters.

In [None]:
# --- 1. Get environment parameters ---
# Use the training env for setting up the agent
state_size = env_train.observation_space.shape[0]
n_actions = env_train.action_space.n

print(f"State size: {state_size}")
print(f"Number of actions: {n_actions}")

# --- 2. Training hyperparameters ---
# You will need to TUNE these. These are small values for a quick test.
TOTAL_TIMESTEPS = 1_000_000     # Total steps to train for
ROLLOUT_STEPS = 2048         # Steps to collect before each PPO update
EVAL_EVERY_N_UPDATES = 5     # How often to run evaluation
MODEL_SAVE_PATH = "models/ppo_trading_agent.pth"



# --- 3. Initialize agent with best hyperparameters ---
agent = PPOAgent(state_size=state_size, n_actions=n_actions, **best_hps)

# --- 4. Training & logging setup ---
all_episode_rewards = [] # Stores total reward for each completed episode
episode_rewards = []     # Stores rewards for the *current* episode
best_eval_return = -float('inf') # Track best performance
update_count = 0

print(f"Starting training for {TOTAL_TIMESTEPS} timesteps...")
print(f"Will update every {ROLLOUT_STEPS} steps.")
print(f"Evaluating every {EVAL_EVERY_N_UPDATES} updates.")

# --- 5. Main training loop ---
obs, info = env_train.reset()

for step in range(1, TOTAL_TIMESTEPS + 1):
    # 5a. Get action, value, and log_prob from the agent
    action, value, log_prob = agent.get_action_value_logprob(obs)
    
    # 5b. Take action in the environment
    next_obs, reward, done, truncated, info = env_train.step(action)
    
    # 5c. Store the transition
    # We store 'done' (terminal state like bankruptcy), not 'truncated'
    agent.store(obs, action, reward, value, done, log_prob)
    episode_rewards.append(reward)
    
    # 5d. Update the current observation
    obs = next_obs
    
    # 5e. Check if rollout is complete (time to update)
    if step % ROLLOUT_STEPS == 0:
        update_count += 1
        
        # 5f. Get the value of the *last* observation for GAE
        # This is the "next_value" for the last transition in the buffer.
        # We get this value UNLESS the last step was a *terminal* 'done'.
        # If it was 'truncated', we still bootstrap.
        next_value = 0.0
        if not done:
            with torch.no_grad():
                _, next_value_tensor = agent.network(torch.tensor(obs, dtype=torch.float32, device=agent.device).unsqueeze(0))
                next_value = next_value_tensor.item()
        
        # 5g. Perform PPO update
        losses = agent.update(next_value)
        
        # 5h. Log progress
        print(f"\nUpdate {update_count} (Step {step}/{TOTAL_TIMESTEPS})")
        print(f"  Actor Loss: {losses['actor_loss']:.4f}, Critic Loss: {losses['critic_loss']:.4f}")
        if len(all_episode_rewards) > 0:
            print(f"  Mean Reward (last 10 ep): {np.mean(all_episode_rewards[-10:]):.4f}")
        
        # 5i. Periodic evaluation
        if update_count % EVAL_EVERY_N_UPDATES == 0:
            print("--- Running Evaluation ---")
            eval_results = evaluate_agent(agent, env_eval, num_episodes=20, render=False) # 5 episodes, no render
            mean_eval_return = eval_results['portfolio_return'].mean()
            market_return = eval_results['market_return'].mean() # Market return is constant
            
            print(f"  Mean Eval Portfolio Return: {mean_eval_return:.2%}")
            print(f"  Market Return: {market_return:.2%}")
            
            if mean_eval_return > best_eval_return:
                best_eval_return = mean_eval_return
                torch.save(agent.network.state_dict(), MODEL_SAVE_PATH)
                print(f"  *** New best model saved with return {best_eval_return:.2%} ***")
            print("--------------------------")
            
    # 5j. Handle episode end (if 'done' or 'truncated')
    if done or truncated:
        all_episode_rewards.append(sum(episode_rewards))
        episode_rewards = []
        obs, info = env_train.reset()

print("\nTraining finished.")
print(f"Best model saved to {MODEL_SAVE_PATH} with return {best_eval_return:.2%}")

State size: 12
Number of actions: 9
Starting training for 10000 timesteps...
Will update every 2048 steps.
Evaluating every 5 updates.

Update 1 (Step 2048/10000)
  Actor Loss: -0.0045, Critic Loss: 0.0032

Update 2 (Step 4096/10000)
  Actor Loss: -0.0007, Critic Loss: 0.0007

Update 3 (Step 6144/10000)
  Actor Loss: -0.0016, Critic Loss: 0.0002

Update 4 (Step 8192/10000)
  Actor Loss: -0.0007, Critic Loss: 0.0001
Market Return : 79.51%   |   Portfolio Return : -41.41%   |   

Training finished.
Best model saved to models/ppo_trading_agent.pth with return -inf%


# Test best model on real 

In [None]:
# Create a new agent instance
trained_agent = PPOAgent(state_size=state_size, n_actions=n_actions)

# Load the saved model weights
trained_agent.network.load_state_dict(torch.load(MODEL_SAVE_PATH))

# Set the network to evaluation mode (e.g., for dropout, batchnorm)
trained_agent.network.eval() 

print("Evaluating trained agent...")

# Evaluate the trained agent
df_results = evaluate_agent(
    trained_agent, 
    env_eval, 
    num_episodes=20, 
    render=True, 
    csv_path=eval_folder / "evaluation_results.csv", 
    renderer_logs_dir=eval_folder / "render_logs"
)

print(df_results)

RuntimeError: Error(s) in loading state_dict for ActorCriticNetwork:
	size mismatch for shared.0.weight: copying a param with shape torch.Size([512, 12]) from checkpoint, the shape in current model is torch.Size([128, 12]).
	size mismatch for shared.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
	size mismatch for shared.2.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([128, 128]).
	size mismatch for shared.2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
	size mismatch for actor.weight: copying a param with shape torch.Size([7, 512]) from checkpoint, the shape in current model is torch.Size([7, 128]).
	size mismatch for critic.weight: copying a param with shape torch.Size([1, 512]) from checkpoint, the shape in current model is torch.Size([1, 128]).

In [None]:
df_results

Unnamed: 0,episode,portfolio_return,market_return,excess_return,steps,total_reward
0,1,0.0362,-0.0363,0.0725,767,0.035558
1,2,0.1379,-0.0363,0.1742,767,0.129163
2,3,0.1378,-0.0363,0.1741,767,0.129138
3,4,0.0362,-0.0363,0.0725,767,0.035558
4,5,0.1379,-0.0363,0.1742,767,0.129163
5,6,0.1379,-0.0363,0.1742,767,0.129188
6,7,0.0362,-0.0363,0.0725,767,0.035558
7,8,0.1378,-0.0363,0.1741,767,0.129138
8,9,0.1363,-0.0363,0.1726,767,0.127746
9,10,0.1379,-0.0363,0.1742,767,0.129188


In [None]:
from gym_trading_env.renderer import Renderer
renderer = Renderer(render_logs_dir=eval_folder/"render_logs")
renderer.run()

 * Serving Flask app 'gym_trading_env.renderer'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
[33mPress CTRL+C to quit[0m
127.0.0.1 - - [16/Nov/2025 16:09:09] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [16/Nov/2025 16:09:09] "GET /update_data/BTCUSD_2025-11-16_15-47-18.pkl HTTP/1.1" 200 -
127.0.0.1 - - [16/Nov/2025 16:09:09] "GET /metrics HTTP/1.1" 200 -
127.0.0.1 - - [16/Nov/2025 16:09:09] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
127.0.0.1 - - [16/Nov/2025 16:09:25] "GET /update_data/BTCUSD_2025-11-16_15-47-17.pkl HTTP/1.1" 200 -
127.0.0.1 - - [16/Nov/2025 16:09:25] "GET /metrics HTTP/1.1" 200 -
127.0.0.1 - - [16/Nov/2025 16:09:35] "GET /update_data/BTCUSD_2025-11-16_15-47-15.pkl HTTP/1.1" 200 -
127.0.0.1 - - [16/Nov/2025 16:09:35] "GET /metrics HTTP/1.1" 200 -
127.0.0.1 - - [16/Nov/2025 16:09:37] "GET /update_data/BTCUSD_2025-11-16_15-47-08.pkl HTTP/1.1" 200 -
127.0.0.1 - - [16/Nov/2025 16:09:37] "GET /metrics HTTP/1.1" 200 -
127.0.0.1 - - [16/Nov/2025 16:09:39] "GET /update_data/BTCUSD_2025-11-16_15-38-06.pkl HTTP/1.1" 200 -
127.0.0.1 - - [1