Intermediate Deep Learning - Deep Reinforcement Learning
# Project: Deep Reinforcement Learning in Trading Environment

## Overview
In this project, you will implement and evaluate a deep reinforcement learning (DRL) agent in a trading environment. The goal is to train an agent that can make profitable trading decisions based on historical market data.

You are encouraged to experiment with different DRL algorithms, architectures, and hyperparameters to optimize the agent's performance.

### Deliverables

You are required to prepare:

- Code implementation of the DRL agent(s) and training process. Can be in the form of Jupyter Notebooks or Python scripts.
- A report (4 pages max) summarizing your approach, results, and insights gained from the project, including:  
    - Description of the DRL algorithm(s) used.
    - Training process and hyperparameter choices.
    - Challenges faced and how they were addressed.

    Justify your design choices. You are also encouraged to include visualizations of the agent's performance over time, or compare strategies developed with DRL against financial benchmarks.
- An evaluation results CSV file `evaluation_results.csv` generated by your agent after training using the provided evaluation function.
- A presentation (15 minutes) to showcase your work, findings, and any interesting observations.

*The documents are to be submitted in a zip file, before 16/11/2025 11:59PM, and the presentation is scheduled for next session.*

### Environment
We will use Gym Trading Env as our trading environment. (https://gym-trading-env.readthedocs.io/en/latest/)

- This is a gymnasium-compatible environment designed to simulate trading (stocks or crypto) from historical market data.
- Its goal is to provide a fast and customizable platform for training RL agents in a trading scenario.

We will use BTC/USDT hour step historical data from Binance for training and evaluation. The agent will be evaluated on the period from 2025-10-01 to 2025-11-01.

Following code blocks demonstrate how to set up the environment and evaluate your agent.

### Grading Criteria
- Implementation of the DRL agent and training process (40%)
    - DRL algorithm correctly implemented (15%)
    - Appropriate training procedure (15%)
    - Effective use of hyperparameters (10%)
    - 10 % bonus for innovative approaches or techniques
- Performance of the agent based on evaluation metrics (30%)
    - If the agent shows progress during training (10%)
    - If the agent outperforms a random strategy during evaluation (10%)
    - If the portfolio return exceeds market return (10%)
    - 30%, 20%, 10% bonus for the top 3 agents respectively
- Quality and clarity of the report (20%)
    - Clear explanation of methods and results (10%)
    - Justification of design choices (10%)
- Presentation (20%)





---

## Environment Setup

### Install Required Packages

In [1]:
import sys
print(sys.executable)

!"{sys.executable}" -m pip install gym-trading-env

/usr/local/bin/python3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
import numpy as np
import pandas as pd
import gymnasium as gym
import gym_trading_env
from gym_trading_env.downloader import download
import datetime
from pathlib import Path
import matplotlib.pyplot as plt
import time
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch

### Prepare Data Set

- Create a data folder to store historical data
- Download historical data for BTC/USDT from Binance using the provided utility function.
- Preprocess the data to create features. The features (plus two dynamic features: last position taken by the agent, and the current real position) are the state of the environment at each time step.  
    *(You can add more features if you want to experiment with different state representations.)*
- Select training and evaluation data based on the specified date ranges.  
    *(You can modify the training range if you want to experiment with different time periods, however, keep in mind the evaluation period should always be after the training period.)*

```python

In [3]:
data_folder = Path("data/")
data_folder.mkdir(parents=True, exist_ok=True)
eval_folder = Path("eval/")
eval_folder.mkdir(parents=True, exist_ok=True)


download(exchange_names = ["binance"],
    symbols= ["BTC/USDT"],
    timeframe= "1h",
    dir = data_folder,
    since= datetime.datetime(year= 2020, month=10, day=1),
)

df = pd.read_pickle(data_folder / "binance-BTCUSDT-1h.pkl")

# --- 1. Initial Data Shift and Cleanup ---
# Shift 'close' one step back to create 'prev_close'. This is the price
# available at the moment the new candle OPENS (i.e., Close at t-1).
df["prev_close"] = df["close"].shift(1)

# --- 2. Calculate Raw Technical Indicators on LAGGED DATA ---
# All indicators MUST use 'prev_close' for their core calculations.

# Sharpe Ratio
ANNUALIZATION_FACTOR = 24 * 365
ROLLING_WINDOW_SR = 7 * 24 
RISK_FREE_RATE_ANNUAL = 0.04
RISK_FREE_RATE_HOURLY = (1 + RISK_FREE_RATE_ANNUAL)**(1/ANNUALIZATION_FACTOR) - 1

# Base returns calculation uses 'prev_close' (i.e., Close at t-1)
df['return'] = df['prev_close'].pct_change()
df['excess_return'] = df['return'] - RISK_FREE_RATE_HOURLY
rolling_mean_excess = df['excess_return'].rolling(window=ROLLING_WINDOW_SR).mean()
rolling_std_excess = df['excess_return'].rolling(window=ROLLING_WINDOW_SR).std()
df['raw_sharpe'] = (rolling_mean_excess / (rolling_std_excess + 1e-9)) * np.sqrt(ANNUALIZATION_FACTOR)

# MACD uses 'prev_close' for EMAs
df['EMA_12'] = df['prev_close'].ewm(span=12, adjust=False).mean()
df['EMA_26'] = df['prev_close'].ewm(span=26, adjust=False).mean()
df['raw_macd'] = df['EMA_12'] - df['EMA_26']
df['raw_macd_signal'] = df['raw_macd'].ewm(span=9, adjust=False).mean()

# Bollinger Bands uses 'prev_close' for MA and StdDev
ROLLING_WINDOW_BB = 20
df['BB_Middle'] = df['prev_close'].rolling(window=ROLLING_WINDOW_BB).mean()
df['BB_Std'] = df['prev_close'].rolling(window=ROLLING_WINDOW_BB).std()
df['raw_bb_upper'] = df['BB_Middle'] + (df['BB_Std'] * 2)
df['raw_bb_lower'] = df['BB_Middle'] - (df['BB_Std'] * 2)

# OBV uses 'prev_close'
df['raw_obv'] = (np.sign(df['prev_close'].diff()) * df['volume'].shift(1)).cumsum().fillna(0)


# ATR (Average True Range) - NEW
df['high_t_minus_1'] = df['high'].shift(1)
df['low_t_minus_1'] = df['low'].shift(1)
df['prev_prev_close'] = df['prev_close'].shift(1)

df['tr_1'] = df['high_t_minus_1'] - df['low_t_minus_1'] # Range of candle t-1
df['tr_2'] = np.abs(df['high_t_minus_1'] - df['prev_prev_close']) # Distance from previous close to high
df['tr_3'] = np.abs(df['low_t_minus_1'] - df['prev_prev_close']) # Distance from previous close to low
df['true_range'] = df[['tr_1', 'tr_2', 'tr_3']].max(axis=1)
df['raw_atr'] = df['true_range'].rolling(window=14).mean()

# RSI (Relative Strength Index) - NEW
delta = df['prev_close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
rs = gain / (loss + 1e-9)
df['raw_rsi'] = 100 - (100 / (1 + rs))

# --- 3. Add Cyclical Time Features (No Change, Already Safe) ---
df['hour'] = df.index.hour
df['day_of_week'] = df.index.dayofweek

df['feature_hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['feature_hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['feature_day_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
df['feature_day_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)

# --- 4. Create Final, Normalized Features (Shift Applied When Necessary) ---

# We define a log-return feature based on data available in the previous completed candle.
df['feature_log_return_1h'] = np.log(df['prev_close'] / df['prev_close'].shift(1))


# Price Features (Normalized by prev_close, ready for t=0 observation)
df['feature_open'] = (df['open'] / df['prev_close']) - 1
df['feature_high'] = (df['high'].shift(1) / df['prev_close']) - 1
df['feature_low'] = (df['low'].shift(1) / df['prev_close']) - 1


# Volume Features (Z-Score of volume/OBV at t-1)
vol_mean_30d = df['volume'].shift(1).rolling(30*24).mean()
vol_std_30d = df['volume'].shift(1).rolling(30*24).std()
df['feature_volume_zscore'] = ((df['volume'].shift(1) - vol_mean_30d) / (vol_std_30d + 1e-9))
obv_mean_30d = df['raw_obv'].shift(1).rolling(30*24).mean()
obv_std_30d = df['raw_obv'].shift(1).rolling(30*24).std()
df['feature_obv_zscore'] = ((df['raw_obv'].shift(1) - obv_mean_30d) / (obv_std_30d + 1e-9))

# Indicator Features
df['feature_MACD'] = (df['raw_macd'].shift(1) / df['prev_close'])
df['feature_MACD_Signal'] = (df['raw_macd_signal'].shift(1) / df['prev_close'])
df['feature_BB_Upper'] = (df['raw_bb_upper'].shift(1) / df['prev_close']) - 1
df['feature_BB_Lower'] = (df['raw_bb_lower'].shift(1) / df['prev_close']) - 1
df['feature_atr'] = (df['raw_atr'].shift(1) / df['prev_close'])
df['feature_rsi'] = df['raw_rsi'].shift(1)
df['feature_sharpe_ratio'] = df['raw_sharpe'].shift(1)

# --- 5. Final Cleanup ---
final_features = [
    'feature_hour_sin', 'feature_hour_cos', 'feature_day_sin', 'feature_day_cos',
    'feature_open', 'feature_high', 'feature_low', 'feature_log_return_1h',
    'feature_volume_zscore', 'feature_obv_zscore',
    'feature_MACD', 'feature_MACD_Signal', 
    'feature_BB_Upper', 'feature_BB_Lower',
    'feature_atr', 'feature_rsi', 'feature_sharpe_ratio'
]

# Keep the current raw OHLCV for the Environment to calculate rewards/penalties,
# but the agent MUST only observe the 'feature_' columns.
all_cols_to_keep = ['close','open', 'high', 'low', 'prev_close', 'volume'] + final_features
df = df[all_cols_to_keep]

df.dropna(inplace=True)

# --- 6. Defining DataFrames ---
# (Environment setup remains correct as it depends on the index slicing)

df_train = df.loc['2024-10-01':'2025-09-30'] # Training data
df_eval = df.loc['2025-10-01':'2025-11-01'] # Evaluation data


BTC/USDT downloaded from binance and stored at data/binance-BTCUSDT-1h.pkl


In [4]:
df_train.head()

Unnamed: 0_level_0,close,open,high,low,prev_close,volume,feature_hour_sin,feature_hour_cos,feature_day_sin,feature_day_cos,...,feature_log_return_1h,feature_volume_zscore,feature_obv_zscore,feature_MACD,feature_MACD_Signal,feature_BB_Upper,feature_BB_Lower,feature_atr,feature_rsi,feature_sharpe_ratio
date_open,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2024-10-01 00:00:00,63531.99,63327.6,63606.0,63006.7,63327.59,1336.93335,0.0,1.0,0.781831,0.62349,...,-0.002435,0.577133,1.019271,-0.00659,-0.007075,0.02202,-0.00397,0.006462,37.120012,0.706674
2024-10-01 01:00:00,63458.0,63532.0,63639.86,63370.01,63531.99,1004.08763,0.258819,0.965926,0.781831,0.62349,...,0.003222,0.34503,0.919524,-0.006591,-0.00696,0.017406,-0.007778,0.006431,45.902577,0.009728
2024-10-01 02:00:00,63443.76,63458.0,63458.0,63180.0,63458.0,716.11822,0.5,0.866025,0.781831,0.62349,...,-0.001165,-0.019572,1.003197,-0.006284,-0.006831,0.01721,-0.006673,0.006645,47.891809,1.640393
2024-10-01 03:00:00,63723.48,63443.76,63744.0,63430.0,63443.76,822.21265,0.707107,0.707107,0.781831,0.62349,...,-0.000224,-0.333418,0.938037,-0.006061,-0.006678,0.015486,-0.006181,0.006556,43.060654,1.358962
2024-10-01 04:00:00,63868.94,63723.47,63879.81,63652.06,63723.48,778.75286,0.866025,0.5,0.781831,0.62349,...,0.004399,-0.218125,0.891219,-0.005808,-0.006481,0.009475,-0.010421,0.006423,37.358784,0.856831


In [20]:
df_eval.head()

Unnamed: 0_level_0,close,open,high,low,prev_close,volume,feature_hour_sin,feature_hour_cos,feature_day_sin,feature_day_cos,...,feature_log_return_1h,feature_volume_zscore,feature_obv_zscore,feature_MACD,feature_MACD_Signal,feature_BB_Upper,feature_BB_Lower,feature_atr,feature_rsi,feature_sharpe_ratio
date_open,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2025-10-01 00:00:00,114239.53,114048.94,114308.0,113966.67,114048.93,434.59016,0.0,1.0,0.974928,-0.222521,...,0.000952,-0.564066,-1.216313,0.001966,0.001569,0.005904,-0.013626,0.005157,55.073954,2.843994
2025-10-01 01:00:00,114549.99,114239.53,114550.0,114142.99,114239.53,597.2536,0.258819,0.965926,0.974928,-0.222521,...,0.00167,-0.190021,-1.176929,0.001997,0.001652,0.003745,-0.015077,0.004875,64.896929,3.274433
2025-10-01 02:00:00,114272.15,114549.99,114551.76,114272.15,114549.99,508.42422,0.5,0.866025,0.974928,-0.222521,...,0.002714,0.1744,-1.115457,0.002129,0.001744,0.001232,-0.017825,0.004807,65.297454,3.20392
2025-10-01 03:00:00,114176.92,114272.16,114530.48,114096.58,114272.15,502.30318,0.707107,0.707107,0.974928,-0.222521,...,-0.002428,-0.023705,-1.032361,0.002434,0.001885,0.004594,-0.015799,0.00487,66.683406,3.335955
2025-10-01 04:00:00,114289.01,114176.93,114700.0,114151.0,114176.92,597.89328,0.866025,0.5,0.974928,-0.222521,...,-0.000834,-0.036563,-1.099294,0.002449,0.001999,0.00586,-0.015073,0.004883,63.49629,3.446608


### Setting up the Trading Environment

We use the `df_train` DataFrame for training and `df_eval` DataFrame for evaluation.

The `positions` parameter defines the discrete positions the agent can take, it is a list containing possible position values. A position value corresponds to the ratio of the portfolio valuation engaged in the position ( > 0 to bet on the rise, < 0 to bet on the decrease)

- if `position < 0` : the agent is shorting the asset
- if `position = 0` : the agent is out of the market
- if `position > 0` : the agent is longing the asset
- if `position = 1` : the agent is fully invested in the asset
- if `position > 1` : the agent is using leverage to invest more than its portfolio valuation in the asset

You are free to modify the `positions` list to experiment with different position options for the agent.


In [10]:
POSITIONS = [-1, -0.75, -0.5, -0.25, 0, 0.25, 0.5, 0.75, 1]
WINDOW_SIZE = 48

env_train = gym.make("TradingEnv",
        name= "BTCUSD",
        df = df_train, # Your dataset with your custom features
        positions = POSITIONS,
        trading_fees = 0.01/100, # 0.01% per stock buy / sell (Binance fees)
        borrow_interest_rate= 0.0003/100, # 0.0003% per timestep (one timestep = 1h here)
    )

env_eval = gym.make("TradingEnv",
        name= "BTCUSD",
        df = df_eval, # Your dataset with your custom features
        positions = POSITIONS,
        trading_fees = 0.01/100, # 0.01% per stock buy / sell (Binance fees)
        borrow_interest_rate= 0.0003/100, # 0.0003% per timestep (one timestep = 1h here)
    )

### Example of interacting with the Environment

The interaction with the environment follows the standard Gymnasium API. At each time step, the agent selects an action (position index) based on the current observation, and the environment returns the next observation, reward, and done flag.

The following code block demonstrates a simple interaction loop where the agent randomly selects actions.

The episode is terminated when `done` or `truncated` is `True`:

- When environment reaches the end of the dataset, `truncated` is set to `True`.
- When agent's portfolio valuation drops below 0, `done` is set to `True`. (This means the agent has gone bankrupt, the situdation can happen when high leverage is used.)

The reward at each time step is calculated based on the change in portfolio valuation, taking into account trading fees and borrow interest rates: $r_{t} = ln(\frac{p_{t}}{p_{t-1}})\text{ with }p_{t}\text{ = portofolio valuation at timestep }t$. You can customize your own reward function if needed (see https://gym-trading-env.readthedocs.io/en/latest/customization.html#custom-reward-function)

In [11]:
done, truncated = False, False
observation, info = env_train.reset()
while not done and not truncated:
    # Pick a position by its index in your position list
    position_index = env_train.action_space.sample()
    observation, reward, done, truncated, info = env_train.step(position_index)

Market Return : 79.51%   |   Portfolio Return : -62.14%   |   


### DRL Agent Training and Evaluation

You will then implement your DRL agent, train it using the training environment, and evaluate its performance using the evaluation environment.

Your agent should include a method `choose_action_eval` for selecting actions based on state during evaluation. It is called in the evaluation function.

```python
    def choose_action_eval(self, state):
        # Implement action selection logic for evaluation
        ...
        return action_index
```

After training, you can evaluate your agent using the provided `evaluate_agent` function, which runs the agent in the evaluation environment for a specified number of episodes and records the results.

In [12]:
class RandomAgent:
    def __init__(self, action_space):
        self.action_space = action_space

    def act(self, observation):
        return self.action_space.sample()

    def choose_action_eval(self, state):
        return self.action_space.sample()

In [13]:
def evaluate_agent(agent, env, num_episodes=10, max_steps=None, render=False, csv_path="evaluation_results.csv", renderer_logs_dir="render_logs"):
    """
    Evaluate the agent on the environment for a number of episodes.
    """

    results = []
    for ep in range(num_episodes):
        obs, info = env.reset()
        done = False
        truncated = False
        step = 0
        reward_total = 0.0
        while not done and not truncated:
            action = agent.choose_action_eval(obs)
            obs, reward, done, truncated, info = env.step(action)
            reward_total += reward
            step += 1
            if (max_steps is not None) and (step >= max_steps):
                break

        # Get metrics from the environment
        metrics = env.get_metrics()  
        # Transform metrics
        # Assume metrics contain keys "Portfolio Return" and "Market Return" as strings like "45.24%"
        port_ret = float(metrics["Portfolio Return"].strip('%')) / 100.0
        market_ret = float(metrics["Market Return"].strip('%')) / 100.0

        results.append({
            "episode": ep+1,
            "portfolio_return": port_ret,
            "market_return": market_ret,
            "excess_return": port_ret - market_ret,
            "steps": step,
            "total_reward": reward_total,
        })
        if render:
            print(f"Eval Episode {ep+1}: Total Reward: {reward_total:.2f}, Portfolio Return: {port_ret:.2%}, Market Return: {market_ret:.2%}, Excess Return: {(port_ret - market_ret):.2%}, Steps: {step}")
            time.sleep(1)  # Pause between episodes in case the execution is too fast and files are not saved properly
            env.save_for_render(dir = renderer_logs_dir)

    df_results = pd.DataFrame(results)
    
    df_results.to_csv(csv_path, index=False)
    print(f"Saved submission to {csv_path}")

    return df_results


In [14]:
# Create a random agent for evaluation
agent = RandomAgent(env_eval.action_space)

# Evaluate the trained agent
df_results = evaluate_agent(agent, env_eval, num_episodes=10, render=True, csv_path=eval_folder / "evaluation_results.csv", renderer_logs_dir=eval_folder / "render_logs")

Market Return : -3.63%   |   Portfolio Return : -6.32%   |   
Eval Episode 1: Total Reward: -0.07, Portfolio Return: -6.32%, Market Return: -3.63%, Excess Return: -2.69%, Steps: 767
Market Return : -3.63%   |   Portfolio Return : -12.30%   |   
Eval Episode 2: Total Reward: -0.13, Portfolio Return: -12.30%, Market Return: -3.63%, Excess Return: -8.67%, Steps: 767
Market Return : -3.63%   |   Portfolio Return : -9.65%   |   
Eval Episode 3: Total Reward: -0.10, Portfolio Return: -9.65%, Market Return: -3.63%, Excess Return: -6.02%, Steps: 767
Market Return : -3.63%   |   Portfolio Return : -8.00%   |   
Eval Episode 4: Total Reward: -0.08, Portfolio Return: -8.00%, Market Return: -3.63%, Excess Return: -4.37%, Steps: 767
Market Return : -3.63%   |   Portfolio Return : -2.76%   |   
Eval Episode 5: Total Reward: -0.03, Portfolio Return: -2.76%, Market Return: -3.63%, Excess Return: 0.87%, Steps: 767
Market Return : -3.63%   |   Portfolio Return : -3.78%   |   
Eval Episode 6: Total Rewar

In [15]:

class ActorCriticNetwork(nn.Module):
    def __init__(self, state_shape, n_actions, hidden_size=128, n_layers=2):
        super().__init__()
        self.window_size = state_shape[0]
        self.n_features = state_shape[1]

        # --- 1. Feature Extractor ---
        # If we have a time sequence (Window > 1), use CNN
        if self.window_size > 1:
            self.feature_extractor = nn.Sequential(
                nn.Conv1d(in_channels=self.n_features, out_channels=32, kernel_size=3, padding=1),
                nn.ReLU(),
                nn.Conv1d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
                nn.ReLU(),
                nn.Flatten()
            )
            self.flattened_size = 64 * self.window_size
            
        # If we have a snapshot (Window == 1), use a Linear Layer (MLP)
        else:
            self.feature_extractor = nn.Sequential(
                nn.Flatten(),
                nn.Linear(self.n_features * self.window_size, 64),
                nn.ReLU()
            )
            self.flattened_size = 64

        # --- 2. Shared Linear Layers ---
        layers = []
        input_dim = self.flattened_size
        
        for _ in range(n_layers):
            layers.append(nn.Linear(input_dim, hidden_size))
            layers.append(nn.ReLU())
            input_dim = hidden_size
            
        self.shared_linear = nn.Sequential(*layers)

        # --- 3. Heads ---
        self.actor = nn.Linear(hidden_size, n_actions)
        self.critic = nn.Linear(hidden_size, 1)

    def forward(self, x):
        # x shape: (Batch, Window, Features)
        
        if self.window_size > 1:
            # Permute for Conv1d: (Batch, Features, Window)
            x = x.permute(0, 2, 1) 
        
        # Pass through specific extractor (CNN or MLP)
        features = self.feature_extractor(x)
        
        # Pass through shared layers
        shared_features = self.shared_linear(features)

        action_logits = self.actor(shared_features) 
        state_value = self.critic(shared_features)

        return action_logits, state_value
class PPOAgent:
    def __init__(
        self, state_size, n_actions,
        lr=3e-4, gamma=0.99, gae_lambda=0.95,
        entropy_beta=0.01, clip_epsilon=0.2, ppo_epochs=10, batch_size=64,
        hidden_size=128,
        n_layers=2  # <<< --- 1. ADD THIS (with a default)
    ):
        
        # Hyperparameters
        self.lr = lr
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.entropy_beta = entropy_beta
        self.clip_epsilon = clip_epsilon
        self.ppo_epochs = ppo_epochs
        self.batch_size = batch_size

        # Environment parameters
        self.state_size = state_size
        self.n_actions = n_actions

        # Device configuration
        if torch.backends.mps.is_available():
            self.device = torch.device("mps")  
        else:
            self.device = torch.device("cpu")

        # Create policy network
        self.network = ActorCriticNetwork(
            state_size, 
            n_actions, 
            hidden_size, 
            n_layers  
        ).to(self.device)

        # Optimizer
        self.optimizer = optim.Adam(self.network.parameters(), lr=lr)

        # Memory buffers
        self.reset_memory()

    def reset_memory(self):
        """Clear rollout buffers."""
        self.states = []
        self.actions = []
        self.rewards = []
        self.values = []
        self.dones = []
        self.log_probs = []

    def get_action_value_logprob(self, state):
        """
        Samples an action for the training loop.
        Returns the action, its value, and log probability.
        """
        state_tensor = torch.tensor(state, dtype=torch.float32, device=self.device).unsqueeze(0)

        with torch.no_grad():
            logits, value = self.network(state_tensor)

        probs = F.softmax(logits, dim=-1)
        dist = torch.distributions.Categorical(probs=probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)

        return action.item(), value.item(), log_prob.item()

    def choose_action_eval(self, state):
        """
        Chooses the best action for evaluation (deterministic).
        Returns only the action index.
        """
        state_tensor = torch.tensor(state, dtype=torch.float32, device=self.device).unsqueeze(0)
        
        with torch.no_grad():
            logits, _ = self.network(state_tensor)
        
        probs = F.softmax(logits, dim=-1)
        action = torch.argmax(probs, dim=-1)
        
        return action.item()

    def store(self, state, action, reward, value, done, log_prob):
        """Store a single transition in memory."""
        self.states.append(state)
        self.actions.append(action)
        self.rewards.append(reward)
        self.values.append(value)
        self.dones.append(done)
        self.log_probs.append(log_prob)

    def compute_gae(self, next_value):
        """
        Compute returns and advantages using GAE (Generalized Advantage Estimation)
        """
        rewards = np.array(self.rewards, dtype=np.float32)
        values = np.array(self.values + [next_value], dtype=np.float32)
        dones = np.array(self.dones, dtype=np.float32)

        T = len(rewards)
        returns = np.zeros(T, dtype=np.float32)
        advantages = np.zeros(T, dtype=np.float32)

        gae = 0.0
        for t in reversed(range(T)):
            delta = rewards[t] + self.gamma * values[t + 1] * (1.0 - dones[t]) - values[t]
            gae = delta + self.gamma * self.gae_lambda * (1.0 - dones[t]) * gae
            advantages[t] = gae
            returns[t] = advantages[t] + values[t]

        return returns, advantages

    def update(self, next_value):
        """Perform one PPO update step."""
        if len(self.states) == 0:
            return {"actor_loss": 0.0, "critic_loss": 0.0}

        returns, advantages = self.compute_gae(next_value)

        # Convert to tensors
        states = torch.tensor(np.array(self.states), dtype=torch.float32, device=self.device)
        actions = torch.tensor(np.array(self.actions), dtype=torch.int64, device=self.device)
        returns = torch.tensor(returns, dtype=torch.float32, device=self.device)
        advantages = torch.tensor(advantages, dtype=torch.float32, device=self.device)
        old_log_probs = torch.tensor(np.array(self.log_probs), dtype=torch.float32, device=self.device)

        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        total_actor_loss = 0
        total_critic_loss = 0
        updates = 0
        
        for _ in range(self.ppo_epochs):
            indices = torch.randperm(len(states))
            
            for start in range(0, len(states), self.batch_size):
                end = start + self.batch_size
                idx = indices[start:end]
                
                if len(idx) == 0:
                    continue

                batch_states = states[idx]
                batch_actions = actions[idx]
                batch_old_log_probs = old_log_probs[idx]
                batch_returns = returns[idx]
                batch_advantages = advantages[idx]
                
                # Forward pass
                logits, values = self.network(batch_states)
                action_probs = F.softmax(logits, dim=-1)
                dist = torch.distributions.Categorical(action_probs)
                log_probs = dist.log_prob(batch_actions)
                entropy = dist.entropy().mean()
                
                # PPO loss computation
                ratio = torch.exp(log_probs - batch_old_log_probs)
                surr1 = ratio * batch_advantages
                surr2 = torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon) * batch_advantages
                actor_loss = -torch.min(surr1, surr2).mean()
                
                critic_loss = (batch_returns - values.squeeze()).pow(2).mean()
                
                loss = actor_loss + 0.5 * critic_loss - self.entropy_beta * entropy
                
                self.optimizer.zero_grad()
                loss.backward()
                torch.nn.utils.clip_grad_norm_(self.network.parameters(), 0.5)
                self.optimizer.step()
                
                total_actor_loss += actor_loss.item()
                total_critic_loss += critic_loss.item()
                updates += 1
        
        self.reset_memory()
        
        if updates == 0:
            return {"actor_loss": 0.0, "critic_loss": 0.0}

        return {
            'actor_loss': total_actor_loss / updates,
            'critic_loss': total_critic_loss / updates
        }

In [17]:
# (best_hps should be the dictionary you got from your Optuna study)
best_hps = {'lr': 0.00038347659965876475, 'gamma': 0.98, 'gae_lambda': 0.889645947951959, 'entropy_beta': 0.001448579588369665, 'clip_epsilon': 0.12948752167104155, 'ppo_epochs': 9, 'batch_size': 64, 'hidden_size': 512, 'n_layers': 4}
state_shape = env_eval.observation_space.shape  # Use the new env
n_actions = env_eval.action_space.n
print(f"Loading best model with hyperparameters: {best_hps}")
trained_agent = PPOAgent(
    state_size=state_shape, 
    n_actions=n_actions,
    **best_hps  
)

# --- 2. Load the saved model weights ---
MODEL_SAVE_PATH = "models/ppo_trading_agent_v2.pth"
trained_agent.network.load_state_dict(torch.load(MODEL_SAVE_PATH))

# Set the network to evaluation mode (this is correct)
trained_agent.network.eval()

print("--- Evaluating final trained agent on TEST SET ---")

# --- 3. Evaluate the agent on the unseen TEST set ---
df_results = evaluate_agent(
    trained_agent,
    env_eval,  
    num_episodes=20,
    render=True,
    csv_path=eval_folder / "evaluation_results.csv",
    renderer_logs_dir=eval_folder / "render_logs"
)

print("--- Final Evaluation Complete ---")
print(df_results)

Loading best model with hyperparameters: {'lr': 0.00038347659965876475, 'gamma': 0.98, 'gae_lambda': 0.889645947951959, 'entropy_beta': 0.001448579588369665, 'clip_epsilon': 0.12948752167104155, 'ppo_epochs': 9, 'batch_size': 64, 'hidden_size': 512, 'n_layers': 4}


IndexError: tuple index out of range

### Environment rendering

Gym Trading Env supports rendering the environment to visualize the agent's trading actions over time. You can enable rendering during evaluation by setting the `render` parameter to `True` in the `evaluate_agent` function. The rendered logs will be saved in the specified directory for later review.

To visualize the rendered logs, you can use the built-in rendering tools provided by Gym Trading Env as shown below.

In [None]:
from gym_trading_env.renderer import Renderer
renderer = Renderer(render_logs_dir=eval_folder/"render_logs")
renderer.run()

Now you are ready to implement your DRL agent and start training! Good luck ðŸ’ª