Intermediate Deep Learning - Deep Reinforcement Learning
# Project: Deep Reinforcement Learning in Trading Environment

## Overview
In this project, you will implement and evaluate a deep reinforcement learning (DRL) agent in a trading environment. The goal is to train an agent that can make profitable trading decisions based on historical market data.

You are encouraged to experiment with different DRL algorithms, architectures, and hyperparameters to optimize the agent's performance.

### Deliverables

You are required to prepare:

- Code implementation of the DRL agent(s) and training process. Can be in the form of Jupyter Notebooks or Python scripts.
- A report (4 pages max) summarizing your approach, results, and insights gained from the project, including:  
    - Description of the DRL algorithm(s) used.
    - Training process and hyperparameter choices.
    - Challenges faced and how they were addressed.

    Justify your design choices. You are also encouraged to include visualizations of the agent's performance over time, or compare strategies developed with DRL against financial benchmarks.
- An evaluation results CSV file `evaluation_results.csv` generated by your agent after training using the provided evaluation function.
- A presentation (15 minutes) to showcase your work, findings, and any interesting observations.

*The documents are to be submitted in a zip file, before 16/11/2025 11:59PM, and the presentation is scheduled for next session.*

### Environment
We will use Gym Trading Env as our trading environment. (https://gym-trading-env.readthedocs.io/en/latest/)

- This is a gymnasium-compatible environment designed to simulate trading (stocks or crypto) from historical market data.
- Its goal is to provide a fast and customizable platform for training RL agents in a trading scenario.

We will use BTC/USDT hour step historical data from Binance for training and evaluation. The agent will be evaluated on the period from 2025-10-01 to 2025-11-01.

Following code blocks demonstrate how to set up the environment and evaluate your agent.

### Grading Criteria
- Implementation of the DRL agent and training process (40%)
    - DRL algorithm correctly implemented (15%)
    - Appropriate training procedure (15%)
    - Effective use of hyperparameters (10%)
    - 10 % bonus for innovative approaches or techniques
- Performance of the agent based on evaluation metrics (30%)
    - If the agent shows progress during training (10%)
    - If the agent outperforms a random strategy during evaluation (10%)
    - If the portfolio return exceeds market return (10%)
    - 30%, 20%, 10% bonus for the top 3 agents respectively
- Quality and clarity of the report (20%)
    - Clear explanation of methods and results (10%)
    - Justification of design choices (10%)
- Presentation (20%)





---

## Environment Setup

### Install Required Packages

In [2]:
import sys
print(sys.executable)

!"{sys.executable}" -m pip install gym-trading-env

/usr/local/bin/python3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [3]:
import numpy as np
import pandas as pd
import gymnasium as gym
import gym_trading_env
from gym_trading_env.downloader import download
import datetime
from pathlib import Path
import matplotlib.pyplot as plt
import time



### Prepare Data Set

- Create a data folder to store historical data
- Download historical data for BTC/USDT from Binance using the provided utility function.
- Preprocess the data to create features. The features (plus two dynamic features: last position taken by the agent, and the current real position) are the state of the environment at each time step.  
    *(You can add more features if you want to experiment with different state representations.)*
- Select training and evaluation data based on the specified date ranges.  
    *(You can modify the training range if you want to experiment with different time periods, however, keep in mind the evaluation period should always be after the training period.)*

```python

In [4]:
data_folder = Path("data/")
data_folder.mkdir(parents=True, exist_ok=True)
eval_folder = Path("eval/")
eval_folder.mkdir(parents=True, exist_ok=True)


download(exchange_names = ["binance"],
    symbols= ["BTC/USDT"],
    timeframe= "1h",
    dir = data_folder,
    since= datetime.datetime(year= 2020, month=10, day=1),
)

# Import your fresh data
df = pd.read_pickle(data_folder / "binance-BTCUSDT-1h.pkl")

""" Preprocess the data to create features """
# Create the feature : ( close[t] - close[t-1] )/ close[t-1]
df["feature_close"] = df["close"].pct_change()

# Create the feature : open[t] / close[t]
df["feature_open"] = df["open"]/df["close"]

# Create the feature : high[t] / close[t]
df["feature_high"] = df["high"]/df["close"]

# Create the feature : low[t] / close[t]
df["feature_low"] = df["low"]/df["close"]

 # Create the feature : volume[t] / max(*volume[t-7*24:t+1])
df["feature_volume"] = df["volume"] / df["volume"].rolling(7*24).max()

df.dropna(inplace= True) # Clean again !
# Each step, the environment will return 5 inputs  : "feature_close", "feature_open", "feature_high", "feature_low", "feature_volume"

df_train = df.loc['2024-10-01':'2025-09-30'] # Training data
df_eval = df.loc['2025-10-01':'2025-11-01'] # Evaluation data


BTC/USDT downloaded from binance and stored at data/binance-BTCUSDT-1h.pkl


In [5]:
df_train

Unnamed: 0_level_0,open,high,low,close,volume,date_close,feature_close,feature_open,feature_high,feature_low,feature_volume
date_open,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2024-10-01 00:00:00,63327.60,63606.00,63006.70,63531.99,1336.93335,2024-10-01 01:00:00,0.003228,0.996783,1.001165,0.991732,0.287876
2024-10-01 01:00:00,63532.00,63639.86,63370.01,63458.00,1004.08763,2024-10-01 02:00:00,-0.001165,1.001166,1.002866,0.998613,0.216205
2024-10-01 02:00:00,63458.00,63458.00,63180.00,63443.76,716.11822,2024-10-01 03:00:00,-0.000224,1.000224,1.000224,0.995843,0.154198
2024-10-01 03:00:00,63443.76,63744.00,63430.00,63723.48,822.21265,2024-10-01 04:00:00,0.004409,0.995610,1.000322,0.995394,0.177043
2024-10-01 04:00:00,63723.47,63879.81,63652.06,63868.94,778.75286,2024-10-01 05:00:00,0.002283,0.997722,1.000170,0.996604,0.167685
...,...,...,...,...,...,...,...,...,...,...,...
2025-09-30 19:00:00,113714.55,114563.46,113702.01,114359.99,971.66803,2025-09-30 20:00:00,0.005676,0.994356,1.001779,0.994246,0.335092
2025-09-30 20:00:00,114359.99,114723.57,114110.09,114626.36,710.64838,2025-09-30 21:00:00,0.002329,0.997676,1.000848,0.995496,0.245076
2025-09-30 21:00:00,114626.36,114754.87,114136.50,114161.00,649.46746,2025-09-30 22:00:00,-0.004060,1.004076,1.005202,0.999785,0.223977
2025-09-30 22:00:00,114161.00,114356.98,113883.01,113940.39,348.93460,2025-09-30 23:00:00,-0.001932,1.001936,1.003656,0.999496,0.120335


In [6]:
df_eval

Unnamed: 0_level_0,open,high,low,close,volume,date_close,feature_close,feature_open,feature_high,feature_low,feature_volume
date_open,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2025-10-01 00:00:00,114048.94,114308.00,113966.67,114239.53,434.59016,2025-10-01 01:00:00,0.001671,0.998332,1.000599,0.997612,0.149874
2025-10-01 01:00:00,114239.53,114550.00,114142.99,114549.99,597.25360,2025-10-01 02:00:00,0.002718,0.997290,1.000000,0.996447,0.205971
2025-10-01 02:00:00,114549.99,114551.76,114272.15,114272.15,508.42422,2025-10-01 03:00:00,-0.002425,1.002431,1.002447,1.000000,0.175337
2025-10-01 03:00:00,114272.16,114530.48,114096.58,114176.92,502.30318,2025-10-01 04:00:00,-0.000833,1.000834,1.003097,0.999296,0.173226
2025-10-01 04:00:00,114176.93,114700.00,114151.00,114289.01,597.89328,2025-10-01 05:00:00,0.000982,0.999019,1.003596,0.998792,0.206191
...,...,...,...,...,...,...,...,...,...,...,...
2025-11-01 19:00:00,110341.28,110341.28,110220.77,110299.99,109.98551,2025-11-01 20:00:00,-0.000374,1.000374,1.000374,0.999282,0.025156
2025-11-01 20:00:00,110300.00,110516.17,110298.17,110406.20,211.21273,2025-11-01 21:00:00,0.000963,0.999038,1.000996,0.999022,0.048308
2025-11-01 21:00:00,110406.21,110426.00,109858.44,109862.85,273.04557,2025-11-01 22:00:00,-0.004921,1.004946,1.005126,0.999960,0.062451
2025-11-01 22:00:00,109862.85,110098.59,109862.84,110092.78,259.45939,2025-11-01 23:00:00,0.002093,0.997911,1.000053,0.997911,0.059343


### Setting up the Trading Environment

We use the `df_train` DataFrame for training and `df_eval` DataFrame for evaluation.

The `positions` parameter defines the discrete positions the agent can take, it is a list containing possible position values. A position value corresponds to the ratio of the portfolio valuation engaged in the position ( > 0 to bet on the rise, < 0 to bet on the decrease)

- if `position < 0` : the agent is shorting the asset
- if `position = 0` : the agent is out of the market
- if `position > 0` : the agent is longing the asset
- if `position = 1` : the agent is fully invested in the asset
- if `position > 1` : the agent is using leverage to invest more than its portfolio valuation in the asset

You are free to modify the `positions` list to experiment with different position options for the agent.


In [7]:
POSITIONS = [ -1, 0, 0.5, 1, 2] # -1 (=SHORT), 0(=OUT), +1 (=LONG)

env_train = gym.make("TradingEnv",
        name= "BTCUSD",
        df = df_train, # Your dataset with your custom features
        positions = POSITIONS,
        trading_fees = 0.01/100, # 0.01% per stock buy / sell (Binance fees)
        borrow_interest_rate= 0.0003/100, # 0.0003% per timestep (one timestep = 1h here)
    )

env_eval = gym.make("TradingEnv",
        name= "BTCUSD",
        df = df_eval, # Your dataset with your custom features
        positions = POSITIONS,
        trading_fees = 0.01/100, # 0.01% per stock buy / sell (Binance fees)
        borrow_interest_rate= 0.0003/100, # 0.0003% per timestep (one timestep = 1h here)
    )

### Example of interacting with the Environment

The interaction with the environment follows the standard Gymnasium API. At each time step, the agent selects an action (position index) based on the current observation, and the environment returns the next observation, reward, and done flag.

The following code block demonstrates a simple interaction loop where the agent randomly selects actions.

The episode is terminated when `done` or `truncated` is `True`:

- When environment reaches the end of the dataset, `truncated` is set to `True`.
- When agent's portfolio valuation drops below 0, `done` is set to `True`. (This means the agent has gone bankrupt, the situdation can happen when high leverage is used.)

The reward at each time step is calculated based on the change in portfolio valuation, taking into account trading fees and borrow interest rates: $r_{t} = ln(\frac{p_{t}}{p_{t-1}})\text{ with }p_{t}\text{ = portofolio valuation at timestep }t$. You can customize your own reward function if needed (see https://gym-trading-env.readthedocs.io/en/latest/customization.html#custom-reward-function)

In [8]:
done, truncated = False, False
observation, info = env_train.reset()
while not done and not truncated:
    # Pick a position by its index in your position list
    position_index = env_train.action_space.sample()
    observation, reward, done, truncated, info = env_train.step(position_index)

Market Return : 79.51%   |   Portfolio Return : -65.56%   |   


### DRL Agent Training and Evaluation

You will then implement your DRL agent, train it using the training environment, and evaluate its performance using the evaluation environment.

Your agent should include a method `choose_action_eval` for selecting actions based on state during evaluation. It is called in the evaluation function.

```python
    def choose_action_eval(self, state):
        # Implement action selection logic for evaluation
        ...
        return action_index
```

After training, you can evaluate your agent using the provided `evaluate_agent` function, which runs the agent in the evaluation environment for a specified number of episodes and records the results.

In [12]:
class RandomAgent:
    def __init__(self, action_space):
        self.action_space = action_space

    def act(self, observation):
        return self.action_space.sample()

    def choose_action_eval(self, state):
        return self.action_space.sample()

In [13]:
def evaluate_agent(agent, env, num_episodes=10, max_steps=None, render=False, csv_path="evaluation_results.csv", renderer_logs_dir="render_logs"):
    """
    Evaluate the agent on the environment for a number of episodes.
    """

    results = []
    for ep in range(num_episodes):
        obs, info = env.reset()
        done = False
        truncated = False
        step = 0
        reward_total = 0.0
        while not done and not truncated:
            action = agent.choose_action_eval(obs)
            obs, reward, done, truncated, info = env.step(action)
            reward_total += reward
            step += 1
            if (max_steps is not None) and (step >= max_steps):
                break

        # Get metrics from the environment
        metrics = env.get_metrics()  
        # Transform metrics
        # Assume metrics contain keys "Portfolio Return" and "Market Return" as strings like "45.24%"
        port_ret = float(metrics["Portfolio Return"].strip('%')) / 100.0
        market_ret = float(metrics["Market Return"].strip('%')) / 100.0

        results.append({
            "episode": ep+1,
            "portfolio_return": port_ret,
            "market_return": market_ret,
            "excess_return": port_ret - market_ret,
            "steps": step,
            "total_reward": reward_total,
        })
        if render:
            print(f"Eval Episode {ep+1}: Total Reward: {reward_total:.2f}, Portfolio Return: {port_ret:.2%}, Market Return: {market_ret:.2%}, Excess Return: {(port_ret - market_ret):.2%}, Steps: {step}")
            time.sleep(1)  # Pause between episodes in case the execution is too fast and files are not saved properly
            env.save_for_render(dir = renderer_logs_dir)

    df_results = pd.DataFrame(results)
    
    df_results.to_csv(csv_path, index=False)
    print(f"Saved submission to {csv_path}")

    return df_results


In [14]:
# Create a random agent for evaluation
agent = RandomAgent(env_eval.action_space)

# Evaluate the trained agent
df_results = evaluate_agent(agent, env_eval, num_episodes=10, render=True, csv_path=eval_folder / "evaluation_results.csv", renderer_logs_dir=eval_folder / "render_logs")

Market Return : -3.63%   |   Portfolio Return : -6.59%   |   
Eval Episode 1: Total Reward: -0.07, Portfolio Return: -6.59%, Market Return: -3.63%, Excess Return: -2.96%, Steps: 767
Market Return : -3.63%   |   Portfolio Return : -4.08%   |   
Eval Episode 2: Total Reward: -0.04, Portfolio Return: -4.08%, Market Return: -3.63%, Excess Return: -0.45%, Steps: 767
Market Return : -3.63%   |   Portfolio Return : 13.37%   |   
Eval Episode 3: Total Reward: 0.13, Portfolio Return: 13.37%, Market Return: -3.63%, Excess Return: 17.00%, Steps: 767
Market Return : -3.63%   |   Portfolio Return : -9.74%   |   
Eval Episode 4: Total Reward: -0.10, Portfolio Return: -9.74%, Market Return: -3.63%, Excess Return: -6.11%, Steps: 767
Market Return : -3.63%   |   Portfolio Return :  5.15%   |   
Eval Episode 5: Total Reward: 0.05, Portfolio Return: 5.15%, Market Return: -3.63%, Excess Return: 8.78%, Steps: 767
Market Return : -3.63%   |   Portfolio Return : -19.51%   |   
Eval Episode 6: Total Reward: -

### Environment rendering

Gym Trading Env supports rendering the environment to visualize the agent's trading actions over time. You can enable rendering during evaluation by setting the `render` parameter to `True` in the `evaluate_agent` function. The rendered logs will be saved in the specified directory for later review.

To visualize the rendered logs, you can use the built-in rendering tools provided by Gym Trading Env as shown below.

In [None]:
from gym_trading_env.renderer import Renderer
renderer = Renderer(render_logs_dir=eval_folder/"render_logs")
renderer.run()

Now you are ready to implement your DRL agent and start training! Good luck ðŸ’ª