# Reinforcement Learning pour le Trading - Deep Q-learning

Pour former agent de trading, nous devons créer un environnement de marché qui fournit des informations sur les prix et autres, propose des actions liées au trading et assure le suivi du portefeuille pour récompenser l'agent en conséquence de ses actions.

## Comment mettre en place un environnement OpenAI pour le trading

L'environnement OpenAI Gym permet la conception, l'enregistrement et l'utilisation d'environnements qui adhèrent à son architecture, comme décrit dans sa [documentation](https://github.com/openai/gym/tree/master/gym/envs#how-to-create-new-environments-for-gym). Le fichier [trading_env.py](trading_env.py) présente un exemple qui illustre comment créer une classe qui implémente les méthodes `step()` et `reset()` requises.

L'environnement de trading consiste en trois classes qui interagissent pour faciliter les activités de l'agent :

 1. La classe `DataSource` charge une série temporelle, génère quelques features, et fournit la dernière observation à l'agent à chaque pas de temps. 

 2. Le `TradingSimulator` suit les positions, les transactions et les coûts, ainsi que les performances. Il met également en œuvre et enregistre les résultats d'une stratégie de référence d'achat et de conservation. 
 
 3. `TradingEnvironment` orchestre lui-même le processus. 

## Une simulation simple de trading

Pour entraîner l'agent, nous devons mettre en place un jeu simple avec un ensemble limité d'options, un état avec un petit nombre de dimensions et d'autres paramètres qui peuvent être facilement modifiés et étendus.

Plus précisément, l'environnement échantillonne une série temporelles du prix d'actions pour un seul téléscripteur en utilisant une date de début aléatoire pour simuler une période de négociation qui, par défaut, contient 252 jours, ou 1 an. L'état contient le prix et le volume (mis à l'échelle), ainsi que certains indicateurs techniques comme les rangs des percentiles du prix et du volume, un indice de force relative (Relative Strength Index, RSI), ainsi que les rendements à 5 et 21 jours. L'agent peut choisir entre trois actions :

- **Acheter** : Investir du capital pour une position longue sur l'action
- **Flat** : Ne conserver que les liquidités
- **Vente à découvert** : Prendre une position courte égale au montant du capital.

L'environnement tient compte du coût de transaction, qui est fixé à 10 points de base par défaut. Il déduit également un coût temps de 1 point de base par période. Il suit la valeur nette d'inventaire (VNA ou Net Asset Value NAV) du portefeuille de l'agent et la compare à celle du portefeuille du marché (qui négocie sans friction afin d'élever la barre pour l'agent).

We use the same DDQN agent and neural network architecture that successfully learned to navigate the Lunar Lander environment. We let exploration continue for 500,000 time steps (~2,000 1yr trading periods) with linear decay of ε to 0.1 and exponential decay at a factor of 0.9999 thereafter.

## Imports & Réglages

### Imports

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install gym
%matplotlib inline
!pip uninstall tensorflow 
!pip install -U tensorflow==2.3.0 
from pathlib import Path
from time import time
from collections import deque
from random import sample

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import seaborn as sns

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2

import gym
from gym.envs.registration import register

Found existing installation: tensorflow 2.5.0
Not uninstalling tensorflow at /shared-libs/python3.6/py/lib/python3.6/site-packages, outside environment /root/venv
Can't uninstall 'tensorflow'. No files were found to uninstall.
Collecting tensorflow==2.3.0
  Using cached tensorflow-2.3.0-cp36-cp36m-manylinux2010_x86_64.whl (320.4 MB)
Installing collected packages: tensorflow
  Attempting uninstall: tensorflow
    Found existing installation: tensorflow 2.5.0
    Not uninstalling tensorflow at /shared-libs/python3.6/py/lib/python3.6/site-packages, outside environment /root/venv
    Can't uninstall 'tensorflow'. No files were found to uninstall.
Successfully installed tensorflow-2.3.0


### Settings

In [None]:
np.random.seed(42)
tf.random.set_seed(42)

In [None]:
sns.set_style('whitegrid')

In [None]:
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
print(gpu_devices)
if gpu_devices:
    print('Using GPU')
    tf.config.experimental.set_memory_growth(gpu_devices[0], True)
else:
    print('Using CPU')

[]
Using CPU


In [None]:
results_path = Path('results', 'trading_bot')
if not results_path.exists():
    results_path.mkdir(parents=True)

### Fonctions auxiliaires

In [None]:
def format_time(t):
    m_, s = divmod(t, 60)
    h, m = divmod(m_, 60)
    return '{:02.0f}:{:02.0f}:{:02.0f}'.format(h, m, s)

## Configurer l'environnement

Avant d'utiliser notre environnement, on l'enregistre grâce à la méthode register.

In [None]:
trading_days = 252

In [None]:
register(
    id='trading-v0',
    entry_point='trading_env:TradingEnvironment',
    max_episode_steps=trading_days
)

### Initialisation de notre environnement de Trading

We can instantiate the environment by using the desired trading costs and ticker:

In [None]:
trading_cost_bps = 1e-3
time_cost_bps = 1e-4

In [None]:
f'Trading costs: {trading_cost_bps:.2%} | Time costs: {time_cost_bps:.2%}'

'Trading costs: 0.10% | Time costs: 0.01%'

In [None]:
!pip install -r requirements.txt 



[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m


pwd

tar -xzf ta-lib-0.4.0-src.tar.gz

cd ta-lib/

pwd

./configure --prefix=/usr

make

sudo make install

In [None]:
!pip install ta-lib

Collecting ta-lib
  Using cached TA-Lib-0.4.20.tar.gz (266 kB)
Building wheels for collected packages: ta-lib
  Building wheel for ta-lib (setup.py) ... [?25l/

done
[?25h  Created wheel for ta-lib: filename=TA_Lib-0.4.20-cp36-cp36m-linux_x86_64.whl size=1438945 sha256=021925dc5593e6c9d14308c842703654d79cfb8ec72254b1e726a835d0624e53
  Stored in directory: /root/.cache/pip/wheels/4a/78/6a/bb3c86ccc471d0914efdc6ab3715f7d00c343392a316bf606a
Successfully built ta-lib
Installing collected packages: ta-lib
Successfully installed ta-lib-0.4.20


In [None]:
!pip install --upgrade tables



In [None]:
trading_environment = gym.make('trading-v0')
trading_environment.env.trading_days = trading_days
trading_environment.env.trading_cost_bps = trading_cost_bps
trading_environment.env.time_cost_bps = time_cost_bps
trading_environment.env.ticker = 'AAPL'
trading_environment.seed(42)

INFO:trading_env:loading data for AAPL...
INFO:trading_env:got data for AAPL...
INFO:trading_env:None
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 9367 entries, (Timestamp('1981-01-30 00:00:00'), 'AAPL') to (Timestamp('2018-03-27 00:00:00'), 'AAPL')
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   returns  9367 non-null   float64
 1   ret_2    9367 non-null   float64
 2   ret_5    9367 non-null   float64
 3   ret_10   9367 non-null   float64
 4   ret_21   9367 non-null   float64
 5   rsi      9367 non-null   float64
 6   macd     9367 non-null   float64
 7   atr      9367 non-null   float64
 8   stoch    9367 non-null   float64
 9   ultosc   9367 non-null   float64
dtypes: float64(10)
memory usage: 1.6+ MB


[42]

### Get Environment Params

In [None]:
state_dim = trading_environment.observation_space.shape[0]
num_actions = trading_environment.action_space.n
max_episode_steps = trading_environment.spec.max_episode_steps

## Définir l'agent de Trading 

In [None]:
class DDQNAgent:
    def __init__(self, state_dim,
                 num_actions,
                 learning_rate,
                 gamma,
                 epsilon_start,
                 epsilon_end,
                 epsilon_decay_steps,
                 epsilon_exponential_decay,
                 replay_capacity,
                 architecture,
                 l2_reg,
                 tau,
                 batch_size):

        self.state_dim = state_dim
        self.num_actions = num_actions
        self.experience = deque([], maxlen=replay_capacity)
        self.learning_rate = learning_rate
        self.gamma = gamma
        self.architecture = architecture
        self.l2_reg = l2_reg

        self.online_network = self.build_model()
        self.target_network = self.build_model(trainable=False)
        self.update_target()

        self.epsilon = epsilon_start
        self.epsilon_decay_steps = epsilon_decay_steps
        self.epsilon_decay = (epsilon_start - epsilon_end) / epsilon_decay_steps
        self.epsilon_exponential_decay = epsilon_exponential_decay
        self.epsilon_history = []

        self.total_steps = self.train_steps = 0
        self.episodes = self.episode_length = self.train_episodes = 0
        self.steps_per_episode = []
        self.episode_reward = 0
        self.rewards_history = []

        self.batch_size = batch_size
        self.tau = tau
        self.losses = []
        self.idx = tf.range(batch_size)
        self.train = True

    def build_model(self, trainable=True):
        layers = []
        n = len(self.architecture)
        for i, units in enumerate(self.architecture, 1):
            layers.append(Dense(units=units,
                                input_dim=self.state_dim if i == 1 else None,
                                activation='relu',
                                kernel_regularizer=l2(self.l2_reg),
                                name=f'Dense_{i}',
                                trainable=trainable))
        layers.append(Dropout(.1))
        layers.append(Dense(units=self.num_actions,
                            trainable=trainable,
                            name='Output'))
        model = Sequential(layers)
        model.compile(loss='mean_squared_error',
                      optimizer=Adam(lr=self.learning_rate))
        return model

    def update_target(self):
        self.target_network.set_weights(self.online_network.get_weights())

    def epsilon_greedy_policy(self, state):
        self.total_steps += 1
        if np.random.rand() <= self.epsilon:
            return np.random.choice(self.num_actions)
        q = self.online_network.predict(state)
        return np.argmax(q, axis=1).squeeze()

    def memorize_transition(self, s, a, r, s_prime, not_done):
        if not_done:
            self.episode_reward += r
            self.episode_length += 1
        else:
            if self.train:
                if self.episodes < self.epsilon_decay_steps:
                    self.epsilon -= self.epsilon_decay
                else:
                    self.epsilon *= self.epsilon_exponential_decay

            self.episodes += 1
            self.rewards_history.append(self.episode_reward)
            self.steps_per_episode.append(self.episode_length)
            self.episode_reward, self.episode_length = 0, 0

        self.experience.append((s, a, r, s_prime, not_done))

    def experience_replay(self):
        if self.batch_size > len(self.experience):
            return
        minibatch = map(np.array, zip(*sample(self.experience, self.batch_size)))
        states, actions, rewards, next_states, not_done = minibatch

        next_q_values = self.online_network.predict_on_batch(next_states)
        best_actions = tf.argmax(next_q_values, axis=1)

        next_q_values_target = self.target_network.predict_on_batch(next_states)
        target_q_values = tf.gather_nd(next_q_values_target,
                                       tf.stack((self.idx, tf.cast(best_actions, tf.int32)), axis=1))

        targets = rewards + not_done * self.gamma * target_q_values

        q_values = self.online_network.predict_on_batch(states)
        q_values[[self.idx, actions]] = targets

        loss = self.online_network.train_on_batch(x=states, y=q_values)
        self.losses.append(loss)

        if self.total_steps % self.tau == 0:
            self.update_target()

## Définir les hyperparamètres

In [None]:
gamma = .99,  # discount factor
tau = 100  # target network update frequency

### Architecture NN

In [None]:
architecture = (256, 256)  # units per layer
learning_rate = 0.0001  # learning rate
l2_reg = 1e-6  # L2 regularization

### Replay d'expérience

In [None]:
replay_capacity = int(1e6)
batch_size = 4096

### Stratégie $\epsilon$-greedy 

In [None]:
epsilon_start = 1.0
epsilon_end = .01
epsilon_decay_steps = 250
epsilon_exponential_decay = .99

## Créer notre agent DDQN 

We will use [TensorFlow](https://www.tensorflow.org/) to create our Double Deep Q-Network .

In [None]:
tf.keras.backend.clear_session()

In [None]:
ddqn = DDQNAgent(state_dim=state_dim,
                 num_actions=num_actions,
                 learning_rate=learning_rate,
                 gamma=gamma,
                 epsilon_start=epsilon_start,
                 epsilon_end=epsilon_end,
                 epsilon_decay_steps=epsilon_decay_steps,
                 epsilon_exponential_decay=epsilon_exponential_decay,
                 replay_capacity=replay_capacity,
                 architecture=architecture,
                 l2_reg=l2_reg,
                 tau=tau,
                 batch_size=batch_size)

In [None]:
ddqn.online_network.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Dense_1 (Dense)              (None, 256)               2816      
_________________________________________________________________
Dense_2 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
Output (Dense)               (None, 3)                 771       
Total params: 69,379
Trainable params: 69,379
Non-trainable params: 0
_________________________________________________________________


## Réalisation de notre expérimentation

### Affecter les paramètres


In [None]:
total_steps = 0
max_episodes = 1000

### Initialiser les variables

In [None]:
episode_time, navs, market_navs, diffs, episode_eps = [], [], [], [], []

## Fonction de visualisation

In [None]:
def track_results(episode, nav_ma_100, nav_ma_10,
                  market_nav_100, market_nav_10,
                  win_ratio, total, epsilon):
    time_ma = np.mean([episode_time[-100:]])
    T = np.sum(episode_time)
    
    template = '{:>4d} | {} | Agent: {:>6.1%} ({:>6.1%}) | '
    template += 'Market: {:>6.1%} ({:>6.1%}) | '
    template += 'Wins: {:>5.1%} | eps: {:>6.3f}'
    print(template.format(episode, format_time(total), 
                          nav_ma_100-1, nav_ma_10-1, 
                          market_nav_100-1, market_nav_10-1, 
                          win_ratio, epsilon))

## Entraînement de l'agent

In [None]:
start = time()
results = []
for episode in range(1, max_episodes + 1):
    this_state = trading_environment.reset()
    for episode_step in range(max_episode_steps):
        action = ddqn.epsilon_greedy_policy(this_state.reshape(-1, state_dim))
        next_state, reward, done, _ = trading_environment.step(action)
    
        ddqn.memorize_transition(this_state, 
                                 action, 
                                 reward, 
                                 next_state, 
                                 0.0 if done else 1.0)
        if ddqn.train:
            ddqn.experience_replay()
        if done:
            break
        this_state = next_state

    # get DataFrame with seqence of actions, returns and nav values
    result = trading_environment.env.simulator.result()
    
    # get results of last step
    final = result.iloc[-1]

    # apply return (net of cost) of last action to last starting nav 
    nav = final.nav * (1 + final.strategy_return)
    navs.append(nav)

    # market nav 
    market_nav = final.market_nav
    market_navs.append(market_nav)

    # track difference between agent an market NAV results
    diff = nav - market_nav
    diffs.append(diff)
    
    if episode % 10 == 0:
        track_results(episode, 
                      # show mov. average results for 100 (10) periods
                      np.mean(navs[-100:]), 
                      np.mean(navs[-10:]), 
                      np.mean(market_navs[-100:]), 
                      np.mean(market_navs[-10:]), 
                      # share of agent wins, defined as higher ending nav
                      np.sum([s > 0 for s in diffs[-100:]])/min(len(diffs), 100), 
                      time() - start, ddqn.epsilon)
    if len(diffs) > 25 and all([r > 0 for r in diffs[-25:]]):
        print(result.tail())
        break

trading_environment.close()

  10 | 00:00:02 | Agent: -39.3% (-39.3%) | Market:   5.6% (  5.6%) | Wins: 20.0% | eps:  0.960


  90 | 01:03:10 | Agent: -20.3% (-21.6%) | Market:  24.4% ( -2.1%) | Wins: 23.3% | eps:  0.644


 130 | 01:39:42 | Agent: -11.4% (-27.1%) | Market:  36.4% ( 80.3%) | Wins: 26.0% | eps:  0.485


 150 | 01:58:21 | Agent:  -8.2% (  5.2%) | Market:  37.9% ( 41.0%) | Wins: 26.0% | eps:  0.406


 170 | 02:16:38 | Agent:  -4.0% (  6.7%) | Market:  32.7% ( 37.8%) | Wins: 28.0% | eps:  0.327


 220 | 03:07:23 | Agent:   2.4% ( 31.8%) | Market:  41.2% ( 41.3%) | Wins: 33.0% | eps:  0.129
 230 | 03:17:07 | Agent:   8.0% ( 29.2%) | Market:  35.5% ( 24.2%) | Wins: 37.0% | eps:  0.089
 240 | 03:27:13 | Agent:  11.8% ( 47.7%) | Market:  32.7% (-14.7%) | Wins: 41.0% | eps:  0.050
 250 | 03:42:22 | Agent:  11.5% (  1.9%) | Market:  31.5% ( 28.6%) | Wins: 42.0% | eps:  0.010
 260 | 03:56:34 | Agent:  15.7% ( 44.6%) | Market:  33.1% ( 31.6%) | Wins: 46.0% | eps:  0.009
 270 | 04:10:58 | Agent:  17.3% ( 22.6%) | Market:  31.5% ( 21.8%) | Wins: 46.0% | eps:  0.008
 280 | 04:25:28 | Agent:  20.6% ( 29.2%) | Market:  33.6% ( 84.3%) | Wins: 45.0% | eps:  0.007
 290 | 04:40:05 | Agent:  21.0% ( -1.6%) | Market:  28.4% (  9.6%) | Wins: 46.0% | eps:  0.007
 300 | 04:54:53 | Agent:  22.9% ( 30.2%) | Market:  31.0% ( 46.1%) | Wins: 45.0% | eps:  0.006
 310 | 05:09:40 | Agent:  26.4% ( 28.6%) | Market:  31.0% ( 36.7%) | Wins: 46.0% | eps:  0.005


KernelInterrupted: Execution interrupted by the Jupyter kernel.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=0f8ffaa5-0b99-4437-a638-6f87b9da36c0' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>