**STOCKS ROBOT**

* The purpose of this project is to maximize the profit from the stocks market during 2005-2021.
* I only chose 3 popular tech stocks (GOOGL, AMZN, AAPL) for this project.

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# install libraries
!pip install yfinance
!pip install stable-baselines3
!pip install optuna
url = "https://launchpad.net/~mario-mariomedina/+archive/ubuntu/talib/+files"
!wget $url/libta-lib0_0.4.0-oneiric1_amd64.deb -qO libta.deb
!wget $url/ta-lib0-dev_0.4.0-oneiric1_amd64.deb -qO ta.deb
!dpkg -i libta.deb ta.deb
!pip install ta-lib

Collecting yfinance
  Downloading yfinance-0.1.70-py2.py3-none-any.whl (26 kB)
Collecting multitasking>=0.0.7
  Downloading multitasking-0.0.10.tar.gz (8.2 kB)
Collecting requests>=2.26
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 300 kB/s 
Collecting charset-normalizer~=2.0.0
  Downloading charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
Building wheels for collected packages: multitasking
  Building wheel for multitasking (setup.py) ... [?25l- \ done
[?25h  Created wheel for multitasking: filename=multitasking-0.0.10-py3-none-any.whl size=8500 sha256=83d58011a9adc9f8bd9e681f630ba10c294ee0372cea7895ec8be0139146d524
  Stored in directory: /root/.cache/pip/wheels/34/ba/79/c0260c6f1a03f420ec7673eff9981778f293b9107974679e36
Successfully built multitasking
Installing collected packages: charset-normalizer, requests, multitasking, yfinance
  Attempting uninstall: requests
    Found existing installation: 

* I also downloaded indices that might correlate to these tech stocks such as NASDAQ index. They may help with the prediction

In [2]:
#loading data
import talib
import yfinance as yf 

stocks = ['GOOGL'
          , 'AMZN'
          , 'AAPL'
         ]
          
index = [
          '^GSPC',
          '^IXIC',
          '^TNX'
         ]
yf_interval = "1d"

df_o = yf.download(
        tickers = stocks+index,            
        interval = yf_interval,  
        start="2005-01-01"
        , end="2021-12-31"
        , group_by = 'ticker',     
        auto_adjust = True,      
        prepost = True,          
        threads = True,          
        proxy = None)            

[*********************100%***********************]  6 of 6 completed


In [3]:
# renaming columns
df = df_o.copy()
idx_name = {'^IXIC':'NASDAQ', '^TNX':'BOND', '^GSPC':'SP500'}
df.columns = [(i[0] if i[0] not in idx_name else idx_name[i[0]])+"_"+i[1] for i in df.columns]

In [4]:
# dropping column with wrong data
df = df.drop(['BOND_Volume'],axis=1)

# checking duplicates
df = df.reset_index()
df = df.drop_duplicates(subset=['Date'])
df = df.set_index('Date')
df.shape

(4279, 29)

* Here I added some common technical indicators that will help with the prediction

In [5]:
# adding extra features

# rsi
from talib import RSI

for i in stocks:
    df[i+'_rsi'] = RSI(df[i+'_Close'], timeperiod=14)

# cci
def CCI(df_main, i, ndays): 
    df = df_main.copy()
    df[i+'_TP'] = (df[i+'_High'] + df[i+'_Low'] + df[i+'_Close']) / 3 
    df[i+'_sma'] = df[i+'_TP'].rolling(ndays).mean()
    df[i+'_mad'] = df[i+'_TP'].rolling(ndays).apply(lambda x: pd.Series(x).mad())
    df[i+'_cci'] = (df[i+'_TP'] - df[i+'_sma']) / (0.015 * df[i+'_mad']) 
    return df[i+'_cci']

for i in stocks:
    df[i+'_cci'] = CCI(df, i, 20)

# macd
for i in stocks:
    short_ema =  df[i+'_Close'].ewm(span=12, adjust=False).mean()
    long_ema = df[i+'_Close'].ewm(span=26, adjust=False).mean()
#     signal = df[i+'_Close'].ewm(span=9, adjust=False).mean()
    df[i+'_macd'] =  short_ema - long_ema
#     df[i+'_short_ema'] = short_ema

# removing rows upto the needed history range of extra features
df = df.iloc[26:,:]

* I used the data from 2005 to 2019 to train the model
* year 2020 was used for validation, and 2021 for testing.
* 60 days of price history was used as features for the deep learning model in the next section

In [6]:
############################################### MODELING  ##########################################

In [7]:
############## REINFORCEMENT LEARNING

* Next, I tried reinforcement learning model using PPO algorithm
* The setting of the model is as below

environment - stocks market with 3 tech stocks and no transaction fee.
agent - the trading robot
state - current cash, latest stocks price, technical indicator, current shares in the port.
action - buy, sell upto the maximum share per trade
reward - gain in price difference

* Note that I had to preprocess the data again, since the technical indicators would be used also.

In [8]:
# adjusting data preprocessing
# train test split
df_train = df.loc[:'2019']
df_val = df.loc['2020']
df_test = df.loc['2021']

# adding prior sequences for val and test
length = 60
df_val = pd.concat([df_train.iloc[-length:,:],df_val])
df_test = pd.concat([df_val.iloc[-length:,:],df_test])

# feature selection
select_feat = ['Close',
               'rsi',
               'cci',
               'macd'
              ]

feat_ls = [stocks+'_'+feat for stocks in stocks for feat in select_feat]

out_col_ls = [stocks+'_Close' for stocks in stocks]

df_train = df_train[feat_ls]
df_val = df_val[feat_ls]
df_test = df_test[feat_ls]

def xy_split(df):
    return df[feat_ls], df[out_col_ls]

X_train, y_train = xy_split(df_train)
X_val, y_val = xy_split(df_val)
X_test, y_test = xy_split(df_test)

# scaling
from sklearn.preprocessing import MinMaxScaler
sc_X = MinMaxScaler()
sc_X.fit(X_train)
X_train = sc_X.transform(X_train)
X_val = sc_X.transform(X_val)
X_test = sc_X.transform(X_test)

sc_y = MinMaxScaler()
sc_y.fit(y_train)
y_train = sc_y.transform(y_train)
y_val = sc_y.transform(y_val)
y_test = sc_y.transform(y_test)

# remove 0
selected_idx = [i for i in range(len(X_train)) if X_train[i,0]>0 and X_train[i,1]>0 and X_train[i,2]>0]
X_train = X_train[selected_idx]
y_train = y_train[selected_idx]


In [9]:
# env train
from gym.utils import seeding
import gym
from gym import spaces
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import pickle
import sys

# shares normalization factor
# max shares per trade
HMAX_NORMALIZE = 100
# initial amount of money we have in our account
INITIAL_ACCOUNT_BALANCE=1000000
MAX_ACCOUNT_BALANCE = 100e6
MAX_SHARE = 1e6
# total number of stocks in our portfolio
STOCK_DIM = 3
TRANSACTION_FEE_PERCENT = 0
REWARD_SCALING = 1e-4
# price history
FEATURES = len(select_feat)

class StockEnvTrain(gym.Env):
    metadata = {'render.modes': ['human']}
    
    def __init__(self, df_sc):

        self.day = 1
        self.df_sc = df_sc
        self.df = sc_X.inverse_transform(df_sc)
        self.action_space = spaces.Box(low = -1, high = 1,shape = (STOCK_DIM,)) 
        # Shape = [Current Balance]+[4 features for 3 stocks]+[owned shares for 3 stocks] 
        self.observation_space = spaces.Box(low=0, high=np.inf, shape = (1+FEATURES*STOCK_DIM+STOCK_DIM,))
        self.terminal = False     
        # initalize state
        self.state = [INITIAL_ACCOUNT_BALANCE] +\
                      self.df[self.day-1].tolist() +\
                      [0]*STOCK_DIM
        
        self.state_sc = [INITIAL_ACCOUNT_BALANCE/MAX_ACCOUNT_BALANCE] +\
                      self.df_sc[self.day-1].tolist() +\
                      [0]*STOCK_DIM

        # initialize reward
        self.reward = 0
        # memorize all the total balance change
        self.asset_memory = [INITIAL_ACCOUNT_BALANCE]
        self._seed()

    def _sell_stock(self, index, action):
        # perform sell action based on the sign of the action
        # update balance
        # sell the amount suggested by action, but not more than the amount in the port
        self.state[0] += \
        self.state[index*FEATURES+1]*min(abs(action),self.state[index+STOCK_DIM*FEATURES+1]) * \
         (1- TRANSACTION_FEE_PERCENT)
        # update amount in the port
        self.state[index+STOCK_DIM*FEATURES+1] -= min(abs(action), self.state[index+STOCK_DIM*FEATURES+1])

    def _buy_stock(self, index, action):
        # perform buy action based on the sign of the action
        available_amount = self.state[0] // self.state[index*FEATURES+1]
        # update cash. buy the amount suggested by action, limited by available cash
        self.state[0] -= self.state[index*FEATURES+1]*min(available_amount, action)* \
                          (1+ TRANSACTION_FEE_PERCENT)        
        # update stocks balance
        self.state[index+STOCK_DIM*FEATURES+1] += min(available_amount, action)
    
    def step(self, actions):

        self.terminal = self.day == self.df.shape[0]-1

        if self.terminal:
            return self.state_sc, self.reward, self.terminal,{}

        else:
            # normalize 
            actions = actions * HMAX_NORMALIZE   
            begin_total_asset = self.state[0]+ \
                sum( 
                np.array([self.state[i*FEATURES+1] for i in range(STOCK_DIM)]) *\
                np.array(self.state[FEATURES*STOCK_DIM+1:])
                )
            argsort_actions = np.argsort(actions)
            sell_index = argsort_actions[:np.where(actions < 0)[0].shape[0]]
            buy_index = argsort_actions[::-1][:np.where(actions > 0)[0].shape[0]]           
            
            for index in sell_index:
                self._sell_stock(index, actions[index])

            for index in buy_index:
                self._buy_stock(index, actions[index])

            self.day += 1

            #load next state
            self.state = [self.state[0]] +\
                          self.df[self.day-1].tolist() +\
                          self.state[FEATURES*STOCK_DIM+1:]
        
            self.state_sc = [self.state[0]/MAX_ACCOUNT_BALANCE] +\
                              self.df_sc[self.day-1].tolist() +\
                              [i/MAX_SHARE for i in self.state[FEATURES*STOCK_DIM+1:]]

            end_total_asset = self.state[0]+ \
                sum( 
                np.array([self.state[i*FEATURES+1] for i in range(STOCK_DIM)]) *\
                np.array(self.state[FEATURES*STOCK_DIM+1:])
                )
            self.asset_memory.append(end_total_asset)
            
            self.reward = end_total_asset - begin_total_asset            
            self.reward = self.reward*REWARD_SCALING
        
        return self.state_sc, self.reward, self.terminal, {}
        
    def reset(self):  
        self.asset_memory = [INITIAL_ACCOUNT_BALANCE]
        self.day = 1
        self.terminal = False 
        self.state = [INITIAL_ACCOUNT_BALANCE] +\
                      self.df[self.day-1].tolist() +\
                      [0]*STOCK_DIM
        self.state_sc = [INITIAL_ACCOUNT_BALANCE/MAX_ACCOUNT_BALANCE] +\
                      self.df_sc[self.day-1].tolist() +\
                      [0]*STOCK_DIM

        return self.state_sc
    
    def render(self, mode='human',close=False):
        return self.state
    
    def _seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]
            

In [10]:
# env train 2
from gym.utils import seeding
import gym
from gym import spaces
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import pickle
import sys

# shares normalization factor
# max shares per trade
HMAX_NORMALIZE = 100
# initial amount of money we have in our account
INITIAL_ACCOUNT_BALANCE=1000000
MAX_ACCOUNT_BALANCE = 100e6
MAX_SHARE = 1e6
# total number of stocks in our portfolio
STOCK_DIM = 3
TRANSACTION_FEE_PERCENT = 0
REWARD_SCALING = 1e-4
# price history
FEATURES = len(select_feat)

class StockEnvTrain2(gym.Env):
    metadata = {'render.modes': ['human']}
    
    def __init__(self, df_sc):

        self.day = 1
        self.df_sc = df_sc
        self.df = sc_X.inverse_transform(df_sc)
        self.action_space = spaces.Box(low = -1, high = 1,shape = (STOCK_DIM,)) 
        # Shape = [Current Balance]+[4 features for 3 stocks]+[owned shares for 3 stocks] 
        self.observation_space = spaces.Box(low=0, high=np.inf, shape = (1+FEATURES*STOCK_DIM+STOCK_DIM,))
        self.terminal = False     
        # initalize state
        self.state = [INITIAL_ACCOUNT_BALANCE] +\
                      self.df[self.day-1].tolist() +\
                      [0]*STOCK_DIM
        
        self.state_sc = [INITIAL_ACCOUNT_BALANCE/MAX_ACCOUNT_BALANCE] +\
                      self.df_sc[self.day-1].tolist() +\
                      [0]*STOCK_DIM

        # initialize reward
        self.reward = 0
        # memorize all the total balance change
        self.asset_memory = [INITIAL_ACCOUNT_BALANCE]
        self._seed()

    def _sell_stock(self, index, action):
        # perform sell action based on the sign of the action
        # update balance
        # sell the amount suggested by action, but not more than the amount in the port
        self.state[0] += \
        self.state[index*FEATURES+1]*min(abs(action),self.state[index+STOCK_DIM*FEATURES+1]) * \
         (1- TRANSACTION_FEE_PERCENT)
        # update amount in the port
        self.state[index+STOCK_DIM*FEATURES+1] -= min(abs(action), self.state[index+STOCK_DIM*FEATURES+1])

    def _buy_stock(self, index, action):
        # perform buy action based on the sign of the action
        available_amount = self.state[0] // self.state[index*FEATURES+1]
        # update cash. buy the amount suggested by action, limited by available cash
        self.state[0] -= self.state[index*FEATURES+1]*min(available_amount, action)* \
                          (1+ TRANSACTION_FEE_PERCENT)        
        # update stocks balance
        self.state[index+STOCK_DIM*FEATURES+1] += min(available_amount, action)
    
    def step(self, actions):

        self.terminal = self.day == self.df.shape[0]-1

        if self.terminal:
            return self.state_sc, self.reward, self.terminal,{}

        else:
            # normalize 
            actions = actions * HMAX_NORMALIZE   
            begin_total_asset = self.state[0]+ \
                sum( 
                np.array([self.state[i*FEATURES+1] for i in range(STOCK_DIM)]) *\
                np.array(self.state[FEATURES*STOCK_DIM+1:])
                )
            argsort_actions = np.argsort(actions)
            sell_index = argsort_actions[:np.where(actions < 0)[0].shape[0]]
            buy_index = argsort_actions[::-1][:np.where(actions > 0)[0].shape[0]]           
            
            for index in sell_index:
                self._sell_stock(index, actions[index])

            for index in buy_index:
                self._buy_stock(index, actions[index])

            self.day += 1

            #load next state
            self.state = [self.state[0]] +\
                          self.df[self.day-1].tolist() +\
                          self.state[FEATURES*STOCK_DIM+1:]
        
            self.state_sc = [self.state[0]/MAX_ACCOUNT_BALANCE] +\
                              self.df_sc[self.day-1].tolist() +\
                              [i/MAX_SHARE for i in self.state[FEATURES*STOCK_DIM+1:]]

            end_total_asset = self.state[0]+ \
                sum( 
                np.array([self.state[i*FEATURES+1] for i in range(STOCK_DIM)]) *\
                np.array(self.state[FEATURES*STOCK_DIM+1:])
                )
            self.asset_memory.append(end_total_asset)
            
            self.reward = end_total_asset - begin_total_asset            
            self.reward = self.reward*REWARD_SCALING
        
        return self.state_sc, self.reward, self.terminal, {}
        
    def reset(self):  
        self.asset_memory = [INITIAL_ACCOUNT_BALANCE]
        self.day = 1
        self.terminal = False 
        self.state = [INITIAL_ACCOUNT_BALANCE] +\
                      self.df[self.day-1].tolist() +\
                      [0]*STOCK_DIM
        self.state_sc = [INITIAL_ACCOUNT_BALANCE/MAX_ACCOUNT_BALANCE] +\
                      self.df_sc[self.day-1].tolist() +\
                      [0]*STOCK_DIM

        return self.state_sc
    
    def render(self, mode='human',close=False):
        return self.state
    
    def _seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]
            

In [11]:
# env val

from gym.utils import seeding
import gym
from gym import spaces
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import pickle
import sys

# shares normalization factor
# max shares per trade
HMAX_NORMALIZE = 100
# initial amount of money we have in our account
INITIAL_ACCOUNT_BALANCE=1000000
MAX_ACCOUNT_BALANCE = 100e6
MAX_SHARE = 1e6
# total number of stocks in our portfolio
STOCK_DIM = 3
TRANSACTION_FEE_PERCENT = 0
REWARD_SCALING = 1e-4
# price history
FEATURES = len(select_feat)

class StockEnvVal(gym.Env):
    metadata = {'render.modes': ['human']}
    
    def __init__(self, df_sc):

        self.day = length
        self.df_sc = df_sc
        self.df = sc_X.inverse_transform(df_sc)
        self.action_space = spaces.Box(low = -1, high = 1,shape = (STOCK_DIM,)) 
        # Shape = [Current Balance]+[4 features for 3 stocks]+[owned shares for 3 stocks] 
        self.observation_space = spaces.Box(low=0, high=np.inf, shape = (1+FEATURES*STOCK_DIM+STOCK_DIM,))
        self.terminal = False     
        # initalize state
        self.state = [INITIAL_ACCOUNT_BALANCE] +\
                      self.df[self.day-1].tolist() +\
                      [0]*STOCK_DIM
        
        self.state_sc = [INITIAL_ACCOUNT_BALANCE/MAX_ACCOUNT_BALANCE] +\
                      self.df_sc[self.day-1].tolist() +\
                      [0]*STOCK_DIM

        # initialize reward
        self.reward = 0
        # memorize all the total balance change
        self.asset_memory = [INITIAL_ACCOUNT_BALANCE]
        self._seed()

    def _sell_stock(self, index, action):
        # perform sell action based on the sign of the action
        # update balance
        # sell the amount suggested by action, but not more than the amount in the port
        self.state[0] += \
        self.state[index*FEATURES+1]*min(abs(action),self.state[index+STOCK_DIM*FEATURES+1]) * \
         (1- TRANSACTION_FEE_PERCENT)
        # update amount in the port
        self.state[index+STOCK_DIM*FEATURES+1] -= min(abs(action), self.state[index+STOCK_DIM*FEATURES+1])

    def _buy_stock(self, index, action):
        # perform buy action based on the sign of the action
        available_amount = self.state[0] // self.state[index*FEATURES+1]
        # update cash. buy the amount suggested by action, limited by available cash
        self.state[0] -= self.state[index*FEATURES+1]*min(available_amount, action)* \
                          (1+ TRANSACTION_FEE_PERCENT)        
        # update stocks balance
        self.state[index+STOCK_DIM*FEATURES+1] += min(available_amount, action)
    
    def step(self, actions):

        self.terminal = self.day == self.df.shape[0]-1

        if self.terminal:
            return self.state_sc, self.reward, self.terminal,{}

        else:
            # normalize 
            actions = actions * HMAX_NORMALIZE   
            begin_total_asset = self.state[0]+ \
                sum( 
                np.array([self.state[i*FEATURES+1] for i in range(STOCK_DIM)]) *\
                np.array(self.state[FEATURES*STOCK_DIM+1:])
                )
            argsort_actions = np.argsort(actions)
            sell_index = argsort_actions[:np.where(actions < 0)[0].shape[0]]
            buy_index = argsort_actions[::-1][:np.where(actions > 0)[0].shape[0]]           
            
            for index in sell_index:
                self._sell_stock(index, actions[index])

            for index in buy_index:
                self._buy_stock(index, actions[index])

            self.day += 1

            #load next state
            self.state = [self.state[0]] +\
                          self.df[self.day-1].tolist() +\
                          self.state[FEATURES*STOCK_DIM+1:]
        
            self.state_sc = [self.state[0]/MAX_ACCOUNT_BALANCE] +\
                              self.df_sc[self.day-1].tolist() +\
                              [i/MAX_SHARE for i in self.state[FEATURES*STOCK_DIM+1:]]

            end_total_asset = self.state[0]+ \
                sum( 
                np.array([self.state[i*FEATURES+1] for i in range(STOCK_DIM)]) *\
                np.array(self.state[FEATURES*STOCK_DIM+1:])
                )
            self.asset_memory.append(end_total_asset)
            
            self.reward = end_total_asset - begin_total_asset            
            self.reward = self.reward*REWARD_SCALING
        
        return self.state_sc, self.reward, self.terminal, {}
        
    def reset(self):  
        self.asset_memory = [INITIAL_ACCOUNT_BALANCE]
        self.day = length
        self.terminal = False 
        self.state = [INITIAL_ACCOUNT_BALANCE] +\
                      self.df[self.day-1].tolist() +\
                      [0]*STOCK_DIM
        self.state_sc = [INITIAL_ACCOUNT_BALANCE/MAX_ACCOUNT_BALANCE] +\
                      self.df_sc[self.day-1].tolist() +\
                      [0]*STOCK_DIM

        return self.state_sc
    
    def render(self, mode='human',close=False):
        return self.state
    
    def _seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]
            

* A good practice is to use multiple environments in parallel. But due to some errors, I could not. So I only used 1 environment for training.
* A random hyperparameter tuning is used here. But again, to not repeat the search work, I only show the best parameter here.

In [12]:
import matplotlib.pyplot as plt
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
import time
from numpy.random import choice

# clearing gpu cache
import torch
torch.cuda.empty_cache()

# adding extra day 
df_train_rl = np.append(X_train, X_train[-1].reshape(1,-1), axis=0)
df_val_rl = np.append(X_val, X_val[-1].reshape(1,-1), axis=0)


# building environment 
# single environment
n_env = 1
env_train = DummyVecEnv([lambda: StockEnvTrain(df_train_rl)])
env_train2 = DummyVecEnv([lambda: StockEnvTrain2(df_train_rl)])
env_val = DummyVecEnv([lambda: StockEnvVal(df_val_rl)])

# env_train.reset()
# env_train2.reset()
# env_val.reset()

# model
import pandas as pd
import numpy as np
import time
import gym
import math
from stable_baselines3.common.callbacks import EvalCallback

# random search params
def build_config(samples):
    
    lr = [1e-3, 1e-4, 1e-5]
    n_steps = [256,512,1024,2048]
    batch_size = [128,256,512,1024]
    ent_coef = [1e-6,1e-7,1e-8]
    
    config = []
    i = 0
    lim = 100
                  
    while len(config) < samples:
            
        params = {}
        params['lr'] = choice(lr)
        params['n_steps'] = choice(n_steps)
        params['batch_size'] = choice(batch_size)
        params['ent_coef'] = choice(ent_coef)

        if params not in config:
            config.append(params)
            
        i += 1
        if i == lim: sys.exit('error')
        
    return config

def perf_plot(version):
    
    data = np.load(f"./logs/m{version}_train/evaluations.npz")
    train_result = [10**6 + total_r*1/REWARD_SCALING for total_r in np.mean(data['results'], axis=1)]
    plt.figure(figsize=(12,8))
    plt.title('training performance')
    plt.xlabel('episode')
    plt.ylabel('asset')
    plt.plot(np.arange(len(train_result)), train_result)
    time.sleep(2)

    # validation
    data = np.load(f"./logs/m{version}_val/evaluations.npz")
    val_result = [10**6 + total_r*1/REWARD_SCALING for total_r in np.mean(data['results'], axis=1)]

    plt.figure(figsize=(12,8))
    plt.title('val performance')
    plt.xlabel('episode')
    plt.ylabel('asset')
    plt.plot(np.arange(len(val_result)), val_result)

    best_val_id = np.argmax(val_result)
    print('train performance: {}, val performance: {}'.format(train_result[best_val_id], val_result[best_val_id]))

def train_PPO(env_train, env_train2, env_val, version, params, timesteps):
    start = time.time()
    eval_train = EvalCallback(env_train2, 
                             log_path=f"./logs/m{version}_train/", 
                             n_eval_episodes = 3,
                             eval_freq=3800//n_env,
                             deterministic=True, 
                             render=False)

    eval_val = EvalCallback(env_val, 
                             log_path=f"./logs/m{version}_val/", 
                             best_model_save_path =f"./mod/m{version}_val/",
                             n_eval_episodes = 3,
                             eval_freq=3800//n_env,
#                              eval_freq=(params['n_steps']+10)//n_env,
                             deterministic=True, 
                             render=False)

    model = PPO('MlpPolicy', 
                env=env_train,
                n_steps = params['n_steps'],
                learning_rate = params['lr'],
                ent_coef =  params['ent_coef'],
                batch_size = params['batch_size']
                )

    model.learn(total_timesteps=timesteps,
                callback=[eval_train, eval_val]
               )
    end = time.time()
    print('version: ', version, 'Training time (PPO): ', (end - start) / 60, ' minutes')
    perf_plot(version)
    
# start training
config = build_config(5)
# for version, params in enumerate(config):
#     print(f'========================training: {params}==========================')
#     train_PPO(env_train, env_train2, env_val, version, params, timesteps=760000)
params = {'lr': 0.001, 'n_steps': 256, 'batch_size': 128, 'ent_coef': 1e-06}


version = 0
print(f'========================training: {params}==========================')
train_PPO(env_train, env_train2, env_val, version, params, timesteps=760000)

Eval num_timesteps=3800, episode_reward=677.31 +/- 0.00
Episode length: 3746.00 +/- 0.00
New best mean reward!
Eval num_timesteps=3800, episode_reward=20.91 +/- 0.00
Episode length: 254.00 +/- 0.00
New best mean reward!
Eval num_timesteps=7600, episode_reward=1486.01 +/- 0.00
Episode length: 3746.00 +/- 0.00
New best mean reward!
Eval num_timesteps=7600, episode_reward=47.85 +/- 0.00
Episode length: 254.00 +/- 0.00
New best mean reward!
Eval num_timesteps=11400, episode_reward=2459.32 +/- 0.00
Episode length: 3746.00 +/- 0.00
New best mean reward!
Eval num_timesteps=11400, episode_reward=55.84 +/- 0.00
Episode length: 254.00 +/- 0.00
New best mean reward!
Eval num_timesteps=15200, episode_reward=3120.40 +/- 0.00
Episode length: 3746.00 +/- 0.00
New best mean reward!
Eval num_timesteps=15200, episode_reward=62.68 +/- 0.00
Episode length: 254.00 +/- 0.00
New best mean reward!
Eval num_timesteps=19000, episode_reward=4577.46 +/- 0.00
Episode length: 3746.00 +/- 0.00
New best mean reward!


* The result is just a little bit lower than the previous strategy.

reference

@article{finrl2020,
    author  = {Liu, Xiao-Yang and Yang, Hongyang and Chen, Qian and Zhang, Runjia and Yang, Liuqing and Xiao, Bowen and Wang, Christina Dan},
    title   = {{FinRL}: A deep reinforcement learning library for automated stock trading in quantitative finance},
    journal = {Deep RL Workshop, NeurIPS 2020},
    year    = {2020}
}
