https://github.com/hackthemarket/gym-trading/blob/master/gym_trading/envs/TradingEnv.ipynb

TODO:
- test data, val data
- multiple stocks?
- bitcoin data env?
    - quandl bter (200 results) bitfinex (26), BCHARTs
- finanical metrics e.g. 
    - http://www.cs.utexas.edu/~ai-lab/pubs/AMEC04-plat.pdf sharpes
    - return
    - dummy score
        - all buy, all hold, all sell
        - random etc
    - quantopians
- add more observational data
    - [x] the last few steps - add memmory
    - [ ] sentiment? e.g. https://www.quandl.com/data/NS1-FinSentS-Web-News-Sentiment
    - [ ] overall stock market e.g. https://www.quandl.com/data/UMICH/SOC4-University-of-Michigan-Consumer-Survey-Index-of-Consumer-Sentiment-Within-Regions
- replay https://github.com/matthiasplappert/keras-rl/issues/40
- or try openai baseline with tensorflow
- model
    - cnn
    - lstm
- unit tests
    - env should give poor result with random steps, only buys, only holds
    - model should overfit on small amount of data
    
- [x] pretraining? helps a lot. Lets the keras-rl beat the market by a few percent initially
 bugs:
 - [x] seems to be discontinuities causing huge navs e.g. 1e51
 
 
 regression vs classification
 
 window length and memory
 
 experience replay
 
 I used [arXiv:1612.01277](https://arxiv.org/abs/1706.10059) paper a lot for understanding the problem and ideas for model design.

In [1]:
# plotting
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

# numeric
import quandl
import numpy as np
from numpy import random
import pandas as pd

# utils
from tqdm import tqdm_notebook as tqdm
from collections import Counter
import pdb
import tempfile
import logging
import time

# logging
logger = log = logging.getLogger(__name__)
log.setLevel(logging.INFO)
logging.basicConfig()
log.info('%s logger started.', __name__)

INFO:__main__:__main__ logger started.


In [2]:
# reinforcement learning
import gym
from gym import error, spaces, utils
from gym.utils import seeding

from keras.models import Sequential
from keras.layers import Flatten, Dense, Activation, BatchNormalization
from keras.optimizers import Adam
from keras.layers.advanced_activations import LeakyReLU
from keras.models import Sequential
from keras.layers import Flatten, Dense, Activation, BatchNormalization, Conv1D, InputLayer, Dropout, regularizers, Conv2D, Reshape
from keras.optimizers import Adam
from keras.layers.advanced_activations import LeakyReLU
from keras.activations import relu

Using TensorFlow backend.


In [3]:
import os
os.sys.path.append(os.path.abspath('.'))
%reload_ext autoreload
%autoreload 2

# from src.callbacks.rl_callbacks import ReduceLROnPlateau, TrainIntervalLoggerTQDMNotebook

# Environment

Day trading over 256 days. We scale and augument the training data.

You can see the base environment class [here](https://github.com/openai/gym/blob/master/gym/core.py#L13) and openai's nice docs [here](https://gym.openai.com/docs)


In [4]:
from src.environments.portfolio import PortfolioEnv

In [5]:
df_train = pd.read_hdf('./data/poliniex_30m.hf',key='train')
env = PortfolioEnv(
    df=df_train,
    steps=30, 
    scale=True, 
    augument=0.0005    
)
env.seed = 0   

df_test = pd.read_hdf('./data/poliniex_30m.hf',key='test')
env_test = PortfolioEnv(
    df=df_test,
    steps=30, 
    scale=True, 
    augument=0.00)
env_test.seed = 0  

env.reset().shape

(5, 50, 3)

## SELU?

I tried SELU but it didn't help, It's mean to replace batchnorm and ELU with less parameters
there have been varied reports for it [reddit discussion]( https://www.reddit.com/r/MachineLearning/comments/6g5tg1/r_selfnormalizing_neural_networks_improved_elu/)

In [6]:
 
# from keras import backend as K
# def selu(x):
#     """Scaled Exponential Linear Unit. (Klambauer et al., 2017)
#     # Arguments
#         x: A tensor or variable to compute the activation function for.
#     # References
#         - [Self-Normalizing Neural Networks](https://arxiv.org/abs/1706.02515)
#     """
#     alpha = 1.6732632423543772848170429916717
#     scale = 1.0507009873554804934193349852946
#     return scale * K.elu(x, alpha)

# Model

arXiv:1612.01277 indicated that CNN's are just as effective. That's great because I like them, they are fast so I can try more things and see the results faster. So we will be using a CNN model.


# Pretrain the Q model as a normal classification problem

We can pretrain on a regular (non-rl) classification problem. This might not be as elegant as end-to-end training but it helps with speed. 

It also helps me quickly test how a model fit's the data (can it overfit, how much does it generalize?). So it's a good sanity check.

In [7]:
# augument the data to compensate for the low quantity
def random_shift(x, fraction):
    min_x, max_x = np.min(x), np.max(x)
    m = np.random.uniform(-fraction, fraction, size=x.shape) + 1
    c = np.random.uniform(-fraction, fraction, size=x.shape) * x.std()
    return np.clip(x * m + c, min_x, max_x)

def X_shift(X, fraction):
    X = X.copy()
    for i in range(X.shape[1]):
        x = X[:,:,i]
        X[:,:,i] = random_shift(x, fraction)
    return X

In [8]:
# 50 times, 8 price values (open, close, volume...), 6 assets 42x6x8 BUT we want 50x6x8
# W, H, C 11x11x3
# Conv2D?
env.action_space.shape
env.src.asset_names

['BTCBTC', 'LTCBTC', 'DOGEBTC', 'DASHBTC', 'XMRBTC', 'XRPBTC']

In [9]:
env.reset().shape

(5, 50, 3)

In [10]:
env.observation_space

Box(5, 50, 3)

In [11]:
from keras.layers import Input, merge, Reshape
from keras.layers import concatenate, Conv2D
from keras.regularizers import l2, l1_l2
from keras.models import Model

window_length=50
nb_actions=env.action_space.shape[0]
reg=1e-8

# Next, we build a very simple model.
actor = Sequential()
actor.add(InputLayer(input_shape=(1,)+env.observation_space.shape))
actor.add(Reshape(env.observation_space.shape))
actor.add(Conv2D(
    filters=2,
    kernel_size=(1,3),
    kernel_regularizer=l2(reg),
    activation='relu'
))
actor.add(Conv2D(
    filters=20,
    kernel_size=(1,window_length-2),
    kernel_regularizer=l2(reg),
    activation='relu'
))
actor.add(Conv2D(
    filters=1,
    kernel_size=(1,1),
    kernel_regularizer=l2(reg),
    activation='relu'
))
actor.add(Flatten())
actor.add(Dense(nb_actions))
actor.add(Activation('softmax'))
print(actor.summary())

action_input = Input(shape=(nb_actions,), name='action_input')
observation_input = Input(shape=(1,)+env.observation_space.shape, name='observation_input')
flattened_observation = Flatten()(observation_input)
x = concatenate([action_input, flattened_observation])
x = Dense(32)(x)
x = Activation('relu')(x)
x = Dense(32)(x)
# x = Activation('relu')(x)
# x = Dense(32)(x)
x = Activation('relu')(x)
x = Dense(1)(x)
x = Activation('linear')(x)
critic = Model(inputs=[action_input, observation_input], outputs=x)
print(critic.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 1, 5, 50, 3)       0         
_________________________________________________________________
reshape_1 (Reshape)          (None, 5, 50, 3)          0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 5, 48, 2)          20        
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 5, 1, 20)          1940      
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 5, 1, 1)           21        
_________________________________________________________________
flatten_1 (Flatten)          (None, 5)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 12)                72        
__________

In [12]:
from rl.agents.ddpg import DDPGAgent
from rl.policy import BoltzmannQPolicy, EpsGreedyQPolicy, LinearAnnealedPolicy
from rl.memory import SequentialMemory
from rl.random import OrnsteinUhlenbeckProcess

# Get the environment and extract the number of actions.
np.random.seed(0)

# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!

memory = SequentialMemory(limit=10000, window_length=1)
random_process = OrnsteinUhlenbeckProcess(
    size=nb_actions, theta=.15, mu=0., sigma=.3)
agent = DDPGAgent(
    nb_actions=nb_actions,
    actor=actor,
    critic=critic,
    critic_action_input=action_input,
    random_process=random_process,
    memory=memory,
    batch_size=50,
    nb_steps_warmup_critic=100,
    nb_steps_warmup_actor=100,    
    gamma=.00, # discounted factor of zero as per paper
    target_model_update=1e-3
)
agent.compile(Adam(lr=3e-5), metrics=['mse'])
agent

<rl.agents.ddpg.DDPGAgent at 0x7fdc2932b630>

In [13]:
from src.callbacks.keras_rl_callbacks import TrainIntervalLoggerTQDMNotebook

In [19]:
# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
history = agent.fit(env, 
                  nb_steps=2e6, 
                  visualize=False, 
                  verbose=1,
                  callbacks=[
#                       TrainIntervalLoggerTQDMNotebook(),
#                       ReduceLROnPlateau(monitor='episode_reward', patience = 150)
                    ]
                 )

# After training is done, we save the final weights.
agent.save_weights('outputs/agent_{}_weights.h5f'.format('portfolio-ddpg-keras-rl'), overwrite=True)

Training for 2000000.0 steps ...
Interval 1 (0 steps performed)
333 episodes - episode_reward: 0.000 [-0.005, 0.003] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 0.999 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.490

Interval 2 (10000 steps performed)
333 episodes - episode_reward: -0.000 [-0.003, 0.003] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: -0.000 - log_return: -0.000 - portfolio_value: 0.999 - returns: 1.000 - rate_of_return: -0.000 - cost: 0.000 - steps: 16.500

Interval 3 (20000 steps performed)
334 episodes - episode_reward: 0.000 [-0.003, 0.002] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: -0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.000 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.510

Interval 4 (30000 steps performed)
333 episodes - episode_reward: -0.000 [-0.004, 0.003] - loss: 0.000 - mean_squared_error: 0.00

Interval 24 (230000 steps performed)
334 episodes - episode_reward: 0.000 [-0.003, 0.006] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.001 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.510

Interval 25 (240000 steps performed)
333 episodes - episode_reward: 0.000 [-0.004, 0.004] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.002 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.490

Interval 26 (250000 steps performed)
333 episodes - episode_reward: 0.000 [-0.003, 0.003] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.001 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.500

Interval 27 (260000 steps performed)
334 episodes - episode_reward: 0.000 [-0.004, 0.007] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward:

333 episodes - episode_reward: 0.000 [-0.003, 0.005] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.001 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.490

Interval 47 (460000 steps performed)
333 episodes - episode_reward: 0.000 [-0.003, 0.003] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.001 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.500

Interval 48 (470000 steps performed)
334 episodes - episode_reward: -0.000 [-0.007, 0.003] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: -0.000 - log_return: -0.000 - portfolio_value: 0.998 - returns: 1.000 - rate_of_return: -0.000 - cost: 0.000 - steps: 16.510

Interval 49 (480000 steps performed)
333 episodes - episode_reward: 0.000 [-0.003, 0.004] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - port

Interval 69 (680000 steps performed)
334 episodes - episode_reward: 0.000 [-0.002, 0.006] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.001 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.510

Interval 70 (690000 steps performed)
333 episodes - episode_reward: -0.000 [-0.004, 0.006] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: -0.000 - log_return: -0.000 - portfolio_value: 0.997 - returns: 1.000 - rate_of_return: -0.000 - cost: 0.000 - steps: 16.490

Interval 71 (700000 steps performed)
333 episodes - episode_reward: 0.000 [-0.003, 0.003] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: -0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.002 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.500

Interval 72 (710000 steps performed)
334 episodes - episode_reward: 0.000 [-0.002, 0.004] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - re

334 episodes - episode_reward: 0.000 [-0.002, 0.006] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.000 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.510

Interval 91 (900000 steps performed)
333 episodes - episode_reward: 0.000 [-0.003, 0.005] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.000 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.490

Interval 92 (910000 steps performed)
333 episodes - episode_reward: 0.000 [-0.002, 0.005] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.001 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.500

Interval 93 (920000 steps performed)
334 episodes - episode_reward: 0.000 [-0.004, 0.004] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfoli

333 episodes - episode_reward: 0.000 [-0.002, 0.005] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.002 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.490

Interval 113 (1120000 steps performed)
333 episodes - episode_reward: 0.000 [-0.002, 0.003] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.002 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.500

Interval 114 (1130000 steps performed)
334 episodes - episode_reward: 0.000 [-0.002, 0.003] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.001 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.510

Interval 115 (1140000 steps performed)
333 episodes - episode_reward: 0.000 [-0.002, 0.003] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - po

333 episodes - episode_reward: 0.000 [-0.003, 0.005] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.000 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.500

Interval 135 (1340000 steps performed)
334 episodes - episode_reward: 0.000 [-0.004, 0.005] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.001 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.510

Interval 136 (1350000 steps performed)
333 episodes - episode_reward: 0.000 [-0.005, 0.004] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.000 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.490

Interval 137 (1360000 steps performed)
333 episodes - episode_reward: 0.000 [-0.002, 0.004] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - po

334 episodes - episode_reward: 0.000 [-0.004, 0.006] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.000 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.510

Interval 157 (1560000 steps performed)
333 episodes - episode_reward: 0.000 [-0.002, 0.004] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.001 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.490

Interval 158 (1570000 steps performed)
333 episodes - episode_reward: -0.000 [-0.003, 0.003] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: -0.000 - log_return: -0.000 - portfolio_value: 0.996 - returns: 1.000 - rate_of_return: -0.000 - cost: 0.000 - steps: 16.500

Interval 159 (1580000 steps performed)
334 episodes - episode_reward: -0.000 [-0.005, 0.003] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: -0.000 - reward: -0.000 - log_return: -0.

333 episodes - episode_reward: 0.000 [-0.004, 0.003] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.000 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.490

Interval 179 (1780000 steps performed)
333 episodes - episode_reward: 0.000 [-0.003, 0.006] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.002 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.500

Interval 180 (1790000 steps performed)
334 episodes - episode_reward: 0.000 [-0.003, 0.003] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - portfolio_value: 1.001 - returns: 1.000 - rate_of_return: 0.000 - cost: 0.000 - steps: 16.510

Interval 181 (1800000 steps performed)
333 episodes - episode_reward: 0.000 [-0.002, 0.004] - loss: 0.000 - mean_squared_error: 0.000 - mean_q: 0.000 - reward: 0.000 - log_return: 0.000 - po

done, took 32396.610 seconds


In [None]:
%debug

In [None]:
agent.save_weights('outputs/agent_{}_weights.h5f'.format('portfolio-ddpg-keras-rl'), overwrite=True)

In [None]:
# Finally, evaluate our algorithm for 5 episodes.
agent.test(env_test, nb_episodes=10, visualize=False)

In [None]:
# history
df_hist = pd.DataFrame(history.history)
df_hist
df_hist['episodes'] = df_hist.index

g = sns.jointplot(x="episodes", y="episode_reward", data=df_hist, kind="reg", size=10)
plt.show()


# g = sns.jointplot(x="episodes", y="rewards", data=history, kind="reg")

# visualise

ideally a price with colored actions? like https://hackernoon.com/the-self-learning-quant-d3329fcc9915

# dummy metrics

In [None]:
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.model_selection import train_test_split


X_flat = X_train.reshape((len(X_train),-1))
X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(X_flat, y_train, test_size=0.2, random_state=0)

def test_env(env, model, memory):
    obs = env.reset()
    state = memory.get_recent_state(obs)
    for t in range(env.days):
        x_batch = np.array([state])
        x_flat = x_batch.reshape((len(x_batch),-1))
        x_flat[np.isnan(x_flat)]=0
        y_pred = model.predict(x_flat)
        action = y_pred.argmax(1)
        obs, rew, done, info = env.step(action[0])
        state = memory.get_recent_state(obs)
    
    df_test = env.sim.to_df()
    end = df_test.iloc[-1]
    gain = end.bod_nav - end.mkt_nav    
    return gain

dummy_scores = []
for strategy in ['most_frequent', 'uniform', 'prior', 'stratified']:
    memory = Memory(window_length=window_length)
    clf = DummyClassifier(strategy=strategy)
    clf.fit(X_train, y_train)
    gain = test_env(env_test, clf, memory)
    df=env_test.sim.to_df()
    print('{:20.20s}: {: 3.2%} /day NAV gain above market'.format(strategy, (df.mkt_nav-df.bod_nav).mean()))
    
    plot_env(env_test, title=strategy)  

for strategy in ['mean', 'median']:
    memory = Memory(window_length=window_length)
    clf=DummyRegressor(strategy=strategy)
    clf.fit(X_train, y_train)
    gain = test_env(env_test, clf, memory)
    df=env_test.sim.to_df()
    print('{:20.20s}: {: 3.2%} /day NAV gain above market'.format(strategy, (df.mkt_nav-df.bod_nav).mean()))
    
    plot_env(env_test, title=strategy)  

In [None]:
plot_env(env_test)