# Reinforcement learning for trading
This is a project to be able to predict what the best course of action is. Initially, I will only include some basic stats, such as the daily price of the dollar in Argentine pesos: both the official and the unofficial (blue). 


## Reminder

We are going to use *Deep Q learning* 

There is an "agent" and an environment. The agent will be essentially a neural network (LSTM), that makes a decision. Initially, our decision will be: buy $100 USD, sell -te equivalent in pesos of- $100 USD, or hold . Then, the environment will inform if the decision was correct or not. The environment is nothing but the data that we give the agent (the price of the dollar in pesos, etc.)



### Key distinctions
Rewars is an immediate signal that is received in a given state, while value is the sum of all rewards you might anticipate from that state. Value is a long-term expectation, while reward is an immediate pleasure. 
You can have states where value and reward diverge.  

### Objective function
$$
\sum_{t=0}^{\infty} \gamma^t r(x(t), a(t))
$$
$x$ is the state at a given time step, and $a$ is the action taken in that state. $r$ is the reward. 

We are trying to maximize the sum of $r$ along, let's say, infinite time steps or whatever...

$$
Q(s, a) = r(s, a) + \gamma \max_{a} Q(s', a)
$$

This is another way to look at the objective function. Q function is recursive: for each step we calculate the immediate reward, then we get the max final reward. 

$\gamma\$ makes the immediate rewards more important. 

$$
Q(s, a) \rightarrow \gamma Q(s', a) + \gamma^2 Q(s'', a) + \dots + \gamma^n Q(s^{''\dots n}, a)
$$
This is another way to look at this. It is essentially an expansion of the aboce recursive funciton. 

### Q-learning and Deep Q-learning

Q-learning does not involve neural networks. Initially, it just assumes we can calculate every possible decision, and every possible state. 

This is where deep learning comes in. Essentially, instead of *calculating* Q function, we *estimate* the Q-function through a neural network. 

#### Loss function in Deep-Q learning

The loss function here is mean squared error of the predicted Q-value and the target Q-value -Q*. This is basically a regression problem. 


## Custom reward function
I want to apply a reward function that punishes losses exponentially, but rewards wins linearly. This is to make the system more conservative when "gambling" money. 




In [2]:
import numpy as np
import gym
import tensorflow


In [3]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten, LSTM, Dropout
from tensorflow.keras.optimizers import Adam 




In [5]:
!pip install wandb


Collecting wandb
  Downloading wandb-0.16.2-py3-none-any.whl.metadata (9.8 kB)
Collecting Click!=8.0.0,>=7.1 (from wandb)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.41-py3-none-any.whl.metadata (14 kB)
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.39.2-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting PyYAML (from wandb)
  Downloading PyYAML-6.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.9 kB)
Collecting appdirs>=1.4.3 (from wandb)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Collecting gitdb<5,>=4.0.1 (from GitPython!=3.1.29,>=1.0.0->wand

In [7]:
!pip install keras-utils

Collecting keras-utils
  Downloading keras-utils-1.0.13.tar.gz (2.4 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: keras-utils
  Building wheel for keras-utils (setup.py) ... [?25ldone
[?25h  Created wheel for keras-utils: filename=keras_utils-1.0.13-py3-none-any.whl size=2653 sha256=880b26893def7e7835ef6d3995e9288c4da2cec33ee8957204e6dc379bdd46f5
  Stored in directory: /home/thomas/.cache/pip/wheels/fd/b9/6e/25d4c3a3c0319873aeeab6592c5b4bb9e2af0fec21a0b5188c
Successfully built keras-utils
Installing collected packages: keras-utils
Successfully installed keras-utils-1.0.13


In [8]:
from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

ModuleNotFoundError: No module named 'keras.utils.generic_utils'

In [10]:
from collections import deque
import random

In [5]:
from keras.callbacks import TensorBoard

#...

# Own Tensorboard class
class ModifiedTensorBoard(TensorBoard):

    # Overriding init to set initial step and writer (we want one log file for all .fit() calls)
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.step = 1
        self.writer = tensorflow.summary.FileWriter(self.log_dir)

    # Overriding this method to stop creating default log writer
    def set_model(self, model):
        pass

    # Overrided, saves logs with our step number
    # (otherwise every .fit() will start writing from 0th step)
    def on_epoch_end(self, epoch, logs=None):
        self.update_stats(**logs)

    # Overrided
    # We train for one batch only, no need to save anything at epoch end
    def on_batch_end(self, batch, logs=None):
        pass

    # Overrided, so won't close writer
    def on_train_end(self, _):
        pass

    # Custom method for saving own metrics
    # Creates writer, writes custom metrics and closes writer
    def update_stats(self, **stats):
        self._write_logs(stats, self.step)

In [11]:
# Input_shape must be : Number of samples, number of time steps, and number of features.
# input_shape = 30, 4 : would be 30 times-steps (about a month, 4 features)
import time
REPLAY_MEMORY_SIZE = 1500
MIN_REPLAY_MEMORY_SIZE = 100
MODEL_NAME = 'FIRST_MODEL'
MINIBATCH_SIZE = 32
DISCOUNT = 1 - (1/2**6)
UPDATE_TARGET_EVERY = 5
class DQNAgent:
    def __init__(self, input_shape_, layers, dropout):
        # Main model
        # gets trained every step
        self.model = self.create_model(input_shape_, layers, dropout)

        # Target network
        # .predict every step
        # every n steps, we update the model that we've been fitting for every step, and I guess we discard the old one ... 
        self.target_model = self.create_model()
        self.target_model.set_weights(self.model.get_weights())

        # An array with last n steps for training
        self.replay_memory = deque(maxlen=REPLAY_MEMORY_SIZE)

        # Custom tensorboard object
        self.tensorboard = ModifiedTensorBoard(log_dir="logs/{}-{}".format(MODEL_NAME, int(time.time())))

        # Used to count when to update target network with main network's weights
        self.target_update_counter = 0 
        
    def create_model(self, input_shape_, layers, dropout):
        model = Sequential()
        model.add(LSTM(layers[0]), input_shape=input_shape_)
        
        for i in range(1,len(layers)):
            if dropout:
                model.add(Dropout, 0.2)
            model.add(Dense(layers[i], activation="relu"))
        model.add(Dense(3, activation='linear'))
        model.compile(loss="mse", optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])
        return model
   
    # Adds step's data to a memory replay array
    # (observation space, action, reward, new observation space, done)
    def update_replay_memory(self, transition):
        self.replay_memory.append(transition)
    
    # Queries main network for Q values given current observation space (environment state)
    def get_qs(self, state):
        return self.model.predict(np.array(state).reshape(-1, *state.shape))[0]

    # Trains main network every step during episode
    def train(self, terminal_state, step):

        # Start training only if certain number of samples is already saved
        if len(self.replay_memory) < MIN_REPLAY_MEMORY_SIZE:
            return
                # Get a minibatch of random samples from memory replay table
        minibatch = random.sample(self.replay_memory, MINIBATCH_SIZE)

        # Get current states from minibatch, then query NN model for Q values
        current_states = np.array([transition[0] for transition in minibatch])
        current_qs_list = self.model.predict(current_states)

        # Get future states from minibatch, then query NN model for Q values
        # When using target network, query it, otherwise main network should be queried
        new_current_states = np.array([transition[3] for transition in minibatch])
        future_qs_list = self.target_model.predict(new_current_states)
        
        X = []
        y = []

        # Now we need to enumerate our batches
        for index, (current_state, action, reward, new_current_state, done) in enumerate(minibatch):

            # If not a terminal state, get new q from future states, otherwise set it to 0
            # almost like with Q Learning, but we use just part of equation here
            if not done:
                max_future_q = np.max(future_qs_list[index])
                new_q = reward + DISCOUNT * max_future_q
            else:
                new_q = reward

            # Update Q value for given state
            current_qs = current_qs_list[index]
            current_qs[action] = new_q

            # And append to our training data
            X.append(current_state)
            y.append(current_qs)

        # Fit on all samples as one batch, log only on terminal state
        self.model.fit(np.array(X), np.array(y), batch_size=MINIBATCH_SIZE, verbose=0, shuffle=False, callbacks=[self.tensorboard] if terminal_state else None)
        
        # Update target network counter every episode
        if terminal_state:
            self.target_update_counter += 1

        # If counter reaches set value, update target network with weights of main network
        if self.target_update_counter > UPDATE_TARGET_EVERY:
            self.target_model.set_weights(self.model.get_weights())
            self.target_update_counter = 0
            
