# Reinforcement learning for trading
This is a project to be able to predict what the best course of action is. Initially, I will only include some basic stats, such as the daily price of the dollar in Argentine pesos: both the official and the unofficial (blue). 


## Reminder

We are going to use *Deep Q learning* 

There is an "agent" and an environment. The agent will be essentially a neural network (LSTM), that makes a decision. Initially, our decision will be: buy $100 USD, sell -te equivalent in pesos of- $100 USD, or hold . Then, the environment will inform if the decision was correct or not. The environment is nothing but the data that we give the agent (the price of the dollar in pesos, etc.)



### Key distinctions
Rewars is an immediate signal that is received in a given state, while value is the sum of all rewards you might anticipate from that state. Value is a long-term expectation, while reward is an immediate pleasure. 
You can have states where value and reward diverge.  

### Objective function
$$
\sum_{t=0}^{\infty} \gamma^t r(x(t), a(t))
$$
$x$ is the state at a given time step, and $a$ is the action taken in that state. $r$ is the reward. 

We are trying to maximize the sum of $r$ along, let's say, infinite time steps or whatever...

$$
Q(s, a) = r(s, a) + \gamma \max_{a} Q(s', a)
$$

This is another way to look at the objective function. Q function is recursive: for each step we calculate the immediate reward, then we get the max final reward. 

$\gamma\$ makes the immediate rewards more important. 

$$
Q(s, a) \rightarrow \gamma Q(s', a) + \gamma^2 Q(s'', a) + \dots + \gamma^n Q(s^{''\dots n}, a)
$$
This is another way to look at this. It is essentially an expansion of the aboce recursive funciton. 

### Q-learning and Deep Q-learning

Q-learning does not involve neural networks. Initially, it just assumes we can calculate every possible decision, and every possible state. 

This is where deep learning comes in. Essentially, instead of *calculating* Q function, we *estimate* the Q-function through a neural network. 

#### Loss function in Deep-Q learning

The loss function here is mean squared error of the predicted Q-value and the target Q-value -Q*. This is basically a regression problem. 


## Custom reward function
I want to apply a reward function that punishes losses exponentially, but rewards wins linearly. This is to make the system more conservative when "gambling" money. 




In [32]:
import numpy as np
import gym
import tensorflow


In [33]:
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten, LSTM, Dropout
from tensorflow.keras.optimizers import Adam 




In [34]:
from collections import deque
import random

In [35]:
from keras.callbacks import TensorBoard

#...

# Own Tensorboard class
class ModifiedTensorBoard(TensorBoard):

    # Overriding init to set initial step and writer (we want one log file for all .fit() calls)
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.step = 1
        self.writer = tensorflow.summary.create_file_writer("/tmp/tfvgg")

    # Overriding this method to stop creating default log writer
    def set_model(self, model):
        pass

    # Overrided, saves logs with our step number
    # (otherwise every .fit() will start writing from 0th step)
    def on_epoch_end(self, epoch, logs=None):
        self.update_stats(**logs)

    # Overrided
    # We train for one batch only, no need to save anything at epoch end
    def on_batch_end(self, batch, logs=None):
        pass

    # Overrided, so won't close writer
    def on_train_end(self, _):
        pass

    # Custom method for saving own metrics
    # Creates writer, writes custom metrics and closes writer
    def update_stats(self, **stats):
        self._write_logs(stats, self.step)

In [36]:
# Input_shape must be : Number of samples, number of time steps, and number of features.
# input_shape = 30, 4 : would be 30 times-steps (about a month, 4 features)
import time
REPLAY_MEMORY_SIZE = 1500
MIN_REPLAY_MEMORY_SIZE = 100
MODEL_NAME = 'FIRST_MODEL'
MINIBATCH_SIZE = 32
DISCOUNT = 1 - (1/2**6)
UPDATE_TARGET_EVERY = 5
TIME_STEP = 200
FEATURES = 3
INPUT_SHAPE = TIME_STEP, FEATURES 


class DQNAgent:
    def __init__(self, input_shape_, layers, dropout):
        # Main model
        # gets trained every step
        self.model = self.create_model(input_shape_, layers, dropout)

        # Target network
        # .predict every step
        # every n steps, we update the model that we've been fitting for every step, and I guess we discard the old one ... 
        self.target_model = self.create_model(input_shape_, layers, dropout)
        self.target_model.set_weights(self.model.get_weights())

        # An array with last n steps for training
        self.replay_memory = deque(maxlen=REPLAY_MEMORY_SIZE)

        # Custom tensorboard object
        self.tensorboard = ModifiedTensorBoard(log_dir="logs/{}-{}".format(MODEL_NAME, int(time.time())))

        # Used to count when to update target network with main network's weights
        self.target_update_counter = 0 
        
    def create_model(self, input_shape_, layers, dropout):
        model = Sequential()
        model.add(LSTM(layers[0],  input_shape=input_shape_))
        
        for i in range(1,len(layers)):
            if dropout:
                model.add(Dropout(dropout))
            model.add(Dense(layers[i], activation="relu"))
        model.add(Dense(3, activation='linear'))
        model.compile(loss="mse", optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])
        return model
   
    # Adds step's data to a memory replay array
    # (observation space, action, reward, new observation space, done)
    def update_replay_memory(self, transition):
        self.replay_memory.append(transition)
    
    # Queries main network for Q values given current observation space (environment state)
    def get_qs(self, state):
        return self.model.predict(np.array(state).reshape(-1, *state.shape))[0]

    # Trains main network every step during episode
    def train(self, terminal_state, step):

        # Start training only if certain number of samples is already saved
        if len(self.replay_memory) < MIN_REPLAY_MEMORY_SIZE:
            return
                # Get a minibatch of random samples from memory replay table
        minibatch = random.sample(self.replay_memory, MINIBATCH_SIZE)

        # Get current states from minibatch, then query NN model for Q values
        current_states = np.array([transition[0] for transition in minibatch])
        current_qs_list = self.model.predict(current_states)

        # Get future states from minibatch, then query NN model for Q values
        # When using target network, query it, otherwise main network should be queried
        new_current_states = np.array([transition[3] for transition in minibatch])
        future_qs_list = self.target_model.predict(new_current_states)
        
        X = []
        y = []

        # Now we need to enumerate our batches
        for index, (current_state, action, reward, new_current_state, done) in enumerate(minibatch):

            # If not a terminal state, get new q from future states, otherwise set it to 0
            # almost like with Q Learning, but we use just part of equation here
            if not done:
                max_future_q = np.max(future_qs_list[index])
                new_q = reward + DISCOUNT * max_future_q
            else:
                new_q = reward

            # Update Q value for given state
            current_qs = current_qs_list[index]
            current_qs[action] = new_q

            # And append to our training data
            X.append(current_state)
            y.append(current_qs)

        # Fit on all samples as one batch, log only on terminal state
        self.model.fit(np.array(X), np.array(y), batch_size=MINIBATCH_SIZE, verbose=0, shuffle=False, callbacks=[self.tensorboard] if terminal_state else None)
        
        # Update target network counter every episode
        if terminal_state:
            self.target_update_counter += 1

        # If counter reaches set value, update target network with weights of main network
        if self.target_update_counter > UPDATE_TARGET_EVERY:
            self.target_model.set_weights(self.model.get_weights())
            self.target_update_counter = 0
            


In [91]:
# self.info is the price of the dollar in pesos
CAUTION_FACTOR = 0.5 # multiplies the punishment for holding
GAMBLER_PUNISHER = 1.6 # scales the punishment for buying or selling and loosing
class BlobEnv:
    def __init__(self, original_info):
        self.original_info = original_info
        self.info = original_info
        self.total_episode_step = 0
    def reset(self):
        self.info = self.original_info[self.total_episode_step:]
        self.episode_step = 0
        self.negative_step = 0

        return self.info
    
    # action will be one of three values. buy, sell, hold. 
    # I don't think we have the need for observation, which in the example is the state of the game. 
    # In this case, the state is simply given by the information we already have ... 
    # So, I'm not sure what to do with observation, really ... 
    # So, state is what we actually feed the model, so state is nothing other than the array we have, except that it's moved
    # one bit over every time ...
    # This will be a problem, because the size of the array will change, decisions ... decision ... 
     
    def step(self, action):
        print('hello')
        self.episode_step += 1
        self.total_episode_step += 1
        diff = self.info[TIME_STEP][0] - self.info[TIME_STEP - 1][0]
        print(f"self.info[TIME_STEP] is {self.info[TIME_STEP][0]}")
        print(f"diff is {diff}")

        if action == 0: # hold 
            # I think this is fine, it's the immediate reward
            # I am getting
            # Holding will always give you some punishment, because unless the price is 
            # exactly the same, it means you could have benefitted from selling or buying 
            # But, we don't want to encourage the system to be wild, so we will multiply
            # this punishment by a caution factor, so the punishment will be less 
            # than if the system actually LOST the money
            reward = - abs(diff) * CAUTION_FACTOR
            self.negative_step += 1
        elif action == 1: # buy dollars
            if diff >= 0:
                reward = diff/self.info[TIME_STEP][0]
            else:
                reward = -(diff ** GAMBLER_PUNISHER)/self.info[TIME_STEP][0]
                self.negative_step += 1
                
        else: # sell dollars
            if diff >= 0:
                reward = -(diff ** GAMBLER_PUNISHER)/self.info[TIME_STEP][0]
                self.negative_step += 1
            else:
                reward = diff/self.info[TIME_STEP][0]
        self.info = self.original_info[self.total_episode_step:]
        done = False
        
        # If you've accumulated 200 days with losses, time to stop ... 
        if self.episode_step  >= 300 or self.negative_step >= 150:
            done = True
        
        return self.info, reward, done


In [92]:
import pandas as pd 

dolar_blue = pd.read_csv("./Dolar_blue.csv")

In [93]:
dolar_blue

Unnamed: 0,Fecha,Compra_bl,Venta_bl
0,22/01/2024,1185,1235
1,19/01/2024,1170,1220
2,18/01/2024,1190,1240
3,17/01/2024,1175,1225
4,16/01/2024,1130,1180
...,...,...,...
4923,16/01/2004,29,291
4924,15/01/2004,288,289
4925,14/01/2004,29,291
4926,13/01/2004,288,289


In [94]:
dolar_blue = dolar_blue.drop_duplicates(subset="Fecha")

In [95]:
dolar_blue

Unnamed: 0,Fecha,Compra_bl,Venta_bl
0,22/01/2024,1185,1235
1,19/01/2024,1170,1220
2,18/01/2024,1190,1240
3,17/01/2024,1175,1225
4,16/01/2024,1130,1180
...,...,...,...
4923,16/01/2004,29,291
4924,15/01/2004,288,289
4925,14/01/2004,29,291
4926,13/01/2004,288,289


In [42]:
dolar_oficial = pd.read_csv("./Dolar_Oficial.csv")

In [43]:
dolar_oficial = dolar_oficial.drop_duplicates(subset="Fecha")

In [44]:
dolar_oficial = dolar_oficial.rename(columns={"Compra":"Compra_of","Venta":"Venta_of"})
dolar_blue = dolar_blue.rename(columns={"Compra":"Compra_bl","Venta":"Venta_bl"})

In [45]:
dolar_blue

Unnamed: 0,Fecha,Compra_bl,Venta_bl
0,22/01/2024,1185,1235
1,19/01/2024,1170,1220
2,18/01/2024,1190,1240
3,17/01/2024,1175,1225
4,16/01/2024,1130,1180
...,...,...,...
4923,16/01/2004,29,291
4924,15/01/2004,288,289
4925,14/01/2004,29,291
4926,13/01/2004,288,289


In [46]:
dolar_oficial.to_csv("Dolar_Oficial.csv", index=False)
dolar_blue.to_csv("Dolar_blue.csv", index=False)


In [96]:
all_dollars = dolar_blue.merge(right=dolar_oficial, how="inner")
all_dollars.to_csv("./dolar_todos.csv",index=False)
all_dollars

Unnamed: 0,Fecha,Compra_bl,Venta_bl,Compra_of,Venta_of
0,22/01/2024,1185,1235,80808,8688
1,19/01/2024,1170,1220,80808,8688
2,18/01/2024,1190,1240,8072,86651
3,17/01/2024,1175,1225,80696,8662
4,16/01/2024,1130,1180,80538,86465
...,...,...,...,...,...
3090,19/01/2004,29,29,287,291
3091,16/01/2004,29,291,287,291
3092,15/01/2004,288,289,286,29
3093,14/01/2004,29,291,286,29


In [60]:
only_all_dollars = all_dollars.iloc[:,1:].apply(lambda x:x.str.replace(',','.').astype(float),axis=1)

In [54]:
all_dollars.iloc[:,1:]

Unnamed: 0,Compra_bl,Venta_bl,Compra_of,Venta_of
0,1185,1235,80808,8688
1,1170,1220,80808,8688
2,1190,1240,8072,86651
3,1175,1225,80696,8662
4,1130,1180,80538,86465
...,...,...,...,...
3090,29,29,287,291
3091,29,291,287,291
3092,288,289,286,29
3093,29,291,286,29


In [62]:
dollars = np.array(only_all_dollars.iloc[:,1:])
dollars

array([[1235.  ,  808.08,  868.8 ],
       [1220.  ,  808.08,  868.8 ],
       [1240.  ,  807.2 ,  866.51],
       ...,
       [   2.89,    2.86,    2.9 ],
       [   2.91,    2.86,    2.9 ],
       [   2.89,    2.85,    2.89]])

In [89]:

agent = DQNAgent(INPUT_SHAPE,[32,32,23],.2)
env = BlobEnv(dollars)

# For more repetitive results
random.seed(23)
np.random.seed(23)
tensorflow.random.set_seed(23)




In [75]:
import os
from tqdm import tqdm

if not os.path.isdir('models'):
    os.makedirs('models')

In [99]:
EPISODES = 4000
# Exploration settings
epsilon = 0.6  # not a constant, going to be decayed
EPSILON_DECAY = 0.99975
MIN_EPSILON = 0.001
ACTION_SPACE_SIZE = 3 # number of possible decisions ... 
AGGREGATE_STATS_EVERY = 50
ep_rewards = [0]
MIN_REWARD = -1
# Iterate over episodes

for episode in tqdm(range(1, EPISODES + 1), ascii=True, unit='episodes'):
    # Update tensorboard step every episode
    
    agent.tensorboard.step = episode # ????
    
    # Restarting episode = reset episode reward and step number
    episode_reward = 0
    step = 1
    
    # Reset environment and get initial state
    current_state = env.reset()
    
    # Reset flag and start iterating until episode ends
    done = False
    while not done:
        
        if np.random.random() > epsilon:
            # Get action from Q table
            action = np.argmax(agent.get_qs(current_state))
        
        else:
            # Get random action  ????? 
            action = np.random.randint(0, ACTION_SPACE_SIZE)
            
        new_state, reward, done = env.step(action)
        
        # Transform new continuous state to a new discrete state and count reward 
        episode_reward += reward
        
        # Every step we update replay memory and train main network
        agent.update_replay_memory((current_state, action, reward, new_state, done))
        agent.train(done, step)
        
        current_state = new_state 
        step += 1
        
        # Append episode reward to a list and log stats (every given number of episodes)
        ep_rewards.append(episode_reward)  
        print(f"ep_rewards is {ep_rewards}")
        
        if not episode % AGGREGATE_STATS_EVERY or episode == 1:
            print(f"ep_rewards is {ep_rewards}")
            average_reward = sum(ep_rewards[-AGGREGATE_STATS_EVERY:])/len(ep_rewards[-AGGREGATE_STATS_EVERY:])
            min_reward = min(ep_rewards[-AGGREGATE_STATS_EVERY:])
            max_reward = max(ep_rewards[-AGGREGATE_STATS_EVERY:])
            agent.tensorboard.update_stats(reward_avg=average_reward, reward_min=min_reward, reward_max=max_reward, epsilon=epsilon)
            
            # Save model, but only when min reward is greater or equal a set value
            if min_reward >= MIN_REWARD:
               agent.model.save(f'models/{MODEL_NAME}__{max_reward:_>7.2f}max_{average_reward:_>7.2f}avg_{min_reward:_>7.2f}min__{int(time.time())}.model')
        
        # Decay epsilon
        if epsilon > MIN_EPSILON:
            epsilon *= EPSILON_DECAY
            epsilon = max(MIN_EPSILON, epsilon)  


  0%|          | 0/4000 [00:00<?, ?episodes/s]

hello
self.info[TIME_STEP] is [288.  145.9 153.9]
[ 0.   -0.35 -0.35]





ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()