# Introduction
* Reinforcement learning works by **trial and error** - aiming to maxmize a reward function/high score
* It is **unsupervised**, and wanders around the problem space establishing structure to maximize its reward

### Definitions
* **Environment** - The problem space, like a video game or financial market
* **State** - All relevant parameters defining the problem space
* **Agent** - **All Elements** of the algorithm that interact with the state
* **Action** - Choose action ${A_{i}}$ from the set of the actions
* **Step** - Given the Agent's *action*, the *environment* is updated - this is one *step* 
* **Reward** - A reward or penalty is awarded based on the *Action* chosen by the agent (e.g. Points in a video game or P/L in finance)
* **Target** - What the *Agent* tries to maximize
* **Policy** - The *Deterministic* action the agent takes given the current state of the environment
* **Episode** - a set of *Steps* taken until success is achieved or failure is observed. In finance this could be something like **Profit over the course of one year or bankruptcy**

# Environment

* The **OpenAI Gym** environment allows for the training of RL agents
* The classic problem in Reinforcement Learning is **CartPole**, where an agent learns to move a cart left or right to balance a pole in the cart
* The state of the environment is describred in a four variable vector {Cart Position, Cart Velocity, Pole Angle, Pole Velocity}
* The following code pull in the Gym and inspects the observation source

In [130]:
import os
import math
import random
import numpy as np
import pandas as pd
from pylab import plt
import gym
import tensorflow.compat.v1 as tf
from keras.layers import Dense, Dropout
from keras.models import Sequential
from keras.optimizers import Adam
from sklearn.metrics import accuracy_score
from collections import deque
plt.style.use('seaborn')
np.set_printoptions(precision=4, suppress=True)
os.environ['PYTHONHASHSEED'] = '0'

In [2]:
env = gym.make('CartPole-v0') # Environment Object

In [6]:
env.observation_space # Cart Position, Cart Velocity, Pole Angle, Pole Velocity

Box(4,)

In [7]:
env.observation_space.low.astype(np.float16) # Minimum value in observation 4-space

array([-4.8  ,   -inf, -0.419,   -inf], dtype=float16)

In [8]:
env.observation_space.high.astype(np.float16) # Maximum value in observation 4-space

array([4.8  ,   inf, 0.419,   inf], dtype=float16)

In [9]:
state = env.reset() # Problem State

In [11]:
state # Ready to start!

array([ 0.0467, -0.0347,  0.0249, -0.0232])

In [13]:
# Action Space
env.action_space # Left, Right

Discrete(2)

In [14]:
env.action_space.n # Two possible actions

2

In [15]:
env.action_space.sample() # Right

1

In [17]:
a = env.action_space.sample()

In [18]:
# Take the action
state, reward, done, info = env.step(a) # Take a step with action a

In [20]:
state, reward, done, info # New state of environment, reward value, Are we finished boolean?, additional info

(array([ 0.046 ,  0.16  ,  0.0244, -0.3079]), 1.0, False, {})

# Continuation
* While `done == False`, we can continue the game
* Success condition is either a number of steps being reached or a particular reward being reached

array([ 0.0393, -0.0018,  0.0169,  0.0479])

In [30]:
def compete(max_steps, max_score):
    env.reset()
    total_reward = 0.
    for e in range(1, max_steps):
        a = env.action_space.sample()
        state, reward, done, info = env.step(a)
        total_reward += reward
        print(f'step={e:2d} | state={state} | action={a} | reward={total_reward}')
        if done and (e + 1) < 200:
            print('*** FAILED ***')
            break

In [34]:
compete(200, 2.0)

step= 1 | state=[ 0.0476 -0.216   0.0394  0.3122] | action=0 | reward=1.0
step= 2 | state=[ 0.0433 -0.0215  0.0457  0.0322] | action=1 | reward=2.0
step= 3 | state=[ 0.0429 -0.2172  0.0463  0.3389] | action=0 | reward=3.0
step= 4 | state=[ 0.0385 -0.413   0.0531  0.6458] | action=0 | reward=4.0
step= 5 | state=[ 0.0303 -0.6088  0.066   0.9548] | action=0 | reward=5.0
step= 6 | state=[ 0.0181 -0.8047  0.0851  1.2674] | action=0 | reward=6.0
step= 7 | state=[ 0.002  -0.6108  0.1105  1.0026] | action=1 | reward=7.0
step= 8 | state=[-0.0102 -0.4173  0.1305  0.7465] | action=1 | reward=8.0
step= 9 | state=[-0.0186 -0.614   0.1454  1.0773] | action=0 | reward=9.0
step=10 | state=[-0.0309 -0.421   0.167   0.8335] | action=1 | reward=10.0
step=11 | state=[-0.0393 -0.2285  0.1837  0.5977] | action=1 | reward=11.0
step=12 | state=[-0.0439 -0.0364  0.1956  0.368 ] | action=1 | reward=12.0
step=13 | state=[-0.0446 -0.2337  0.203   0.7154] | action=0 | reward=13.0
step=14 | state=[-0.0493 -0.0419  

# A Monte Carlo Approach
* The CartPole problem can be solved with a simpler Monte Carlo approach, using the standard normally distributed weights + dot product apporach
* Define which policy to adopt based on a partition of the Monte Carlo output
* Define a large number of weights based on Monte Carlo simulation and select optimal weights
* Define what counts as "solution" - e.g. mean score 195

In [36]:
weights = np.random.random(4) * 2 - 1
weights

array([-0.7302, -0.4214, -0.591 , -0.5365])

In [38]:
state = env.reset()
state

array([ 0.0325,  0.0424, -0.0248,  0.0036])

In [39]:
s = np.dot(weights, state)
s

-0.028903315517161035

In [40]:
# Define policy based on bounding dot product
if s < 0:
    a = 0
else:
    a = 1

In [41]:
a

0

In [76]:
def run_episode(env, weights):
    state = env.reset()
    treward = 0
    for _ in range(200):
        s = np.dot(state, weights)
        a=0 if s < 0 else 1
        state, reward, done, info = env.step(a)
        treward += reward
        if done:
            break
    return treward

In [77]:
run_episode(env, weights)

9.0

In [107]:
def set_seeds(seed=100):
    random.seed(seed)
    np.random.seed(seed)
    env.seed(seed)
    tf.random.set_random_seed(100)

In [108]:
set_seeds()
num_episodes = 1000

In [109]:
bestreward = 0
for e in range(1, num_episodes+1):
    weights = np.random.rand(4) * 2 - 1
    treward = run_episode(env, weights)
    if treward > bestreward:
        bestreward = treward
        bestweights = weights
        if treward == 200:
            print(f'SUCCESS | episode{e}')
            break
        print(f'UPDATE | episode={e}')

UPDATE | episode=1
UPDATE | episode=2
SUCCESS | episode13


In [110]:
weights

array([-0.4282,  0.7048,  0.95  ,  0.7697])

In [111]:
res = []
for _ in range(100):
    treward = run_episode(env, weights)
    res.append(treward)
res[:10]

[200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0]

In [112]:
sum(res) / len(res)

200.0

# A Neural Network Approach
* This problem can also be thought of as a classification problem: **what is the optimal action (label)** given the weights of the state?

In [125]:
class NNAgent:
    
    def __init__(self):
        self.max = 0
        self.scores = list()
        self.memory = list()
        self.model = self._build_model() # Private 
    
    def _build_model(self):
        model = Sequential()
        model.add(Dense(24, input_dim=4, activation='relu'))
        model.add(Dense(1, activation='sigmoid')) # Classification layer
        model.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.001))
        return model
    
    def act(self, state):
        if random.random() <= 0.5:
            return env.action_space.sample()
        action = self.model.predict_classes(state, batch_size=None)[0, 0]
        return action
    
    def train_model(self, state, action):
        self.model.fit(state, np.array([action]), epochs=1, verbose=False) # Fit model from state space to action
        
    def learn(self, episodes):
        for e in range(1, episodes+1):
            state = env.reset()
            for _ in range(201):
                state = np.reshape(state, [1, 4])
                action = self.act(state)
                next_state, reward, done, info = env.step(action)
                if done:
                    score = _ + 1
                    self.scores.append(score)
                    self.max = max(score, self.max) # Was this run the best?
                    print('episode: {:4d}/{} | score: {:3d} | max: {:3d}'
                                   .format(e, episodes, score, self.max), end='\r')
                    break
                self.memory.append((state, action))
                self.train_model(state, action) # Update the policy (by training the neural net)
                state = next_state

In [126]:
set_seeds(100)
agent = NNAgent()

In [128]:
episodes = 500
agent.learn(episodes) # Performance of agent - doesn't even approach the reinforcement approach

episode:  500/500 | score:  11 | max:  57

# Q-Learning
* Q-Learning is a more sophisticated approach, taking into account **delayed rewards** rather than immediate rewards
* This works by computing an **action-value** policy ${Q}$ that assigns every combination of a state and action a value. The higher the value, the better the action.
* ${Q}$ is the **sum of the actions direct reward and the discounted value of the optimal action in the next state**, formally ${Q(S_{t}, A_{t}) = R_{t+1} + \gamma max_{a} Q(S_{t+1}, a)}$ with ${S_{t}}$ state at time t and ${A_{t}}$ the action taken at time t. ${R_{t+1}}$ is the reward of action ${A}$, ${\gamma \in }$ is a discounting factor and ${max_{a}Q(S_{t+1}, a}$ is the reward of the optimal action at the next step.
* ${Q}$ should generally be thought of as a function in continuous space - closed form solutions to an optimal ${Q}$ can likely not be derived, so we **approximate**
* This is where neural networks come into play - the appoximation is an optimization problem.
* Another critical element is **replay** - the ${QL}$-replays over experiences to update the policy action ${Q}$. An optimal ${QL}$ agent starts with pure exploration of the space, then decrease the exploration rate (${\epsilon}$) until it reaches a minimum level.

In [170]:
class DQLAgent:
    def __init__(self, finish=False):
        self.finish = finish
        self.epsilon = 1.0 # Initial Exploration %
        self.epsilon_min = 0.01 # Minimum Exploration %
        self.epsilon_decay = 0.995 # Decay rate of Epsilon
        self.gamma = 0.95 # Discount factor of t+1 optimal choice
        self.batch_size = 32
        self.max_treward = 0
        self.averages = list()
        self.memory = deque(maxlen=2000) # Deque for limited history
        self.osn = env.observation_space.shape[0] # NN input layer
        self.model = self._build_model()
        
    def _build_model(self):
        model = Sequential()
        model.add(Dense(24, input_dim=self.osn, activation='relu'))
        model.add(Dense(24, activation='relu')) # Hidden layer
        model.add(Dense(env.action_space.n, activation='linear'))
        model.compile(loss='mse', optimizer='adam')
        return model
    
    def act(self, state):
        if random.random() <= self.epsilon: # This means explore
            return env.action_space.sample()
        action = self.model.predict(state) # What's return value of this?
        return np.argmax(action)
    
    def replay(self):
        batch = random.sample(self.memory, self.batch_size) # Take a random batch of history to replay
        for state, action, reward, next_state, done in batch:
            if not done:
                reward += self.gamma * np.amax(self.model.predict(next_state)[0]) # Q-value for state-action pair
                target = self.model.predict(state)
                target[0, action] = reward
                self.model.fit(state, target, epochs=1, verbose=False) # Update neural net for action value pairs
        if self.epsilon > self.epsilon_min: # Decay exploration rate
            self.epsilon *= self.epsilon_decay
            
    def learn(self, episodes):
        trewards = []
        for e in range(1, episodes+1):
            state = env.reset()
            state = np.reshape(state, [1, self.osn]) # Reshape state into one row input dim columns
            for _ in range(5000):
                action = self.act(state)
                next_state, reward, done, info = env.step(action)
                next_state = np.reshape(next_state, [1, self.osn])
                self.memory.append([state, action, reward, next_state, done]) # Memory _ x 6
                state = next_state
                
                if done:
                    treward = _ + 1
                    trewards.append(treward)
                    av = sum(trewards[-25:]) / 25 # Average of everything so far
                    self.averages.append(av)
                    self.max_treward = max(self.max_treward, treward)
                    templ = 'episode: {:4d}/{} | treward: {:4d} |'
                    templ += 'av: {:5.1f} | max: {:4d}'
                    print(templ.format(e, episodes, treward, av, self.max_treward), end='\r')
                    break
                
                if av > 195 and self.finish:
                     break
                    
                if len(self.memory) > self.batch_size:
                    self.replay() # Replay once able to
                    
    def test(self, episodes):
        trewards = []
        for e in range(1, episodes+1):
            state = env.reset()
            for _ in range(1001):
                state = np.reshape(state, [1, self.osn])
                action = self.act(state)
                next_state, reward, done, info = env.step(action)
                state = next_state
                
                if done:
                    treward = _ + 1
                    trewards.append(treward)
                    print('episode: {:4d}/{} | treward: {:4d}'.format(e, episodes, treward), end='\r')
                    break
                    
            return trewards

# Performance of QL-Agent
* This agent performs extremely well, though the performance improvement is not monotonic.
* The below run extraordinarily slowly - I should ping Yves and understand this and what `av` should be defaulted to

In [171]:
set_seeds(100)
agent = DQLAgent(finish=True)

In [172]:
episodes = 1000
agent.learn(episodes)

UnboundLocalError: local variable 'av' referenced before assignment

# Finance Applications
* First, a simple finance gym working on time series
* Success is achieved when the agent successfully trades through all of the data set

In [165]:
class observation_space:
    def __init__(self, n):
        self.shape = (n, )

In [166]:
class action_space:
    def __init__(self, n):
        self.n = n
    
    def sample(self):
        return random.randint(0, self.n - 1)

In [169]:
class FinanceGym:
    
    url = '../../source/aiif_eikon_eod_data.csv'
    def __init__(self, symbol, features):
        self.symbol = symbol
        self.features = features
        self.observation_space = observation_space(4)
        self.osn = self.observation_space.shape[0]
        self.action_space = action_space(2) # Long, Short
        self.min_accuracy = .50
        self._get_data()
        self._prepare_data()
        
    def _get_data(self):
        self.raw = pd.read_csv(self.url, index_col=0, parse_dates=True).dropna()
        
    def _prepare_data(self):
        self.data = pd.DataFrame(self.raw[self.symbol])
        self.data['r'] = np.log(self.data / self.data.shift(1))
        self.data.dropna(inplace=True)
        self.data = (self.data - self.data.mean()) / self.data.std()
        self.data['d'] = np.where(self.data['r'] > 0, 1, 0)
        
    def _get_state(self): # Select data defining state of market
        return self.data[self.features].iloc[self.bar - self.osn: self.bar].values 
    
    def seed(self, seed=None):
        pass
    
    def reset(self):
        self.treward = 0
        self.accuracy = 0
        self.bar = self.osn
        state = self.data[self.features].iloc[self.bar - self.osn: self.bar]
        return state.values
    
    def step()