<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# Python for Finance Basics

&copy; Dr. Yves J. Hilpisch | The Python Quants GmbH

http://tpq.io | [training@tpq.io](mailto:trainin@tpq.io) | [@dyjh](http://twitter.com/dyjh)

## Reinforcement Learning

In [None]:
!git clone https://github.com/tpq-classes/pff_basics.git
import sys
sys.path.append('pff_basics')


In [None]:
import numpy as np
import pandas as pd
from pylab import plt
np.set_printoptions(suppress=True)
plt.style.use('seaborn-v0_8')
%config InlineBackend.figure_format = 'svg'

## `CartPole` Game

**Environment & Agents**

Topics:

**Reinforcement Learning**

* environment
* state
* agent
* action
* step
* reward/penalty
* objective
* policy
* episode


**Deep Q-Learning**

* reward function
* action policy
* representation
* deep neural network
* exploration/exploitation
* replay & policy update


## Gym Environment

In [None]:
import gymnasium as gym

In [None]:
env = gym.make('CartPole-v1')

## Memory-Prediction Framework

From Wikipedia (see [Memory-Prediction Framework](https://en.wikipedia.org/wiki/Memory-prediction_framework)):

> The memory-prediction framework is a theory of brain function created by Jeff Hawkins and described in his 2004 book On Intelligence. This theory concerns the role of the mammalian neocortex and its associations with the hippocampi and the thalamus in matching sensory inputs to stored memory patterns and how this process leads to predictions of what will happen in the future.

## DQL Agent

In [None]:
import os, warnings

In [None]:
warnings.simplefilter('ignore')

In [None]:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

In [None]:
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

In [None]:
import random
from tensorflow import keras
from keras.layers import Dense
from keras.models import Sequential

In [None]:
opt = keras.optimizers.Adam(learning_rate=0.001)

In [None]:
class DQLAgent:
    def __init__(self):
        self.epsilon = 1.0
        self.epsilon_decay = 0.995
        self.epsilon_min = 0.1
        self.memory = list()
        self.batch_size = 32
        self.gamma = 0.95
        self.trewards = list()
        self.averages = list()
        self.max_treward = 0
        self._create_model()
    def _create_model(self):
        self.model = Sequential()
        self.model.add(Dense(24, activation='relu', input_dim=4))
        self.model.add(Dense(24, activation='relu'))
        self.model.add(Dense(2, activation='linear'))
        self.model.compile(loss='mse', optimizer=opt)
    def act(self, state):
        if random.random() < self.epsilon:
            return env.action_space.sample()  # exploration
        return np.argmax(self.model.predict(state)[0])  # exploitation
    def replay(self):
        batch = random.sample(self.memory, self.batch_size)
        for state, action, next_state, reward, done in batch:
            if not done:
                reward += self.gamma * np.amax(self.model.predict(next_state)[0])
            target = self.model.predict(state)
            target[0, action] = reward
            self.model.fit(state, target, epochs=1, verbose=False)
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
    def learn(self, episodes):
        for e in range(1, episodes + 1):
            state, _ = env.reset()
            state = np.reshape(state, [1, 4])
            for f in range(1, 301):
                action = self.act(state)
                next_state, reward, done, trunc, _ = env.step(action)
                next_state = np.reshape(next_state, [1, 4])
                self.memory.append([state, action, next_state, reward, done])
                state = next_state
                if done:
                    self.trewards.append(f)
                    av = sum(self.trewards[-20:]) / 20
                    self.averages.append(av)
                    self.max_treward = max(self.max_treward, f)
                    templ = f'episode={e:4d} | treward={f:3d} | '
                    templ += f'av={av:5.1f} | max={self.max_treward:3d}'
                    print(templ, end='\r')
                    break
            if len(self.memory) > self.batch_size:
                self.replay()
        print()
    def test(self, episodes):
        for e in range(1, episodes + 1):
            state, _ = env.reset()
            state = np.reshape(state, [1, 4])
            for f in range(1, 301):
                action = np.argmax(self.model.predict(state)[0])  # exploitation
                state, reward, done, trunc, _ = env.step(action)
                state = np.reshape(state, [1, 4])
                if done or f > 299:
                    print(f, end=' ')
                    break

In [None]:
agent = DQLAgent()

In [None]:
%time agent.learn(1000)

In [None]:
sum(agent.trewards[:50]) / 50

In [None]:
sum(agent.trewards[-50:]) / 50

In [None]:
agent.epsilon

In [None]:
agent.test(50)

In [None]:
plt.plot(agent.averages);

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="mailto:training@tpq.io">training@tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> 