<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# Python for Finance Basics

&copy; Dr. Yves J. Hilpisch | The Python Quants GmbH

http://tpq.io | [training@tpq.io](mailto:trainin@tpq.io) | [@dyjh](http://twitter.com/dyjh)

## Reinforcement Learning

In [None]:
!git clone https://github.com/tpq-classes/pff_basics.git
import sys
sys.path.append('pff_basics')


In [None]:
import numpy as np
import pandas as pd
from pylab import plt
np.set_printoptions(suppress=True)
plt.style.use('seaborn-v0_8')
%config InlineBackend.figure_format = 'svg'

## `CartPole` Game

**Environment & Agents**

Topics:

**Reinforcement Learning**

* environment
* state
* agent
* action
* step
* reward/penalty
* objective
* policy
* episode


**Deep Q-Learning**

* reward function
* action policy
* representation
* deep neural network
* exploration/exploitation
* replay & policy update


## Gym Environment

In [None]:
import gymnasium as gym

In [None]:
env = gym.make('CartPole-v1')

In [None]:
env.observation_space

In [None]:
env.action_space

In [None]:
env.action_space.n

In [None]:
env.action_space.contains(0)

In [None]:
env.action_space.contains(1)

In [None]:
for _ in range(10):
    print(env.action_space.sample(), end=' ')

In [None]:
a = env.action_space.sample()
a

In [None]:
env.reset()

In [None]:
env.step(a)

In [None]:
env.step(0)

## Random Agent

In [None]:
for _ in range(15):
    env.reset()
    for e in range(1, 200):
        a = env.action_space.sample()
        state, reward, done, trunc, _ = env.step(a)
        if done:
            print(f'*** FAILED [{e}]***')
            break

## Dimensionality Reduction

In [None]:
state, _ = env.reset()

In [None]:
state

In [None]:
from numpy.random import default_rng

In [None]:
rng = default_rng() 

In [None]:
weights = rng.random(4) * 2 - 1
weights

In [None]:
s = np.dot(state, weights)
s

In [None]:
if s > 0:
    a = 1
else:
    a = 0

In [None]:
treward = 0
state, _ = env.reset()
for e in range(1, 201):
    s = np.dot(state, weights)
    a = 1 if s > 0 else 0
    state, reward, done, trunc, _ = env.step(a)
    treward += reward
    if done:
        break
print(treward)

In [None]:
def run_episode(weights):
    treward = 0
    state, _ = env.reset()
    for e in range(1, 201):
        s = np.dot(state, weights)
        a = 1 if s > 0 else 0
        state, reward, done, trunc, _ = env.step(a)
        treward += reward
        if done:
            break
    return treward

In [None]:
weights = rng.random(4) * 2 - 1

In [None]:
run_episode(weights)

In [None]:
besttreward = 0
for e in range(1, 101):
    weights = rng.random(4) * 2 - 1
    treward = run_episode(weights)
    if treward > besttreward:
        besttreward = treward
        if treward == 200.:
            print(f'*** SUCCESS [{e}] ***')
            break
        print(f'*** UPDATE [{e}] ***')

In [None]:
weights

In [None]:
trewards = list()
for e in range(1000):
    treward = run_episode(weights)
    trewards.append(treward)

In [None]:
sum(trewards) / len(trewards)

In [None]:
trewards[:10]

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="mailto:training@tpq.io">training@tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> 