<img src='https://certificate.tpq.io/quantsdev_banner_color.png' width="250px" align="right">

# Reinforcement Learning

**OpenAI Gym CartPole Environment**

&copy; Dr Yves J Hilpisch | The Python Quants GmbH

[quants@dev Discord Server](https://discord.gg/uJPtp9Awaj) | [@quants_dev](https://twitter.com/quants_dev) | <a href="mailto:qd@tpq.io">qd@tpq.io</a>

<img src="https://hilpisch.com/aiif_cover_shadow.png" width="300px" align="left">

## Reinforcement Learning

In [None]:
import math
import random
import numpy as np
import pandas as pd
np.set_printoptions(precision=4, suppress=True)

## `CartPole` Environment 

In [None]:
import gym

In [None]:
env = gym.make('CartPole-v0')

In [None]:
env.seed(100)
env.action_space.seed(100)

In [None]:
env.observation_space

In [None]:
env.observation_space.low.astype(np.float16)

In [None]:
env.observation_space.high.astype(np.float16)

In [None]:
state = env.reset()

In [None]:
state

In [None]:
env.action_space

In [None]:
env.action_space.n

In [None]:
env.action_space.sample()

In [None]:
env.action_space.sample() 

In [None]:
a = env.action_space.sample()
a

In [None]:
state, reward, done, info = env.step(a)
state, reward, done, info

In [None]:
env.reset()
for e in range(1, 200):
    a = env.action_space.sample()
    state, reward, done, info = env.step(a)
    print(f'step={e:2d} | state={state} | action={a} | reward={reward}')
    if done and (e + 1) < 200:
        print('*** FAILED ***')
        break

In [None]:
done

## Dimensionality Reduction

See http://kvfrans.com/simple-algoritms-for-solving-cartpole/.

In [None]:
np.random.seed(100)

In [None]:
weights = np.random.random(4) * 2 - 1

In [None]:
weights

In [None]:
state = env.reset()

In [None]:
state

In [None]:
s = np.dot(state, weights)
s

## Action Rule

In [None]:
if s < 0:
    a = 0
else:
    a = 1

In [None]:
a

## Total Reward per Episode

In [None]:
def run_episode(env, weights):  
    state = env.reset()
    treward = 0
    for _ in range(200):
        s = np.dot(state, weights)
        a = 0 if s < 0 else 1
        state, reward, done, info = env.step(a)
        treward += reward
        if done:
            break
    return treward

In [None]:
run_episode(env, weights)

## Simple Learning 

In [None]:
def set_seeds(seed=100):
    random.seed(seed)
    np.random.seed(seed)
    env.seed(seed)

In [None]:
set_seeds()
num_episodes = 1000

In [None]:
besttreward = 0
for e in range(1, num_episodes + 1):
    weights = np.random.rand(4) * 2 - 1
    treward = run_episode(env, weights)
    if treward > besttreward:
        besttreward = treward
        bestweights = weights
        if treward == 200:
            print(f'SUCCESS | episode={e}')
            break
        print(f'UPDATE  | episode={e}')

In [None]:
weights

## Testing the Results

In [None]:
res = []
for _ in range(100):
    treward = run_episode(env, weights)
    res.append(treward)
res[:10]

In [None]:
sum(res) / len(res)

<img src="https://certificate.tpq.io/quantsdev_banner_color.png" alt="quants@dev" width="35%" align="right" border="0"><br>

[quants@dev Discord Server](https://discord.gg/uJPtp9Awaj) | [@quants_dev](https://twitter.com/quants_dev) | <a href="mailto:qd@tpq.io">qd@tpq.io</a>