# Reinforcement Learning with openAI gym

In this notebook we explore how to train an agent to play the "CartPole"-game, i.e. they are given a cart with a pole attached to it and it is their task to balance the pole for as long as possible. The code is largely inspired by [this](https://github.com/keon/deep-q-learning) repository. The accompanying [blogpost](https://keon.io/deep-q-learning/) explains some of the mathematical background.

We build a *Deep Q Learning agent*, i.e. our agent will be a deep neural network. It is constructed with keras. For the training environment, we use openAI's [gym](https://github.com/openai/gym) library.

In [1]:
import random
import gym
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


The environment used is the `CartPole-v1`, a standard environment that ships together with the `gym` library. Its statespace has size 4 and its action space has size 2. That is, there are only ever two actions we can take (encoded as 0 and 1) for pushing the cart to the left or the right.

In [2]:
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

In [7]:
# Model parameters
memory = deque(maxlen=2000)
gamma = 0.95    # discount rate
epsilon = 1.0  # exploration rate
epsilon_min = 0.05
epsilon_decay = 0.995
learning_rate = 0.001

In [4]:
# Let's build the DQN agent
model = Sequential()
model.add(Dense(24, input_dim=state_size, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(action_size, activation='linear'))
model.compile(loss='mse', optimizer=Adam(lr=learning_rate))
model.summary()





_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 24)                120       
_________________________________________________________________
dense_2 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 50        
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________


In [5]:
# Helper functions
# to "remember" the actions the agent took in the past and replay them for training
def remember(state, action, reward, next_state, done):
        memory.append((state, action, reward, next_state, done))

# This is how we decide which action to take. With a probability of epsilon the agent acts randomly
def act(state):
    if np.random.rand() <= epsilon:
        return random.randrange(action_size)
    act_values = model.predict(state)
    return np.argmax(act_values[0])  # returns action

# "replay" the memories and learn from them
def replay(batch_size):
    minibatch = random.sample(memory, batch_size)
    for state, action, reward, next_state, done in minibatch:
        target = reward
        if not done:
            target = (reward + gamma * np.amax(model.predict(next_state)[0]))
        target_f = model.predict(state)
        target_f[0][action] = target
        model.fit(state, target_f, epochs=1, verbose=0)

The actual training below will take some minutes. But you should be able to see good results already after a few episodes.

In [None]:
done = False
EPISODES = 20
batch_size = 32

for e in range(EPISODES):
    state = env.reset()
    state = np.reshape(state, [1, state_size])
    for time in range(500):
        env.render()
        action = act(state)
        next_state, reward, done, _ = env.step(action)
        reward = reward if not done else -10
        next_state = np.reshape(next_state, [1, state_size])
        remember(state, action, reward, next_state, done)
        state = next_state
        if done:
            print("episode: {}/{}, score: {}, e: {:.2}"
                  .format(e, EPISODES, time, epsilon))
            break
        if len(memory) > batch_size:
            replay(batch_size)
            if epsilon > epsilon_min:
                epsilon *= epsilon_decay