### Reinforcement learning

In this notebook, we will apply reinforcement learning to train a model to play the 'CartPole' game. This is a classic control game, in which the player will be rewarded by balancing a pole on a cart. The longer it goes on, the bigger the reward.

In [1]:
import numpy as np
import pandas as pd
import random
import os

# Gym is a toolkit for developing and comparing reinforcement learning algorithms 
# !pip install gym 
import gym

from collections import deque
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import Adam

Using TensorFlow backend.


In [2]:
# Initiate the environment
env = gym.make("CartPole-v1")

state_size = env.observation_space.shape[0]
action_size = env.action_space.n

# Batch_size is the number of states our agent replays, in order to learn
batch_size = 32

# We want the agent to play the game 50 times
n_episodes = 51

# Make a folder to store output, if it is not there already
output_dir = "model_output/CartPole"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

### Define agent

Our agent is the one who takes actions in the environment. In the CartPole game, the actions are 0 or 1 (push cart to left or right). The environment is defined by 4 state variables: 

0: Cart Position, (-4.8, 4.8)

1: Cart Velocity, (-Inf, Inf)

2: Pole Angle, (-24 deg, 24 deg)

3: Pole Velocity At Tip, (-Inf, Inf)

In [3]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size # state_size is the amount of variables which define our state
        self.action_size = action_size # action_size is the amount of actions the agent can take
        self.memory = deque(maxlen=2000) # the maximum amount of 'things' the agent remembers from the past
        self.gamma = 0.99 # discount factor for the future rewards
        self.epsilon = 1.0 # rate at which agent randomly takes actions
        self.epsilon_decay = 0.995 
        self.epsilon_min = 0.01
        self.learning_rate = 0.001 
        
        self.model = self._build_model()
        
    # This is the 'brain' of our agent, it predicts which action to take based on the state values    
    def _build_model(self):
        model = Sequential()
        model.add(Dense(24, input_dim=self.state_size, activation='relu'))
        model.add(Dense(24, activation='relu'))
        model.add(Dense(self.action_size, activation='linear'))
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
        
        return model
    
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
        
    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        
        act_values = self.model.predict(state)
        
        return np.argmax(act_values)
      
    
    def replay(self, batch_size):
        mini_batch = random.sample(self.memory, batch_size)
        
        for state, action, reward, next_state, done in mini_batch:
            target = reward
            if not done:
                # The famous Bellman equation
                target = (reward + self.gamma * np.amax(self.model.predict(next_state)[0]))
            target_f = self.model.predict(state)
            target_f[0][action] = target
            
            self.model.fit(state, target_f, epochs=1, verbose=0)
            
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
            
    def load(self, name):
        self.model.load_weights(name)
        
    def save(self, name):
        self.model.save_weights(name)

In [4]:
agent = DQNAgent(state_size, action_size)

Instructions for updating:
Colocations handled automatically by placer.


### Interact with the environment

1. Agent takes random action(initially) and steps through the environment. This yields some outcome and reward
2. Agent remembers what happened.
3. If agent has enough memory, it replays and tries to 'learn' to do better.
4. Repeat 1-3 until the episode is over.
5. Repeat 1-4 for n_episodes.

In [5]:
# if done = true, then either the pole has dropped off the cart or it manages to stay balanced for 200 time units
# which is the maximum
done = False

for episode in range(n_episodes):
    state = env.reset()
    state = np.reshape(state, [1, state_size])
    
    for time in range(200):

        action = agent.act(state)
        next_state, reward, done, _ = env.step(action)
        
        reward = reward if not done else -10
        
        next_state = np.reshape(next_state, [1, state_size])
        
        agent.remember(state, action, reward, next_state, done)
        
        state = next_state
        
        if done:
            print("Episode {}/{}, score: {}, epsilon: {}".format(episode, n_episodes, time, agent.epsilon))
            break
        
        if len(agent.memory) > batch_size:
            agent.replay(batch_size)
            
        if episode % 50 == 0:
            agent.save(output_dir + "weights_" + "{:04d}".format(episode) + ".hdf5")

Episode 0/51, score: 25, epsilon: 1.0
Instructions for updating:
Use tf.cast instead.
Episode 1/51, score: 11, epsilon: 0.9752487531218751
Episode 2/51, score: 15, epsilon: 0.9046104802746175
Episode 3/51, score: 67, epsilon: 0.6465587967553006
Episode 4/51, score: 30, epsilon: 0.5562889678716474
Episode 5/51, score: 93, epsilon: 0.34901730169741024
Episode 6/51, score: 44, epsilon: 0.2799384215094006
Episode 7/51, score: 30, epsilon: 0.2408545925762412
Episode 8/51, score: 32, epsilon: 0.20516038984972615
Episode 9/51, score: 17, epsilon: 0.18840216465300522
Episode 10/51, score: 30, epsilon: 0.16209824418995536
Episode 11/51, score: 62, epsilon: 0.11879805134519765
Episode 12/51, score: 121, epsilon: 0.06477420436570952
Episode 13/51, score: 44, epsilon: 0.05195383849590569
Episode 14/51, score: 41, epsilon: 0.04230229704853423
Episode 15/51, score: 98, epsilon: 0.025883670561501974
Episode 16/51, score: 141, epsilon: 0.012766746905164949
Episode 17/51, score: 146, epsilon: 0.0099864

Score is the measure of performance. After only 50 episodes, our agent is already capable of staying balanced for almost the maximum amount of time. 