In [None]:
import tensorflow as tf
import numpy as np
import gym
import ppaquette_gym_doom
import matplotlib.pyplot as plt
from skimage import transform

In this environment, the Doom player is standing on top of acid water, and needs to learn how to navigate and collect health packs to stay alive.  

<img src="images/doom1.gif">


In [None]:
env = gym.make('ppaquette/DoomHealthGathering-v0')

One method of reinforcement learning we can use to solve this problem is the REINFORCE with baselines algorithm. Reinforce is very simple -- the only data it needs includes states and rewards from an environment episode. Reinforce is called a policy gradient method because it solely evaluates and updates an agent's policy.

Reinforce is considered a Monte Carlo method of learning, this means that the agent will collect data from an entire episode then perform calculations at the end of that episode.  In our case we will gather a batch of multiple episodes to train on.

In [None]:
# Environment Parameters
n_actions = 3
n_epochs = 5000
n = 0
average = []
step = 1
batch_size = 5000
render = False

# Define our three actions of moving forward, turning left & turning right
choice = [[0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
          [0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
          [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]]

We next define some of our hyper parameters that our neural network will use.

Alpha is our usual learning rate and gamma is our rate of reward discounting.


In [None]:
# Hyper Parameters
alpha = 1e-4
gamma = 0.99
normalize_r = True
save_path='models/healthGather.ckpt'
value_scale = 0.5
entropy_scale = 0.00
gradient_clip = 40

Reward discounting is a way of evaluating potential future rewards given the reward history from an agent. As the discount rate approaches zero, the agent is only concerned with immediate rewards and does not consider potential future rewards. We can write a simple function to evaluate a set of rewards from an episode, with the following:


In [None]:
# Apply discount to episode rewards & normalize
def discount(r, gamma, normal):
    discount = np.zeros_like(r)
    G = 0.0
    for i in reversed(range(0, len(r))):
        G = G * gamma + r[i]
        discount[i] = G
    # Normalize 
    if normal:
        mean = np.mean(discount)
        std = np.std(discount)
        discount = (discount - mean) / (std)
    return discount

Let’s evaluate the following sets of rewards:
    
<img src="images/discounting.png">


Next we will build our convolutional neural network for taking in a state and outputting action probabilities and state values.  We will have three actions to choose from: move forward, move right, and move left. The policy approximation is set up exactly the same as an image classifier, but instead of the outputs representing the confidence of a class, our outputs will represent our confidence in taking a certain action. Compared to large image classification models, when it comes to reinforcement learning, simple networks work best. 

We will use a very popular convnet also used for the famous DQN algorithm. Our network will input a processed resized image of 84x84 pixels, output 16 convolutions of a 8x8 kernel with a stride of 4, followed by 32 convolutions with a 4x4 kernel and a stride of 2, finished with a fully connected layer of 256 neurons. For the convolutional layers we will use ‘VALID’ padding which will shrink the image quite aggressively. 

Both our policy approximation and our value approximation will share the same convolutional neural network to calculate their values.  For input we will feed in the resized pixel values from the environment.


In [None]:
# Conv Layers
convs = [16,32]
kerns = [8,8]
strides = [4,4]
pads = 'valid'
fc = 256
activ = tf.nn.elu

In [None]:
# Function for resizing image
def resize(image):
    # Greyscale Image
    x = np.mean(image,-1)
    # Normalize Pixel Values
    x = x/255
    x = transform.resize(x, [84,84])
    return(x)

In [None]:
# Tensorflow Variables
X = tf.placeholder(tf.float32, (None,84,84,1), name='X')
Y = tf.placeholder(tf.int32, (None,), name='actions')
R = tf.placeholder(tf.float32, (None,), name='reward')
N = tf.placeholder(tf.float32, (None), name='episodes')
D_R = tf.placeholder(tf.float32, (None,), name='discounted_reward')

In [None]:
# Policy Network
conv1 = tf.layers.conv2d(
        inputs = X,
        filters = convs[0],
        kernel_size = kerns[0],
        strides = strides[0],
        padding = pads,
        activation = activ,
        name='conv1')

conv2 = tf.layers.conv2d(
        inputs=conv1,
        filters = convs[1],
        kernel_size = kerns[1],
        strides = strides[1],
        padding = pads,
        activation = activ,
        name='conv2')

flat = tf.layers.flatten(conv2)

dense = tf.layers.dense(
        inputs = flat, 
        units = fc, 
        activation = activ,
        name = 'fc')

logits = tf.layers.dense(
         inputs = dense, 
         units = n_actions, 
         name='logits')

value = tf.layers.dense(
        inputs=dense, 
        units = 1, 
        name='value')

calc_action = tf.multinomial(logits, 1)
aprob = tf.nn.softmax(logits)
action_logprob = tf.nn.log_softmax(logits)

In [None]:
tf.trainable_variables()

This function will gather a batch of training data from multiple episodes.

In [None]:
def rollout(batch_size, render):
    
    states, actions, rewards, rewardsFeed, discountedRewards = [], [], [], [], []
    state = resize(env.reset())
    episode_num = 0 
    action_repeat = 3
    reward = 0
    
    while True: 
        
        if render:
            env.render()
        
        # Run State Through Policy & Calculate Action
        feed = {X: state.reshape(1, 84, 84, 1)}
        action = sess.run(calc_action, feed_dict=feed)
        action = action[0][0]
        
        # Perform Action
        for i in range(action_repeat):
            state2, reward2, done, info = env.step(choice[action])
            reward += reward2
            if done:
                break
        
        # Store Results
        states.append(state)
        rewards.append(reward)
        actions.append(action)
        
        # Update Current State
        reward = 0
        state = resize(state2)
        
        if done:
            # Track Discounted Rewards
            rewardsFeed.append(rewards)
            discountedRewards.append(discount(rewards, gamma, normalize_r))
            
            if len(np.concatenate(rewardsFeed)) > batch_size:
                break
                
            # Reset Environment
            rewards = []
            state = resize(env.reset())
            episode_num += 1
                         
    return np.stack(states), np.stack(actions), np.concatenate(rewardsFeed), np.concatenate(discountedRewards), episode_num

So now that we have the model built, how are we going to have it learn? The solution is elegantly simple. We want to change the network's weights so that it will increase its confidence in what action to take, and the amount of change is based upon our baseline of how accurate our value estimation was. Overall we need to minimize our total loss.

Implementing this in TensorFlow, we measure our loss by using the sparse_softmax_cross_entropy function.  The sparse means that our action labels are single integers and the the logits are our final policy output without an activation function.  This function calculates the softmax and log loss for us.  As confidence in an taken action approaches one, the loss approaches zero. 

We then multiply the cross entropy by the difference of our discounted reward and our value approximation to get our total policy gradient loss.  We calculate our value loss by using the common squared mean error loss.  We then add our losses to together to calculate our total loss.

In [None]:
mean_reward = tf.divide(tf.reduce_sum(R), N)

# Define Losses
pg_loss = tf.reduce_mean((D_R - value) * tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=Y))
value_loss = value_scale * tf.reduce_mean(tf.square(D_R - value))
entropy_loss = -entropy_scale * tf.reduce_sum(aprob * tf.exp(aprob))
loss = pg_loss + value_loss - entropy_loss

# Create Optimizer
optimizer = tf.train.AdamOptimizer(alpha)
grads = tf.gradients(loss, tf.trainable_variables())
grads, _ = tf.clip_by_global_norm(grads, gradient_clip) # gradient clipping
grads_and_vars = list(zip(grads, tf.trainable_variables()))
train_op = optimizer.apply_gradients(grads_and_vars)

# Initialize Session
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

In [None]:
# Setup TensorBoard Writer
writer = tf.summary.FileWriter("/tmp/dpg")
tf.summary.scalar('Total_Loss', loss)
tf.summary.scalar('PG_Loss', pg_loss)
tf.summary.scalar('Entropy_Loss', entropy_loss)
tf.summary.scalar('Value_Loss', value_loss)
tf.summary.scalar('Reward_Mean', mean_reward)
tf.summary.histogram('Conv1', tf.trainable_variables()[0])
tf.summary.histogram('Conv2', tf.trainable_variables()[2])
tf.summary.histogram('FC', tf.trainable_variables()[4])
tf.summary.histogram('Logits', tf.trainable_variables()[6])
tf.summary.histogram('Value', tf.trainable_variables()[8])
write_op = tf.summary.merge_all()

In [None]:
# Load model if exists
saver = tf.train.Saver(tf.global_variables())
load_was_success = True 
try:
    save_dir = '/'.join(save_path.split('/')[:-1])
    ckpt = tf.train.get_checkpoint_state(save_dir)
    load_path = ckpt.model_checkpoint_path
    saver.restore(sess, load_path)
except:
    print("No saved model to load. Starting new session")
    writer.add_graph(sess.graph)
    load_was_success = False
else:
    print("Loaded Model: {}".format(load_path))
    saver = tf.train.Saver(tf.global_variables())
    step = int(load_path.split('-')[-1])+1

We are now ready to train the agent. We feed our current state into the network and get our action by calling the tf.multinomial function.  We perform that action and store the state, action and future reward. We then store the new resized state2 as our current state and repeat this procedure until the end of the episode. We then append our state, action, and reward data into a new list, which we will use for feeding into the network, for evaluating an episode

Depending on our intial weight initialization, our agent should eventually solve the environment in roughly 1000 training batches. OpenAI’s standard for solving the environment is getting an average reward of 1,000 over 100 consecutive trials.

If you your initial epoch get's a reward < 300 I recommend you restart the kernel and try a new activation of weights.

In [None]:
while step < n_epochs+1:
    # Gather Training Data
    print('Epoch', step)
    s, a, r, d_r, n = rollout(batch_size,render)
    mean_reward = np.sum(r)/n
    average.append(mean_reward)
    print('Training Episodes: {}  Average Reward: {:4.2f}  Total Average: {:4.2f}'.format(n, mean_reward, np.mean(average)))
          
    # Update Network
    sess.run(train_op, feed_dict={X:s.reshape(len(s),84,84,1), Y:a, D_R: d_r})
          
    # Write TF Summaries
    summary = sess.run(write_op, feed_dict={X:s.reshape(len(s),84,84,1), Y:a, D_R: d_r, R: r, N:n})
    writer.add_summary(summary, step)
    writer.flush()
          
    # Save Model
    if step % 10 == 0:
          print("SAVED MODEL")
          saver.save(sess, save_path, global_step=step)
          
    step += 1


Here is my agent after 1000 batches:

<img src="images/doomFinal.gif">

If you want to test your agent’s confidence at any given frame, all you need to do is feed that state into the network and observe the output. Here, while facing just the wall, the agent had 90% confidence that the best action was to turn right and in the following picture on the right the agent was only 61% confident that going forward was the best action when it seems to be the clear best choice.

<img src="images/compare.png">

In [None]:
state = resize(env.reset())
prob, val = sess.run([aprob, value], feed_dict={X: state.reshape(1, 84, 84, 1)})

print('Turn Right: {:4.2f}  Turn Left: {:4.2f}  Move Forward {:4.2f}'.format(prob[0][0],prob[0][2], prob[0][1]))
print('Approximated State Value: {:4.4f}'.format(val[0][0]))

If you are keen, you may think to yourself that that 61% confidence for what seems like a clearly good move is not that great, and you would be right!  I suspect our agent has mainly learned to avoid walls extremely well and since the agent only receives rewards for surviving  it’s not exclusively trying to pick up health packs.  Picking up health packs is just correlated with surviving longer.  It is possible that a more sophisticated algorithm like A3C with it’s multiple agents simultaneously exploring the environment may get better results.    

In some ways I would not consider this agent fully intelligent.  The agent also almost completely disregards turning left! The agent has a simple policy, but -- it has learned it on it’s own and it works! 

<img src="images/tensorboard.png">