In [1]:
import gym
import ppaquette_gym_doom
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import scipy.ndimage

In this notebook we will use Python, TensorFlow and the reinforcement learning library Gym to solve the 3d Doom health gathering environment.

In [2]:
env = gym.make('ppaquette/DoomHealthGathering-v0')

[2017-09-27 22:48:44,458] Making new env: ppaquette/DoomHealthGathering-v0


In this environment the Doom player is standing on top of acid water and needs to learn how to navigate and collect health packs to stay alive. 

<img src="images/doom1.gif">

One method of reinforcement learning we can use to solve this problem is the REINFORCE algorithm.  Reinforce is very simple, the only data it needs is states and rewards from an environment episode. 

Reinforce is considered a Monte Carlo method of learning, this means that the agent will collect data from an entire episode then perform calculations at the end of that episode.  We will set up our environment training data as empty lists which we will append our data into for each step.  

In [3]:
# Environment Parameters
n_actions = 3
n_episodes = 10000
render = False
train = True
actionRepeat = 3

# Environment Variables
states, actions, rewards = [], [], []
episode_num = 1
G = 0
reward = 0
average, avg_loss = [], []
counter = 1


Alpha is our usual learning rate and gamma is our rate of reward discounting.  

In [None]:
# Hyper Parameters
alpha = 1e-5l
gamma = .99
save_path='models/healthGather.ckpt'

# Conv Layers
convs = [16, 32]
kerns = [8, 4]
strides = [4, 2]
pads = 'VALID'
fc = 256

# TF Placeholders & Variables
X = tf.placeholder(tf.float32, shape=(None, 96, 96, 1), name="X")
Y = tf.placeholder(dtype=tf.float32, shape=[None, n_actions],name="Y")
eps_rewards = tf.placeholder(dtype=tf.float32, shape=[None,1], name="Episode_Discounted_Rewards")
tf_g = tf.Variable(0.0)

We will use a very popular convnet also used for the famous DQN algorithm. Our network will input a processed resized image of 96x96 pixel, output 16 convolutions of a 8x8 kernel with a stride of 4, followed by 32 convolutions with a 4x4 kernel and a stride of 2, finished with a fully connected layer of 256 neurons.  For the convolutional layers we will use ‘VALID’ padding which will shrink the image quite aggressively. 


In [4]:
# CONVOLUTION 1 - 1
with tf.name_scope('conv1'):
    
    filter1 = tf.Variable(tf.truncated_normal([kerns[0], kerns[0], 1, convs[0]], dtype=tf.float32,
                            stddev=1/np.sqrt(96**2)), name='weights1')
    stride = [1,strides[0],strides[0],1]
    conv = tf.nn.conv2d(X, filter1, stride, padding=pads)
    biases = tf.Variable(tf.constant(0.0, shape=[convs[0]], dtype=tf.float32),
                         trainable=True, name='biases1')
    out = tf.nn.bias_add(conv, biases)
    conv1 = tf.nn.relu(out)
    
    
# CONVOLUTION 1 - 2
with tf.name_scope('conv2'):
    shape = int(np.prod(conv1.get_shape()[1:]))
    filter2 = tf.Variable(tf.truncated_normal([kerns[1], kerns[1], convs[0], convs[1]], dtype=tf.float32,
                                                stddev=1/np.sqrt(shape)), name='weights2')
    stride = [1,strides[1],strides[1],1]
    conv = tf.nn.conv2d(conv1, filter2, stride, padding=pads)
    biases = tf.Variable(tf.constant(0.0, shape=[convs[1]], dtype=tf.float32),
                         trainable=True, name='biases2')
    out = tf.nn.bias_add(conv, biases)
    conv2 = tf.nn.relu(out)
    

#FULLY CONNECTED 1
with tf.name_scope('fc1') as scope:
    shape = int(np.prod(conv2.get_shape()[1:]))
    fc1w = tf.Variable(tf.truncated_normal([shape, fc], dtype=tf.float32, stddev=1/np.sqrt(shape)), name='weights3')
    fc1b = tf.Variable(tf.constant(1.0, shape=[fc], dtype=tf.float32),
                       trainable=True, name='biases3')
    flat = tf.reshape(conv2, [-1, shape])
    out = tf.nn.bias_add(tf.matmul(flat, fc1w), fc1b)
    fc_1 = tf.nn.relu(out)
    

#FULLY CONNECTED 2 & SOFTMAX OUTPUT
with tf.name_scope('softmax') as scope:
    fc2w = tf.Variable(tf.truncated_normal([fc, n_actions], dtype=tf.float32,
                                           stddev=1/np.sqrt(fc)), name='weights4')
    fc2b = tf.Variable(tf.constant(1.0, shape=[n_actions], dtype=tf.float32),
                       trainable=True, name='biases4')
    Ylogits = tf.nn.bias_add(tf.matmul(fc_1, fc2w), fc2b)
    output = tf.nn.softmax(Ylogits)


In [5]:
# Function for resizing image
def resize(image):
    # Greyscale Image
    x = np.mean(image,-1)
    # Crop Image
    x = x[:400,100:540]
    # Normalize Pixel Values
    x = x/255
    x = scipy.misc.imresize(x, [96,96])
    return(x)

In [6]:
# Apply discount to episode rewards & normalize
def discount_rewards(rewards, gamma):
    discount = np.zeros_like(rewards)
    G = 0
    for i in reversed(range(0, len(rewards))):
        G = G * gamma + rewards[i]
        discount[i] = G
    # Normalize 
    mean = np.mean(discount)
    std = np.std(discount)
    discount = (discount - mean) / (std)
    return discount


We measure our loss by taking the negative log of our action probability output and multiply it by a one hot vector for the action (this then only measures the loss for the given action).  As confidence in an action approaches one, the loss approaches zero.  As confidence in an action approaches zero the loss approaches infinity.  

In [7]:
# Define loss
loss = -tf.log(output)*Y
loss_mean = tf.reduce_mean(loss)
optimizer = tf.train.AdamOptimizer(alpha)
grads = optimizer.compute_gradients(loss, var_list=tf.trainable_variables(), 
                                    grad_loss=eps_rewards)
train = optimizer.apply_gradients(grads)

In [8]:
# Define Session and initialize variables
sess = tf.Session()
sess.run(tf.global_variables_initializer())

In [9]:
# Setup TensorBoard Writer
writer = tf.summary.FileWriter("/tmp/dpg")
writer.add_graph(sess.graph)
tf.summary.scalar('Loss', loss_mean)
tf.summary.scalar('Episode_Reward', tf_g)
tf.summary.histogram("Weights_1", filter1)
tf.summary.histogram("Weights_2", filter2)
write_op = tf.summary.merge_all()

In [10]:
# Load model if exists
saver = tf.train.Saver(tf.global_variables())
load_was_success = True 
try:
    save_dir = '/'.join(save_path.split('/')[:-1])
    ckpt = tf.train.get_checkpoint_state(save_dir)
    load_path = ckpt.model_checkpoint_path
    saver.restore(sess, load_path)
except:
    print("no saved model to load. starting new session")
    load_was_success = False
else:
    print("loaded model: {}".format(load_path))
    saver = tf.train.Saver(tf.global_variables())
    episode_num = int(load_path.split('-')[-1])+1

no saved model to load. starting new session


In [None]:
# Define our three actions of moving forward, turning left & turning right


choice = [[0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
          [0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
          [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]]

# If you want your agent to learn a good policy much faster 
# combine your turning actions with moving forward

#choice = [[0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
#          [0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
#          [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]]


In [None]:
state = resize(env.reset())

for i in range(n_episodes):
    
    while True:
        if render:
            env.render()
        
        # Get action probability from network
        feed = {X:state.reshape(1, 96, 96, 1)}
        aprob = sess.run([output], feed_dict=feed)
        action = np.random.choice(n_actions, p=aprob[0][0])
        #print(aprob)
         
        # Convert action to one hot vector
        oneHot = np.zeros(n_actions)
        oneHot[action] = 1

        # Perform action loop & store results
        for i in range(actionRepeat):
            state2, reward2, done, info = env.step(choice[action])
            reward += reward2
        G += reward
        
        # Record history
        states.append(state)
        actions.append(oneHot)
        rewards.append(reward)
        reward = 0
        
        # Update current state
        state = resize(state2)
    
        if done:
            average.append(G)
            rewards = discount_rewards(rewards,gamma)
                
            # Define our network feed & measure average episode loss
            feed = {X: np.dstack(states).reshape(len(states),96,96,1), 
                    eps_rewards: np.vstack(rewards), 
                    Y: np.vstack(actions)}
            losses = sess.run([loss_mean], feed_dict=feed)
            avg_loss.append(losses)
            
            if episode_num % 1 == 0:
                print('Episode: {}   G:{:4.0f}  Average: {:4.1f}  Avg. Eps. Loss: {:4.4f}'
                      .format(episode_num, G, np.mean(average), losses[0]))
                
            if episode_num % 5 == 0:
                # Write TF Summaries
                tf_g = tf.assign(tf_g, G)
                sess.run(tf_g)
                summary = sess.run(write_op, feed_dict=feed)
                writer.add_summary(summary, episode_num)
                writer.flush()
            
            if episode_num % 50 == 0:
                # Save TensorFlow Variables 
                saver.save(sess, save_path, global_step=episode_num)
                print("SAVED MODEL #{}".format(episode_num))
                if counter >=100:
                    print('Last 100 Average {}'
                          .format(np.mean(average[counter-100:counter])))
            
            if train:
                # If train == True we will update the network every episode
                _ = sess.run([train], feed_dict=feed)
            
            
            # Reset our variables for next episode
            states, actions, rewards = [], [], []
            G = 0
            state = resize(env.reset())
            episode_num += 1
            counter += 1
            break
    
    if counter >=100:
        if np.sum(average[counter-100:counter])/100>=1000:
            print('Solved in {} Episodes'.format(episode_num))
            saver.save(sess, save_path, global_step=episode_num)
            print("SAVED MODEL #{}".format(episode_num))
            break
            
            

Episode: 1   G: 171  Average: 171.0  Avg. Eps. Loss: 0.3676
Episode: 2   G: 171  Average: 171.0  Avg. Eps. Loss: 0.3492
Episode: 3   G: 267  Average: 203.0  Avg. Eps. Loss: 0.3485
Episode: 4   G: 171  Average: 195.0  Avg. Eps. Loss: 0.3458
Episode: 5   G: 171  Average: 190.2  Avg. Eps. Loss: 0.3620
Episode: 6   G: 171  Average: 187.0  Avg. Eps. Loss: 0.3507
Episode: 7   G: 171  Average: 184.7  Avg. Eps. Loss: 0.3619
Episode: 8   G: 431  Average: 215.5  Avg. Eps. Loss: 0.3641
Episode: 9   G: 459  Average: 242.6  Avg. Eps. Loss: 0.3607
Episode: 10   G: 171  Average: 235.4  Avg. Eps. Loss: 0.3565
Episode: 11   G: 171  Average: 229.5  Avg. Eps. Loss: 0.3519
Episode: 12   G: 527  Average: 254.3  Avg. Eps. Loss: 0.3520
Episode: 13   G: 267  Average: 255.3  Avg. Eps. Loss: 0.3469
Episode: 14   G: 171  Average: 249.3  Avg. Eps. Loss: 0.3510
Episode: 15   G: 363  Average: 256.9  Avg. Eps. Loss: 0.3595
Episode: 16   G: 171  Average: 251.5  Avg. Eps. Loss: 0.3624
Episode: 17   G: 171  Average: 24

Episode: 134   G: 171  Average: 291.3  Avg. Eps. Loss: 0.3483
Episode: 135   G: 267  Average: 291.1  Avg. Eps. Loss: 0.3594
Episode: 136   G: 363  Average: 291.7  Avg. Eps. Loss: 0.3661
Episode: 137   G: 171  Average: 290.8  Avg. Eps. Loss: 0.3632
Episode: 138   G: 459  Average: 292.0  Avg. Eps. Loss: 0.3634
Episode: 139   G: 199  Average: 291.3  Avg. Eps. Loss: 0.3553
Episode: 140   G: 171  Average: 290.5  Avg. Eps. Loss: 0.3628
Episode: 141   G: 267  Average: 290.3  Avg. Eps. Loss: 0.3545
Episode: 142   G: 171  Average: 289.5  Avg. Eps. Loss: 0.3715
Episode: 143   G: 267  Average: 289.3  Avg. Eps. Loss: 0.3688
Episode: 144   G: 363  Average: 289.8  Avg. Eps. Loss: 0.3651
Episode: 145   G: 459  Average: 291.0  Avg. Eps. Loss: 0.3635
Episode: 146   G: 267  Average: 290.8  Avg. Eps. Loss: 0.3610
Episode: 147   G: 363  Average: 291.3  Avg. Eps. Loss: 0.3646
Episode: 148   G: 459  Average: 292.5  Avg. Eps. Loss: 0.3660
Episode: 149   G: 363  Average: 292.9  Avg. Eps. Loss: 0.3590
Episode:

Episode: 265   G: 459  Average: 297.5  Avg. Eps. Loss: 0.3290
Episode: 266   G: 363  Average: 297.8  Avg. Eps. Loss: 0.3377
Episode: 267   G: 555  Average: 298.7  Avg. Eps. Loss: 0.3520
Episode: 268   G: 267  Average: 298.6  Avg. Eps. Loss: 0.3250
Episode: 269   G: 335  Average: 298.7  Avg. Eps. Loss: 0.3288
Episode: 270   G:1487  Average: 303.1  Avg. Eps. Loss: 0.3285
Episode: 271   G: 363  Average: 303.4  Avg. Eps. Loss: 0.3230
Episode: 272   G: 171  Average: 302.9  Avg. Eps. Loss: 0.3447
Episode: 273   G: 267  Average: 302.8  Avg. Eps. Loss: 0.3284
Episode: 274   G: 171  Average: 302.3  Avg. Eps. Loss: 0.3198
Episode: 275   G: 363  Average: 302.5  Avg. Eps. Loss: 0.3280
Episode: 276   G: 363  Average: 302.7  Avg. Eps. Loss: 0.3264
Episode: 277   G: 171  Average: 302.2  Avg. Eps. Loss: 0.3094
Episode: 278   G:2100  Average: 308.7  Avg. Eps. Loss: 0.3433
Episode: 279   G: 267  Average: 308.6  Avg. Eps. Loss: 0.3491
Episode: 280   G: 363  Average: 308.7  Avg. Eps. Loss: 0.3313
Episode:

Episode: 396   G: 171  Average: 314.1  Avg. Eps. Loss: 0.3469
Episode: 397   G: 171  Average: 313.8  Avg. Eps. Loss: 0.3333
Episode: 398   G: 363  Average: 313.9  Avg. Eps. Loss: 0.3541
Episode: 399   G: 171  Average: 313.5  Avg. Eps. Loss: 0.3411
Episode: 400   G: 363  Average: 313.7  Avg. Eps. Loss: 0.3583
SAVED MODEL #400
Last 100 Average 326.52
Episode: 401   G: 459  Average: 314.0  Avg. Eps. Loss: 0.3360
Episode: 402   G: 267  Average: 313.9  Avg. Eps. Loss: 0.3534
Episode: 403   G: 267  Average: 313.8  Avg. Eps. Loss: 0.3466
Episode: 404   G: 719  Average: 314.8  Avg. Eps. Loss: 0.3508
Episode: 405   G: 363  Average: 314.9  Avg. Eps. Loss: 0.3505
Episode: 406   G: 267  Average: 314.8  Avg. Eps. Loss: 0.3334
Episode: 407   G: 171  Average: 314.5  Avg. Eps. Loss: 0.3206
Episode: 408   G: 171  Average: 314.1  Avg. Eps. Loss: 0.3454
Episode: 409   G: 719  Average: 315.1  Avg. Eps. Loss: 0.3585
Episode: 410   G: 171  Average: 314.7  Avg. Eps. Loss: 0.3508
Episode: 411   G: 267  Averag

Episode: 527   G: 267  Average: 322.1  Avg. Eps. Loss: 0.3547
Episode: 528   G:1419  Average: 324.2  Avg. Eps. Loss: 0.3497
Episode: 529   G: 103  Average: 323.8  Avg. Eps. Loss: 0.3610
Episode: 530   G: 363  Average: 323.8  Avg. Eps. Loss: 0.3410
Episode: 531   G: 171  Average: 323.6  Avg. Eps. Loss: 0.3661
Episode: 532   G: 267  Average: 323.4  Avg. Eps. Loss: 0.3506
Episode: 533   G: 103  Average: 323.0  Avg. Eps. Loss: 0.3453
Episode: 534   G: 267  Average: 322.9  Avg. Eps. Loss: 0.3722
Episode: 535   G: 103  Average: 322.5  Avg. Eps. Loss: 0.3523
Episode: 536   G: 171  Average: 322.2  Avg. Eps. Loss: 0.3601
Episode: 537   G: 459  Average: 322.5  Avg. Eps. Loss: 0.3436
Episode: 538   G: 295  Average: 322.4  Avg. Eps. Loss: 0.3441
Episode: 539   G: 459  Average: 322.7  Avg. Eps. Loss: 0.3521
Episode: 540   G: 843  Average: 323.7  Avg. Eps. Loss: 0.3690
Episode: 541   G: 171  Average: 323.4  Avg. Eps. Loss: 0.3515
Episode: 542   G: 267  Average: 323.3  Avg. Eps. Loss: 0.3598
Episode:

Episode: 658   G: 171  Average: 334.2  Avg. Eps. Loss: 0.3346
Episode: 659   G: 171  Average: 333.9  Avg. Eps. Loss: 0.3336
Episode: 660   G: 267  Average: 333.8  Avg. Eps. Loss: 0.3347
Episode: 661   G: 267  Average: 333.7  Avg. Eps. Loss: 0.3292
Episode: 662   G: 171  Average: 333.5  Avg. Eps. Loss: 0.3382
Episode: 663   G: 267  Average: 333.4  Avg. Eps. Loss: 0.3469
Episode: 664   G: 459  Average: 333.6  Avg. Eps. Loss: 0.3404
Episode: 665   G: 267  Average: 333.5  Avg. Eps. Loss: 0.3461
Episode: 666   G: 363  Average: 333.5  Avg. Eps. Loss: 0.3401
Episode: 667   G: 459  Average: 333.7  Avg. Eps. Loss: 0.3347
Episode: 668   G: 363  Average: 333.8  Avg. Eps. Loss: 0.3443
Episode: 669   G: 363  Average: 333.8  Avg. Eps. Loss: 0.3548
Episode: 670   G: 623  Average: 334.2  Avg. Eps. Loss: 0.3325
Episode: 671   G: 103  Average: 333.9  Avg. Eps. Loss: 0.3411
Episode: 672   G: 171  Average: 333.7  Avg. Eps. Loss: 0.3464
Episode: 673   G: 171  Average: 333.4  Avg. Eps. Loss: 0.3483
Episode:

Episode: 789   G: 171  Average: 330.6  Avg. Eps. Loss: 0.3520
Episode: 790   G: 267  Average: 330.5  Avg. Eps. Loss: 0.3294
Episode: 791   G: 555  Average: 330.8  Avg. Eps. Loss: 0.3463
Episode: 792   G: 267  Average: 330.7  Avg. Eps. Loss: 0.3459
Episode: 793   G: 459  Average: 330.9  Avg. Eps. Loss: 0.3248
Episode: 794   G: 459  Average: 331.0  Avg. Eps. Loss: 0.3416
Episode: 795   G: 335  Average: 331.0  Avg. Eps. Loss: 0.3602
Episode: 796   G: 459  Average: 331.2  Avg. Eps. Loss: 0.3509
Episode: 797   G: 267  Average: 331.1  Avg. Eps. Loss: 0.3439
Episode: 798   G: 583  Average: 331.4  Avg. Eps. Loss: 0.3415
Episode: 799   G: 459  Average: 331.6  Avg. Eps. Loss: 0.3506
Episode: 800   G: 363  Average: 331.6  Avg. Eps. Loss: 0.3488
SAVED MODEL #800
Last 100 Average 318.6
Episode: 801   G: 679  Average: 332.1  Avg. Eps. Loss: 0.3473
Episode: 802   G: 335  Average: 332.1  Avg. Eps. Loss: 0.3399
Episode: 803   G: 267  Average: 332.0  Avg. Eps. Loss: 0.3398
Episode: 804   G: 679  Average