To run code on colab, use following code, if path and file exists, it will restore parameters and keep training from checkpoint, otherwise it will start a new training session.

In [None]:
from google.colab import drive
from google.colab import files
drive.mount('/content/gdrive')

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
    raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

path = "/content/gdrive/My Drive/DQN_standard_deter"
file_name = "/dm.ckpt-555266"

In [None]:
import tensorflow as tf
import pickle
import time
import gym
import random
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import cv2
import sys
import os
from skimage.transform import resize
import imageio

Most important part of code, change parameters here to train and test different model.

Env can change the training game, testing will run on this same game.

TEST_MODE is a switch for test and training mode, make sure trained_path has corresponding trained checkpoint if testing.

If NoFrameskip is true, testing will run on PongNoFrameskip-v4, otherwise it will run on training game.

You can choose n_actions to be 6 or 3, as mentioned in report, by default Pong return action space of 6, however minimal set of actions are 3(up, down, stay).

state_shape is input shape, since we use valid padding, change of shape won't cause an error but might cause loss of information.

We update the network every step_freq steps, that is after every collecting step_freq frames.

update_step_freq is for Double DQN, we copy main DQN to target DQN every update_step_freq steps.

Test_freq is how often we run testing game and record testing rewards during training, in terms of episodes.

REPLAY_MEMORY_START_SIZE is how many frames we will do completely randomly before start training, training won't start before that and epsilon will start decreasing schedule after that.

Memory_size is the maximum number of frames we store.

Max_episodes is the maximum training episodes will be performed, usually colab kick us out before that number.

Max_frame is the maximum frame number of training, epsilon keep decreasing until this number in OpenAI schedule.

DUELINGwang is a switch to use Dueling structure in Wang, Z. (2015) paper.

DUELINGawjuliani is a switch to use Dueling structure in awjuliani. (2017) https://github.com/awjuliani/DeepRL-Agents/blob/master/Double-Dueling-DQN.ipynb. Those two switch cannot be both True obviously.

DOUBLE is a switch to use DOUBLE DQN. Use True on both DUELINGwang and DOUBLE to train DOUBLE DUELING DQN Wang, Z. (2015), etc.

EPS takes a string, we only implement DeepMind and OpenAI episilon schedule, look at report 2.3.1 for more.

Terminal_every_point is True if we use terminal signal when every point ends, False for terminal signal when whole match ends.

init is the initializer we use in convolutional layers in network.

*do not turn every switch to False, this is not designed for training a standard model.

In [5]:
Env = 'PongDeterministic-v4' #'PongDeterministic-v4' and 'Pong-v0'

TEST_MODE = False

NoFrameskip = False

n_actions = 6 #6 or 3

state_shape = [84,84,4]

step_freq = 4

update_step_freq = 10000

Test_freq = 50

REPLAY_MEMORY_START_SIZE = 50000

Memory_size = 1000000

Max_episodes = 5001

Max_frame = 30000000

DUELINGwang = False

DUELINGawjuliani = True

DOUBLE = True

EPS = "OpenAI" #take 'DeepMind' for 1 linearly to 0.1, take 'OpenAI' for 1 linearly to 0.1 then linearly to 0.01 to MAX FRAME

Terminal_every_point = False

init = tf.variance_scaling_initializer(scale=2)


The boring part of implementation of Dueling and Double DQN, 

structure referenced code learnt from class http://stat.columbia.edu/~cunningham/teaching/scratch_lec06.ipynb , 

many useful tricks like store frames as type int learnt from https://github.com/fg91/Deep-Q-Learning/blob/master/DQN.ipynb

In [6]:
class Preprocess:
    def __init__(self):

        self.frame = tf.placeholder(shape=[210, 160, 3], dtype=tf.uint8)
        self.processed = tf.image.rgb_to_grayscale(self.frame)
        self.processed = tf.image.crop_to_bounding_box(self.processed, 34, 0, 160, 160)
        self.processed = tf.image.resize_images(self.processed, [state_shape[0], state_shape[1]], 
                                                method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)

    
    def process(self, session, frame):
        return session.run(self.processed, feed_dict={self.frame:frame})

class Replay:
        
    def __init__(self, size=Memory_size, agent_history_length=4, batch_size=32):
        self.size = size
        self.agent_history_length = agent_history_length
        self.batch_size = batch_size
        self.count = 0
        self.current = 0

        self.actions = np.empty(self.size, dtype=np.int32)
        self.rewards = np.empty(self.size, dtype=np.float32)
        self.frames = np.empty((self.size, state_shape[0], state_shape[1]), dtype=np.uint8)
        self.terminal_flags = np.empty(self.size, dtype=np.bool)

        self.states = np.empty((self.batch_size, self.agent_history_length, 
                                state_shape[0], state_shape[1]), dtype=np.uint8)
        self.new_states = np.empty((self.batch_size, self.agent_history_length, 
                                    state_shape[0], state_shape[1]), dtype=np.uint8)
        self.indices = np.empty(self.batch_size, dtype=np.int32)
                
        self.action_space = {0:0,2:1,5:2}
         
    def write(self, action, frame, reward, terminal):
        if n_actions == 3:
            self.actions[self.current] = self.action_space[action]
        else:
            self.actions[self.current] = action
        self.frames[self.current, ...] = frame
        self.rewards[self.current] = reward
        self.terminal_flags[self.current] = terminal
        self.count = max(self.count, self.current+1)
        self.current = (self.current + 1) % self.size

    def _get_state(self, index):
        return self.frames[index-self.agent_history_length+1:index+1, ...]
        
    def _get_valid_indices(self):
        for i in range(self.batch_size):
            while True:
                index = random.randint(self.agent_history_length, self.count - 1)
                if index < self.agent_history_length:
                    continue
                if index >= self.current and index - self.agent_history_length <= self.current:
                    continue
                if self.terminal_flags[index - self.agent_history_length:index].any():
                    continue
                break
            self.indices[i] = index
    
    def read(self):
        
        self.result = []
                
        self._get_valid_indices()
            
        for i, idx in enumerate(self.indices):
            self.states[i] = self._get_state(idx - 1)
            self.new_states[i] = self._get_state(idx)
        
        return np.transpose(self.states, axes=(0, 2, 3, 1)), self.actions[self.indices], self.rewards[self.indices], np.transpose(self.new_states, axes=(0, 2, 3, 1)), self.terminal_flags[self.indices]    

class Network:
    
    def __init__(self, n_in , n_out):
        self.n_in = n_in
        self.n_actions = n_out
        self.hidden = 1024
        
        self.input = tf.placeholder(shape=[None]+self.n_in, 
                                    dtype=tf.float32)

        self.inputscaled = self.input/255
        
        # Convolutional layers
        self.conv1 = tf.layers.conv2d(
            inputs=self.inputscaled, filters=32, kernel_size=[8, 8], strides=4,
            kernel_initializer=init,
            padding="valid", activation=tf.nn.relu, use_bias=False, name='conv1')
        self.conv2 = tf.layers.conv2d(
            inputs=self.conv1, filters=64, kernel_size=[4, 4], strides=2, 
            kernel_initializer=init,
            padding="valid", activation=tf.nn.relu, use_bias=False, name='conv2')
        self.conv3 = tf.layers.conv2d(
            inputs=self.conv2, filters=64, kernel_size=[3, 3], strides=1, 
            kernel_initializer=init,
            padding="valid", activation=tf.nn.relu, use_bias=False, name='conv3')
        
        if DUELINGwang:
            self.flatten_layer = tf.layers.flatten(self.conv3)

            self.fully_A =  tf.layers.dense(
                inputs=self.flatten_layer, units=512, 
                kernel_initializer=init, name='fully_A')
              
            self.fully_V =  tf.layers.dense(
                inputs=self.flatten_layer, units=512, 
                kernel_initializer=init, name='fully_V')
            
            self.advantage = tf.layers.dense(
                inputs=self.fully_A, units=self.n_actions,
                kernel_initializer=init, name="advantage")
            
            self.value = tf.layers.dense(
                inputs=self.fully_V, units=1, 
                kernel_initializer=init, name='value')
            
            self.q_values = self.value + tf.subtract(self.advantage, tf.reduce_mean(self.advantage, axis=1, keepdims=True))

        elif DUELINGawjuliani:
            self.conv4 = tf.layers.conv2d(
                inputs=self.conv3, filters=self.hidden, kernel_size=[7, 7], strides=1, 
                kernel_initializer=init,
                padding="valid", activation=tf.nn.relu, use_bias=False, name='conv4')

            self.valuestream, self.advantagestream = tf.split(self.conv4, 2, 3)
            self.valuestream = tf.layers.flatten(self.valuestream)
            self.advantagestream = tf.layers.flatten(self.advantagestream)
            self.advantage = tf.layers.dense(
                inputs=self.advantagestream, units=self.n_actions,
                kernel_initializer=init, name="advantage")
            self.value = tf.layers.dense(
                inputs=self.valuestream, units=1, 
                kernel_initializer=init, name='value')

            self.q_values = self.value + tf.subtract(self.advantage, tf.reduce_mean(self.advantage, axis=1, keepdims=True))
           
        else:
            self.flatten_layer = tf.layers.flatten(self.conv3)
            
            self.fully_connected =  tf.layers.dense(
                inputs=self.flatten_layer, units=512, 
                kernel_initializer=init, name='fully')
            
            self.q_values = tf.layers.dense(
                inputs=self.fully_connected, units=self.n_actions,
                kernel_initializer=init)
            
        self.best_action = tf.argmax(self.q_values, 1)
        
        self.target_q = tf.placeholder(shape=[None], dtype=tf.float32)

        self.action = tf.placeholder(shape=[None], dtype=tf.int32)
        self.Q = tf.reduce_sum(tf.multiply(self.q_values, tf.one_hot(self.action, self.n_actions, dtype=tf.float32)), axis=1)
        
        self.loss = tf.reduce_mean(tf.losses.huber_loss(labels=self.target_q, predictions=self.Q))
        self.optimizer = tf.train.AdamOptimizer(0.00025)
        self.train_step = self.optimizer.minimize(self.loss)

class Agent:
    
    def __init__(self):
        self.n_in = state_shape
        self.n_out = 6
        self.total_reward = 0 
        self.gamma = 0.99
        self.epsilon = 1
        self.final_epsilon = 0.01
        self.n_frames = 0
        self.batch_size = 32
        self.q_value = 0
        self.actions = [0,2,5]
    
    def choose_action(self, tf_session, observation, main_dqn, testing=False):
        if EPS == "OpenAI":
            if testing:
                self.epsilon = 0
            elif self.n_frames <= REPLAY_MEMORY_START_SIZE:
                self.epsilon = 1.0
            elif REPLAY_MEMORY_START_SIZE < self.n_frames <= Memory_size + REPLAY_MEMORY_START_SIZE:
                self.epsilon = 1.0 - 0.9 * ((self.n_frames-REPLAY_MEMORY_START_SIZE)/Memory_size)
            elif Memory_size + REPLAY_MEMORY_START_SIZE < self.n_frames <= Max_frame:
                self.epsilon = 0.1 - 0.09 * ((self.n_frames - Memory_size - REPLAY_MEMORY_START_SIZE)/(Max_frame - Memory_size - REPLAY_MEMORY_START_SIZE))
            else:
                self.epsilon = self.final_epsilon
        elif EPS == "DeepMind":
            if testing:
                self.epsilon = 0
            elif self.n_frames <= REPLAY_MEMORY_START_SIZE:
                self.epsilon = 1.0
            elif REPLAY_MEMORY_START_SIZE < self.n_frames <= Memory_size + REPLAY_MEMORY_START_SIZE:
                self.epsilon = 1.0 - 0.9 * ((self.n_frames-REPLAY_MEMORY_START_SIZE)/Memory_size)
            else:
                self.epsilon = self.final_epsilon
        if np.random.rand() < self.epsilon:
            if n_actions == 3:
                return self.actions[np.random.randint(3)]
            else:
                return np.random.randint(n_actions)
        choice = tf_session.run(main_dqn.best_action, feed_dict={main_dqn.input:[observation]})[0]
        self.q_value = max(tf_session.run(main_dqn.q_values, feed_dict={main_dqn.input:[observation]}))
        if n_actions == 3:
            return self.actions[choice]
        else:
            return choice
    
    def learn(self, tf_session, replay, main_dqn, target_dqn):
        states, actions, rewards, new_states, terminal_flags = replay.read()    

        arg_q_max = tf_session.run(main_dqn.best_action, feed_dict={main_dqn.input:new_states})

        if DOUBLE:
            q_vals = tf_session.run(target_dqn.q_values, feed_dict={target_dqn.input:new_states})
            double_q = q_vals[range(self.batch_size), arg_q_max]

            target_q = rewards + (self.gamma*double_q * (1-terminal_flags))
        else:
            q_vals = tf_session.run(main_dqn.q_values, feed_dict={main_dqn.input:new_states})
            q = q_vals[range(self.batch_size), arg_q_max]

            target_q = rewards + (self.gamma*q * (1-terminal_flags))

        loss, train = tf_session.run([main_dqn.loss, main_dqn.train_step], 
                              feed_dict={main_dqn.input:states, 
                                         main_dqn.target_q:target_q, 
                                         main_dqn.action:actions})
            
    def add_frame(self):
        self.n_frames += 1    

    def gather_reward(self, reward):
        self.total_reward += reward
        
    def get_total_reward(self):
         return self.total_reward
        
    def set_total_reward(self, new_total):
         self.total_reward = new_total
        
    def set_n_frames(self,frames):
        self.n_frames = frames
        
class CopyMainNetwork:
    def __init__(self, main_dqn_vars, target_dqn_vars):
        self.main_dqn_vars = main_dqn_vars
        self.target_dqn_vars = target_dqn_vars

    def _update_target_vars(self):
        update_ops = []
        for i, var in enumerate(self.main_dqn_vars):
            copy_op = self.target_dqn_vars[i].assign(var.value())
            update_ops.append(copy_op)
        return update_ops
            
    def update_networks(self, sess):
        update_ops = self._update_target_vars()
        for copy_op in update_ops:
            sess.run(copy_op)
            
class Game:
    def __init__(self, agent_history_length=4, env = Env):
        self.env = gym.make(env)
        self.frame_processor = Preprocess()
        self.state = None
        self.agent_history_length = agent_history_length
        self.states12 = []
        self.count = 1

    def reset(self, sess):
        frame = self.env.reset()
        processed_frame = self.frame_processor.process(sess, frame)
        self.state = np.repeat(processed_frame, self.agent_history_length, axis=2)
        
    def step(self, sess, action):
        new_frame, reward, terminal, info = self.env.step(action)
        
        if Terminal_every_point:
            if reward != 0:
                terminal_life_lost = True
            else:
                terminal_life_lost = terminal
                
        else:
            terminal_life_lost = terminal
        
        processed_new_frame = self.frame_processor.process(sess, new_frame)
        new_state = np.append(self.state[:, :, 1:], processed_new_frame, axis=2)  
        self.state = new_state
        
        return processed_new_frame, reward, terminal, terminal_life_lost, new_frame
    

Trainning part implementation, it restore file if half-trained model exists, otherwise it starts a new training session.

There is a saver, which saves the model to path every 50 steps, also it saves a pickle names results, consists four lists, training rewards, testing q-values, testing rewards and number of frames trained.

There is a timer, it stops the training after 40000 seconds and do the saving to prevent colab kick us out before saving the latest result.

In [None]:
if not TEST_MODE:
    with tf.Graph().as_default():
        ep_rewards = []
        ep_qs = []
        test_rewards = []
        frames = []

        atari = Game()

        with tf.variable_scope('mainDQN'):
            MAIN_DQN = Network(state_shape, n_actions)   # (★★)
        with tf.variable_scope('targetDQN'):
            TARGET_DQN = Network(state_shape, n_actions)               # (★★)

        if DOUBLE:
            MAIN_DQN_VARS = tf.trainable_variables(scope='mainDQN')
            TARGET_DQN_VARS = tf.trainable_variables(scope='targetDQN')
            network_updater = CopyMainNetwork(MAIN_DQN_VARS, TARGET_DQN_VARS)

        my_replay_memory = Replay()
        agent = Agent()

        with tf.Session() as sess:

            saver = tf.train.Saver()

            try:
                saver.restore(sess, path + file_name)
                with open(path + '/results', 'rb') as fp:
                    file = pickle.load(fp)
                    ep_rewards = file[0] 
                    ep_qs = file[1]
                    test_rewards = file[2]
                    frames = file[3]
                    agent.set_n_frames(frames[-1])
                    print('set frames n =' , agent.n_frames)
            except:
                print('usual tf initialization: initializer is He et al. 2015 equation 10')
                sess.run(tf.global_variables_initializer())

            ####
            # Q-learn (train) DQN on Pong
            ####

            start = time.time()

            for ep in range(Max_episodes): 

                atari.reset(sess)

                agent.set_total_reward(0)
                qs = []

                while True:
                    action = agent.choose_action(sess, atari.state, MAIN_DQN)

                    processed_frame, reward, done, terminal_life_lost, _ = atari.step(sess,action)

                    agent.add_frame()

                    agent.gather_reward(reward)

                    my_replay_memory.write(action=action, 
                                           frame=processed_frame[:, :, 0],
                                           reward=reward, 
                                           terminal=terminal_life_lost) 

                    if my_replay_memory.count > REPLAY_MEMORY_START_SIZE and agent.n_frames % step_freq == 0:
                        agent.learn(sess, my_replay_memory, MAIN_DQN, TARGET_DQN)

                    if DOUBLE and my_replay_memory.count > REPLAY_MEMORY_START_SIZE and agent.n_frames % update_step_freq == 0:
                        network_updater.update_networks(sess) # (9★)

                    if done:
                        ep_rewards.append(agent.get_total_reward())
                        frames.append(agent.n_frames)
                        break

                ####
                # Control Pong with greedy learned DQN (test)
                #### 

                if (ep+1) % Test_freq == 0:
                    atari.reset(sess)                

                    agent.set_total_reward(0)
                    while True:
                        action = agent.choose_action(sess, atari.state, MAIN_DQN,testing=True)

                        processed_frame, reward, done, terminal_life_lost, frame = atari.step(sess,action)

                        qs.append(agent.q_value)

                        agent.gather_reward(reward)
                        if done==True:
                            ep_qs.append(np.mean(qs))
                            test_rewards.append(agent.get_total_reward())
                            break

                end = time.time()
                time_elapsed = end - start

                if (ep+1) % 50 == 0:
                    print('After {} episodes, last 50 rewards averaged {}'.format(ep+1, np.mean(ep_rewards[-50:])))
                    print('After {} episodes, last Q {}'.format(ep+1, ep_qs[-1]))
                    print('After {} episodes, last test rewards {}'.format(ep+1, test_rewards[-1]))

                    print(frames[-1],'frames trained')

                    print(time_elapsed,'seconds passed')

                    save_path = saver.save(sess, path + "/dm.ckpt",global_step=frames[-1])

                    with open(path+'/results', 'wb') as fp:
                        pickle.dump([ep_rewards,ep_qs,test_rewards,frames], fp)

                    print('done saving at',save_path)
                
                #if colab is about to kick us out
                if time_elapsed > 40000:
                    save_path = saver.save(sess, path + "/dm.ckpt",global_step=frames[-1])

                    with open(path+'/results', 'wb') as fp:
                        pickle.dump([ep_rewards,ep_qs,test_rewards,frames], fp)

                    print('done saving at',save_path)

                    print('colab session time out')
                    break 

            plt.plot(ep_rewards, linewidth=2)
            plt.xlabel('episode')
            plt.ylabel('total reward per episode')
            plt.title('DQN q-learning (training)')
            plt.show()

Testing mode, make sure trained_path, save_file and file name in saver.restore are correct before testing, also, you need to make the model configuration in the first part above is same as trained model or there will be an error obviously. A gif will be saved after testing.

In [4]:
def generate_gif(frame_number, frames_for_gif, reward, path):
    imageio.mimsave(f'{path}{"ATARI_frame_{0}_reward_{1}.gif".format(frame_number, reward)}', 
                    frames_for_gif,fps = 30)
    
if TEST_MODE:
    tf.Graph().as_default()
    
    if NoFrameskip:
        atari = Game(env = 'PongNoFrameskip-v4')
    else:
        atari = Game(env = Env)
    
    with tf.variable_scope('mainDQN'):
        MAIN_DQN = Network(state_shape, n_actions)  
    with tf.variable_scope('targetDQN'):
        TARGET_DQN = Network(state_shape, n_actions)             

    if DOUBLE:
        MAIN_DQN_VARS = tf.trainable_variables(scope='mainDQN')
        TARGET_DQN_VARS = tf.trainable_variables(scope='targetDQN')
        network_updater = CopyMainNetwork(MAIN_DQN_VARS, TARGET_DQN_VARS)

    agent = Agent()
    
    gif_path = "GIF/"
    os.makedirs(gif_path,exist_ok=True)
    
    trained_path = "content/gdrive/My Drive/DQN_wang_deter/"
    save_file = "dm.ckpt-8079637.meta"
    
    with tf.Session() as sess:
        saver = tf.train.Saver()

        saver.restore(sess,trained_path+"dm.ckpt-8079637")

        frames_for_gif = []
        atari.reset(sess)
        episode_reward_sum = 0
        action = np.random.randint(6)
        while True:
            atari.env.render()

            action = agent.choose_action(sess, atari.state, MAIN_DQN,testing=True)
            if NoFrameskip:
                for i in range(3):
                    atari.env.render()
                    new_frame, reward, terminal, info = atari.env.step(action) 
                    episode_reward_sum += reward
                    frames_for_gif.append(new_frame)
                    if terminal == True:
                        break

            processed_new_frame, reward, terminal, terminal_live_lost, new_frame = atari.step(sess, action)
            episode_reward_sum += reward
            frames_for_gif.append(new_frame)

            if terminal == True:
                break

    atari.env.close()
    print("The total reward is {}".format(episode_reward_sum))
    print("Creating gif...")
    generate_gif(0, frames_for_gif, episode_reward_sum, gif_path)
    print("Gif created, check the folder {}".format(gif_path))

INFO:tensorflow:Restoring parameters from content/gdrive/My Drive/DQN_wang_deter/dm.ckpt-8079637
The total reward is 20.0
Creating gif...
Gif created, check the folder GIF/



Source: http://stat.columbia.edu/~cunningham/teaching/scratch_lec06.ipynb and https://github.com/fg91/Deep-Q-Learning/blob/master/DQN.ipynb