# Playing Atari games using Categorical DQN

Let's implement the categorical DQN algorithm for playing the Atari games. The code used
in this section is adapted from open-source categorical DQN implementation - 
https://github.com/princewen/tensorflow_practice/tree/master/RL/Basic-DisRLDemo provided by Prince Wen. 



In [1]:
import tensorflow as tf
print(tf.__version__)

2.0.0


First, let's import the necessary libraries:

In [3]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import random
from collections import deque
import math

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

import gym
from tensorflow.python.framework import ops

## Defining the convolutional layer

In [4]:
def conv(inputs, kernel_shape, bias_shape, strides, weights, bias=None, activation=tf.nn.relu):

    weights = tf.get_variable('weights', shape=kernel_shape, initializer=weights)
    conv = tf.nn.conv2d(inputs, weights, strides=strides, padding='SAME')
    if bias_shape is not None:
        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)
        return activation(conv + biases) if activation is not None else conv + biases
    
    return activation(conv) if activation is not None else conv

## Defining the dense layer

In [5]:
def dense(inputs, units, bias_shape, weights, bias=None, activation=tf.nn.relu):
    
    if not isinstance(inputs, ops.Tensor):
        inputs = ops.convert_to_tensor(inputs, dtype='float')
    if len(inputs.shape) > 2:
        inputs = tf.layers.flatten(inputs)
    flatten_shape = inputs.shape[1]
    weights = tf.get_variable('weights', shape=[flatten_shape, units], initializer=weights)
    dense = tf.matmul(inputs, weights)
    if bias_shape is not None:
        assert bias_shape[0] == units
        biases = tf.get_variable('biases', shape=bias_shape, initializer=bias)
        return activation(dense + biases) if activation is not None else dense + biases
    
    return activation(dense) if activation is not None else dense


## Defining the variables

Now, let's define some of the important variables.

Initialize the $V_{min}$ and $V_{max}$

In [6]:
v_min = 0
v_max = 1000

Initialize the number of atoms (supports) $N$:

In [7]:
atoms = 51

Define the discount factor, $\gamma$:

In [8]:
gamma = 0.99 

Define the batch size:

In [9]:
batch_size = 10

Define the time step at which we want to update the target network

In [10]:
update_target_net = 50 

Define the epsilon value which is used in the epsilon-greedy policy:

In [11]:
epsilon = 0.5

## Defining the replay buffer

First, let's define the buffer length:

In [12]:
buffer_length = 20000

Define the replay buffer as a deque structure:

In [13]:
replay_buffer = deque(maxlen=buffer_length)

We define a function called sample_transitions which returns the randomly sampled
minibatch of transitions from the replay buffer:

In [14]:
def sample_transitions(batch_size):
    batch = np.random.permutation(len(replay_buffer))[:batch_size]
    trans = np.array(replay_buffer)[batch]
    return trans

## Defining the Categorical DQN class

Let's define the class called `Categorical_DQN` where we will implement the categorical
DQN algorithm. For a clear understanding, you can also check the detailed explanation of code on the book.

In [15]:
class Categorical_DQN():
    
    #first, let's define the init method
    def __init__(self,env):
        
        #start the TensorFlow session
        self.sess = tf.InteractiveSession()
        
        #initialize v_min and v_max
        self.v_max = v_max
        self.v_min = v_min
        
        #initialize the number of atoms
        self.atoms = atoms 
        
        #initialize the epsilon value
        self.epsilon = epsilon
        
        #get the state shape of the environment
        self.state_shape = env.observation_space.shape
        
        #get the action shape of the environment
        self.action_shape = env.action_space.n

        #initialize the time step:
        self.time_step = 0
        
        #initialize the target state shape
        target_state_shape = [1]
        target_state_shape.extend(self.state_shape)

        #define the placeholder for the state
        self.state_ph = tf.placeholder(tf.float32,target_state_shape)
        
        #define the placeholder for the action
        self.action_ph = tf.placeholder(tf.int32,[1,1])
                                       
        #define the placeholder for the m value (distributed probability of target distribution)
        self.m_ph = tf.placeholder(tf.float32,[self.atoms])
    
        #compute delta z
        self.delta_z = (self.v_max - self.v_min) / (self.atoms - 1)
                                       
        #compute the support values
        self.z = [self.v_min + i * self.delta_z for i in range(self.atoms)]

        self.build_categorical_DQN()
                                       
        #initialize all the TensorFlow variables
        self.sess.run(tf.global_variables_initializer())

        
    #let's define a function called build_network for building a deep network. Since we are
    #dealing with the Atari games, we use the convolutional neural network
                                       
    def build_network(self, state, action, name, units_1, units_2, weights, bias, reg=None):
                                       
        #define the first convolutional layer
        with tf.variable_scope('conv1'):
            conv1 = conv(state, [5, 5, 3, 6], [6], [1, 2, 2, 1], weights, bias)
                                       
        #define the second convolutional layer
        with tf.variable_scope('conv2'):
            conv2 = conv(conv1, [3, 3, 6, 12], [12], [1, 2, 2, 1], weights, bias)
                                       
        #flatten the feature maps obtained as a result of the second convolutional layer
        with tf.variable_scope('flatten'):
            flatten = tf.layers.flatten(conv2)
    
        #define the first dense layer
        with tf.variable_scope('dense1'):
            dense1 = dense(flatten, units_1, [units_1], weights, bias)
                                       
        #define the second dense layer
        with tf.variable_scope('dense2'):
            dense2 = dense(dense1, units_2, [units_2], weights, bias)
                                       
        #concatenate the second dense layer with the action
        with tf.variable_scope('concat'):
            concatenated = tf.concat([dense2, tf.cast(action, tf.float32)], 1)
                                       
        #define the third layer and apply the softmax function to the result of the third layer and
        #obtain the probabilities for each of the atoms
        with tf.variable_scope('dense3'):
            dense3 = dense(concatenated, self.atoms, [self.atoms], weights, bias) 
        return tf.nn.softmax(dense3)

    #now, let's define a function called build_categorical_DQNfor building the main and
    #target categorical deep Q networks
                                       
    def build_categorical_DQN(self):      
                                       
        #define the main categorical DQN and obtain the probabilities
        with tf.variable_scope('main_net'):
            name = ['main_net_params',tf.GraphKeys.GLOBAL_VARIABLES]
            weights = tf.random_uniform_initializer(-0.1,0.1)
            bias = tf.constant_initializer(0.1)

            self.main_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)
                                       
        #define the target categorical DQN and obtain the probabilities
        with tf.variable_scope('target_net'):
            name = ['target_net_params',tf.GraphKeys.GLOBAL_VARIABLES]

            weights = tf.random_uniform_initializer(-0.1,0.1)
            bias = tf.constant_initializer(0.1)

            self.target_p = self.build_network(self.state_ph,self.action_ph,name,24,24,weights,bias)

        #compute the main Q value with probabilities obtained from the main categorical DQN
        self.main_Q = tf.reduce_sum(self.main_p * self.z)
                                    
        #similarly, compute the target Q value with probabilities obtained from the target categorical DQN 
        self.target_Q = tf.reduce_sum(self.target_p * self.z)
        
        #define the cross entropy loss
        self.cross_entropy_loss = -tf.reduce_sum(self.m_ph * tf.log(self.main_p))
        
        #define the optimizer and minimize the cross entropy loss using Adam optimizer
        self.optimizer = tf.train.AdamOptimizer(0.01).minimize(self.cross_entropy_loss)
    
        #get the main network parameters
        main_net_params = tf.get_collection("main_net_params")
        
        #get the target network parameters
        target_net_params = tf.get_collection('target_net_params')
        
        #define the update_target_net operation for updating the target network parameters by
        #copying the parameters of the main network
        self.update_target_net = [tf.assign(t, e) for t, e in zip(target_net_params, main_net_params)]

    #let's define a function called train to train the network
    def train(self,s,r,action,s_,gamma):
        
        #increment the time step
        self.time_step += 1
    
        #get the target Q values
        list_q_ = [self.sess.run(self.target_Q,feed_dict={self.state_ph:[s_],self.action_ph:[[a]]}) for a in range(self.action_shape)]
        
        #select the next state action a dash as the one which has the maximum Q value
        a_ = tf.argmax(list_q_).eval()
        
        #initialize an array m with shape as the number of support with zero values. The denotes
        #the distributed probability of the target distribution after the projection step

        m = np.zeros(self.atoms)
        
        #get the probability for each atom using the target categorical DQN
        p = self.sess.run(self.target_p,feed_dict = {self.state_ph:[s_],self.action_ph:[[a_]]})[0]
        
        #perform the projection step
        for j in range(self.atoms):
            Tz = min(self.v_max,max(self.v_min,r+gamma * self.z[j]))
            bj = (Tz - self.v_min) / self.delta_z 
            l,u = math.floor(bj),math.ceil(bj) 

            pj = p[j]

            m[int(l)] += pj * (u - bj)
            m[int(u)] += pj * (bj - l)
    
        #train the network by minimizing the loss
        self.sess.run(self.optimizer,feed_dict={self.state_ph:[s] , self.action_ph:[action], self.m_ph: m })
        
        #update the target network parameters by copying the main network parameters
        if self.time_step % update_target_net == 0:
            self.sess.run(self.update_target_net)
    
    #let's define a function called select_action for selecting the action. We generate a random number and if the number is less than epsilon we select the random
    #action else we select the action which has maximum Q value.
    def select_action(self,s):
        if random.random() <= self.epsilon:
            return random.randint(0, self.action_shape - 1)
        else: 
            return np.argmax([self.sess.run(self.main_Q,feed_dict={self.state_ph:[s],self.action_ph:[[a]]}) for a in range(self.action_shape)])


## Training the network

Now, let's start training the network. First, create the Atari game environment using the
gym. Let's create a Tennis game environment:

In [16]:
env = gym.make("Tennis-v0")

Create an object to our `Categorical_DQN` class:

In [None]:
agent = Categorical_DQN(env)

Define the number of episodes:

In [18]:
num_episodes = 800

In [None]:
#for each episode
for i in range(num_episodes):
    
    #set done to False
    done = False
    
    #initialize the state by resetting the environment
    state = env.reset()
    
    #initialize the return
    Return = 0
    
    #while the episode is not over
    while not done:
        
        #render the environment
        env.render()
        
        #select an action
        action = agent.select_action(state)
        
        #perform the selected action
        next_state, reward, done, info = env.step(action)
        
        #update the return
        Return = Return + reward
        
        #store the transition information into the replay buffer
        replay_buffer.append([state, reward, [action], next_state])
        
        #if the length of the replay buffer is greater than or equal to buffer size then start training the
        #network by sampling transitions from the replay buffer
        if len(replay_buffer) >= batch_size:
            trans = sample_transitions(2)
            for item in trans:
                agent.train(item[0],item[1], item[2], item[3],gamma)
                
        #update the state to the next state
        state = next_state
    
    #print the return obtained in the episode
    print("Episode:{}, Return: {}".format(i,Return))

Now that we learned how categorical DQN works and how to implement them, in the next
section, we will learn another interesting algorithm.