# Deep Q Learning to Play SpaceInvaders using Tensorflow

In this notebook I attempted to create an agent that plays the game space invaders using atari 2600.

This notebook is inspired from the tutorial given below.

In [1]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/gCJyVX98KJ4?showinfo=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')



In [2]:
import tensorflow as tf # Deep Learning Library
import numpy as np # Handle Matrices
import retro # Retro Environment

from skimage import transform # helps to preprocess the frames
from skimage.color import rgb2gray # Helps to convert frames from color to grayscale

import matplotlib.pyplot as plt # Displays graphs

from collections import deque # Ordered Collection with ends

import random

import warnings # This ignores all the warning messages that are normally printed during training because of skimage

warnings.filterwarnings('ignore')

## Creating the Environment

This time we use **OpenAI Retro**, a wrapper for video game emulator cores using the Liberto API to turn them into gym environments

### Our Environment

We use the Atari Space Invaders environment

In [3]:
# Creating the environment
env = retro.make(game='SpaceInvaders-Atari2600')

print("The size of our frame is : ",env.observation_space)
print("The action size is : ",env.action_space.n)

# Here we create an one-hot encoded version of our actions
# possible actions = [[1, 0, 0, 0, 0, 0, 0, 0],[0, 1, 0, 0, 0, 0, 0, 0]...]
possible_actions = np.array(np.identity(env.action_space.n,dtype=int).tolist())

The size of our frame is :  Box(210, 160, 3)
The action size is :  8


## Defining the Preprocessing functions

Preprocessing is an important step, because we want to reduce the complexity of our states to reduce the computation time needed for training.

Our Steps Taken :

1. Grayscale each of the frames (Reason : Color doesn't add important information)
2. Crop the screen (in our case we remove the part below the player, because it doesn't add any important information)
3. We normalize the pixel values
4. Finally we re-size the preprocessed frame

In [4]:
def preprocess_frame(frame) :
    
    # Grayscale the frame
    gray = rgb2gray(frame)
    
    # Crop the scree (remove the part below the player)
    # [Up: Down, Left: Right]
    cropped_frame = gray[0:-12,4:-12]
    
    # Normalize Pixel values
    normalized_frame = cropped_frame/255.0
    
    # Resize
    preprocessed_frame = transform.resize(cropped_frame,[110,84])
    
    return preprocessed_frame # 110x84x1 frame

### stack_frames

Stacking frames is really important because, it helps us to **give a sense of motion to our Neural Network.**

But, We don't stack each frame, **we skip 4 frames at each time step**. This means that only every fourth frame is considered. And then, we use this frame to form the stack_frame

**The frame skipping method is already implemented in the library**

* First we preprocess the frame
* Then we append the frame to the deque that automatically **removes the oldest frame**
* Finally we build the stacked state

How we work stack

* For the first frame, we need 4 frames
* At each timestep, **we add the new frame to deque and then we stack them to form a new stacked frame**
* And so on
* If we're done. **we create a new stack with 4 new frames (because we are in a new episode)**

In [5]:
stack_size = 4 # We stack 4 frames

# Initialize deque with zero-images one array for each image
stacked_frames = deque([np.zeros((110,84),dtype=np.int) for i in range(stack_size)],maxlen=4)
def stack_frames(stacked_frames,state,is_new_episode) :
    
    # Preprocess frame
    frame = preprocess_frame(state)
    if is_new_episode :
        # Clear our stacked_frames
        stacked_frames = deque([np.zeros((110,84),dtype=np.int) for i in range(stack_size)],maxlen=4)
        
        # Because we're in a new episode, copy the same frame 4x
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        
        
        # Stack the frames
        stacked_state = np.stack(stacked_frames,axis=2)
        
    else :
        # Append the frame to deque, automatically removes the oldest frame
        stacked_frames.append(frame)
        
        # Build the stacked state (first dimension specifies different frames)
        stacked_state = np.stack(stacked_frames,axis=2)
        
    return stacked_state, stacked_frames

### Setting up the hyper-parameters

In this part we'll set up our different hyperparameters. But when you implement a Neural Network **you will not implement hyperparameters at once but progressively**

* First you begin by defining the neural networks hyperparameters when you implement the model

* Then you'll add the training hyperparameters when you implement the training algorithm

In [6]:
# Model Hyperparameters
state_size = [110,84,4]          # Our input is a stack of 4 frames hence 110x84x4 (Width, height, channels)
action_size = env.action_space.n # 8 possible actions
learning_rate = 0.00025 # Alpha (aka learning rate)

# Training Hyperparameters
total_episodes = 5  # Total episodes for training
max_steps = 500    # Max Possible steps in an episode
batch_size = 64   # Batch Size


# Exploration parameters for epsilon greedy strategy
explore_start = 1.0        # exploration probability at start
explore_stop = 0.01        # minimum exploration probability
decay_rate = 0.00001      # exponential decay rate for exploration probability


# Q-Learning hyperparameters
gamma = 0.9          # Discounting Rate


# Memory Hyperparameters
pretrain_length = batch_size   # Number of experiences stored in the Memory when initialized for the first time

memory_size = 1000000         # Number of experiences the Memory can keep

# Preprocessing hyperparameters
stack_size = 4       # Number of frames stacked

# Modify if you want to see the trained agent
training = True

# Turn this to TRUE if you want to render the environment
episode_render = False

### Creating the Deep Q-Learning Neural Network Mode

This is our Deep Q-Learning Model :

* We take a stack of 4 frames as input
* It passes through 3 convnets
* Then it is flatened
* Finally it passes through 2 FC layers
* It outputs a Q value for each actions

In [7]:
class DQNetwork :
    def __init__(self,state_size,action_size,learning_rate,name='DQNetwork') :
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        
        with tf.variable_scope(name) :
            
            # We create the placeholders
            # *state_size means that we take each elements of state_size in tuple hence is like if we wrote 
            # [None,84,84,4]
            self.inputs_ = tf.placeholder(tf.float32,[None,*state_size],name="inputs")
            self.actions_ = tf.placeholder(tf.float32,[None,self.action_size],name="actions_")
            
            # Remember that target_Q is R(s,a) + gamma*max Q_hat(s',a')
            self.target_Q = tf.placeholder(tf.float32,[None],name="target")
            
            
            '''
            First Convnet :
            CNN
            ELU
            '''
            self.conv1 = tf.layers.conv2d(inputs=self.inputs_,
                                          filters=32,
                                          kernel_size=[8,8],
                                          strides=[4,4],
                                          padding="VALID",
                                          kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                          name="conv1")
            
            self.conv1_out = tf.nn.elu(self.conv1,name="conv1_out")
            
            
            '''
            Second convnet :
            CNN
            ELU
            '''
            
            self.conv2 = tf.layers.conv2d(inputs=self.conv1_out,
                                          filters=64,
                                          kernel_size=[4,4],
                                          strides=[2,2],
                                          padding="VALID",
                                          kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                          name="conv2")
            
            self.conv2_out = tf.nn.elu(self.conv2,name="conv2_out")
            
            
            '''
            Third Convnet :
            CNN
            ELU
            '''
            
            self.conv3 = tf.layers.conv2d(inputs=self.conv2_out,
                                          filters=64,
                                          kernel_size=[3,3],
                                          strides=[2,2],
                                          padding="VALID",
                                          kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),
                                          name="conv3")
            
            self.conv3_out = tf.nn.elu(self.conv3,name="conv3_out")
            
            self.flatten = tf.contrib.layers.flatten(self.conv3_out)
            
            self.fc = tf.layers.dense(inputs=self.flatten,
                                      units=512,
                                      activation=tf.nn.elu,kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                      name="fc1")
            
            self.output = tf.layers.dense(inputs=self.fc,
                                          kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                          units=self.action_size,
                                          activation=None)
            
            
            
            # Q is our predicted Q value.
            self.Q = tf.reduce_sum(tf.multiply(self.output,self.actions_))
            
            
            # The loss is the difference between our predicted Q_values and the Q_target
            # Sum(Q_target - Q_predicted)^2
            self.loss = tf.reduce_mean(tf.square(self.target_Q - self.Q))
            
            self.optimizer = tf.train.AdamOptimizer(self.learning_rate).minimize(self.loss)
            
            

In [8]:
# Reset the graph
tf.reset_default_graph()

# Instantiate the DQNetwork
DQNetwork = DQNetwork(state_size,action_size,learning_rate)

### Experience Replay

Now that we create our Neural Network, **we need to implement the Experience Replay Method**.

Here we'll create the Memory Network object that creates a deque. A deque (*double ended queue*) is a data type that **removes the oldest element each time that you add a new element**

In [9]:
class Memory() :
    def __init__(self,max_size) :
        self.buffer = deque(maxlen=max_size)
    
    def add(self,experience) :
        self.buffer.append(experience)
        
    def sample(self,batch_size) :
        buffer_size = len(self.buffer)
        index = np.random.choice(np.arange(buffer_size),
                                 size=batch_size,
                                 replace=False)
        return [self.buffer[i] for i in index]

Here we'll **deal with the empty memory problem**. we pre-populate our memory by taking experience (state, action,reward,next_state)

In [10]:
# Instantiate memory
memory = Memory(max_size=memory_size)
for i in range(pretrain_length) :
    
    # If it's the first step
    if i == 0 :
        state = env.reset()
        
        state, stacked_frames = stack_frames(stacked_frames,state,True)
        
    # Get the next_state, the rewards, done by taking a random action
    choice = random.randint(1,len(possible_actions)) - 1
    action = possible_actions[choice]
    next_state, reward, done, _ = env.step(action)
    
    
    # Stack the frames
    next_state, stacked_frames = stack_frames(stacked_frames,next_state,False)
    
    
    # If the episode is finished (we're dead 3x)
    if done :
        
        # We finished the episode
        next_state =np.zeros(state.shape)
        
        
        # Add experience to memory
        memory.add((state,action,reward,next_state,done))
        
        
        # Start a new episode
        state = env.reset()
        
        # Stack the frames
        state, stacked_frames = stack_frames(stacked_frames,state,True)
    
    else :
        
        # Add experience to memory
        memory.add((state,action,reward,next_state,done))
        
        
        # Our new state is now next_state
        state = next_state

### Set up Tensorboard

To launch tensorboard : ```tensorboard --logdir=/tensorboard/dqn/1```

In [11]:
# Setup TensorBoard Writer
writer = tf.summary.FileWriter("/tensorboard/dqn/1")

# Losses
tf.summary.scalar("Loss",DQNetwork.loss)

write_op = tf.summary.merge_all()

### Training our agent

Our Algorithm :

* Initialize the weights
* Init the environment
* Initialize the decay rate (that will reduce epsilon)

* **For** episode to max_episode **do**
    1. Make a new episode
    2. Set step to 0
    3. Observe the first state s0
    
    4. **While** step < max_steps **do**
        * Increase decay_rate
        * With epsilon select a random actions a(t) otherwise select a(t) = argmax Q(s(t),a)
        * Store transition S
        * Sample random mini-batch from **D**
        * Set Q_hat = r if the episode ends at +1 otherwise set Q_hat = r + gamma*max(Q(s',a'))
        * Make a gradient descent step with loss (Q_hat - Q(s,a))^2

In [12]:
'''
This function will do the part 
With epsilon select a random action a(t), otherwise select a(t) = argmax Q(s(t),a)
'''

def predict_action(explore_start,explore_stop,decay_rate,decay_step,state,actions) :
    
    # Epsilon Greedy Strategy
    # Choose action a, from state s using epsilon greedy
    # First we randomize a number
    exp_exp_tradeoff = np.random.rand()
    
    
    # Here we'll use an improved version of our epsilon greedy strategy used in Q-learning notebook 
    explore_probability = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*decay_step)
    
    
    if (explore_probability > exp_exp_tradeoff) :
        # Make a random acton (exploration)
        choice = random.randint(1,len(possible_actions)) - 1
        action = possible_actions[choice]
        
    else :
        # Get action from Q-network
        # Estimate the Qs values state
        Qs = sess.run(DQNetwork.output,feed_dict={DQNetwork.inputs_ : state.reshape((1,*state.shape))})
        
        # Take the biggest Q value (= best action)
        choice = np.argmax(Qs)
        action = possible_actions[choice]
        
    return action, explore_probability

In [13]:
# Saver will help us to save the model
saver = tf.train.Saver()

if training == True :
    with tf.Session() as sess :
        
        # Initialize the variables
        sess.run(tf.global_variables_initializer())
        
        # Initialize the decay rate (that will use to reduce epsilon)
        decay_step = 0
        
        for episode in range(total_episodes) :
            
            # Set step to 0
            step = 0
            
            # Initialize the rewards of the episode
            episode_rewards = []
            
            # Make a new episode and observe the first state
            state = env.reset()
            
            # Remember that stack frame function also call our preprocess function.
            state, stacked_frames = stack_frames(stacked_frames,state,True)
            
            while step < max_steps :
                
                step += 1
                
                # Increase decay_step
                decay_step += 1
                
                # Predict the action to take and take it
                action, explore_probability = predict_action(explore_start,explore_stop,decay_rate,decay_step,state,possible_actions)
                
                
                # Perform the action and get the next state, reward, and done information 
                next_state, reward, done, _ = env.step(action)
                
                if episode_render :
                    env.render()
                    
                # Add the reward to the total reward
                episode_rewards.append(reward)
                
                # If the game is finished
                if done :
                    
                    # The episode ends so no next state
                    next_state = np.zeros((110,84),dtype=np.int)
                    next_state, stacked_frames = stack_frames(stacked_frames,next_state,False)
                    
                    # Set step = max_steps to end the episode
                    step = max_steps
                    
                    # Get the total reward of the episode
                    total_reward = np.sum(episode_rewards)
                    
                    print('Episode: {}'.format(episode),'Total Reward : {}'.format(total_reward),'Explore P :{:.4f}'.format(explore_probability),'Training Loss {:4.f}'.format(loss))
                    
                    
                    
                    rewards_list.append((episode,total_reward))
                    
                    # Store transition <s(t),a(t),r(t+1),s(t+1)> in memory D
                    memory.add((state,action,reward,next_state,done))
                    
                else :
                    
                    # Stack the frame of the next_state
                    next_state, stacked_frames = stack_frames(stacked_frames,next_state,False)
                    
                    # Add experience to memory
                    memory.add((state,action,reward,next_state,done))
                    
                    # s(t+1) is now out current state
                    state = next_state
                    
            # Learning Part
            # Obtain random mini batch from memory
            batch = memory.sample(batch_size)
            states_mb = np.array([each[0] for each in batch],ndmin=3)
            actions_mb = np.array([each[1] for each in batch])
            rewards_mb = np.array([each[2] for each in batch])
            next_states_mb = np.array([each[3] for each in batch],ndmin=3)
            dones_mb = np.array([each[4] for each in batch])
            
            target_Qs_batch = []
            
            
            # Get Q values for next_state
            Qs_next_state = sess.run(DQNetwork.output,feed_dict={DQNetwork.inputs_ : next_states_mb})
            
            
            # Set Q_target = r if the episode at s+1, other wise set Q_target = r + gamma*max(Q(s',a'))
            
            for i in range(0,len(batch)) :
                terminal = dones_mb[i]
                
                # If we are in terminal state, only equals reward
                if terminal :
                    target_Qs_batch.append(rewards_mb[i])
                    
                else :
                    target = rewards_mb[i] + gamma*np.max(Qs_next_state[i])
                    target_Qs_batch.append(target)
                    
                targets_mb = np.array([each for each in target_Qs_batch])
                
                loss, _ = sess.run([DQNetwork.loss,DQNetwork.optimizer],feed_dict={DQNetwork.inputs_:states_mb,
                                                                                   DQNetwork.target_Q:targets_mb,
                                                                                   DQNetwork.actions_:actions_mb})
                
                # Write TF Summaries
                summary = sess.run(write_op,feed_dict={DQNetwork.inputs_:states_mb,
                                                       DQNetwork.target_Q : targets_mb,
                                                       DQNetwork.actions_:actions_mb})
                writer.add_summary(summary,episode)
                writer.flush()
                
            # Save model every 5 episodes
            if episode % 5 == 0 :
                save_path = saver.save(sess,"./models/model.ckpt")
                print("Model Saved")

Model Saved


### Test and Watching our agent play

Now that we trained our agent, we can test it

In [None]:
with tf.Session() as sess :
    total_test_rewards = []
    
    # load saved model
    saver.restore(sess,"./models/model.ckpt")
    for episode in range(1) :
        total_rewards = 0
        
        state = env.reset()
        state, stacked_frames = stack_frames(stacked_frames,state,True)
        
        print('^^^^^^^^^^^^^^^^^^^^^')
        print('EPISODE',episode)
        
        while True :
            
            # Reshape the state
            state = state.reshape((1,*state_size))
            
            # Get action from Q-network
            # Estimate the Qs values state
            Qs = sess.run(DQNetwork.output,feed_dict={DQNetwork.inputs_:state})
            
            
            # Take the biggest Q value (= the best action)
            choice = np.argmax(Qs)
            action = possible_actions[choice]
            
            # Perform the action and get the next_state, reward, and done information
            next_state, reward, done, _ = env.step(action)
            env.render()
            
            total_rewards += reward
            
            if done :
                print('Score',total_rewards)
                total_test_rewards.append(total_rewards)
                break
                
            next_state, stacked_frames = stack_frames(stacked_frames,next_state,False)
            state = next_state
        
    env.close()

INFO:tensorflow:Restoring parameters from ./models/model.ckpt
^^^^^^^^^^^^^^^^^^^^^
EPISODE 0
