## Import Libraries

In [1]:
import numpy as np
import cv2
import random

## Data Preparation

In [2]:
def rgb_to_gray(self, im):
    """Converts RGB image into Greyscale using formula
    Formula: Y = 0.2126R+0.7152G+0.0722B;
    R=red,G=green,B=blue;"""
    return np.dot(im, [0.2126, 0.7152, 0.0722])

1.Reshaping the input image such that the width of the resulting image equals the resized width, 84, indicated by the resized_shape parameter.

In [3]:
def cv2_resize_image(image, resized_shape=(84, 84),method='crop', crop_offset=8):
    """returns a cropped image with default size of 84 x 84, given 
    a grayscale input image"""
    height, width = image.shape    
    resized_height, resized_width = resized_shape
    if method == 'crop':
        h = int(round(float(height) * resized_width / width))
        resized = cv2.resize(image, 
                             (resized_width, h),
                             interpolation=cv2.INTER_LINEAR)
        crop_y_cutoff = h - crop_offset - resized_height
        cropped = resized[crop_y_cutoff:crop_y_cutoff+resized_height, :]
        return np.asarray(cropped, dtype=np.uint8)    
    elif method == 'scale':        
        return np.asarray(cv2.resize(image,
                                     (resized_width, resized_height),
                                     interpolation=cv2.INTER_LINEAR),
                                     dtype=np.uint8)
    else:
        raise ValueError('Unrecognized image resize method.')

## Deep Q-Learning Theory
* The State Space defines all possible events that could happen. In Atari this is a screen image or set of screen images collected over a time interval.

* The Reward Function defines the goal that needs to be solved, mapping states and actions to a value that indicates the desirability of being in that state. In Atari reward is the score recieved after taking actions.

* Policy Function is the begavior of the agent and maps states to actions that need to be taken when in those states.

* Value Function informs of which state action pairs are good in the long term, or state value discounted amount of reward an agent can expect to collect over time.

* An episode consists of one complete pass from a start state, thorugh different states until a terminal state is reached (goal reached or times up).

* Exploration is trying something new to learn more about the environment and exploitation is making the best decision based on all information you have. Epsilon greedy is the simplest way to make such a trade-off.

In [6]:
def Manual_Q_learning():
    """This simplest Q-learning algorithm can 
    only handle discrete states and actions
    continuous states, it fails because the 
    convergence is not guaranteed due to the 
    existence of infinite states"""
    alpha = 1.0
    gamma = 0.8
    epsilon = 0.2
    num_episodes = 100
    R = np.array([
        [-1, 0,-1, -1, -1, -1],
        [0, -1, 0, -1,  0, -1],
        [-1, 0,-1,-50, -1, -1],
        [-1,-1, 0, -1, -1, -1],
        [-1, 0,-1, -1, -1,100],
        [-1,-1,-1, -1, -1, -1]])
    # initialise Q
    Q = np.zeros((6,6))
    #run each episode
    for _ in range(num_episodes):
        #randomly choose an intial state
        s = np.random.choice(5)
        while s !=5:
            #get all possible actions
            actions = [a for a in range(6) if R[s][a] != -1]
            #epsilon-greedy
            if np.random.binomial(1, epsilon) == 1:
                a = random.choice(actions)
            else:
                a = actions[np.argmax(Q[s][actions])]
            next_state = a
            #update Q(s,a)
            Q[s][a] += alpha * (R[s][a] + gamma * np.max(Q[next_state])- Q[s][a])
            #go to next state
            s = next_state
    return Q

In [9]:
print("Q-Table:")
Manual_Q_learning()

Q-Table:


array([[  0.  ,  64.  ,   0.  ,   0.  ,   0.  ,   0.  ],
       [ 51.2 ,   0.  ,  51.2 ,   0.  ,  80.  ,   0.  ],
       [  0.  ,  64.  ,   0.  ,  -9.04,   0.  ,   0.  ],
       [  0.  ,   0.  ,  51.2 ,   0.  ,   0.  ,   0.  ],
       [  0.  ,  64.  ,   0.  ,   0.  ,   0.  , 100.  ],
       [  0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ]])

This simplest Q-learning algorithm can only handle discrete states and actions continuous states, it fails because the convergence is not guaranteed due to the existence of infinite states. The solution is to use a neural network.

The Neural Network Architecture takes the state as an input and the output for each possible action. This allows the ability to compute Q-values for all possible actions given that state as a single forward pass through the network and removing the cost of scaling. For some tasks such as playing the game breakout in atari the direction and velocity of the paddle and ball are important. This information is not available in an 84 x 84 pixel image. Instead we need to take 4 images to compare.

The state space now becomes 84 x 84 x 4 or 28,224 states. Convolutional neural networks can be useful here. The Q network can contain x1 input layer, x3 convolutional layers and x1 fully connected layer.
The first convolutional layer has 64 8x8 filters with a stride of 4, using RELU. The second has 64 4x4 filters, stride of 2, using RELU.
The third layer has 64 3x3 filters, stride of 2, using RELU. The reason for the first layer having a large filter size is to pick up on smaller objects in the game such as the ball or paddle. The rest are sufficient to pick up useful features.

algorithms that use the bellman equation as an iterative update are called value iteration algorithms.

* Q(s,a) = R(s,a) + gamma * max_a(s') * Q(s',a) (1)

The formula above is only suitable for deterministic environment where the next state S' is fixed given the current state s and a. 

In non-deterministic environments the bellman equation needs to be modified to the below formula:
* Q(s,a) = Es'~s[R(s,a) + gamma * max_a(s') * Q(s',a)|s,a] (2)

The right hand side of the equation takes the value iteration and the Q network can be trained by minimising the loss function at i_th iteration, using the following formula:
* L_i(tetha_i) = E_s,a~P(s,a)[yi - Q(s,a; tetha_i)^2] (3)

Q(s,a) represents a Q-network that uses equation 2 to update its parameters. 
P(s,a) is the probability distribution over sequences and actions. previous parameters from i-l are fixed when optimising the loss function L_i(tetha_i) over theta. We don't optimise this directly but instead minimise the emeperical loss using stocastic gradient descent.

This algorithm doesn't need to know the internal workings of the atari game it just needs to use samples from the game. This is called model-free, treating the simulator as a black box.

It is off-policy because it learns about the greedy policy argmaxQ)s, a, theta) while following a probability distribution P(s,a) which balances exploration with explotation. P(s,a) can be epislon-greedy.

The brain of our AI player is the Q-network controller. At each time step t, she observes the screen image  (recall that st is an  image that stacks the last four frames). Then, her brain analyzes this observation and comes up with an action, . The Atari emulator receives this action and returns the next screen image, , and the reward, . The quadruplet  is stored in the memory and is taken as a sample for training the Q-network by minimizing the empirical loss function via stochastic gradient descent.

if we sample directly from memory the data is strongly correlated breaking assumptions that samples for emperical loss function are independent, making training unstable and leadning to bad performance. Using experience replay we can sample data randomly for training.

Algorithm for Q-Learning
```
Initialize replay memory  to capacity ;
Initialize the Q-network  with random weights ;
Repeat for each episode:
    Set time step ;
    Receive an initial screen image  and do preprocessing ;
    While the terminal state hasn't been reached:
        Select an action at via greedy, i.e., select a random action with probability , otherwise select ;
        Execute action at in the emulator and observe reward  and image ;
        Set  and store transition  into replay memory ;
        Randomly sample a batch of transitions  from ;
        Set  if  is a terminal state or  if  is a non-terminal state;
        Perform a gradient descent step on ;
    End while
```

This algorithm will work for some atari games like breakout, seaquest, pong and Qbert but cannot reach humal-level control. This is thought to be because computing the target uses the current estimate of action-value function, which updates Q(st,a) while also increasing Q(st-1,a), in turn increasing the target. This leads to oscillation or divergence of the policy.
To solve this problem a seperate network is used for generating the targets in the Q-learning update. More precisely for every M q-learning updates the network Q is cloned to obtain a target network Q which is used for generating targets in the following M updates to Q. The new algorithm is below.

Q-learning algorithm:
```
Initialize replay memory  to capacity ;
Initialize the Q-network  with random weights ;
Initialize the target network  with weights ;
Repeat for each episode:
 Set time step ;
    Receive an initial screen image  and do preprocessing ;
    While the terminal state hasn't been reached:
        Select an action at via greedy, i.e., select a random action with probability , otherwise select ;
        Execute action at in the emulator and observe reward  and image ;
        Set  and store transition  into replay memory ;
        Randomly sample a batch of transitions  from ;
        Set  if  is a terminal state or  if  is a non-terminal state;
        Perform a gradient descent step on ;
        Set  for every  steps;
 End while
```

This update allows the algorithm to solve Star Gunner, Atlantis, Assault, and Space Invaders and 45 other games.

However, 1) it is slow at converging (7 days on one GPU) to reach human-level performance, 2) it fails with sparse reward, Montezuma's revenge requires long term plannin, and 3) large data amount reqired. Double Q-learning, prioritized experience replay, bootstrapped DQN, and dueling network architectures have all been proposed to solve this. In the next section we only implement DQN, not its varients.

## Deep Q-Learning Implementation

```
class DQN:
    def __init__(self, 
                 input_shape=(84, 84, 4), 
                 n_outputs=4,
                 network_type='cnn',
                 scope='q_network'):
        self.width = input_shape[0]
        self.height = input_shape[1]
        self.channel = input_shape[2]
        self.n_outputs = n_outputs
        self.network_type = network_type
        self.scope = scope

        # frame images
        self.x = tf.placeholder(dtype=tf.float32,
                                shape=(None, 
                                       self.channel, 
                                       self.width, 
                                       self.height))
        # estimates of Q-value
        self.y = tf.placeholder(dtype=tf.float32, shape=(None, ))

        #selected actions
        self.a = tf.placeholder(dtype=tf.int32, shape=(None,))

        with tf.variable_scope(scope):
            self.build()
            self.build_loss()
```