# Flappy Bird Value Approximator: Deep Reinforcement Learning

## Value Function Approximation
- Function Approximators: Linear Combinations and Neural Network
- Increment Methods: Stochastic Gradient Descent Prediction and Control 
- Batch Methods: Least Squares Prediction and Control, Experience Replay

## Types of Function Approximators
- There are many function Approximators – supervised ML algorithms: Linear combinations of features, Neural Network, Decision Tree, etc.
- RL can get better benefit from differential function Approximators like Linear and Neural Network algorithms
- Incremental methods update the weights on each sample while batch does an updated on each epoch (batch).
- Stochastic Gradient Descent (SCD) is an incremental and iterative optimization algorithm to find values of parameters (weights) of a function that minimizes  cost function. 
- Least Squares method is a form of mathematical regression analysis that finds the line of best fit for a dataset, providing a - visual demonstration of the relationship between the data points.

## Neural Network Approximating the Q Function

![title](images/va_nonlinear.png)
![title](images/va_nonlinear_z.png)


## Q-Learning with Non-Linear Approximation

- Step 1: Start with initial parameter values
- Step 2: Take action a according to an explore or exploit policy, transitioning from s to s’
- Step 3: Perform TD update for each parameter
     \begin{equation}
\large
\theta_i \leftarrow \theta_i + \alpha [R(s) + \beta * max_{a'}\hat{Q_\theta}(s', a') - \hat{Q_\theta}(s, a)]* \frac{\partial\hat{Q_\theta}(s,a)}{\partial\theta_i}
\end{equation}
- Step 4: Go to Step 2

Typically the space has many local minima and we no longer guarantee convergence, often works well in practice.


# Neural Network Concepts

- Perceptron, the first generation neural network, created a simple mathematical model or a function, mimicking neuron – the basic unit of brain
- Sigmoid Neuron improved learning by giving some weightage to the input
- Neural Network is a directed graph, organized by layers and layers are created by number of interconnected neurons (nodes)
- Typical neural network contains three layers: input, hidden and output. If the hidden layers are more than one, then it is called deep neural network
- Actual processing happens in hidden layers where each neuron acts as an activation function to process the input (from previous layers)
- The performance of neural network is measured using cost or error function and the dependent weight functions
- Forward and backward-propagation are two techniques, neural network users repeatedly until all the input variables are adjusted or calibrated to predict accurate output.
- During, forward-propagation, information moves in forward direction and passes through all the layers by applying certain weights to the input parameters. Back-propagation method minimizes the error in the weights by applying an algorithm called gradient descent at each iteration step.

![title](images/nn.png)


# Deep Neural Networks

- Deep Learning is an advanced neural network with multiple hidden layers that can work with supervised or unsupervised datasets.
- Deep Learning vectorizes the input and converts it into output vector space by decomposing complex geometric and polynomial equations into a series of simple transformations. These transformations go through neuron activation functions at each layer parameterized by input weights.
- Convolutional Neural Network (CNN) consists of (1) convolutional layers - to identify the features using weights and biases, followed by (2) fully connected layers - where each neuron is connected from all the neurons of previous layers - to provide nonlinearity, sub-sampling or max-pooling, performance and control data overfitting. Examples include: image and voice recognition.
- Recursive Neural Network (RNN) is, another type of Deep Learning, that uses same shared feature weights recursively for processing sequential data, emitted by sensors or the way spoken words are processed in NLP, to produce arbitrary size input and output vectors. Long Short Term Memory (LSTM) is an advanced RNN to learn and remember longer sequences by composing series of repeated modules of neural network. 

![title](images/deep_nn.png)
![title](images/cnn.png)
![title](images/rnn.png)

# Weight Sharing and Experience Replay

- **Weight Sharing**: Convolutional Neural Network shares weights between local regions Recurrent Neural Network shares weights between time-steps


- **Experience Replay**: Store experience (S, A, R, Snext) in a replay buffer and sample mini-batches from it to train the network. This de-correlates the data and leads to better data efficiency. In the beginning, the replay buffer is filled with random experience.Better convergence behavior when training a function approximator. 



# Deep Q-Learning Network (DQN)

- Step 1: Take action at according to e-greedy policy
- Step 2: Store transition (st, at, rt+1, st+1) in replay memory D
- Step 3: Sample random mini-batch of transitions (s, a, r, s’) from D
- Step 4: Compute Q-learning targets w.r.t old, fixed parameters w—
- Step 5: Optimize MSE (mean squared error) between Q-network and Q-learning targets
    \begin{equation}
\Large
\mathcal{L_i(w_i)} = \mathbb{E}_{s,a,r,s' \tilde{} D_i}[(r + \gamma * max_{a'} Q(s', a';w_i^-) - Q(s,a; w_i))^2]
\end{equation}
- Step 6: Using variant of stochastic gradient descent


## Some key aspects of the implementation:

Libraries used: Keras with TensorFlow (**GPU version**) and trained for several hours in Azure Windows Environment.

To scale the implementation, we pre-process the images by converting color images to grayscale and then crop the images to 80X80 pixels. And then stack 4 frames together so that the flappy bird velocity is inferred properly.

- The input to the neural network consists of an 4x80x80 images. 
- The first hidden layer convolves 32 filters of 8 x 8 with stride 4 and applies ReLU activation function. 
- The 2nd layer convolves a 64 filters of 4 x 4 with stride 2 and applies ReLU activation function. 
- The 3rd layer convolves a 64 filters of 3 x 3 with stride 1 and applies ReLU activation function. 
- The final hidden layer is fully-connected consisted of 512 rectifier units. 
- The output layer is a fully-connected linear layer with a single output for each valid action.  

![title](images/flappy_dqn.png)
Image Source: https://github.com/yenchenlin/DeepLearningFlappyBird

**Convolution** actually helps computer to learn higher features like edges and shapes. The example below shows how the edges are stand out after a convolution filter is applied.

**Keras** makes it very easy to build convolution neural network. However, there are few things to track:

- A) It is important to choose a right initialization method. I choose normal distribution with sigma(σ) =0.01. init=lambda shape, name: normal(shape, scale=0.01, name=name)

- B) The ordering of the dimension is important, the default setting is 4x80x80 (Theano setting), so if your input is 80x80x4 (Tensorflow setting) then you are in trouble because the dimension is wrong. Alert: If your input dimension is 80x80x4 (Tensorflow setting) you need to set dim_ordering = tf (tf means tensorflow, th means theano)

- C) In Keras, subsample=(2,2) means you down sample the image size from (80x80) to (40x40). In ML literature it is often called “stride”

- D) We have used an adaptive learning algorithm called ADAM to do the optimization. The learning rate is 1-e6.

**Experience Relay:**

It was found that approximation of Q-value using non-linear functions like neural network is not very stable. During the game-play all the episode (s,a,r,s′) are stored in replay memory D. When training the network, random mini-batches from the replay memory are used instead of most the recent transition, which will greatly improve the stability.

## Policy Gradient (PG)

Policy Gradient algorithms optimize the parameters of a policy by following the gradients toward higher rewards. One popular class of PG algorithms, called REINFORCE algorithms, was introduced back in 1992 by Ronald Williams.

# Deep Neural Network Implementation

Jupyter notebook function to disable cell-level scrolling

In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [None]:
#!/usr/bin/env python
from __future__ import print_function

import argparse
import skimage as skimage
from skimage import transform, color, exposure
from skimage.transform import rotate
from skimage.viewer import ImageViewer
import sys
sys.path.append("game/")
#import wrapped_flappy_bird as game
from flappy_bird_env import * 
import random
import numpy as np
from collections import deque
import datetime
import json

import json
from keras.initializers import normal, identity
from keras.models import model_from_json
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.optimizers import SGD , Adam
import tensorflow as tf


In [3]:
%run flappy_bird_env_open_ai_gym.py #open AI gym clone

In [None]:
GAME = 'bird' # the name of the game being played for log files
CONFIG = 'nothreshold'
ACTIONS = 2 # number of valid actions
GAMMA = 0.99 # decay rate of past observations
OBSERVATION = 1000. # timesteps to observe before training
EXPLORE = 1000000. # frames over which to anneal epsilon
FINAL_EPSILON = 0.0001 # final value of epsilon
INITIAL_EPSILON = 0.08 # starting value of epsilon
REPLAY_MEMORY = 50000 # number of previous transitions to remember
BATCH = 32 # size of minibatch
FRAME_PER_ACTION = 1
LEARNING_RATE = 1e-4

img_rows , img_cols = 80, 80
#Convert image into Black and white
img_channels = 4 #We stack 4 frames

In [1]:
def buildmodel():
    print("Now we build the model")
    model = Sequential()
    
    #80*80*4
    model.add(Convolution2D(32, 8, 8, subsample=(4, 4), border_mode='same',
                                input_shape=(img_rows,img_cols,img_channels))) 
    #hidden layers
    model.add(Activation('relu'))
    model.add(Convolution2D(64, 4, 4, subsample=(2, 2), border_mode='same'))
    model.add(Activation('relu'))
    model.add(Convolution2D(64, 3, 3, subsample=(1, 1), border_mode='same'))
    model.add(Activation('relu'))
    model.add(Flatten())
    model.add(Dense(512))
    model.add(Activation('relu'))
    
    model.add(Dense(2))
   
    adam = Adam(lr=LEARNING_RATE)
    model.compile(loss='mse',optimizer=adam)
    
    return model

In [5]:
def trainNetwork(model,args):
    # open up a game state to communicate with emulator
    max_episode=2000000
    env = FlappyBirdEnv()
    start = datetime.datetime.now()
    algorithm = 'DQN'
    data = []
    # store the previous observations in replay memory
    D = deque()

    # get the first state by doing nothing and preprocess the image to 80x80x4
    #do_nothing = np.zeros(ACTIONS)
    #do_nothing[0] = 1

    if args['mode'] == 'Run':
        OBSERVE = 999999999    #We keep observe, never train
        epsilon = FINAL_EPSILON
        print ("Now we load weight")
        model.load_weights("data/model.h5")
        adam = Adam(lr=LEARNING_RATE)
        model.compile(loss='mse',optimizer=adam)
        print ("Weight load successfully")    
    else:                       #We go to training mode
        model.load_weights("data/model.h5")
        adam = Adam(lr=LEARNING_RATE)
        model.compile(loss='mse',optimizer=adam)
        OBSERVE = OBSERVATION
        epsilon = INITIAL_EPSILON

    t = 0
    for t in range( max_episode):
        x_t = env.reset(return_type=3)
  
        x_t = skimage.color.rgb2gray(x_t)
        x_t = skimage.transform.resize(x_t,(80,80))
        x_t = skimage.exposure.rescale_intensity(x_t,out_range=(0,255))

        x_t = x_t / 255.0

        s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)
    #print (s_t.shape)

    #In Keras, need to reshape
        s_t = s_t.reshape(1, s_t.shape[0], s_t.shape[1], s_t.shape[2])  #1*80*80*4
        
        #loss = 0
        #Q_sa = 0
        terminal=False
        while not terminal:
            loss = 0
            Q_sa = 0
            #action_index = 0
            r_t = 0
            a_t = 0
        #choose an action epsilon greedy
            if t % FRAME_PER_ACTION == 0:
                if random.random() <= epsilon:
                    print("----------Random Action----------")
                    action = random.randrange(ACTIONS)
                    a_t = action
                    #print("a_t", a_t)
                else:
                    q = model.predict(s_t)       #input a stack of 4 images, get the prediction
                    max_Q = np.argmax(q)
                    action = max_Q
                    a_t = action
                    #print("a_t", a_t)

        #We reduced the epsilon gradually
            if epsilon > FINAL_EPSILON and t > OBSERVE:
                epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE

        #run the selected action and observed next state and reward
            x_t1_colored, r_t, terminal,_ = env.step(a_t,return_type=3)
            env.render()
            x_t1 = skimage.color.rgb2gray(x_t1_colored)
            x_t1 = skimage.transform.resize(x_t1,(80,80))
            x_t1 = skimage.exposure.rescale_intensity(x_t1, out_range=(0, 255))


            x_t1 = x_t1 / 255.0


            x_t1 = x_t1.reshape(1, x_t1.shape[0], x_t1.shape[1], 1) #1x80x80x1
            s_t1 = np.append(x_t1, s_t[:, :, :, :3], axis=3)

        # store the transition in D
            D.append((s_t, action, r_t, s_t1, terminal))
            if len(D) > REPLAY_MEMORY:
                D.popleft()

        #only train if done observing
            if t > OBSERVE:
            #sample a minibatch to train on
                minibatch = random.sample(D, BATCH)

            #Now we do the experience replay
                state_t, action_t, reward_t, state_t1, done = zip(*minibatch)
                state_t = np.concatenate(state_t)
                state_t1 = np.concatenate(state_t1)
                targets = model.predict(state_t)
                Q_sa = model.predict(state_t1)
                #print ("Q_sa", Q_sa)
                #print ("Max Q_sa",np.max(Q_sa, axis=1))
                targets[range(BATCH), action_t] = reward_t + GAMMA*np.max(Q_sa, axis=1)*np.invert(done)

                loss += model.train_on_batch(state_t, targets)
            
     
            s_t = s_t1
            #if terminal:
            #    break
            
        t = t + 1
        
        
        duration = datetime.datetime.now() - start 
        
        if (env.score >= 10):
            print("Duration: {} Episode {} Score: {}".format(duration, 
                                                                t, 
                                                                env.score))
        
        data.append(json.dumps({ "algorithm": algorithm, 
                    "duration":  "{}".format(duration), 
                    "episode":   t, 
                    "reward":    r_t, 
                    "score":     env.score}))
        
        if (len(data) == 500):
            file_name = 'data/stats_flappy_bird_{}.json'.format(algorithm)
            
            # delete the old file before saving data for this session
            #if t == 1 and os.path.exists(file_name): os.remove(file_name)
                
            # open the file in append mode to add more json data
            file = open(file_name, 'a+')  
            for item in data:
                file.write(item)  
                file.write(",")
            #end for
            file.close()
            
            data = []
            
        # save progress every 10000 iterations
        if t % 1000 == 0:
            print("Now we save model")
            model.save_weights("model.h5", overwrite=True)
            with open("data/model.json", "w") as outfile:
                json.dump(model.to_json(), outfile)

        # print info
        state = ""
        if t <= OBSERVE:
            state = "observe"
        elif t > OBSERVE and t <= OBSERVE + EXPLORE:
            state = "explore"
        else:
            state = "train"

        print("TIMESTEP", t, "/ STATE", state, \
            "/ EPSILON", epsilon, "/ ACTION", action, "/ REWARD", r_t, \
            "/ Q_MAX " , np.max(Q_sa), "/ Loss ", loss)

        print("Episode finished!")
        print("************************")

In [6]:
def playGame(args):
    model = buildmodel()
    trainNetwork(model,args)

def main():
    parser = argparse.ArgumentParser(description='Description of your program')
    parser.add_argument('-m','--mode', help='Train / Run', required=True)
    args = vars(parser.parse_args())
    playGame(args)


In [None]:
if __name__ == "__main__":
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    sess = tf.Session(config=config)
    from keras import backend as K
    K.set_session(sess)
    main()