# Flappy Bird Value Approximator: Simple Neural Network

## Value Function Approximation
- Function Approximators: Linear Combinations and Neural Network
- Increment Methods: Stochastic Gradient Descent Prediction and Control 
- Batch Methods: Least Squares Prediction and Control, Experience Replay

## Types of Function Approximators
- There are many function Approximators – supervised ML algorithms: Linear combinations of features, Neural Network, Decision Tree, etc.
- RL can get better benefit from differential function Approximators like Linear and Neural Network algorithms
- Incremental methods update the weights on each sample while batch does an updated on each epoch (batch).
- Stochastic Gradient Descent (SCD) is an incremental and iterative optimization algorithm to find values of parameters (weights) of a function that minimizes  cost function. 
- Least Squares method is a form of mathematical regression analysis that finds the line of best fit for a dataset, providing a - visual demonstration of the relationship between the data points.

## Neural Network Approximating the Q Function

![title](images/va_nonlinear.png)
![title](images/va_nonlinear_z.png)


## Q-Learning with Non-Linear Approximation

- Step 1: Start with initial parameter values
- Step 2: Take action a according to an explore or exploit policy, transitioning from s to s’
- Step 3: Perform TD update for each parameter
     \begin{equation}
\large
\theta_i \leftarrow \theta_i + \alpha [R(s) + \beta * max_{a'}\hat{Q_\theta}(s', a') - \hat{Q_\theta}(s, a)]* \frac{\partial\hat{Q_\theta}(s,a)}{\partial\theta_i}
\end{equation}
- Step 4: Go to Step 2

Typically the space has many local minima and we no longer guarantee convergence, often works well in practice.


# Neural Network Concepts

- Perceptron, the first generation neural network, created a simple mathematical model or a function, mimicking neuron – the basic unit of brain
- Sigmoid Neuron improved learning by giving some weightage to the input
- Neural Network is a directed graph, organized by layers and layers are created by number of interconnected neurons (nodes)
- Typical neural network contains three layers: input, hidden and output. If the hidden layers are more than one, then it is called deep neural network
- Actual processing happens in hidden layers where each neuron acts as an activation function to process the input (from previous layers)
- The performance of neural network is measured using cost or error function and the dependent weight functions
- Forward and backward-propagation are two techniques, neural network users repeatedly until all the input variables are adjusted or calibrated to predict accurate output.
- During, forward-propagation, information moves in forward direction and passes through all the layers by applying certain weights to the input parameters. Back-propagation method minimizes the error in the weights by applying an algorithm called gradient descent at each iteration step.

![title](images/nn.png)


# Simple Neural Network Implementation

Jupyter notebook function to disable cell-level scrolling

In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [7]:
import os
import datetime
import gym
import random
import json
import os
import numpy as np
from collections      import deque
from keras.models     import Sequential
from keras.layers     import Dense
from keras.optimizers import Adam

Using TensorFlow backend.


In [15]:
%run util/flappy_bird_env_open_ai_gym.py #open AI gym clone

In [16]:
class ValueApproxSimpleNNModel(object):
    
    def __init__(self, state_size, action_size, algorithm):
        self.algorithm          = algorithm
        self.learning_rate      = 0.001        
        self.weight_backup      = "flappy_va_{}.h5".format(algorithm) 
        
        self.state_size         = state_size        
        self.action_size        = action_size    
        
        self.exploration_rate   = 1.0
        self.exploration_min    = 0.01        
        
        self.brain              = self._build_model()
    
    def _build_model(self):
        
        # Neural Net for Deep-Q learning Model
        model = Sequential()
        
        #input layer
        model.add(Dense(24, input_dim=self.state_size, activation='relu'))
        
        #hidden layer
        model.add(Dense(24, activation='relu'))
        
        #output layer with two outputs - up or down
        model.add(Dense(self.action_size, activation='linear'))
        
        #set the loss function
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))

        #check file exists to load the weights from
        if os.path.isfile(self.weight_backup):
            model.load_weights(self.weight_backup)
            self.exploration_rate = self.exploration_min
        return model

    def save_model(self):
            self.brain.save(self.weight_backup)

In [17]:
class ValueApproxAgent(object):
    def __init__(self, state_size, action_size, model):        
        self.state_size         = state_size
        self.action_size        = action_size
        self.memory             = deque(maxlen=2000)
        self.learning_rate      = 0.001
        self.gamma              = 0.95
        self.exploration_rate   = 1.0
        self.exploration_min    = 0.01
        self.exploration_decay  = 0.995
        self.model              = model        
    
    def act(self, state):
        #*****************************************
        #ACT: agent will randomly select its action at first by a certain percentage, 
        #called ‘exploration rate’ (or ‘epsilon’). 
        #At the beginning, it is better for the DQN agent to try 
        #different things before it starts to search for a pattern
        #*****************************************
        
        if np.random.rand() <= self.exploration_rate:
            return random.randrange(self.action_size)
        
        act_values = self.model.brain.predict(state)
        return np.argmax(act_values[0])

   
    def remember(self, state, action, reward, next_state, done):
        #****************************************************
         #One of the most important steps in the learning process is to remember 
         #what we did in the past and how the reward was bound to that action
        #****************************************************
        self.memory.append((state, action, reward, next_state, done))

    
    def replay(self, sample_batch_size):
        #*****************************************
        #ONLINE learning from the samples of the execution trace
        #REPLAY: Now that we have our past experiences in an array, 
        #we can train our neural network. 
        #We cannot afford to go through all our memory, it will take too many ressources. 
        #Therefore, we will only take a few samples (sample_batch_size and here set as 32) 
        #and we will just pick them randomly.
        #*************************************************************
        #not enough data to train; play another episode
        if len(self.memory) < sample_batch_size:
            return
        
        sample_batch = random.sample(self.memory, sample_batch_size)
        for state, action, reward, next_state, done in sample_batch:
            target = reward
            
            if not done:
                #q-learning
                target = reward + self.gamma * np.amax(self.model.brain.predict(next_state)[0])
                
            target_f = self.model.brain.predict(state)
            target_f[0][action] = target
                                    
            #online learning with One Sample and discard this after fitting it to the model
            self.model.brain.fit(state, target_f, epochs=1, verbose=0)
            
        #adjust the exploration based on decay
        if self.exploration_rate > self.exploration_min:
            self.exploration_rate *= self.exploration_decay   
                

In [18]:
class ValueApproxGame:
    
    def __init__(self, env, agent, max_iterations= 10000):
        
        self.sample_batch_size = 32 #only few samples
        self.episodes          = max_iterations
        self.agent             = agent
        self.env               = env #Open AI gym    
        self.state_size        = self.env.observation_space.shape[0]
        self.action_size       = self.env.action_space.n
        
        self.start             = datetime.datetime.now()      
        self.data              = []

    def run(self):
        
        try:
            for index_episode in range(self.episodes):
                state = self.env.reset(2)
                state = np.reshape(state, [1, self.state_size])

                done = False
                index = 0
                episode_reward = 0
                
                while not done:
                    self.env.render(close = True)

                    #take action
                    action = self.agent.act(state)                    
                    
                    next_state, reward, done, _ = self.env.step(action, 2)
                    next_state = np.reshape(next_state, [1, self.state_size])
                    
                    self.agent.remember(state, action, reward, next_state, done)
                    
                    state = next_state
                    episode_reward += reward
                    index += 1
                                
                self.save_stats(index_episode, episode_reward, self.env.score)
                
                self.agent.model.save_model()
                    
                self.agent.replay(self.sample_batch_size)
        finally:
            self.agent.model.save_model()

    #save_stats method is used to capture the output of all the episodes with metrics: 
    #algorithm, duration, episode, reward and score.**
        
    #only for the reporting purpose
    def save_stats(self, episode, reward, score):
                
        duration = datetime.datetime.now() - self.start 
        
        if (score >= 50):
            print("Duration: {} Episode {} Score: {}".format(duration, 
                                                                episode, 
                                                                score))
        
        self.data.append(json.dumps({ "algorithm": self.agent.model.algorithm, 
                    "duration":  "{}".format(duration), 
                    "episode":   episode, 
                    "reward":    reward, 
                    "score":     score}))
        
        if (len(self.data) == 500):
            file_name = 'data/stats_flappy_bird_{}.json'.format(self.agent.model.algorithm)
            
            # delete the old file before saving data for this session
            if episode == 1 and os.path.exists(file_name): os.remove(file_name)
                
            # open the file in append mode to add more json data
            file = open(file_name, 'a+')  
            for item in self.data:
                file.write(item)  
                file.write(",")
            #end for
            file.close()
            
            self.data = []

In [None]:
if __name__ == "__main__":
    
    algorithm = "Simple_Neural_Network"
    max_episodes = 10000
    env = FlappyBirdEnv()
    state_size = env.observation_space.shape[0]
    action_size = env.action_space.n
    
    model = ValueApproxSimpleNNModel(state_size, action_size, algorithm)
    agent = ValueApproxAgent(state_size, action_size, model)
    
    flappy = ValueApproxGame(env, agent, max_episodes)
    
    flappy.run()