# Deep Q-Network

## From RL to Deep RL
So far, we have solved many problem of our reinforcement learning problems, using solution methods that represent the action values in a small table. Earlier in the nanodegree, we referred to this table as a **Q-table**.

In the video below, **Kelvin Lwin** will introduce us to the idea of using neural networks to expand the size of the problems that we can solve with reinforcement learning. This context is useful preparation for exploring hte details behind the Deep Q-Learning algorithm later in this lesson!

* DeepRL : Using nonlinear function approximators to calculate the value actions based directly on the observation from the environment. We represent this as Deep Neural network.

#### Stabilizing Deep Reinforcement Learning
---
As we'll learn in this lesson, the Deep Q-Learning algorithm represents the optimal action value function $q_*$ as a neural network (instead of table).

* Unfortunately, reinforcement learning is __notoriously unstable__ when neural networks are used to represent the action values. In this lesson, we will learn all about the Deep Q-Learning algorithm, which addressed these instabilities by using __two key features:__
    * Experience Replay
    * Fixed Q-Targets
    
    
### Deep Q-Networks
* _HOW IT WORKS_
    * A deep neural network that acts as a function approximator.
    * Input : Raw pixels and it outputs the vector of actions.
    * Atari games are displayed at a resolution of 210 by 160 pixels, with 128 possible colors for each pixel. This is still technically a discrete state space but very large to process as is.
    * Deepmind reduce this image to 84 by 84 and grayscale.(square to optimize operation in GPU).
    * To give it sequence of frame they(deepmind) stack 4 frame at a time.Resulting final state size (84 by 84 by 4).
    * On the output side unlike a traditional reinforcement learning setup where only one Q value is produced at a time, __The Q network is designed to produce a Q value for every possible action in a single forward pass.__
    * Without this, we would have to run the network individually for every action.
    * Instead we could simply use this vector to take an action, either stochastically, or by choosing the one with maximum value.
    
* TRAINING SUCH NETWORK REQUIRES A LOT OF DATA, BUT EVEN THEN, IT IS NOT GUARANTEED TO CONVERGE ON THE OPTIMAL VALUE FUNCTION. **IN FACT THRER ARE SITUATIONS WHERE THE NETWORK WEIGHTS CAN OSCILLATE OR DIVERGE, due to high correlation between action and states.**
* This can result in a very unstabke and ineffectively policy.
    * Experience Replay
    * Fixed Q-Targets


# Experience Replay
* The idea of experience relay and its application to training neural network isn't new.
* It was originally proposed to make more efficient use of observed experiences.
* Consider the basic online Q-Learning algorithm where we interact with environment and at each time step, we obtain a state action reward next state tuple.$(S_t,A_t,R_{t+1},S_{t+1})$, we learn from it and __discard__ it.
* Moving on to the next tuple in the following timestep.
* We could possibly can learn more from these experienced tuples if we store them somewhere.
* Moreover, some states. are pretty rare to come by and some action can be pretty costly, so it would be nice to **recall** such experiences.
* That is what exactly what a replay buffer allows us to do.

### Replay Buffer
* We store each experienced tuple in this buffer as we are interacting with the environment and then sample a small batch of tuples from it in order to learn.
* As a result, we are able to learn from individual tuples multiple times, recall rare occurrences, and in general make better use of our experience.


<br>**Another Problem that replay buffer solves.**
* This what DQN takes advantage of.
* If you think about the experiences being obtained, we realize that every action __AT__ affects the next state __ST__ in some way, which means that a __sequence of experienced tuples can be highly correlated__.
* A naive Q-Learning approach that learns from each of these experiences in sequential order runs the risk of getting swayed by the effect of this correlation.
* With experience replay, can sample form this buffer at random.
* It doesn't have to be in the same sequence as we stored the tuples.
* This helps break the correlation and ultimately prevents action values from oscillating or diverging catastrophically.

### Example to show why we need to break correlation between subsequent experience tuple

#### Tennis Example:
* Practicing forehand, learning to play tennis.
* More confident with forehand shot than backhand.
* I hit the ball __straight__ so ball comes straight back to my forehand.

* Now, if I were an online Q-learning agent learning to play, this is what I might pick up.
* When the ball comes to my right, should hit with my forehand less certainly at first but with increasing confident as I repeatedly hit the ball.
* I'm learning to play forehand pretty well **but not exploring rest of the state space**
* This could be addressed by _Epsilon-greedy policy action randomly with small chances._
* So I try different combinations of states and actions and sometimes I make mistakes, but I eventually figure out the best overall policy.
* Use a forehand shot when the ball comes to my right and a backhand when it comes to my left.

* __THIS WORKS FINE WITH SIMPLIFIED STATE SPACE WITH JUST two Discrete states__


#### Continuous state space -- Problem
* But when we consider a continuous state space things can fall apart. Let's see how.
* First the ball can actually come anywhere between the extreme left and extreme right.
* If I discretized this range into buckets I will have too many buckets(too many possibilites).
* What if I end up learning a policy with __holes__ in it.
* States or situation that we may not have visited during practice.
* Instead it makes more sense to use a function approximator like alinear combination of (RBF kernels or a Q-network) that can generalize my learning across the space.
* Now, every time the ball comes to my right and I successfully hit a forehand shot, my value function changes slightly.

* What happens when I learn while playing(__processing each experience tuple in order__)
* For instance, if my forehand shot is fairly  straight, I likely get back the ball around the same spot.
* __This produces a state very similar to the previous one, so I use my forehand again and if it is successful it reinforces my belief that forehand is a good choice__
* I can easily get trapped in this cycle.
* Ultimately, if I don't see too many examples of the ball coming to my left for a while, value of forehand shot become greater than backhand across the entire state space.
* __My policy would be then be to choose forehand regardless of where I see the ball coming.__


### Fix it
* First thing I should do is stop learning while practicing.
* This time is best spend in trying out different shots playing little randomly and thus exploring the state space.
* It becomes important to remember my interactions, what shot well in situations, etc.
* When I take a break or when I am back home or resting, that's the good time to recall these experiences and learn from them.

* The main advantages is that now I have a more comprehensive set of examples.
* I can call random experience tuple from buffer and learn different shot in different region.
* After this with this __learn experienced__ I will again play and __collect more experience tuple__ and learn from them in batches.
* __Experience replay__ can help us to learn more robust policy, one that is not affected by the inherent correlation present in the sequence of observed experience tuples.


### Experience Replay
* Reinforcement learning ->  Supervised Learning
* Prioritized Experience Replay


### Summary
When the agent interacts with the environment, the sequence of experienced tuples can be highly correlated. The naive Q-learning algorithm that learns from each of these experience tuples in sequential order runs the risk of getting swayed by the effects of this correlation. By instead keeping track of a **replay buffer** and using **experience replay** to sample from the buffer at random, we can prevent action values from oscillating or diverging catastrophically.

The **replay buffer** contains a collection of experience tuples ($S, A, R,S^{`}$). The tuples are gradually added to the buffer as we are interacting with the environment.

The act of sampling a small batch of tuples from the replay buffer in order to learn is known as __experience replay__. In addition to breaking harmful correlations, experience replay allows us to learn more from individual tuples multiple times, recall rare occurrences, and in general make better use of our experience.



# Fixed Q-Targets
* Experience replay helps us address one type of correlation.
    * That is between consecutive experience tuples.
* There is another kind of correlation that Q-learning is susceptible to.

### Q-learning update
$\Delta w = \alpha (R + \gamma max_{a} q^{`}(S^{`},a,w^{-}) - q^{`}(S,A,w)) dw q^{`}(S,A,w)$

* TD error: $(R + \gamma max_{a} q^{`}(S^{`},a,w^{-}) - q^{`}(S,A,w))$
* TD target: $R + \gamma max_{a} q^{`}(S^{`},a,w^{-})$
* Old value: $q^{`}(S,A,w)$

where $w^{-}$ are the weights of a sperate target network that are not changed during learning step, and $(S,A,R,S^{`})$ is an experience tuple.

<br>$ J(w) = E_{\pi}[(q_\pi(S,A) - q^{`}(S,A,w))^2]$
<br>$d J(w) = -2(q_\pi(S,A)- q^{`}(S,A,w))d q^{`}(S,A,w)$
<br>$dw = - \alpha \frac{1}{2} dJ(w)$
<br>$= \alpha(q_\pi(S,A) - q^{`}(S,A,w))dq^{`}(S,A,w)$
<br>$dw  = \alpha (R+\gamma max_{a}q^{`}(S^{`},a,w)- q^{`}(S,A,w))dq^{`}(S,A,w)$

* The main idea of introducing fixed Q targets is both **labels** and **predicted values** are function of weights.

* All the Q values are intrinsically tied together through the function parameters.
* Dosen't expereience replay take care of this problem?
    * Well, it addresses a similar but slightly different issue.
        * There we broke the correlation effects between consecutive tuples by sampling them randomly out of order.
* Here, the correlation between the target and the parameters we are changing.

### Fixed Target
$dw  = \alpha (R+\gamma max_{a}q^{`}(S^{`},a,w^{-})- q^{`}(S,A,w))dq^{`}(S,A,w)$
* The fixed parameters indicated by a w minus are basically a copy of w that we don't change during the learning step.
* In practice, we copied w into w minus, use to generate targets while changing w for a certain number of learning steps.
* Then, we update w minus with the latest w, again, learn for a number of steps and so on.
* __This decouples the target from the parameters, makes the learning algorithm much more stable, and less likely to diverge or fall into oscillations.__

## Summary 
In Q-Learning, we **update a guess with a guess**, and this can potentially lead to harmful correlations. To avoid this, we can update the parameters $w$ in the network $q^{`}$ to better approximate the action value corresponding to state $S$ and action $A$ with following update rule:
$ \Delta w = \alpha (R + \gamma max_{a} q^{`}(S^{`},a,w^{-}) - q^{`}(S,A,w)) dw q^{`}(S,A,w)$

where $w^-$ are the weights of a separate target network that are not changed during the learning step, and $(S,A,R,S^{`})$ is an experience tuple.

## DQN Network Paper
* Neural fitted Q-iteration -- Search about it

### Methods
**Preprocessing**. Working directly with raw Atari 2600 frames, which are $210 \times 160$ pixel images with a 128 colour palette, can be demanding in terms of computation and memory requirements. We apply a basic preprocessing step aimed reducing the inut dimensionality and dealing with some aretefact of Atari 2600 emulator. _First, to encode a single frame we take the maximum value for each pixel colour value over the frame being encoded and the previous frame. This was necessary to remove the flickering that is present in games where some objects appear only in even frames while other objects appear only in odd frames, an artefact caused by limited number of *sprites* Atari 2600 can display at once_. Second, we then extract the Y channel, also known as luminance, from the RGB frame and rescale it to $84 \times 84$. The function $\phi$ from algorithm 1 described below applies this preprocessing to the m most recent frames and stacks them to produce the input to the Q-function in which m =4, although algorithm is robust to different values of m(example 3,5).

**Model architecture** There are several possible ways of parameterizing Q using a neural network.Because Q maps history-actions pairs to scalar estimates of their Q-values of each action,the history and the action have been used as inputs to the neural network by some previous approaches.**The main drawback of this type of architecture is that a separate forward pass is required to compute the Q-value for each actionm resulting in a cost that scales linearly woth the number of actions**.


We instead use an architecture is which there is a seperate output unit for each possible action, and only the state representation is an input ot the neural network. The outputs corresponding to the predicted Q-values of the individual actions for the input state. __The main advantage of this type of architecuture is the ability to compute Q-values for all possible actions in a given state with single pass(forward pass)__.
    The exact architecture, shown(figure).The input to the neural network consists of an $84 \times 84 \times 4$ image produced by the preprocessing map $\phi$. 
<br>**First hidden layer**:The first hidden layer convolves 32 filters of $8 \times 8$ with stride of 4 with the ReLU activation.
<br>**Second hidden layer**: Convolves 64 filters of $4 \times 4 $ with stride of 2, again ReLU activation.
<br>**Third hidden layer** Convolves 64 filters of $3 \times 3$ with stride of 1 followed by ReLU activation.
<br>**Final hidden layer** is a fully connected layer and consists of 512 nodes with ReLU activation. and number of valid actions varied between 4 and 18 on the game we considered.

**Trainig details**: We performed experiments on 49 Atari games 2600 games where results were available for all other comparable methods.A different network was trained on each game: the same network architecture, learning algorithm and hyperparameter setting were used across all games, showing our approach is robust enough to work on a variety of games while incorporating only minimal prior knoweldge.

#### Reward clipping 
Changes in reward system structure of games during training only. As the scale of scores varies greatly from game to game, we **clipped all positive rewards at 1 and all negative reward at -1**, leaving 0 rewards unchanged. Clipping rewards in this manner limits the scale of the error derivatives and makes it wasier to use the same learning rate across multiple games. __At the same time, it could affect the performance of our agent since it cannot differenitate between rewards of different magnitude__. For games where there is a life counter, the Atari 2600 emulator also sends the number of lives left in the game, which is then used to mark the end of the episode.

#### OPTIMIZATION
In these experiments, we used RMSProp algorithm with minibatches of size of 32. The behaviour policy during training is epsilon greedy policy with epsilon __annealed linerality__(decaying) from 1.0 to 0.1 over the first million frames, and fixed at 0.1 thereafter. We trained for a total of 50 million frames __(that is, around 38 days of game experience in total)__ and used a replay memory of 1 million most recent frame.

In [3]:
#convolution formula
#lout  =floor[lin - k_s + 2P/s +1]
(84 -8)/4 +1 ## 20*20*32
(20-4)/2 +1 ## 9*9*64
(9-3)/1 +1  ## 7*7*64
## flatten 
#(batch_size,3136) * (3136,512) hidden layer weight dimensions.

7.0

### DQN Network -- implementation

### Necessary Packages
```python
import gym
!pip3 install box2d
import random
import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

!python -m pip install pyvirtualdisplay
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

is_ipython = 'inline' in plt.get_backend()
if is_ipython:
    from IPython import display

plt.ion()
```

## 2. Instantiate the Environment and Agent
```python
env = gym.make('LunarLander-v2')
env.seed(0)
print("State shape : ", env.observation_space.shape)
print("Number of actions : ", env.action_space.n)
```


Before running the next code cell, familiarize ourself with code in **Steps 2** and **Steps 3** of this notebook, along with the code in `dqn_agent.py` and `model.py`. Once we have an understanding of how the different files work together,
- Define  a neural network architecture in `model.py` that maps states to action values. This file mostly empty - its's up to us to define our own deep Q-network!
- Finish the `learn` method in the `Agent` class in `dqn_agent.py`. The sampled batch of experience tuples is already provided for us; we need only use the local and target Q-networks to compute the loss, before taking a steps towards minimizing the loss.

Once we have completed the code in `dqn_agent.py` and `model.py`, run the code cell below. (_If we end up needing to make multiple changes and get unexpected behaviour, please restart the kernel and run the cell for the beginning of the notebook!_)

_(Note that there are many ways to train the agent and solve the exercise, and the "solution" is one way of approaching the problem, to yield a trained agent)_