# Intro to Deep-Q Learning

Use NNs to solve RL problems.

NNs as value functions.

Monte Carlo and TD as Deep-Q Learning

## NNs as value functions

NNs are universal Function Approximators.

![learningWithNN.png](attachment:learningWithNN.png)

Learning with a NN

![actionValueFunctionNN.png](attachment:actionValueFunctionNN.png)

## Monte Carlo Learning

Recall MC learning, it can update gradient decent as below

![MClearning.png](attachment:MClearning.png)

We can apply this as a function approximator as follows:

![MCfuncApprox.png](attachment:MCfuncApprox.png)

## Temporal Different Learning

Updates at each time step instead of each episode

![TDvsMC.png](attachment:TDvsMC.png)

For which the gradient can be written as

![TDgradient.png](attachment:TDgradient.png)

TD(0) Control with function approximation

![TDfuncApprox.png](attachment:TDfuncApprox.png)

Sarsa for continuing tasks.

![TDcontinuous.png](attachment:TDcontinuous.png)


## Q-Learning

Use one greed policy pi to take actions, another to perform value updates, a greedy policy. For episoding tasks.

![qLearnEpisodic.png](attachment:qLearnEpisodic.png)

For continuing tasks

![qLearnContinuous.png](attachment:qLearnContinuous.png)

Sarsa compared to Q-Learning

## Deep Q Network

Playing Atari games with a RL-DNN

![atariInputSpace.png](attachment:atariInputSpace.png)

Deep Q network produces a Q value for every possible action in a single forward pass, all actions at once, stoachistically, or the one with the max value

![forwardPassAllActions.png](attachment:forwardPassAllActions.png)

CNN to help understand spacial representations, since 4 frames are stacked, some temporal information is also extracted between frames. Then RELU activation functions, and a fully connected hidden RELU, and fully connected layer with the final action functions.

![atariNetworkArch.png](attachment:atariNetworkArch.png)

## Experience Replay

A sequence of experience tuples can be highly correlated, and sampling in sequence can affect the learning, getting trapped (or at least biased) by the correlation of a sequence of highly similar experiences. Helps with action value divergence. This is related to the explore-exploit trade-off.

Frame reinforcement learning as a supervised learning problem using a batch of experiences as training. A replay buffer allows us to sample later to learn (as well as sampling with some randomness). Therefore, can learn multiple times from the same experience, such as rare and valuable ones.


## Fixed Q Targets

Q-Learning is a form of TD learning.

Goal in Q learning is to reduce the difference between the TD target, and currently predicted Q value.

![QLearningUpdate.png](attachment:QLearningUpdate.png)

The target and the parameters are the same, the is mathematically not correct.

![movingTarget.png](attachment:movingTarget.png)

To address this we take a w and fix it, to stabilize the learning algorithm.

![fixedQ.png](attachment:fixedQ.png)

## Deep Q-Learning Algorithm

The algorithm used in the atari games can be summarised as follows

![algoDeepQLearning.png](attachment:algoDeepQLearning.png)

Mnih et al., 2015. Human-level control through deep reinforcement learning. (DQN paper)
https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf

He et al., 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. (weight initialization) https://arxiv.org/abs/1502.01852

## DQN Improvements

Readings

Thrun & Schwartz, 1993. Issues in Using Function Approximation for Reinforcement Learning. (overestimation of Q-values)
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.73.3097

van Hasselt et al., 2015. Deep Reinforcement Learning with Double Q-Learning.
https://arxiv.org/abs/1509.06461

Schaul et al., 2016. Prioritized Experience Replay.
https://arxiv.org/abs/1511.05952

Wang et al., 2015. Dueling Network Architectures for Deep Reinforcement Learning.
https://arxiv.org/abs/1511.06581

Hausknecht & Stone, 2015. Deep Recurrent Q-Learning for Partially Observable MDPs.
https://arxiv.org/abs/1507.06527


## Implementation

https://keon.io/deep-q-learning/
http://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
