# Double Deep Q-Learning

## Table of Contents
- [1 - Motivation](#1)
- [2 - Example](#2)
- [3 - Application to Deep Q-Learning](#3)

Full paper: [Deep Reinforcement Learning with Double Q-Learning (2015)](https://arxiv.org/pdf/1509.06461.pdf)

<a name='1'></a>
# 1 - Motivation

Conventional Q-Learning is affected by an overestimation bias, due to the maximization step performed for the bootstrap target. This can harm learning as illustrated in figure 1.

<img src="images/overestimation_dqn.png">
<caption><center><font ><b>Figure 1</b>: Overestimation by DQN </center></caption>

<a name='2'></a>
# 2 - Example

**Problem understanding**

Lets say there are 100 people with a equal true weight of 150 lbs. We have a weighing scale that is off by +/- 1 lb. We measure person 1's weight and store it in $X^1$, person 2's weight in $X^2$, and so on.
    
Let's calculate $Y=\max_{i} X^i$.
    
At noise 0, Y is equal to 150 lbs, but as the measurement noise increases, Y will increase too. So under noise, the maximum value is biased to be larger than it should be.

**Solution**

Let's focus on a idea to solve this problem. 
    
Measure each person's weight twice (independent noise): $X_{1}^i and X_{2}^i$. 

Then set 
$$n = argmax_{i} X_{1}^i$$ 
$$Y = X_{2}^n$$
    
Where n is the index corresponding to the person with the highest first measurement of weight. To estimate the max you now take that same person's second weight. This new estimate of the max is now robust to noise. 

<a name='3'></a>
# 3 - Application to Deep Q-Learning

Double Q-Learning addresses the overestimation problem by appyling the idea above. The key idea is to decouple the selection of the action from its evaluation in the maximization performed for the bootstrap target. This change was shown to reduce harmful overestimations that were present for DQN, thereby improving performance. In Double Q-Learning the loss is calculated using:

$$(R_{t+1} + \gamma_{t+1} q_{\theta'}(S_{t+1}, \max_{a'}q_{\theta}(S_{t+1}, a')) - q_{\theta}(S_{t}, A_{t}))²$$

instead of $$(R_{t+1} + \gamma_{t+1} \max_{a'}q_{\theta'}(S_{t+1}, a')) - q_{\theta}(S_{t}, A_{t}))²$$
    
In other words: We find the index of the highest Q-value from the first network $Q_{\theta_{1}}$ and use that index to obtain the action from the second network $Q_{\theta_{2}}$. Note that this idea is independent of the target network trick to avoid shifting targets.

