
section: Refresh
# Function Approximation

We will approximate value function $v$ and/or state-value function $q$, choosing
from a family of functions parametrized by a weight vector $→w∈ℝ^d$.

We denote the approximations as
$$\begin{gathered}
  v̂(s, →w),\\
  q̂(s, a, →w).
\end{gathered}$$


We utilize the _Mean Squared Value Error_ objective, denoted $\overline{VE}$:
$$\overline{VE}(→w) ≝ ∑_{s∈𝓢} μ(s) \left[v_π(s) - v̂(s, →w)\right]^2,$$
where the state distribution $μ(s)$ is usually on-policy distribution.

---
# Gradient and Semi-Gradient Methods

The functional approximation (i.e., the weight vector $→w$) is usually optimized
using gradient methods, for example as
$$\begin{aligned}
  →w_{t+1} &← →w_t - \frac{1}{2} α ∇ \left[v_π(S_t) - v̂(S_t, →w_t)\right]^2\\
           &← →w_t + α\left[v_π(S_t) - v̂(S_t, →w_t)\right] ∇ v̂(S_t, →w_t).\\
\end{aligned}$$

As usual, the $v_π(S_t)$ is estimated by a suitable sample. For example in Monte
Carlo methods, we use episodic return $G_t$, and in temporal difference methods,
we employ bootstrapping and use $R_{t+1} + γv̂(S_{t+1}, →w).$

---
section: DQN
# Deep Q Network

Off-policy Q-learning algorithm with a convolutional neural network function
approximation of action-value function.

Training can be extremely brittle (and can even diverge as shown earlier).

![w=65%,h=center](images/dqn_architecture.png)

---
# Deep Q Networks

- Preprocessing: $210×160$ 128-color images are converted to grayscale and
  then resized to $84×84$.

- Frame skipping technique is used, i.e., only every $4^\textrm{th}$ frame
  (out of 60 per second) is considered, and the selected action is repeated on
  the other frames.

- Input to the network are last $4$ frames (considering only the frames kept by
  frame skipping), i.e., an image with $4$ channels.

- The network is fairly standard, performing
  - 32 filters of size $8×8$ with stride 4 and ReLU,
  - 64 filters of size $4×4$ with stride 2 and ReLU,
  - 64 filters of size $3×3$ with stride 1 and ReLU,
  - fully connected layer with 512 units and ReLU,
  - output layer with 18 output units (one for each action)

---
# Deep Q Networks

- Network is trained with RMSProp to minimize the following loss:
  $$𝓛 ≝ 𝔼_{(s, a, r, s')∼\mathit{data}}\left[(r + γ \max_{a'} Q(s', a'; θ̄) - Q(s, a; θ))^2\right].$$

- An $ε$-greedy behavior policy is utilized.


Important improvements:

- experience replay: the generated episodes are stored in a buffer as $(s, a, r,
  s')$ quadruples, and for training a transition is sampled uniformly;

- separate target network $θ̄$: to prevent instabilities, a separate target
  network is used to estimate state-value function. The weights are not trained,
  but copied from the trained network once in a while;

- reward clipping of $(r + γ \max_{a'} Q(s', a'; θ̄) - Q(s, a; θ))$ to $[-1, 1]$.

---
class: tablefull
# Deep Q Networks Hyperparameters

| Hyperparameter | Value |
|----------------|-------|
| minibatch size | 32 |
| replay buffer size | 1M |
| target network update frequency | 10k |
| discount factor | 0.99 |
| training frames | 50M |
| RMSProp learning rate and momentum | 0.00025, 0.95 |
| initial $ε$, final $ε$ and frame of final $ε$ | 1.0, 0.1, 1M |
| replay start size | 50k |
| no-op max | 30 |

---
# Rainbow

There have been many suggested improvements to the DQN architecture. In the end
of 2017, the _Rainbow: Combining Improvements in Deep Reinforcement Learning_
paper combines 7 of them into a single architecture they call _Rainbow_.


![w=38%,h=center](images/rainbow_results.png)

---
section: DDQN
# Rainbow DQN Extensions

## Double Q-learning

Similarly to double Q-learning, instead of
$$r + γ \max_{a'} Q(s', a'; θ̄) - Q(s, a; θ),$$
we minimize
$$r + γ Q(s', \argmax_{a'}Q(s', a'; θ); θ̄) - Q(s, a; θ).$$


![w=30%,h=center](images/ddqn_errors.png)

---
# Rainbow DQN Extensions

## Double Q-learning

![w=100%,h=center](images/ddqn_errors_analysis.png)

---
# Rainbow DQN Extensions

## Double Q-learning

![w=60%,h=center](images/ddqn_analysis.png)

---
# Rainbow DQN Extensions

## Double Q-learning

![w=40%,h=center,mh=40%,v=middle](images/ddqn_results_5min.png)

![w=55%,h=center,mh=40%,v=middle](images/ddqn_results_30min.png)

---
# Rainbow DQN Extensions

## Prioritized Replay

Instead of sampling the transitions uniformly from the replay buffer,
we instead prefer those with a large TD error. Therefore, we sample transitions
according to their probability
$$p_t ∝ \Big|r + γ \max_{a'} Q(s', a'; θ̄) - Q(s, a; θ)\Big|^ω,$$
where $ω$ controls the shape of the distribution (which is uniform for $ω=0$
and corresponds to TD error for $ω=1$).


New transitions are inserted into the replay buffer with maximum probability
to support exploration of all encountered transitions.

---
section: PriRep
# Rainbow DQN Extensions

## Prioritized Replay

Because we now sample transitions according to $p_t$ instead of uniformly,
on-policy distribution and sampling distribution differ. To compensate, we
therefore utilize importance sampling with ratio
$$ρ_t = \left( \frac{1/N}{p_t} \right) ^β.$$


The authors utilize in fact “for stability reasons”
$$ρ_t / \max_i ρ_i.$$

---
# Rainbow DQN Extensions

## Prioritized Replay

![w=75%,h=center](images/prioritized_dqn_algorithm.png)

---
section: Duelling
# Rainbow DQN Extensions

## Duelling Networks

Instead of computing directly $Q(s, a; θ)$, we compose it from the following quantities:
- value function for a given state $s$,
- advantage function computing an _advantage_ of using action $a$ in state $s$.

$$Q(s, a) ≝ V(f(s; ζ); η) + A(f(s; ζ), a; ψ) - \frac{\sum_{a' ∈ 𝓐} A(f(s; ζ), a'; ψ)}{|𝓐|}$$

![w=25%,h=center](images/dqn_dueling_architecture.png)

---
# Rainbow DQN Extensions

## Duelling Networks

![w=100%,h=center](images/dqn_dueling_corridor.png)

---
# Rainbow DQN Extensions

## Duelling Networks

![w=32%,h=center](images/dqn_dueling_visualization.png)

---
# Rainbow DQN Extensions

## Duelling Networks

![w=70%,h=center,mh=80%,v=middle](images/dqn_dueling_results.png)

---
# Rainbow DQN Extensions
## Multi-step Learning

Instead of Q-learning, we use $n$-step variant Q-learning (to be exact, we use
$n$-step Expected Sarsa) to maximize
$$∑_{i=1}^n γ^{i-1} r_i + γ^n \max_{a'} Q(s', a'; θ̄) - Q(s, a; θ),$$


This changes the off-policy algorithm to on-policy, but it is not discussed in
any way by the authors.

---
section: NoisyNets
# Rainbow DQN Extensions

## Noisy Nets

Noisy Nets are neural networks whose weights and biases are perturbed by
a parametric function of a noise.


The parameters $→θ$ are represented as
$$→θ ≝ →μ + →σ ⊙ →ε,$$
where $→ε$ is zero-mean noise with fixed statistics. We therefore learn the
parameters $→ζ ≝ (→μ, →σ)$.


Therefore, a fully connected layer
$$→y = →w →x + →b$$
is represented in the following way in Noisy Nets:
$$→y = (→μ_w + →σ_w ⊙ →ε_w) →x + (→μ_b + →σ_b ⊙ →ε_b).$$

---
# Rainbow DQN Extensions

## Noisy Nets

The noise $ε$ can be for example independent Gaussian noise. However, for
performance reasons, factorized Gaussian noise is used to generate a matrix of
noise. If $ε_{i, j}$ is noise corresponding to a layer with $i$ inputs and $j$
outputs, we generate independent noise $ε_i$ for input neurons, independent
noise $ε_j$ for output neurons, and set
$$ε_{i,j} = f(ε_i) f(ε_j)$$
for $f(x) = \operatorname{sign}(x) \sqrt{|x|}$.


The authors generate noise samples for every batch, sharing the noise for all
batch instances.


### Deep Q Networks
When training a DQN, $ε$-greedy is no longer used and all policies are greedy,
and all fully connected layers are parametrized as noisy nets.

---
# Rainbow DQN Extensions

## Noisy Nets

![w=50%,h=center](images/dqn_noisynets_results.png)

![w=65%,h=center](images/dqn_noisynets_curves.png)

---
# Rainbow DQN Extensions

## Noisy Nets

![w=100%](images/dqn_noisynets_noise_study.png)

---
section: DistRL
# Rainbow DQN Extensions

## Distributional RL

Instead of an expected return $Q(s, a)$, we could estimate distribution of
expected returns $Z(s, a)$.

These distributions satisfy a distributional Bellman equation:
$$Z(s, a) = R(s, a) + γ Z(s', a').$$


The authors of the paper prove similar properties of the distributional Bellman
operator compared to the regular Bellman operator, mainly being a contraction
under a suitable metric (Wasserstein metric).

---
# Rainbow DQN Extensions

## Distributional RL

The distribution of returns is modeled as a discrete distribution parametrized
by the number of atoms $N ∈ ℕ$ and by $V_\textrm{MIN}, V_\textrm{MAX} ∈ ℝ$.
Support of the distribution are atoms
$$\{z_i ≝ V_\textrm{MIN} + i Δz : 0 ≤ i < N\}\textrm{for~}Δz ≝ \frac{V_\textrm{MAX} - V_\textrm{MIN}}{N-1}.$$


The atom probabilities are predicted using a $\softmax$ distribution as
$$Z_→θ(s, a) = \left\{z_i\textrm{ with probability }p_i = \frac{e^{f_i(s, a)}}{∑_j e^{f_j(s, a)}}\right\}.$$

---
# Rainbow DQN Extensions

## Distributional RL

![w=30%,f=right](images/dqn_distributional_operator.png)

After the Bellman update, the support of the distribution $R(s, a) + γZ(s', a')$
is not the same as the original support. We therefore project it to the original
support by proportionally mapping each atom of the Bellman update to immediate
neighbors in the original support.


$$Φ\big(R(s, a) + γZ(s', a')\big)_i ≝
  ∑_{j=1}^N \left[ 1 - \frac{\left|[r + γz_j]_{V_\textrm{MIN}}^{V_\textrm{MAX}}-z_i\right|}{Δz} \right]_0^1 p_j(s', a').$$


The network is trained to minimize the Kullbeck-Leibler divergence between the
current distribution and the (mapped) distribution of the one-step update
$$D_\textrm{KL}\big(Φ(R + \max_{a'} Z(s', a') || Z(s, a)\big).$$

---
# Rainbow DQN Extensions

## Distributional RL

![w=50%,h=center](images/dqn_distributional_algorithm.png)


---
# Rainbow DQN Extensions

## Distributional RL

![w=40%,h=center](images/dqn_distributional_results.png)

![w=40%,h=center](images/dqn_distributional_example_distribution.png)

---
# Rainbow DQN Extensions

## Distributional RL

![w=100%](images/dqn_distributional_example_distributions.png)

---
# Rainbow DQN Extensions

## Distributional RL

![w=100%](images/dqn_distributional_atoms_ablation.png)

---
section: Rainbow
# Rainbow Architecture

Rainbow combines all described DQN extensions. Instead of $1$-step updates,
$n$-step updates are utilized, and KL divergence of the current and target
return distribution is minimized:
$$D_\textrm{KL}\big(Φ(G_{t:t+n} + γ^n \max_{a'} Z(s', a')) || Z(s, a)\big).$$


The prioritized replay chooses transitions according to the probability
$$p_t ∝ \Big(D_\textrm{KL}\big(Φ(G_{t:t+n} + γ^n \max_{a'} Z(s', a')) || Z(s, a)\big)\Big)^ω.$$


Network utilizes duelling architecture feeding the shared representation $f(s; ζ)$
into value computation $V(f(s; ζ); η)$ and advantage computation $A_i(f(s; ζ), a; ψ)$ for atom $z_i$,
and the final probability of atom $z_i$ in state $s$ and action $a$ is computed as
$$p_i(s, a) ≝
  \frac{e^{V(f(s; ζ); η) + A_i(f(s; ζ), a; ψ) - \sum_{a' ∈ 𝓐} A_i(f(s; ζ), a'; ψ)/|𝓐|}}
  {\sum_j e^{V(f(s; ζ); η) + A_j(f(s; ζ), a; ψ) - \sum_{a' ∈ 𝓐} A_j(f(s; ζ), a'; ψ)/|𝓐|}}.$$

---
# Rainbow Hyperparameters

![w=70%,h=center](images/rainbow_hyperparameters.png)

---
# Rainbow Results

![w=93%,mw=50%,h=center](images/rainbow_results.png)![w=50%](images/rainbow_table.png)

---
# Rainbow Results

![w=93%,mw=50%,h=center](images/rainbow_results.png)![w=93%,mw=50%,h=center](images/rainbow_results_ablations.png)

---
# Rainbow Ablations

![w=90%,h=center](images/rainbow_ablations.png)

---
# Rainbow Ablations

![w=84%,h=center](images/rainbow_ablations_per_game.png)
