title: NPFL122, Lecture 11
class: title, langtech, cc-by-nc-sa
# V-trace, PopArt Normalization, Partially Observable MDPs

## Milan Straka

### January 7, 2019

---
section: IMPALA
# IMPALA

Impala (**Imp**ortance Weighted **A**ctor-**L**earner **A**rchitecture) was
suggested in Feb 2018 paper and allows massively distributed implementation
of an actor-critic-like learning algorithm.


Compared to A3C-based agents, which communicates gradients with respect to the
parameters of the policy, IMPALA actors communicates trajectories to the
centralized learner.


![w=50%](impala_overview.pdf)
 ~~
![w=50%](impala_overview.pdf)![w=50%](impala_comparison.pdf)


If many actors are used, the policy used to generate a trajectory can lag behind
the latest policy. Therefore, a new **V-trace** off-policy actor-critic
algorithm is proposed.

---
# IMPALA – V-trace

Consider a trajectory $(S_t, A_t, R_{t+1})_{t=s}^{t=s+n}$ generated by
a behaviour policy $b$.


The $n$-step V-trace target for $S_s$ is defined as
$$v_s ≝ V(S_s) + ∑_{t=s}^{s+n-1} γ^{t-s} \left(∏\nolimits_{i=s}^{t-1} c_i\right) δ_t V,$$

where $δ_t V$ is the temporal difference for V
$$δ_t V ≝ ρ_t \big(R_{t+1} + γV(s_{t+1}) - V(s_t)\big),$$

and $ρ_t$ and $c_i$ are truncated importance sampling ratios with $ρ̄ ≥ c̄$:
$$ρ_t ≝ \min\left(ρ̄, \frac{π(A_t | S_t)}{b(A_t | S_t)}\right),~c_i ≝ \min\left(c̄, \frac{π(A_i | S_i)}{b(A_i | S_i)}\right).$$


Note that if $b=π$ and assuming $c̄ ≥ 1$, $v_s$ reduces to $n$-step Bellman
target.

---
# IMPALA – V-trace

Note that the truncated IS weights $ρ_t$ and $c_i$ play different roles:


- The $ρ_t$ appears in the definition of $δ_t V$ and defines the fixed point
  of the update rule. For $ρ̄=∞$, the target is the value function $v_π$,
  if $ρ̄<∞$, the fixed point is somewhere between $v_π$ and $v_b$. Notice that
  we do not compute a product of these $ρ_t$ coefficients.

- The $c_i$ impacts the speed of convergence (the contraction rate of the
  Bellman operator), not the sought policy. Because a product of the $c_i$
  ratios is computed, it plays an important role in variance reduction.


The paper utilizes $c̄=1$ and out of $ρ̄ ∈ \{1, 10, 100\}$, $ρ̄=1$ works
empirically the best.

---
# IMPALA – V-trace

Consider a parametrized functions computing $v(s; →θ)$ and $π(a|s; →ω)$.
Assuming the defined $n$-step V-trace target
$$v_s ≝ V(S_s) + ∑_{t=s}^{s+n-1} γ^{t-s} \left(∏\nolimits_{i=s}^{t-1} c_i\right) δ_t V,$$


we update the critic in the direction of
$$\big(v_s - v(S_s; →θ)\big) ∇_→θ v(S_s; →θ)$$


and the actor in the direction of the policy gradient
$$ρ_s ∇_→ω \log π(A_s | S_s; →ω)\big(R_{s+1} + γ v_{s+1} - v(S_s; →θ)\big).$$


Finally, we again add the entropy regularization term $H(π(⋅ | S_s; →θ))$ to the
loss function.

---
# IMPALA

![w=60%,h=center](impala_throughput.pdf)

---
# IMPALA – Population Based Training

For Atari experiments, population based training with a population of 24 agents
is used to adapt entropy regularization, learning rate, RMSProp $ε$ and the
global gradient norm clipping threshold.


![w=80%,h=center](pbt_overview.pdf)

---
# IMPALA – Population Based Training

For Atari experiments, population based training with a population of 24 agents
is used to adapt entropy regularization, learning rate, RMSProp $ε$ and the
global gradient norm clipping threshold.

In population based training, several agents are trained in parallel. When an
agent is _ready_ (after 5000 episodes), then:

- it may be overwritten by parameters and hyperparameters of another agent, if
  it is sufficiently better (5000 episode mean capped human normalized score returns
  are 5% better);

- and independently, the hyperparameters may undergo a change (multiplied by
  either 1.2 or 1/1.2 with 33% chance).


---
# IMPALA

![w=100%,v=middle](impala_results.pdf)

---
# IMPALA – Learning Curves

![w=32%,h=center](impala_curves.pdf)

---
# IMPALA – Atari Games

![w=60%,h=center,v=middle](impala_results_atari.pdf)

---
# IMPALA – Ablations

![w=60%,h=center](impala_ablations_table.pdf)

---
# IMPALA – Ablations

![w=50%,h=center](impala_ablations_graphs.pdf)

---
section: PopArt Normalization
# PopArt Normalization

An improvement of IMPALA from Sep 2018, which performs normalization of task
rewards instead of just reward clipping. PopArt stands for _Preserving Outputs
Precisely, while Adaptively Rescaling Targets_.


Assume the value estimate $v(s; →θ, σ, μ)$ is computed using a normalized value
predictor $n(s; →θ)$
$$v(s; →θ, σ, μ) ≝ σ n(s; →θ) + μ$$
and further assume that $n(s; →θ)$ is an output of a linear function
$$n(s; →θ) ≝ →ω^T f(s; →θ-\{→ω, b\}) + b.$$


We can update the $σ$ and $μ$ using exponentially moving average with decay rate
$β$ (in the paper, first moment $μ$ and second moment $υ$ is tracked, and
standard deviation is computed as $σ=\sqrt{υ-μ^2}$; decay rate $β=3 ⋅ 10^{-4}$ is employed).

---
# PopArt Normalization

Utilizing the parameters $μ$ and $σ$, we can normalize the observed (unnormalized) returns as
$(G - μ) / σ$ and use an actor-critic algorithm with advantage $(G - μ)/σ - n(S; →θ)$.


However, in order to make sure the value function estimate does not change when
the normalization parameters change, the parameters $→ω, b$ computing the
unnormalized value estimate are updated under any change $μ → μ'$ and $σ → σ'$ as:
$$→ω' ≝ \frac{σ}{σ'}→ω,~b' ≝ \frac{σb + μ - μ'}{σ'}.$$


In multi-task settings, we train a task-agnostic policy and task-specific value
functions (therefore, $→μ$, $→σ$ and $→n(s; →θ)$ are vectors).

---
# PopArt Results

![w=80%,h=center](popart_results.pdf)


![w=100%](popart_atari_curves.pdf)

---
# PopArt Results

![w=90%,h=center](popart_atari_statistics.pdf)

---
# PopArt Results

![w=100%,v=middle](popart_dmlab_curves.pdf)

---
section: POMDPs
# Partially Observable MDPs

Recall that a _Markov decision process_ (MDP) is a quadruple $(𝓢, 𝓐, p, γ)$,
where:
- $𝓢$ is a set of states,
- $𝓐$ is a set of actions,
- $p(S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a)$ is a probability that
  action $a ∈ 𝓐$ will lead from state $s ∈ 𝓢$ to $s' ∈ 𝓢$, producing a _reward_ $r ∈ ℝ$,
- $γ ∈ [0, 1]$ is a _discount factor_.


_Partially observable Markov decision process_ extends the Markov decision
process to a sextuple $(𝓢, 𝓐, p, γ, 𝓞, o)$, where in addition to an MDP
- $𝓞$ is a set of observations,
- $o(O_t | S_t, A_{t-1})$ is an observation model.


In robotics (out of the domain of this course), several approaches are used to
handle POMDPs, to model uncertainty, imprecise mechanisms and inaccurate
sensors.

---
# Partially Observable MDPs

In Deep RL, partially observable MDPs are usually handled using recurrent
networks. After suitable encoding of input observation $O_t$ and previous
action $A_{t-1}$, a RNN (usually LSTM) unit is used to model the current $S_t$
(or its suitable latent representation), which is in turn utilized to produce
$A_t$.

![w=50%,h=center](merlin_rl-lstm.pdf)

---
section: MERLIN
# MERLIN

However, keeping all information in the RNN state is substantially limiting.
Therefore, _memory-augmented_ networks can be used to store suitable information
in external memory (in the lines of NTM, DNC or MANN models).

We now describe an approach used by Merlin architecture (_Unsupervised
Predictive Memory in a Goal-Directed Agent_ DeepMind Mar 2018 paper).

![w=50%,h=center](merlin_rl-mem.pdf)

---
# MERLIN – Memory Module

![w=30%,f=right](merlin_rl-mem.pdf)

Let $→M$ be a memory matrix of size $N_\textit{mem} × 2|z|$.


Assume we have already encoded observations as $→e_t$ and previous action
$a_{t-1}$. We concatenate them with $K$ previously read vectors and process
by a deep LSTM (two layers are used in the paper) to compute $→h_t$.


Then, we apply a linear layer to $→h_t$, computing $K$ key vectors
$→k_1, … →k_K$ of length $2|z|$ and $K$ positive scalars $β_1, …, β_K$.


**Reading:** For each $i$, we compute cosine similarity of $→k_i$ and all memory
rows $M_j$, multiply the similarities by $β_i$ and pass them through a $\softmax$
to obtain weights $→ω_i$. The read vector is then computed as $⇉M →w_i$.


**Writing:** We find one-hot write index $→v_\textit{wr}$ to be the least used
memory row (we keep usage indicators and add read weights to them). We then
compute $→v_\textit{ret} ← γ →v_\textit{ret} + (1 - γ) →v_\textit{wr}$, and update
the memory matrix using $⇉M ← ⇉M + →v_\textit{wr}[→e_t, 0] + →v_\textit{ret}[0, →e_t]$.

---
# MERLIN — Prior and Posterior

However, updating the encoder and memory content purely using RL is inefficient.
Therfore, MERLIN includes a _memory-based predictor (MBP)_ in addition to policy.
The goal of MBP is to compress observations into low-dimensional state
representations $z$ and storing them in memory.


According to the paper, the idea of unsupervised and predictive modeling has
been entertained for decades, and recent discussions have proposed such modeling
to be connected to hippocampal memory.

We want the state variables not only to faithfully represent the data, but also
emphasise rewarding elements of the environment above irrelevant ones. To
accomplish this, the authors follow the hippocampal representation theory of
Gluck and Myers, who proposed that hippocampal representations pass through
a compressive bottleneck and then reconstruct input stimuli together with task
reward.


In MERLIN, a _prior_ distribution over $z_t$ predicts next state variable
conditioned on history of state variables and actions $p(z_t | z_{t-1}, a_{t-1}, …, z_1, a_1)$,
and _posterior_ corrects the prior using the new observation $o_t$, forming
a better estimate $q(z_t | o_t, z_{t-1}, a_{t-1}, …, z_1, a_1)$.

---
# MERLIN — Prior and Posterior

To achieve the mentioned goals, we add two terms to the loss.

- We try reconstructing input stimuli, action, reward and return using a sample from
  the state variable posterior, and add the difference of the reconstruction and
  ground truth to the loss.


- We also add KL divergence of the prior and posterior to the loss, to ensure
  consistency between the prior and posterior.


![w=85%,h=center](merlin_diagram.pdf)

---
# MERLIN — Algorithm

![w=37%,h=center](merlin_algorithm.pdf)

---
# MERLIN

![w=70%,h=center](merlin_tasks.pdf)

---
# MERLIN

![w=50%,h=center](merlin_analysis.pdf)

---
# MERLIN

![w=90%,h=center](merlin_predictive_model.pdf)

---
section: CTF-FTW
# For the Win agent for Capture The Flag

![w=100%](ctf_overview.pdf)

---
# For the Win agent for Capture The Flag

- Extension of the MERLIN architecture.


- Hierarchical RNN with two timescales.


- Population based training controlling KL divergence penalty weights,
  slow ticking RNN speed and gradient flow factor from fast to slow RNN.

---
# For the Win agent for Capture The Flag

![w=47%,h=center](ctf_architecture.pdf)

---
# For the Win agent for Capture The Flag

![w=80%,h=center](ctf_curves.pdf)

