## Introduction

In DQN, by storing the
agent’s data in an experience replay memory, the data can
be batched or randomly sampled from different time-steps to remove the correlation between samples and non-stationarity, but at the same time limits the methods to off-policy methods (it requires off-policy learning algorithms
that can update from data generated by an older policy) and will have more memory, computation cost. In the paper, instead of experience replay, the authors asynchronously execute multiple agents in parallel, on-multiple instances of the environment. This parallelism also decorrelates the agents' data into a more stationary process, since at any given time-step, the parallel agents will be experiencing a variety of different states. This simple idea enables on-policy learning using Sarsa, n-step methods and actor-critic methods besides off-policy methods. At the same time, the algorithms do not rely on GPUs or massively distributed architectures.

## Related Work

The General Reinforcement Learning Architecture (Gorila) performs asynchronous training of reinforcement
learning agents in a distributed setting. In Gorila,
each process contains an actor that acts in its own copy
of the environment, a separate replay memory, and a learner
that samples data from the replay memory and computes
gradients of the DQN loss with respect
to the policy parameters. The gradients are asynchronously
sent to a central parameter server which updates a central
copy of the model. The updated policy parameters are sent
to the actor-learners at fixed intervals.

## Reinforcement Learning Background

Let $Q(s, a; \theta)$ be an approximate actoin-value function with parameters $\theta$. In one-step Q-learning, the parameters $\theta$ of the action value function $Q(s, a; \theta)$ are learned by iteratively minimizing a sequence of loss functions (AVI), where the ith loss function defined as

$L_i (\theta_i) = E[r + \gamma max_{a^{\prime}} Q(s^{\prime}, a^{\prime}; \theta_{i-1}) - Q(s, a; \theta_i)]^2$

One drawback of using one-step methods is that obtaining a reward r only directly affects the value of the state action pair s, a that led to the reward. The values of other state action paris are affected only indirectly through the updated value $Q(s, a)$. This can make the learning process slow since many updates are required the propagate a reward to the relevant preceding states and actions. One way of propagating rewards faster is by using n-step returns. This results in a single reward r directly affecting the values of n preceding state action pairs. This makes the process of propagating rewards to relevant state-action paris potentially much more efficient (ie. some r is reused, less bias)

In contrast to value-based methods, policy-based model-free
methods directly parameterize the policy $\pi(a | s; \theta)$ and update the parameters $\theta$ by performing, typically approximate, gradient ascent on $E[G_t]$. A learned estimate of the value function is commonly used
as the baseline $b_t(s_t) \approx V^{\pi}(s_t)$ leading to a much lower variance estimate of the policy gradient. When an approximate value function is used as the baseline, the quantity $G_t - b_t$ used to scale the policy gradient can be seen as an estimate of the **advantage** of action $a_t$ in state $s_t$ or $A(a_t, s_t) = Q(a_t, s_t) - V(s_t)$. This approach can be viewed as an actor-critic architecture where the policy $\pi$ is the actor and the baseline $b_t$ is the critic.

## Asynchronous RL Framework

While the underlying RL methods are quite different,
with actor-critic being an on-policy policy search
method and Q-learning being an off-policy value-based
method, we use two main ideas to make all four algorithms
practical given our design goal.

First, we use asynchronous actor-learners on multiple CPU threads on a single machine. Keeping the lears on a single machine removes the communicaton costs of sending gradients and parameters and enables us to use Hogwild style updates for training.

Second, we make the observation that multiple actors-learners running in parallel are likely to be exploring different parts of the environment. Moreover, one can explicitly use different exploration polices in each actor-learner to maximize this diversity. By running different exploration policies in different threads, the overall changes being
made to the parameters by multiple actor-learners applying
online updates in parallel are likely to be less correlated
in time than a single agent applying online updates. Hence, we do not use a replay memory and rely on parallel
actors employing different exploration policies to perform
the stabilizing role undertaken by experience replay in the
DQN training algorithm.

### Asynchronous one-step Q-learning (AVI, estimating $V^{*}$)

Each thread interacts with its own copy of the environment and at each step computes a gradient of the Q-learning loss. they use a shared and slowly changing target network in computing the Q-learning loss as was proposed in the DQN training method. They also accumulate gradients over multiple timesteps before they are applied, which is similar using minibatches. This reduces the chances of multiple actor
learners overwriting each other’s updates. Accumulating
updates over several steps also provides some ability to
trade off computational efficiency for data efficiency.

Finally, they found that giving each thread a different exploration
policy helps improve robustness. Adding diversity
to exploration in this manner also generally improves performance
through better exploration.

<img src='pngs/algo_1.png'>

### Asynchronous one-step SARSA (GPI):

Most of the algorithm is the same as Asynchronous one-step Q-learning except that it uses different target value $r + \gamma Q(s^{\prime}, a^{\prime};\theta^-)$and follows GPI process.

### Asynchronous n-step Q-learning

The algorithm is somewhat unusual because
it operates in the forward view by explicitly computing nstep
returns, as opposed to the more common backward
view used by techniques like eligibility traces. They found that using the forward view is easier
when training neural networks with momentum-based
methods and backpropagation through time.

<img src='pngs/algo_3.png'>

This process results in the agent
receiving up to $t_{max}$ rewards from the environment since tis last update. The algorithm then computes gradients for n-step Q-learning updates for each of the state-action pairs encountered since the last update. Each n-step update uses the longest possible n-step return resulting in a one-step update for the last state, a two-step update for the second last state, and so on for a total of up to $t_max$ updates. The
accumulated updates are applied in a single gradient step.
### Asynchronous advantage actor-critic

The A3C algorithm maintains a policy $\pi(a_t|s_t;\theta)$ and an estimate of the value function $V(s_t;\theta_v)$. Like the variant of n-step Q-learning, the variant of actor-critic also operates in the forward view and uses the same mix of n-step returns to update both the policy and the value-function.
The policy and the value function are updated after every $t_{max}$ actions or when a terminal state is reached. The update performed by the algorithm can be seen as $\nabla_{\theta^{\prime}} log \pi(a_t | s_t;\theta^\prime) A(s_t, a_t;\theta, \theta_v)$ where  $A(s_t, a_t;\theta, \theta_v) = \sum_{i=0}^{k-1} \gamma^{i} r_{t+i} + \gamma^k V(s_{t+k}; \theta_v) - V(s_t; \theta_v)$ where k can vary from state to state and is upper-bounded by $t_{max}$.

<img src='pngs/algo_4.png'>

As with the value-based methods we rely on parallel actor-learners
and accumulated updates for improving training
stability. Note that while the parameters $\theta$ of the policy and $\theta_{v}$ of the value function are shown as being separate for generality, they always share some of the parameters in practice. The authors use a CNN that has one softmax output for the policy and one linear output for the value function with all non-output layers shared.

