# A3C

A3C is an parallel version of A2C. It introduces asynchronous updates of the policy from $N$ agents running on the same multicore machine.
Here is the brief explanation of the logic:

First the global policy-value network parameters $\theta$ are randomly initialized. Network parameters are made available to all the processes.

Every agent runs in a separater process/thread. It acts as an advantage actor critic (A2C). It repeats the following set of steps till convergence:
1. Pull the latest parameters from the global network. $\theta_{\mathcal{A}_i}\leftarrow\theta$
2. Run policy in the local environment for a number of steps T
3. Calculate the loss $\mathcal{L_\theta_{\mathcal{A}_i}}=\mathcal{L_\theta_{\mathcal{A}_i}}^{POLICY}+\mathcal{L_\theta_{\mathcal{A}_i}}^{VALUE}$ using A2C logic.
4. Calculate the gradient $\nabla_\theta_{\mathcal{A}_i} \mathcal{L_\theta_{\mathcal{A}_i}}$
5. Update the global network with the gradient. $\theta = \theta - \alpha*\nabla_\theta_{\mathcal{A}_i} \mathcal{L_\theta_{\mathcal{A}_i}}$

At the first glance A3C should converge $N$ times faster than A2C, but it turns out that in practice it converges even faster. It's not clear why, maybe it's because of decorrelated updates of network parameters?

In [104]:
import ctypes
import importlib
import os

import gym
import torch
import torch.multiprocessing as mp
from gym.spaces import Box

import A3C

In [105]:
model = A3C.PolicyValueModel(2, 2, 4)
model.forward([1, 2.])

[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuous=False
[INFO/MainProcess] Creating policy. is_continuou

(Categorical(probs: torch.Size([2]), logits: torch.Size([2])),
 array(0),
 tensor(-0.7704, grad_fn=<SqueezeBackward1>),
 tensor([-0.4709], grad_fn=<AddBackward0>))

In [106]:
optim = A3C.SharedAdam(model.parameters())
optim

SharedAdam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.001
    maximize: False
    weight_decay: 0
)

In [126]:
importlib.reload(A3C)


def a3c(env_name, capacity=256, n_agents=1, early_stopping_reward=300, num_episodes=10000, train_every_step=16,
        max_episode_length=300, entropy_loss_weight=0.01, gamma=0.9):

    with gym.make(env_name) as env:
        n = env.action_space.shape[0] if type(env.action_space) is Box else env.action_space.n
        global_model = A3C.PolicyValueModel(env.observation_space.shape[0], n, capacity,
                                            type(env.action_space) is Box)
        optimizer = A3C.SharedAdam(global_model.parameters())
        agents = []
        for a in range(n_agents):
            agents.append(A3C.A3CAgent(f'agent{a}', env_name, capacity, global_model, optimizer, num_episodes,
                                       train_every_step, early_stopping_reward, gamma, entropy_loss_weight,
                                       max_episode_length, verbose=a == 0, terminate_flag=mp.Value(ctypes.c_int, 0)))
            agents[-1].start()
        for a in agents:
            a.join()
        return global_model

## CartPole-v1

In [129]:
model = a3c('CartPole-v1',
            128,
            n_agents=os.cpu_count(),
            train_every_step=5,
            gamma=0.9,
            entropy_loss_weight=0.,
            early_stopping_reward=200)

[INFO/agent0] child process calling self.run()
[INFO/agent1] child process calling self.run()
[INFO/agent0] child process calling self.run()
[INFO/agent0] Agent agent0 starting
[INFO/agent0] Agent agent0 starting
[INFO/agent1] child process calling self.run()
[INFO/agent1] Agent agent1 starting
[INFO/agent1] Agent agent1 starting
[INFO/agent3] child process calling self.run()
[INFO/agent3] child process calling self.run()
[INFO/agent3] Agent agent3 starting
[INFO/agent3] Agent agent3 starting
[INFO/agent4] child process calling self.run()
[INFO/agent4] child process calling self.run()
[INFO/agent4] Agent agent4 starting
[INFO/agent4] Agent agent4 starting
[INFO/agent5] child process calling self.run()
[INFO/agent5] child process calling self.run()
[INFO/agent5] Agent agent5 starting
[INFO/agent5] Agent agent5 starting
[INFO/agent6] child process calling self.run()
[INFO/agent6] child process calling self.run()
[INFO/agent6] Agent agent6 starting
[INFO/agent6] Agent agent6 starting
[INF

In [130]:
with gym.make('CartPole-v1') as env:
    state = env.reset()
    done = False
    while not done:
        action = model.forward(torch.tensor(state))[1]
        state, reward, done, _ = env.step(action)
        env.render()

## LunarLanderContinuous-v2

In [124]:
model_ll = a3c('LunarLanderContinuous-v2',
            128,
            n_agents=os.cpu_count(),
            num_episodes=100,
            early_stopping_reward=0,
            train_every_step=8,
            max_episode_length=300,
            entropy_loss_weight=0.,
            gamma=0.9)

[INFO/agent1] child process calling self.run()
[INFO/agent1] child process calling self.run()
[INFO/agent0] child process calling self.run()
[INFO/agent2] child process calling self.run()
[INFO/agent0] child process calling self.run()
[INFO/agent2] child process calling self.run()
[INFO/agent1] Agent agent1 starting
[INFO/agent1] Agent agent1 starting
[INFO/agent0] Agent agent0 starting
[INFO/agent0] Agent agent0 starting
[INFO/agent2] Agent agent2 starting
[INFO/agent2] Agent agent2 starting
[INFO/agent5] child process calling self.run()
[INFO/agent5] child process calling self.run()
[INFO/agent5] Agent agent5 starting
[INFO/agent5] Agent agent5 starting
[INFO/agent7] child process calling self.run()
[INFO/agent7] child process calling self.run()
[INFO/agent7] Agent agent7 starting
[INFO/agent7] Agent agent7 starting
[INFO/agent9] child process calling self.run()
[INFO/agent9] child process calling self.run()
[INFO/agent9] Agent agent9 starting
[INFO/agent9] Agent agent9 starting
[INF

In [125]:
with gym.make('LunarLanderContinuous-v2') as env:
    state = env.reset()
    done = False
    while not done:
        action = model_ll.forward(torch.tensor(state))[1]
        state, reward, done, _ = env.step(action)
        env.render()