# RL Exercise 4 - Asynchronous Advantage Actor Critic

**GOAL:** The goal of this exercise is to demonstrate how to use the asynchronous advantage actor critic (A3C) algorithm.

To understand how to use **Ray RLlib**, see the documentation at http://ray.readthedocs.io/en/latest/rllib.html.

A3C is described in detail in https://arxiv.org/abs/1602.01783.

In A3C, the driver maintains the most up-to-date policy. It creates a number of actors which are used to compute perform partial rollouts and to compute gradient updates to the model. The driver runs in a loop in which it waits for a single actor task to finish, updates the model with the result of the actor task, and launches a new actor task with the updated model. Because the actor tasks may run in any order, the algorithm is fundamentally asynchronous and non-deterministic.

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import ray
from ray.rllib.a3c import A3CAgent, DEFAULT_CONFIG
from ray.rllib.a3c.shared_model import SharedModel

Start up Ray. This must be done before we instantiate any RL agents. We pass in `num_workers=0` because the training agent's constructor will create a number of actors.

In [None]:
ray.init(num_workers=0)

Instantiate an A3CAgent object. We pass in a config object that specifies how the network and training procedure should be configured. Some of the parameters are the following.

- `num_workers` is the number of actors that the agent will create. This determines the degree of parallelism that will be used.
- `batch_size` is the number of simulator steps that each actor will batch together.

In [None]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['batch_size'] = 10

agent = A3CAgent(config, 'CartPole-v0')

**EXERCISE:** Train the agent for some number of steps on the CartPole environment. Compare the performance to PPO from the previous exercise.

In [None]:
raise Exception('Implement this.')

**EXERCISE:** Instantiate an A3CAgent object on the `MountainCar-v0` environment and train it for some number of steps. Compare the performance to PPO from the previous exercise.

In [None]:
raise Exception('Implement this.')