# RL Exercise 5 - Evolution Strategies

**GOAL:** The goal of this exercise is to demonstrate how to use the evolution strategies (ES) algorithm.

To understand how to use **Ray RLlib**, see the documentation at http://ray.readthedocs.io/en/latest/rllib.html.

ES is described in detail in https://arxiv.org/abs/1703.03864.

The ES algorithm works as follows.

- It maintains a distribution over policies (which in this case is a multivariate Gaussian distribution over the weights of a neural network policy represented by the mean of the Gaussian $\theta$).
- The mean of the distribution is updated at each iteration, from $\theta_0$ to $\theta_1$ to $\theta_2$ and so on.
- At each iteration, a large number of policies are sampled from the distribution over policies and rollouts are performed using these **perturbed policies**.
- The distribution over policies is updated by moving its mean in the direction of the perturbed policies that achieved higher reward.

Of the algorithms explored so far, this one is the closest to the Monte Carlo algorithm implemented in one of the earlier exercises.

**NOTE:** One interesting property of this algorithm is that it only cares about the rewards achieved in a given rollout. The algorithm does not need to know the states that were visited and so much less data has to be communicated.

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import ray
from ray.rllib.es import ESAgent, DEFAULT_CONFIG

Start up Ray. This must be done before we instantiate any RL agents. We pass in num_workers=0 because the training agent's constructor will create a number of actors.

In [None]:
ray.init(num_workers=0)

Instantiate an ESAgent object. We pass in a config object that specifies how the network and training procedure should be configured. Some of the parameters are the following.

- `num_workers` is the number of actors that the agent will create. This determines the degree of parallelism that will be used.
- `episodes_per_batch` is the minimum number of rollouts to perform at each iteration.
- `timesteps_per_batch` is the minimum number of steps of the environment to perform at each iteration.
- `noise_stdev` is the standard deviation of the multivariate Gaussian distribution over the neural net policy weights.
- `stepsize` is the size of the update to the distribution over policies to take at each iteration.

In [None]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['episodes_per_batch'] = 100
config['timesteps_per_batch'] = 1000
config['noise_stdev'] = 0.02
config['stepsize'] = 0.01
config['eval_prob'] = 0.5

agent = ESAgent(config, 'CartPole-v0')

**EXERCISE:** Train the agent for some number of steps on the CartPole environment. Compare the performance to PPO from the previous exercise.

In [None]:
raise Exception('Implement this.')

**EXERCISE:** Instantiate an `ESAgent` object on the `MountainCar-v0` environment and train it for some number of steps. Compare the performance to PPO and A3C from the previous exercise.

In [None]:
raise Exception('Implement this.')