In [None]:
import gym
import numpy as np
import ray

ray.init()

## Natural Evolution Strategies

An improved approach to derivative free optimization is to maintain a *population of policies*. This is sometimes done by maintaining a set of distinct policies. In this exercise, we will maintain a *distribution over policies*. The policies will use neural nets with a fixed architecture to choose actions. **The distribution over policies will be a multivariate Gaussian over the weights of the neural nets.** The Gaussian will be represented by its mean $\mu$. It will have a fixed covariance matrix (some multiple of the identity).

The mean of the Gaussian will be initialized to the vector $\mu_0$ of all zeros. At time $t$, we will generate an updated mean vector $\mu_t$ as follows.

In [None]:
# Initialize the mean policy to all zeros.
num_inputs = 4
num_hiddens = 10
num_outputs = 1

def initial_policy():
    mean_policy = {"weights1": np.zeros((num_hiddens, num_inputs)),
                   "biases1": np.zeros((num_hiddens)),
                   "weights2": np.zeros((num_outputs, num_hiddens)),
                   "biases2": np.zeros(num_outputs)}
    return mean_policy

# This is a helper function for computing an action given a policy and a state.
def compute_action(policy, state):
    hiddens = np.maximum(np.dot(policy["weights1"], state) + policy["biases1"], 0)
    output = np.dot(policy["weights2"], hiddens) + policy["biases2"]
    assert output.size == 1
    # Turn output into a probability using a sigmoid function.
    probability_of_0 = 1 / (1 + np.exp(-output[0]))
    return 0 if np.random.uniform(0, 1) < probability_of_0 else 1

We will generate $N$ samples from the distribution over policies.

\begin{equation}
\qquad \theta_n \sim \mathcal N(\mu_t, \sigma I) \quad \text{for $1 \le n \le N$}
\end{equation}

where $\mathcal N(\mu,\Sigma)$ represents a multivariate Gaussian distribution with mean $\mu$ and covariance matrix $\Sigma$ and $I$ is the identity matrix. For practical reasons, we will generate perturbed policies in pairs, with opposite perturbations. This is shown in the next box.

In [None]:
sigma = 1e-2

def generate_perturbed_policies(policy):
    new_policy1 = dict()
    new_policy2 = dict()
    for key, weights in policy.items():
        perturbation = sigma * np.random.normal(size=weights.shape)
        new_policy1[key] = weights + perturbation
        new_policy2[key] = weights - perturbation
    return new_policy1, new_policy2

For each policy $\theta_n$, we will perform a rollout to obtain cumulative reward $R_n$. These rewards will be used to update the mean policy via the formula

\begin{equation}
\mu_t = \mu_{t-1} + \frac{\alpha}{N \sigma} \sum_{n=1}^N R_n .
\end{equation}

Note that $\alpha$ is the learning rate.

TODO(rkn): Finish the explanation.

In [None]:
alpha = 1e-2

def update_mean_policy(mean_policy, policy_perturbations, rewards):
    batch_size = len(perturbed_policies)
    for policy, reward in zip(perturbed_policies, rewards):
        for key in mean_policy:
            mean_policy[key] += (alpha / (batch_size * sigma)) * (policy[key] - mean_policy[key]) * reward
    return mean_policy

**Exercise:** Define a remote function which takes the "mean policy" $\mu$, generates policies $\mu + \epsilon$ and $\mu - \epsilon$, where $\epsilon \sim \mathcal N(0, \sigma I)$ and returns the vectors $\mu + \epsilon$ and $\mu - \epsilon$ along with the rewards obtained by performing $N$ rollouts using those policies.

In [None]:
@ray.remote
def evaluate_perturbed_policies(mean_policy, N):
    # This function should do the following:
    # - perturb the mean policy to generate two new policies
    # - create a gym environment
    # - do rollouts with the two policies (see the rollout_policy function
    #   from the first notebook)
    # - return the two policies and the average rewards from the rollouts
    raise NotImplementedError

**Exercise:** Using the `evaluate_perturbed_policies` remote function, implement the natural evolution strategies algorithm.

**Note:** If it doesn't appear to be learning, try the following.
- Debug it using the test environments created in an earlier notebook.
- Print the magnitudes of the weights and the gradients to see if anything is too big or too small.

In [None]:
mean_policy = initial_policy()

num_iters = 100
for _ in range(num_iters):
    # Run the remote function a bunch of times and get the results.
    raise NotImplementedError

    # Collect the results into a big list of perturbed policies and a
    # list of the corresponding rewards.
    raise NotImplementedError

    # Update the mean_policy.
    raise NotImplementedError
    
    # Print the current average reward.
    raise NotImplemented

**Note about efficiency:** In the above example, we're returning the entire perturbation vectors. In the cluster setting, this may require shipping fairly large parameter vectors across the network, which can be expensive. It turns out that it's unnecessary for this algorithm. Because the perturbation vectors are generated randomly, it suffices to ship the seed that was used to generate the perturbations so that they can be regenerated on the other side. This strategy can be used to eliminate nearly all required communication for this algorithm.