In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import numpy as np
import ray

ray.init()

## Derivative Free Optimization

The goal of reinforcement learning is to find a policy (parameterized by $\pi$), which solves the following optimization problem.

\begin{equation}
\max_{\pi} \sum_{t=0}^T R_t
\end{equation}

Here, $R_t$ is the reward received at time $t$ when acting according to the policy $\pi$. Note that if the environment is stochastic or the policy is stochastic, then each $R_t$ will be a random variable. Also note that $T$ will be a random variable. Both $R_t$ and $T$ depend on $\pi$.

Though the setup is similar to supervised learning in that in both settings we want to minimize or maximize some objective function, in supervised learning we often have an explicit formula for the objective function in terms of the parameters of interest, which enables us to symbolically compute the gradient of the objective function. So in supervised learning, we can often directly apply gradient descent to optimize the objective.

In reinforcement learning, we often do not have an explicit formula for the reward function that we are trying to optimize, and so we can't easily compute gradients. For example, imagine an environment in which a robot walks until it falls over and the reward is the distance that the robot walked before it fell over. Computing the gradient of that reward with respect to the parameters of the robots policy is not straightforward.

The difficulty of computing explicit gradients motivates the use of **derivative free optimization**. We will work through some examples below.

The class below is a policy that chooses an action using a randomly-generated two-layer neural net.

In [None]:
class TwoLayerPolicy(object):
    def __init__(self, num_inputs, num_hiddens, num_outputs=1):
        self.num_inputs = num_inputs
        self.num_hidden_units = num_hiddens
        self.num_outputs = num_outputs
        self.weights1 = np.random.normal(size=(num_hiddens, num_inputs))
        self.biases1 = np.random.normal(size=num_hiddens)
        self.weights2 = np.random.normal(size=(num_outputs, num_hiddens))
        self.biases2 = np.random.normal(size=num_outputs)
    
    def __call__(self, state):
        hiddens = np.maximum(np.dot(self.weights1, state) + self.biases1, 0)
        output = np.dot(self.weights2, hiddens) + self.biases2
        assert output.size == 1
        return 0 if output[0] < 0 else 1

policy = TwoLayerPolicy(4, 5)
# You can get an action by applying the policy to a state.
action = policy(np.random.normal(size=4))
print(action)

**Exercise:** Using Ray, define a remote function that generates a random `TwoLayerPolicy`, performs 10 rollouts using a CartPole environment, and returns the average reward over those rollouts along with the policy.

**Note:** You may want to copy over the function `rollout_policy` from an earlier notebook to use as a helper function.

In [None]:
@ray.remote
def evaluate_random_policy(num_rollouts):
    # This function should do the following.
    # - generate a TwoLayerPolicy
    # - create a CartPole environment with gym.make('CartPole-v0')
    # - do num_rollouts rollouts (perhaps using one of the functions
    #   defined in a previous notebook)
    # - return the average reward and the policy
    raise NotImplementedError


policy, average_reward = ray.get(evaluate_random_policy.remote(10))
print(policy)
print(average_reward)

**Exercise:** Using the `evaluate_random_policy` remote function, evaluate 100 randomly generated policies. Keep the best policy and make a note of its score. Try taking the best of 1000.

**Note:** The best possible score should be 200.

In [None]:
# Evaluate 100 randomly generated policies.
raise NotImplementedError

# Print the best score obtained.
raise NotImplementedError