# RL Exercise 6 - Training with Ray and Serving with Clipper

**GOAL:** The goal of this exercise is to show how to train a policy with Ray and to deploy it with Clipper in a fun, interactive way.

We will train an agent to play Pong, and then we will play Pong against the policy that we trained.

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import pong_py
import ray

from ray.rllib.ppo import PPOAgent, DEFAULT_CONFIG

Start Ray.

In [None]:
ray.init(num_workers=0)

The cell below is a hack. The explanation is as follows. Internally within the `PPOAgent` constructor, a number of actors are created, and these actors will instantiate gym environments using the command `gym.make('PongJS-v0')`. The command `gym.make` knows how to instantiate a number of pre-defined environments that are shipped with the `gym` module. However, the `PongJS-v0` environment is defined in the `pong_py` module and is registered with the `gym` module when the `import pong_py` statement gets run.

Therefore, for the actors to successfully instantiate the gym environments, the `pong_py` module must be imported on the actors. This is why we define a remote function `import_pong_py` which closes over the `pong_py` environment. When the actors are created, that remote function is unpickled on the actors which forces the `pong_py` module to be imported, which enables the `gym` module to create the `PongJS-v0` environment.

In [None]:
# This is a hack.
@ray.remote
def import_pong_py():
    pong_py

Instantiate an agent that can be trained using Proximal Policy Optimization (PPO).

In [None]:
config = DEFAULT_CONFIG.copy()
# Consider using more workers to speed up the rollouts.
config['num_workers'] = 3
config['gamma'] = 0.99
config['sgd_stepsize'] = 5e-3
config['kl_coeff'] = 0.1
config['num_sgd_iter'] = 20
config['sgd_batchsize'] = 8196
config['observation_filter'] = 'NoFilter'
config['model']['fcnet_hiddens'] = [32, 32]

agent = PPOAgent('PongJS-v0', config)

Train the `PPOAgent` for some number of iterations.

**EXERCISE:** You will need to experiment with the number of iterations as well as with the configuration to get the agent to learn something reasonable.

In [None]:
for i in range(2):
    result = agent.train()

Use the agent manually by calling `agent.compute_action` and see the rewards you get are consistent with the rewards printed during the training procedure.

In [None]:
env = gym.make('PongJS-v0')

for _ in range(20):
    state = env.reset()
    done = False
    cumulative_reward = 0

    while not done:
        action = agent.compute_action(state)
        state, reward, done, _ = env.step(action)
        cumulative_reward += reward

    print(cumulative_reward)

Checkpoint the agent so that the relevant model can be saved and deployed to Clipper. We save the name of the checkpoint file in `metadata.json` so the model container knows how to restore the policy checkpoint.

In [None]:
import os
import json
checkpoint_path = agent.save()
checkpoint_dir = os.path.dirname(checkpoint_path)
checkpoint_file = os.path.basename(checkpoint_path)
with open(os.path.join(checkpoint_dir, "metadata.json"), "w") as f:
    json.dump({"checkpoint": checkpoint_file}, f)

## Play Against the Policy

In this section, we will play Pong against the policy that we just trained. The game will be played in your browser, and the policy that we trained will be served by Clipper.

**EXERCISE:** Deploy your policy using Clipper. Follow the instructions that get printed below to play Pong against the deployed policy. You'll need to deploy all of the data that is saved in the directory `os.path.dirname(checkpoint_path)`.

Start by importing the `clipper_admin` library and use that to create a new Clipper instance to serve the policy.

When you create your ClipperConnection, you need to tell it how to communicate with the Docker service and Clipper. You can use the following command to get the Docker IP address. Use that address when you create your `ClipperConnection` in the next step.

In [None]:
!./get_docker_ip.sh

In [None]:
# Set the value of this variable to the IP address output by the ./get_docker_ip.sh command.
docker_ip = ""


In [None]:
# Make logging work correctly in the Jupyter notebook
import logging
import sys
logger = logging.getLogger()
logger.setLevel(logging.INFO)

from clipper_admin import DockerContainerManager, ClipperConnection
clipper_conn = ClipperConnection(DockerContainerManager(docker_ip_address=docker_ip))
clipper_conn.start_clipper()

Next, deploy the saved policy checkpoint to Clipper. The policy will run in a Docker container we created for this exercise.

**TODO(crankshaw):** link to model container code once it's on github.

In [None]:
import os
model_name = "pong-policy"
app_name = "pong"
clipper_conn.build_and_deploy_model(
    name=model_name,
    version=1,
    input_type="doubles",
    model_data_path=os.path.dirname(checkpoint_path),
    base_image="clipper/risecamp-pong-container"
)

Finally, register a Clipper application and link it the deployed policy model.

In [None]:
app_name = "pong"
clipper_conn.register_application(name=app_name, default_output="0", input_type="doubles", slo_micros=100000)
clipper_conn.link_model_to_app(app_name=app_name, model_name=model_name)

Now that you have deployed your policy to Clipper, you will start a Pong application that will let you play against your policy in the browser.

When you start the application, you need to tell it where Clipper is running in order for the Pong application to request predictions from Clipper. `ClipperConnection` provides the `get_query_addr()` method to get the IP address and port on which Clipper is listening for incoming prediction requests.

In [None]:
clipper_addr = clipper_conn.get_query_addr()
print("Clipper address: {}".format(clipper_addr))

In [None]:
import subprocess32 as subprocess
server_handle = subprocess.Popen(["./start_webserver.sh", clipper_addr])

Now go to port 3000 on the same host as the one running this notebook. If you're running locally, that's <http://localhost:3000>. If you're running on EC2, that's `<EC2_HOSTNAME:3000>`.
**TODO:** Figure out what address they should actually go to.

**TODO:** Instructions on how to deploy a new version of the policy to Clipper.