# RL Exercise 6 - Training with Ray and Serving with Clipper

**GOAL:** The goal of this exercise is to show how to train a policy with Ray and to deploy it with Clipper in a fun, interactive way.

We will train an agent to play Pong, and then we will play Pong against the policy that we trained.

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import pong_py
import ray

from ray.tune.registry import register_env
from ray.rllib.agents import ppo

Start Ray.

In [None]:
ray.init()

Instantiate an agent that can be trained using Proximal Policy Optimization (PPO).

In [None]:
def env_creator(env_config):
    return pong_py.PongJSEnv()

register_env("my_env", env_creator)
trainer = ppo.PPOAgent(env="my_env", config={
    "env_config": {},  # config to pass to env creator
})

Train the `PPOAgent` for some number of iterations.

**EXERCISE:** You will need to experiment with the number of iterations as well as with the configuration to get the agent to learn something reasonable.

In [None]:
for i in range(2):
    result = trainer.train()

Use the agent manually by calling `agent.compute_action` and see the rewards you get are consistent with the rewards printed during the training procedure.

In [None]:
env = pong_py.PongJSEnv()

for _ in range(20):
    state = env.reset()
    done = False
    cumulative_reward = 0

    while not done:
        action = trainer.compute_action(state)
        state, reward, done, _ = env.step(action)
        cumulative_reward += reward

    print(cumulative_reward)

Checkpoint the agent so that the relevant model can be saved and deployed to Clipper. We save the name of the checkpoint file in `metadata.json` so the model container knows how to restore the policy checkpoint.

In [None]:
import os
import json
checkpoint_path = trainer.save()
checkpoint_dir = os.path.dirname(checkpoint_path)
checkpoint_file = os.path.basename(checkpoint_path)
with open(os.path.join(checkpoint_dir, "metadata.json"), "w") as f:
    json.dump({"checkpoint": checkpoint_file}, f)

Let's initialize WAVE to use it to encrypt the model. You can read the WAVE notebook if you want to see the boilerplate

In [None]:
%run wave-setup.ipynb

We will combine the model into a single file that can be encrypted with WAVE

In [None]:
import tarfile
import io

model = io.BytesIO()
with tarfile.open(fileobj=model, mode="w:gz") as tar:
        tar.add(checkpoint_dir, arcname=os.path.basename(checkpoint_dir))

Now we encrypt it:

In [None]:
encrypt_response = wave.EncryptMessage(
    wv.EncryptMessageParams(
        # the namespace is the organization
        namespace=orgNamespaceEntity.hash,
        resource="models/pong",
        content=model.getvalue()))

Now we would place this encrypted model somewhere where the final users of the model can obtain it

In [None]:
ciphertext = encrypt_response.ciphertext

Although the recipient is in the same notebook here, they are using a distinct WAVE entity and received permission to decrypt using a WAVE delegation

In [None]:
decrypt_response = wave.DecryptMessage(wv.DecryptMessageParams(
        perspective=wv.Perspective(
            entitySecret=wv.EntitySecret(DER=recipientEntity.SecretDER)),
        ciphertext=ciphertext,
        resyncFirst= True))
if decrypt_response.error.code != 0:
    raise Exception(resp.error.Message)

In [None]:
decryptedpath = "outputmodel"
decrypted_file = os.path.join(decryptedpath,checkpoint_file)
decryptedmodel = io.BytesIO(decrypt_response.content)
with tarfile.open(fileobj=decryptedmodel, mode="r:gz") as tar:
    tar.extractall(path=decryptedpath)

The decrypted model is now available at `decrypted_file`

## Play Against the Policy

In this section, we will play Pong against the policy that we just trained. The game will be played in your browser, and the policy that we trained will be served by Clipper.

**EXERCISE:** Deploy your policy using Clipper. Follow the instructions that get printed below to play Pong against the deployed policy. You'll need to deploy all of the data that is saved in the directory `os.path.dirname(checkpoint_path)`.

Start by importing the `clipper_admin` library and use that to create a new Clipper instance to serve the policy.

When you create your ClipperConnection, you need to tell it how to communicate with the Docker service and Clipper. You can use the following command to get the Docker IP address. Use that address when you create your `ClipperConnection` in the next step.

In [None]:
# Make logging work correctly in the Jupyter notebook
import logging
import sys
import subprocess32 as subprocess
logger = logging.getLogger()
logger.setLevel(logging.INFO)

from clipper_admin import DockerContainerManager, ClipperConnection
docker_ip = subprocess.check_output("./get_docker_ip.sh").strip()
clipper_conn = ClipperConnection(DockerContainerManager(docker_ip_address=docker_ip))

# Add a call to stop all in case you still have Clipper running from the earlier exercises
clipper_conn.stop_all()
clipper_conn.start_clipper()

Next, deploy the saved policy checkpoint to Clipper using a Docker image we created for this exercise (similar to the TensorFlow model container in the Clipper tutorial). If you're curious, you can find the custom model container code on [GitHub](https://github.com/ucbrise/risecamp/blob/077aa51078e2043d4d3d2d539e256c30c259678e/rl_and_pong/pong_model_container.py).

In [None]:
import os
model_name = "pong-policy"
app_name = "pong"
clipper_conn.build_and_deploy_model(
    name=model_name,
    version=1,
    input_type="doubles",
    model_data_path=os.path.dirname(decrypted_file),
    base_image="clipper/risecamp-pong-container"
)

Finally, register a Clipper application and link it the deployed policy model.

In [None]:
app_name = "pong"
clipper_conn.register_application(name=app_name, default_output="0", input_type="doubles", slo_micros=100000)
clipper_conn.link_model_to_app(app_name=app_name, model_name=model_name)

Now that you have deployed your policy to Clipper, you will start a Pong application that will let you play against your policy in the browser.

When you start the application, you need to tell it where Clipper is running in order for the Pong application to request predictions from Clipper. `ClipperConnection` provides the `get_query_addr()` method to get the IP address and port on which Clipper is listening for incoming prediction requests.

In [None]:
clipper_addr = clipper_conn.get_query_addr()
print("Clipper address: {}".format(clipper_addr))

Now you can start the Pong webserver. It will print out the URL it's running on after it starts. Copy and paste that URL into your browser and press "1" to play against your policy!

In [None]:
import subprocess32 as subprocess
server_handle = subprocess.Popen(["./start_webserver.sh", clipper_addr], stdout=subprocess.PIPE)
print(server_handle.stdout.readline().strip())

## Deploy a New Policy

The first policy that you deploy probably won't be a very strong competitor, especially if you only trained it for a few iterations. Try training it for more iterations and deploying the new policy to Clipper. Clipper will automatically switch the Pong application to query the new version of the policy. You don't need to reload the page or even restart the game.

For your convenience, we've copied the relevant cells from above to train the policy for more iterations and deploy it Clipper. You can run this cell as many times as you want. Don't forget to increment the version number of the model each time you deploy to Clipper.

In [None]:
# Train for more iterations
for i in range(50):
    result = agent.train()
    
# Save the new policy
checkpoint_path = agent.save()
checkpoint_dir = os.path.dirname(checkpoint_path)
checkpoint_file = os.path.basename(checkpoint_path)
with open(os.path.join(checkpoint_dir, "metadata.json"), "w") as f:
    json.dump({"checkpoint": checkpoint_file}, f)
    
# Deploy the new policy to Clipper.
clipper_conn.build_and_deploy_model(
    name=model_name,
    version=2, # If you run this more than once, don't forget to keep updating the version.
    input_type="doubles",
    model_data_path=os.path.dirname(checkpoint_path),
    base_image="clipper/risecamp-pong-container"
)