# RL Exercise 6 - Training with Ray and Serving with Clipper

**GOAL:** The goal of this exercise is to show how to train a policy with Ray and to deploy it with Clipper in a fun, interactive way.

We will train an agent to play Pong, and then we will play Pong against the policy that we trained.

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import pong_py
import ray

from ray.rllib.ppo import PPOAgent, DEFAULT_CONFIG

Start Ray.

In [2]:
ray.init(num_workers=0)

Waiting for redis server at 127.0.0.1:49181 to respond...
Waiting for redis server at 127.0.0.1:55556 to respond...
Starting local scheduler with 4 CPUs, 0 GPUs
View the web UI at http://localhost:8889/notebooks/ray_ui25886.ipynb


{'local_scheduler_socket_names': ['/tmp/scheduler85656775'],
 'node_ip_address': '127.0.0.1',
 'object_store_addresses': [ObjectStoreAddress(name='/tmp/plasma_store7176638', manager_name='/tmp/plasma_manager26229005', manager_port=56579)],
 'redis_address': '127.0.0.1:49181'}

The cell below is a hack. The explanation is as follows. Internally within the `PPOAgent` constructor, a number of actors are created, and these actors will instantiate gym environments using the command `gym.make('PongJS-v0')`. The command `gym.make` knows how to instantiate a number of pre-defined environments that are shipped with the `gym` module. However, the `PongJS-v0` environment is defined in the `pong_py` module and is registered with the `gym` module when the `import pong_py` statement gets run.

Therefore, for the actors to successfully instantiate the gym environments, the `pong_py` module must be imported on the actors. This is why we define a remote function `import_pong_py` which closes over the `pong_py` environment. When the actors are created, that remote function is unpickled on the actors which forces the `pong_py` module to be imported, which enables the `gym` module to create the `PongJS-v0` environment.

In [3]:
# This is a hack.
@ray.remote
def import_pong_py():
    pong_py

Instantiate an agent that can be trained using Proximal Policy Optimization (PPO).

In [4]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 20
config['sgd_batchsize'] = 8196
config['model']['fcnet_hiddens'] = [32, 32]

agent = PPOAgent('PongJS-v0', config)

[2017-08-31 02:26:00,350] PPOAgent algorithm created with logdir '/tmp/ray/PongJS-v0_PPOAgent_2017-08-31_02-26-00Q8I340'
[2017-08-31 02:26:00,352] Making new env: PongJS-v0


Non-atari env, not using any observation preprocessor.
Constructing fcnet [32, 32] <function tanh at 0x114b75398>
Constructing fcnet [32, 32] <function tanh at 0x114b75398>
Constructing fcnet [32, 32] <function tanh at 0x114b75398>
Constructing fcnet [32, 32] <function tanh at 0x114b75398>
Constructing fcnet [32, 32] <function tanh at 0x114b75398>
Constructing fcnet [32, 32] <function tanh at 0x114b75398>
Constructing fcnet [32, 32] <function tanh at 0x114b75398>
Constructing fcnet [32, 32] <function tanh at 0x114b75398>
Constructing fcnet [32, 32] <function tanh at 0x114b75398>
Constructing fcnet [32, 32] <function tanh at 0x114b75398>


Train the `PPOAgent` for some number of iterations.

**EXERCISE:** You will need to experiment with the number of iterations as well as with the configuration to get the agent to learn something reasonable.

In [5]:
for i in range(2):
    result = agent.train()

===> iteration 1
total reward is  153.595505618
trajectory length mean is  152.595505618
timesteps: 40743
Computing policy (iterations=20, stepsize=5e-05):
           iter     total loss    policy loss        vf loss             kl        entropy
              0    5.18152e+03   -1.92394e-03    5.18152e+03    4.12476e-08    1.09860e+00
              1    5.18144e+03   -1.93869e-03    5.18145e+03    3.17493e-07    1.09860e+00
              2    5.18136e+03   -1.95562e-03    5.18137e+03    1.01237e-06    1.09860e+00
              3    5.18128e+03   -1.96921e-03    5.18128e+03    2.17570e-06    1.09860e+00
              4    5.18120e+03   -1.98596e-03    5.18120e+03    3.76647e-06    1.09860e+00
              5    5.18112e+03   -2.00005e-03    5.18112e+03    5.78781e-06    1.09859e+00
              6    5.18104e+03   -2.01340e-03    5.18104e+03    8.29189e-06    1.09859e+00
              7    5.18095e+03   -2.02955e-03    5.18095e+03    1.12526e-05    1.09859e+00
              8    5.1808

Use the agent manually by calling `agent.compute_action` and see the rewards you get are consistent with the rewards printed during the training procedure.

In [None]:
env = gym.make('PongJS-v0')

for _ in range(20):
    state = env.reset()
    done = False
    cumulative_reward = 0

    while not done:
        action = agent.compute_action(state)
        state, reward, done, _ = env.step(action)
        cumulative_reward += reward

    print(cumulative_reward)

Checkpoint the agent so that the relevant model can be saved and deployed to Clipper.

In [6]:
checkpoint_path = agent.save()

INFO:tensorflow:/tmp/ray/PongJS-v0_PPOAgent_2017-08-31_02-26-00Q8I340/checkpoint-2 is not in all_model_checkpoint_paths. Manually adding it.


[2017-08-31 02:26:53,866] /tmp/ray/PongJS-v0_PPOAgent_2017-08-31_02-26-00Q8I340/checkpoint-2 is not in all_model_checkpoint_paths. Manually adding it.


## Play Against the Policy

In this section, we will play Pong against the policy that we just trained. The game will be played in your browser, and the policy that we trained will be served by Clipper.

**EXERCISE:** Deploy your policy using Clipper. Follow the instructions that get printed below to play Pong against the deployed policy. You'll need to deploy all of the data that is saved in the directory `os.path.dirname(checkpoint_path)`.

Start by importing the `clipper_admin` library and use that to create a new Clipper instance to serve the policy.

In [12]:
# Make logging work correctly in the Jupyter notebook
import logging
import sys
logger = logging.getLogger()
logger.setLevel(logging.INFO)

from clipper_admin import DockerContainerManager, ClipperConnection
clipper_conn = ClipperConnection(DockerContainerManager())
clipper_conn.start_clipper(query_frontend_image="clipper/query_frontend:cors")

[2017-08-31 02:32:48,941] Starting managed Redis instance in Docker
[2017-08-31 02:32:51,060] Clipper is running


Next, deploy the saved policy checkpoint to Clipper. The policy will run in a Docker container we created for this exercise.

**TODO(crankshaw):** link to model container code once it's on github.

In [13]:
import os
model_name = "pong-policy"
app_name = "pong"
clipper_conn.build_and_deploy_model(
    name=model_name,
    version=1,
    input_type="doubles",
    model_data_path=os.path.dirname(checkpoint_path),
    base_image="clipper/risecamp-pong-container",
    force=True)

[2017-08-31 02:32:57,841] Found existing Dockerfile in /tmp/ray/PongJS-v0_PPOAgent_2017-08-31_02-26-00Q8I340. This file will be overwritten
[2017-08-31 02:32:57,845] Building model Docker image with model data from /tmp/ray/PongJS-v0_PPOAgent_2017-08-31_02-26-00Q8I340
[2017-08-31 02:32:58,265] Pushing model Docker image to pong-policy:1
[2017-08-31 02:32:59,693] Found 0 replicas for pong-policy:1. Adding 1
[2017-08-31 02:33:00,370] Successfully registered model pong-policy:1
[2017-08-31 02:33:00,371] Done deploying model pong-policy:1.


Finally, register a Clipper application and link it the deployed policy model.

In [14]:
app_name = "pong"
clipper_conn.register_application(name=app_name, default_output="0", input_type="doubles", slo_micros=100000)
clipper_conn.link_model_to_app(app_name=app_name, model_name=model_name)

[2017-08-31 02:33:04,402] Application pong was successfully registered
[2017-08-31 02:33:04,417] Model pong-policy is now linked to application pong


Now you can play Pong against the policy in the browser.

**TODO(crankshaw):** Once the Javascript pong works, print out address they need to copy and paste to direct the JS to the right Clipper instance.

In [15]:
clipper_conn.get_query_addr()

'localhost:1337'