# RL Exercise 5 - Training and Serving a Policy with Ray

**GOAL:** The goal of this exercise is to show how to train a policy with Ray and deploy it using an actor in a fun, interactive way.

We will train an agent to play Pong, and then we will play Pong against the policy that we trained.

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import pong_py  # If this line fails, you need to do "pip install utilities/pong_py".
import ray

from ray.rllib.agents.ppo import PPOAgent, DEFAULT_CONFIG
from ray.tune.registry import register_env

Start Ray.

In [None]:
ray.init(num_cpus=36, ignore_reinit_error=True)

**EXERCISE:** Run the cell below to instantiate an agent that can be trained using Proximal Policy Optimization (PPO).

In [None]:
config = DEFAULT_CONFIG.copy()
# Consider using more workers to speed up the rollouts.
config['num_workers'] = 8
config['gamma'] = 0.99
config['sgd_stepsize'] = 5e-3
config['kl_coeff'] = 0.1
config['num_sgd_iter'] = 20
config['timesteps_per_batch'] = 8196 * 4
config['sgd_batchsize'] = 8196
config['observation_filter'] = 'NoFilter'
config['model']['fcnet_hiddens'] = [32, 32]

def env_creator(env_config):
    import pong_py
    return pong_py.PongJSEnv()

register_env('pong_env', env_creator)

agent = PPOAgent(env='pong_env', config=config)

## Train the Policy

Train the `PPOAgent` for some number of iterations.

**EXERCISE:** You will need to experiment with the number of iterations as well as with the configuration to get the agent to learn something reasonable. Training may take around 30 or so iterations.

**NOTE:** You may want to skip this box for now and try playing against the untrained policy. After you've done that, you can return to this box, train the policy more, and serve the updated policy.

In [None]:
for i in range(2):
    result = agent.train()

Use the agent manually by calling `agent.compute_action` and see the rewards you get are consistent with the rewards printed during the training procedure.

In [None]:
env = env_creator({})

for _ in range(20):
    state = env.reset()
    done = False
    cumulative_reward = 0

    while not done:
        action = agent.compute_action(state)
        state, reward, done, _ = env.step(action)
        cumulative_reward += reward

    print(cumulative_reward)

Checkpoint the agent so that the relevant model can be saved and served by the actor webserver.

In [None]:
checkpoint_path = agent.save()

## Play Against the Policy

Below, we've provided an actor that runs a very simple webserver to serve the policy. The webserver will respond to POST requests coming from the javascript game and will serve actions used to power the AI opponent.

To play against the policy you trained, start the webserver by running the cell below. Instructions will be printed.

In [None]:
import cgi
from http.server import BaseHTTPRequestHandler, HTTPServer
import json
import requests
import socketserver
import subprocess
import threading


# Check that we haven't already started the server.
try:
    server
except:
    pass
else:
    raise Exception('The policy server has already been created, so we are not creating another one.')


# Check that the required port isn't already in use.
try:
    requests.get('http://localhost:3000')
except:
    pass
else:
    raise Exception('The port 3000 is still in use (perhaps from a previous run of this notebook. '
                    'You will need to kill that process before proceeding, e.g., by running '
                    '"subprocess.call([\'ray\', \'stop\'])" in a new cell and restarting this notebook.')

def make_handler_class(agent):
    """This function is used to define a custom handler using the policy."""

    class PolicyHandler(BaseHTTPRequestHandler):
        def __init__(self, *args, **kwargs):
            BaseHTTPRequestHandler.__init__(self, *args, **kwargs)
            
        def end_headers(self):
            self.send_header('Access-Control-Allow-Origin', '*')
            self.send_header('Access-Control-Allow-Methods', '*')
            self.send_header('Access-Control-Allow-Headers', 'Content-Type')
            BaseHTTPRequestHandler.end_headers(self)
        
        def do_OPTIONS(self):
            self.send_response(200, 'ok')
            self.end_headers()

        def do_POST(self):
            """This method receives the state of the game and returns an action."""
            length = int(self.headers.get_all('content-length')[0])
            post_body = cgi.parse_qs(self.rfile.read(length), keep_blank_values=1)
            features = json.loads(list(post_body.keys())[0])['input']

            self.send_response(200)
            self.send_header('Content-type', 'json')
            self.end_headers()

            # NOTE: This line is where we compute the action using the policy.
            action = int(agent.compute_action(features))
        
            self.wfile.write(json.dumps({'output': action}).encode('ascii'))

    return PolicyHandler

@ray.remote
class PolicyServer(object):
    def __init__(self, port, config, checkpoint_path):
        self.port = port

        config['num_workers'] = 0
        self.agent = PPOAgent(env='pong_env', config=config)
        self.agent.restore(checkpoint_path)

        self.HandlerClass = make_handler_class(self.agent)

    def start_server(self):
        # run web server, do this in a separate thread
        def run_server(port):
            httpd = HTTPServer(('', port), self.HandlerClass)
            httpd.serve_forever()

        self.server_thread = threading.Thread(target=run_server,
                                              args=(self.port,))

        print('Starting server.')
        self.server_thread.start()

    def update_policy(self, checkpoint_path):
        # update policy
        self.agent.restore(checkpoint_path)


print('Starting the server, this will take some time.')
server = PolicyServer.remote(3000, config, checkpoint_path)
ray.get(server.start_server.remote())
print('The server has started.')

**EXERCISE:** Run the cell below to print the URL of the Pong game.

In [None]:
from IPython.core.display import display, HTML
try:
    # Figure out the public IP address if we are on an EC2 instance.
    public_ip = subprocess.check_output(['curl', '-m', '5', 'http://169.254.169.254/latest/meta-data/public-ipv4']).decode('ascii')
    address = 'http://{}/pong'.format(public_ip)
except subprocess.CalledProcessError:
    address = 'http://localhost:8888/notebooks/utilities/javascript-pong/static/index.html'

display(HTML('To play against your policy, go to <a href="{0}" target="_blank">{0}</a> and PRESS "1".'.format(address)))

## Update the policy

To update the policy, leave the webserver running (you do not need to start a new webserver), and run the box below.

This will checkpoint the current policy and load the updated policy on the webserver.

In [None]:
# Checkpoint the current policy.
checkpoint_path = agent.save()

# Serve the updated policy.
ray.get(server.update_policy.remote(checkpoint_path))

print('The policy has been updated. You can continue playing at the same URL as before.')