# RL Exercise 3 - Proximal Policy Optimization

**GOAL:** The goal of this exercise is to demonstrate how to use the proximal policy optimization (PPO) algorithm.

PPO is described in detail in https://arxiv.org/abs/1707.06347. It is a variant of Trust Region Policy Optimization (TRPO) described in https://arxiv.org/abs/1502.05477

PPO works in two phases. In one phase, a large number of rollouts are performed (in parallel). The rollouts are then aggregated on the driver and a surrogate optimization objective is defined based on those rollouts. We then use SGD to find the policy that maximizes that objective with a penalty term for diverging too much from the current policy.

**NOTE:** The SGD optimization step is best performed in a data-parallel manner over multiple GPUs. This is exposed through the `devices` field of the `config` dictionary (for this to work, you must be using a machine that has GPUs).

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import ray
from ray.rllib.ppo import PPOAgent, DEFAULT_CONFIG

Start up Ray. This must be done before we instantiate any RL agents. We pass in `num_workers=0` because the training agent's constructor will create a number of actors.

In [2]:
ray.init(num_workers=0)

Waiting for redis server at 127.0.0.1:24039 to respond...
Waiting for redis server at 127.0.0.1:10890 to respond...
Starting local scheduler with 40 CPUs, 0 GPUs


{'local_scheduler_socket_names': ['/tmp/scheduler76068844'],
 'node_ip_address': '127.0.0.1',
 'object_store_addresses': [ObjectStoreAddress(name='/tmp/plasma_store76850868', manager_name='/tmp/plasma_manager96617492', manager_port=54750)],
 'redis_address': '127.0.0.1:24039',
 'webui_url': ''}

Instantiate a PPOAgent object. We pass in a config object that specifies how the network and training procedure should be configured. Some of the parameters are the following.

- `num_agents` is the number of actors that the agent will create. This determines the degree of parallelism that will be used.
- `num_sgd_iter` is the number of epochs of SGD (passes through the data) that will be used to optimize the PPO surrogate objective at each iteration of PPO.
- `sgd_batchsize` is the SGD batch size that will be used to optimize the PPO surrogate objective.
- `model` contains a dictionary of parameters describing the neural net used to parameterize the policy. The `fcnet_hiddens` parameter is a list of the sizes of the hidden layers.

In [3]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 30
config['sgd_batchsize'] = 128
config['model']['fcnet_hiddens'] = [100, 100]

agent = PPOAgent('CartPole-v0', config)

[2017-09-18 23:10:44,302] PPOAgent algorithm created with logdir '/tmp/ray/CartPole-v0_PPOAgent_2017-09-18_23-10-44b_15knyr'
[2017-09-18 23:10:44,304] Making new env: CartPole-v0


Non-atari env, not using any observation preprocessor.
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>


Train the policy on the `CartPole-v0` environment for 2 steps. The CartPole problem is described at https://gym.openai.com/envs/CartPole-v0.

**EXERCISE:** Inspect how well the policy is doing by looking for the lines that say something like

```
total reward is  22.3215974777
trajectory length mean is  21.3215974777
```

This indicates how much reward the policy is receiving and how many time steps of the environment the policy ran. The maximum possible reward for this problem is 200. The reward and trajectory length are very close because the agent receives a reward of one for every time step that it survives (however, that is specific to this environment).

In [4]:
for i in range(2):
    result = agent.train()
    print(result)

===> iteration 1
total reward is  21.9772138788
trajectory length mean is  20.9772138788
timesteps: 40507
Computing policy (iterations=30, stepsize=5e-05):
           iter     total loss    policy loss        vf loss             kl        entropy
              0    3.01077e+02   -2.79731e-02    3.01101e+02    1.70152e-02    6.77316e-01
              1    2.89601e+02   -4.36657e-02    2.89639e+02    2.85183e-02    6.66428e-01
              2    2.59408e+02   -4.77955e-02    2.59449e+02    3.22889e-02    6.62838e-01
              3    2.33537e+02   -4.93663e-02    2.33579e+02    3.36404e-02    6.61523e-01
              4    2.10215e+02   -5.02108e-02    2.10258e+02    3.44015e-02    6.60779e-01
              5    1.85847e+02   -5.08632e-02    1.85891e+02    3.48997e-02    6.60295e-01
              6    1.56811e+02   -5.13301e-02    1.56855e+02    3.53625e-02    6.59846e-01
              7    1.24922e+02   -5.17684e-02    1.24967e+02    3.58524e-02    6.59377e-01
              8    9.8307

**EXERCISE:** The current network and training configuration are too large and heavy-duty for a simple problem like CartPole. Modify the configuration to use a smaller network and to speed up the optimization of the surrogate objective (fewer SGD iterations and a larger batch size should help).

In [5]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 30
config['sgd_batchsize'] = 128
config['model']['fcnet_hiddens'] = [100, 100]

agent = PPOAgent('CartPole-v0', config)

[2017-09-18 23:12:09,520] PPOAgent algorithm created with logdir '/tmp/ray/CartPole-v0_PPOAgent_2017-09-18_23-12-09oy02h5hq'
[2017-09-18 23:12:09,521] Making new env: CartPole-v0


Non-atari env, not using any observation preprocessor.
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>


**EXERCISE:** Train the agent and try to get a reward of 200. If it's training too slowly you may need to modify the config above to use fewer hidden units, a larger `sgd_batchsize`, a smaller `num_sgd_iter`, or a larger `num_workers`.

This should take around 20 or 30 training iterations.

In [None]:
### note use a simpler network 100x100 is overkill for this problem

In [7]:
for i in range(30):
    result = agent.train()
    print(result)

===> iteration 3
total reward is  129.39184953
trajectory length mean is  128.39184953
timesteps: 40957
Computing policy (iterations=30, stepsize=5e-05):
           iter     total loss    policy loss        vf loss             kl        entropy
              0    1.03670e+03   -8.39487e-03    1.03671e+03    5.24551e-03    5.59758e-01
              1    8.94730e+02   -9.88502e-03    8.94737e+02    6.27735e-03    5.53654e-01
              2    8.15363e+02   -1.00116e-02    8.15370e+02    6.39050e-03    5.53428e-01
              3    7.60303e+02   -1.00249e-02    7.60310e+02    6.34543e-03    5.54137e-01
              4    7.20068e+02   -1.00716e-02    7.20075e+02    6.30446e-03    5.54078e-01
              5    6.87645e+02   -9.95674e-03    6.87652e+02    6.01853e-03    5.55839e-01
              6    6.59445e+02   -1.00523e-02    6.59452e+02    6.06512e-03    5.55622e-01
              7    6.34159e+02   -1.00701e-02    6.34166e+02    6.10951e-03    5.55423e-01
              8    6.11115e

              5    3.64003e+02   -2.36821e-04    3.64003e+02    4.29094e-04    5.49663e-01
              6    3.36252e+02   -3.32965e-04    3.36252e+02    5.41246e-04    5.48683e-01
              7    3.17249e+02   -3.63815e-04    3.17250e+02    5.21005e-04    5.45170e-01
              8    3.03187e+02   -4.30062e-04    3.03187e+02    5.80607e-04    5.45207e-01
              9    2.92003e+02   -4.35611e-04    2.92004e+02    6.94037e-04    5.49425e-01
             10    2.82733e+02   -4.49105e-04    2.82733e+02    7.46260e-04    5.49558e-01
             11    2.74851e+02   -4.79854e-04    2.74851e+02    8.13771e-04    5.48289e-01
             12    2.68060e+02   -4.90129e-04    2.68060e+02    7.86196e-04    5.49117e-01
             13    2.62150e+02   -5.38677e-04    2.62151e+02    9.02211e-04    5.46629e-01
             14    2.56928e+02   -5.87420e-04    2.56929e+02    9.57457e-04    5.49312e-01
             15    2.52393e+02   -5.78309e-04    2.52394e+02    1.00914e-03    5.46277e-01

             13    2.41558e+02   -7.77707e-04    2.41558e+02    3.05840e-03    5.38185e-01
             14    2.40667e+02   -7.61884e-04    2.40668e+02    3.39073e-03    5.35448e-01
             15    2.39934e+02   -8.40998e-04    2.39935e+02    3.10428e-03    5.34660e-01
             16    2.39271e+02   -8.45143e-04    2.39272e+02    2.86631e-03    5.39830e-01
             17    2.38714e+02   -8.58843e-04    2.38714e+02    3.21907e-03    5.37720e-01
             18    2.38212e+02   -9.09221e-04    2.38213e+02    3.27115e-03    5.41820e-01
             19    2.37756e+02   -9.02058e-04    2.37756e+02    3.07555e-03    5.30887e-01
             20    2.37329e+02   -9.34895e-04    2.37330e+02    3.37208e-03    5.39241e-01
             21    2.36948e+02   -9.28508e-04    2.36949e+02    3.60381e-03    5.33110e-01
             22    2.36593e+02   -8.85475e-04    2.36593e+02    2.81112e-03    5.29741e-01
             23    2.36266e+02   -9.51372e-04    2.36267e+02    2.90725e-03    5.32365e-01

             21    2.32830e+02   -1.02788e-03    2.32831e+02    5.69929e-03    5.29583e-01
             22    2.32411e+02   -1.02138e-03    2.32412e+02    4.95867e-03    5.25160e-01
             23    2.32026e+02   -9.84224e-04    2.32027e+02    4.58778e-03    5.20490e-01
             24    2.31655e+02   -1.00555e-03    2.31656e+02    4.73769e-03    5.21648e-01
             25    2.31356e+02   -1.08488e-03    2.31357e+02    4.84898e-03    5.29307e-01
             26    2.31065e+02   -1.09590e-03    2.31066e+02    5.31661e-03    5.22858e-01
             27    2.30767e+02   -1.06633e-03    2.30768e+02    4.90463e-03    5.24795e-01
             28    2.30523e+02   -1.15467e-03    2.30524e+02    4.89859e-03    5.25682e-01
             29    2.30221e+02   -1.06354e-03    2.30222e+02    4.68647e-03    5.15402e-01
kl div: 0.00468647
kl coeff: 0.007031250000000001
rollouts time: 8.52013087272644
shuffle time: 0.004523515701293945
load time: 0.001699209213256836
sgd time: 18.20466136932373
sgd 

             29    2.09413e+02   -7.72173e-04    2.09414e+02    4.46291e-03    5.14065e-01
kl div: 0.00446291
kl coeff: 0.0017578125000000003
rollouts time: 8.559034824371338
shuffle time: 0.004549980163574219
load time: 0.0013918876647949219
sgd time: 18.358441829681396
sgd examples/s: 2210.7540703362815
total time so far: 421.6331603527069
TrainingResult(experiment_id='aaf6071946ae492eb28236dfaaae1b3c', training_iteration=11, episode_reward_mean=199.95098039215685, episode_len_mean=198.95098039215685, info={'kl_divergence': 0.0044629145, 'kl_coefficient': 0.0017578125000000003, 'rollouts_time': 8.559034824371338, 'shuffle_time': 0.004549980163574219, 'load_time': 0.0013918876647949219, 'sgd_time': 18.358441829681396, 'sample_throughput': 2210.7540703362815}, timesteps_this_iter=40586, timesteps_total=446837, time_this_iter_s=26.928181409835815, time_total_s=298.0102126598358)
===> iteration 12
total reward is  200.0
trajectory length mean is  199.0
timesteps: 40596
Computing policy (

total reward is  198.004901961
trajectory length mean is  197.004901961
timesteps: 40189
Computing policy (iterations=30, stepsize=5e-05):
           iter     total loss    policy loss        vf loss             kl        entropy
              0    5.95610e+02   -8.30503e-04    5.95611e+02    7.49417e-04    5.12182e-01
              1    3.27245e+02   -1.11166e-03    3.27246e+02    1.57430e-03    5.02363e-01
              2    2.78624e+02   -1.33474e-03    2.78625e+02    2.85658e-03    4.95744e-01
              3    2.61087e+02   -1.47280e-03    2.61088e+02    4.10106e-03    4.92422e-01
              4    2.53010e+02   -1.56411e-03    2.53012e+02    4.20282e-03    4.92715e-01
              5    2.48571e+02   -1.61524e-03    2.48573e+02    3.88703e-03    5.01291e-01
              6    2.45924e+02   -1.71139e-03    2.45926e+02    4.58175e-03    4.94469e-01
              7    2.44157e+02   -1.75729e-03    2.44158e+02    4.92590e-03    4.96527e-01
              8    2.42899e+02   -1.76921e

              5    2.58458e+02   -2.17826e-03    2.58460e+02    6.86491e-03    4.97065e-01
              6    2.56513e+02   -2.25898e-03    2.56515e+02    6.82112e-03    4.98918e-01
              7    2.54964e+02   -2.24479e-03    2.54966e+02    6.72969e-03    5.01591e-01
              8    2.53627e+02   -2.31962e-03    2.53629e+02    6.88276e-03    4.98092e-01
              9    2.52378e+02   -2.44720e-03    2.52381e+02    6.72508e-03    4.96625e-01
             10    2.51195e+02   -2.47830e-03    2.51197e+02    7.64064e-03    4.96769e-01
             11    2.50083e+02   -2.41569e-03    2.50086e+02    6.86974e-03    4.95680e-01
             12    2.48984e+02   -2.51533e-03    2.48987e+02    7.38318e-03    4.95757e-01
             13    2.47886e+02   -2.55983e-03    2.47888e+02    7.29028e-03    4.98351e-01
             14    2.46816e+02   -2.68206e-03    2.46819e+02    7.47374e-03    4.98838e-01
             15    2.45812e+02   -2.62668e-03    2.45815e+02    7.43878e-03    4.96720e-01

             13    2.50829e+02   -1.12666e-03    2.50831e+02    3.10541e-03    5.04706e-01
             14    2.50215e+02   -1.14055e-03    2.50216e+02    3.13577e-03    5.01165e-01
             15    2.49672e+02   -1.20289e-03    2.49673e+02    3.26135e-03    5.02926e-01
             16    2.49239e+02   -1.27091e-03    2.49240e+02    3.28408e-03    5.03903e-01
             17    2.48837e+02   -1.23491e-03    2.48838e+02    3.49777e-03    5.02219e-01
             18    2.48524e+02   -1.31065e-03    2.48525e+02    3.37488e-03    5.04479e-01
             19    2.48189e+02   -1.28614e-03    2.48190e+02    3.23664e-03    4.99667e-01
             20    2.47960e+02   -1.34871e-03    2.47961e+02    3.43908e-03    5.01938e-01
             21    2.47721e+02   -1.43755e-03    2.47722e+02    3.63670e-03    5.04319e-01
             22    2.47525e+02   -1.40977e-03    2.47526e+02    3.61360e-03    5.04943e-01
             23    2.47343e+02   -1.51165e-03    2.47344e+02    3.74736e-03    5.04569e-01

             21    2.51314e+02   -1.07743e-03    2.51315e+02    3.95868e-03    4.85748e-01
             22    2.51201e+02   -1.12121e-03    2.51202e+02    3.82630e-03    4.83681e-01
             23    2.51052e+02   -1.06844e-03    2.51053e+02    3.65530e-03    4.78874e-01
             24    2.50914e+02   -1.16282e-03    2.50915e+02    3.91769e-03    4.79653e-01
             25    2.50836e+02   -1.17618e-03    2.50837e+02    3.93158e-03    4.86394e-01
             26    2.50719e+02   -1.18245e-03    2.50720e+02    3.84927e-03    4.78231e-01
             27    2.50635e+02   -1.16203e-03    2.50636e+02    3.51938e-03    4.84351e-01
             28    2.50557e+02   -1.25538e-03    2.50558e+02    4.03430e-03    4.83055e-01
             29    2.50457e+02   -1.28193e-03    2.50458e+02    4.00776e-03    4.81409e-01
kl div: 0.00400776
kl coeff: 1.3732910156250002e-05
rollouts time: 9.794663190841675
shuffle time: 0.005086660385131836
load time: 0.0016257762908935547
sgd time: 19.054980754852295

             29    2.39670e+02   -2.04095e-03    2.39672e+02    4.44236e-03    4.77602e-01
kl div: 0.00444236
kl coeff: 3.4332275390625005e-06
rollouts time: 8.55243468284607
shuffle time: 0.004193782806396484
load time: 0.0015778541564941406
sgd time: 17.775023221969604
sgd examples/s: 2281.122195659618
total time so far: 724.2080056667328
TrainingResult(experiment_id='aaf6071946ae492eb28236dfaaae1b3c', training_iteration=22, episode_reward_mean=199.75980392156862, episode_len_mean=198.75980392156862, info={'kl_divergence': 0.0044423635, 'kl_coefficient': 3.4332275390625005e-06, 'rollouts_time': 8.55243468284607, 'shuffle_time': 0.004193782806396484, 'load_time': 0.0015778541564941406, 'sgd_time': 17.775023221969604, 'sample_throughput': 2281.122195659618}, timesteps_this_iter=40547, timesteps_total=892128, time_this_iter_s=26.33870029449463, time_total_s=600.581650018692)
===> iteration 23
total reward is  199.549019608
trajectory length mean is  198.549019608
timesteps: 40504
Comput

total reward is  200.0
trajectory length mean is  199.0
timesteps: 40596
Computing policy (iterations=30, stepsize=5e-05):
           iter     total loss    policy loss        vf loss             kl        entropy
              0    6.45931e+02   -2.44956e-04    6.45931e+02    9.26459e-04    4.76205e-01
              1    2.71145e+02   -4.88182e-04    2.71145e+02    1.22332e-03    4.80724e-01
              2    2.64441e+02   -8.39268e-04    2.64442e+02    3.69750e-03    4.82266e-01
              3    2.61715e+02   -8.80819e-04    2.61716e+02    3.11283e-03    4.85109e-01
              4    2.59886e+02   -9.13358e-04    2.59887e+02    2.61473e-03    4.81559e-01
              5    2.58669e+02   -1.05123e-03    2.58670e+02    3.17944e-03    4.81710e-01
              6    2.57752e+02   -1.07925e-03    2.57753e+02    3.56603e-03    4.78330e-01
              7    2.56974e+02   -1.23097e-03    2.56976e+02    3.67543e-03    4.81806e-01
              8    2.56326e+02   -1.24567e-03    2.56327e+

              6    2.55119e+02   -1.12380e-03    2.55120e+02    3.19432e-03    4.79038e-01
              7    2.54458e+02   -1.12832e-03    2.54459e+02    3.22281e-03    4.78515e-01
              8    2.53846e+02   -1.21437e-03    2.53847e+02    3.15948e-03    4.78209e-01
              9    2.53410e+02   -1.23193e-03    2.53411e+02    3.35429e-03    4.79563e-01
             10    2.52990e+02   -1.36831e-03    2.52991e+02    3.72556e-03    4.80335e-01
             11    2.52624e+02   -1.22047e-03    2.52625e+02    3.07186e-03    4.74750e-01
             12    2.52315e+02   -1.34016e-03    2.52316e+02    3.44804e-03    4.76037e-01
             13    2.52044e+02   -1.39446e-03    2.52046e+02    3.85545e-03    4.78537e-01
             14    2.51792e+02   -1.54057e-03    2.51794e+02    3.77408e-03    4.80040e-01
             15    2.51565e+02   -1.39639e-03    2.51567e+02    3.44713e-03    4.76223e-01
             16    2.51306e+02   -1.57637e-03    2.51308e+02    3.52589e-03    4.78989e-01

             15    2.50532e+02   -1.35098e-03    2.50533e+02    4.24358e-03    4.46597e-01
             16    2.50350e+02   -1.45305e-03    2.50352e+02    4.20984e-03    4.43061e-01
             17    2.50172e+02   -1.60370e-03    2.50174e+02    4.76084e-03    4.43436e-01
             18    2.50044e+02   -1.39976e-03    2.50046e+02    4.03560e-03    4.46081e-01
             19    2.49845e+02   -1.50178e-03    2.49846e+02    4.57734e-03    4.45918e-01
             20    2.49733e+02   -1.57192e-03    2.49734e+02    4.58941e-03    4.45356e-01
             21    2.49590e+02   -1.59434e-03    2.49592e+02    4.58839e-03    4.43902e-01
             22    2.49484e+02   -1.56317e-03    2.49486e+02    4.68725e-03    4.44663e-01
             23    2.49349e+02   -1.65161e-03    2.49350e+02    4.56106e-03    4.44231e-01
             24    2.49261e+02   -1.63769e-03    2.49263e+02    4.68451e-03    4.46236e-01
             25    2.49122e+02   -1.67790e-03    2.49123e+02    4.61360e-03    4.43296e-01

             23    2.27900e+02   -1.49257e-03    2.27901e+02    3.62888e-03    4.31002e-01
             24    2.27759e+02   -1.57693e-03    2.27761e+02    3.64806e-03    4.35358e-01
             25    2.27599e+02   -1.56277e-03    2.27600e+02    3.41298e-03    4.30819e-01
             26    2.27437e+02   -1.55714e-03    2.27439e+02    3.94707e-03    4.33786e-01
             27    2.27337e+02   -1.60365e-03    2.27339e+02    3.49833e-03    4.32107e-01
             28    2.27179e+02   -1.62937e-03    2.27181e+02    3.68933e-03    4.32500e-01
             29    2.27055e+02   -1.67347e-03    2.27057e+02    3.73381e-03    4.30750e-01
kl div: 0.00373381
kl coeff: 1.3411045074462893e-08
rollouts time: 10.08564567565918
shuffle time: 0.0047719478607177734
load time: 0.001676321029663086
sgd time: 19.660324335098267
sgd examples/s: 2064.869292493139
total time so far: 976.2047762870789
TrainingResult(experiment_id='aaf6071946ae492eb28236dfaaae1b3c', training_iteration=31, episode_reward_mean=20

Checkpoint the current model. The call to `agent.save()` returns the path to the checkpointed model and can be used later to restore the model.

In [8]:
checkpoint_path = agent.save()

INFO:tensorflow:/tmp/ray/CartPole-v0_PPOAgent_2017-09-18_23-12-09oy02h5hq/checkpoint-32 is not in all_model_checkpoint_paths. Manually adding it.


[2017-09-18 23:28:54,456] /tmp/ray/CartPole-v0_PPOAgent_2017-09-18_23-12-09oy02h5hq/checkpoint-32 is not in all_model_checkpoint_paths. Manually adding it.


Now let's use the trained policy to make predictions.

**NOTE:** Here we are loading the trained policy in the same process, but in practice, this would often be done in a different process (probably on a different machine).

In [9]:
trained_config = config.copy()

test_agent = PPOAgent('CartPole-v0', trained_config)
test_agent.restore(checkpoint_path)

[2017-09-18 23:28:54,701] PPOAgent algorithm created with logdir '/tmp/ray/CartPole-v0_PPOAgent_2017-09-18_23-28-54xl1oynzn'
[2017-09-18 23:28:54,702] Making new env: CartPole-v0


Non-atari env, not using any observation preprocessor.
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
Constructing fcnet [100, 100] <function tanh at 0x7f2bf7bd9268>
INFO:tensorflow:Restoring parameters from /tmp/ray/CartPole-v0_PPOAgent_2017-09-18_23-12-09oy02h5hq/checkpoint-32


[2017-09-18 23:28:57,851] Restoring parameters from /tmp/ray/CartPole-v0_PPOAgent_2017-09-18_23-12-09oy02h5hq/checkpoint-32


Now use the trained policy to act in an environment. The key line is the call to `test_agent.compute_action(state)` which uses the trained policy to choose an action.

**EXERCISE:** Verify that the reward received roughly matches up with the reward printed in the training logs.

In [10]:
env = gym.make('CartPole-v0')
state = env.reset()
done = False
cumulative_reward = 0

while not done:
    action = test_agent.compute_action(state)
    state, reward, done, _ = env.step(action)
    cumulative_reward += reward

print(cumulative_reward)

[2017-09-18 23:29:07,638] Making new env: CartPole-v0


200.0
