# Tensor2Tensor Reinforcement Learning

The `rl` package provides the ability to run model-free and model-based reinforcement learning algorithms.

Currently, we support the Proximal Policy Optimization ([PPO](https://arxiv.org/abs/1707.06347)) and Simulated Policy Learning ([SimPLe](https://arxiv.org/abs/1903.00374)).

Below you will find examples of PPO training using `trainer_model_free.py` and SimPLe traning using `trainer_model_based.py`.


In [0]:
#@title
# Copyright 2018 Google LLC.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# https://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [0]:
!pip install -q tensorflow==1.13.1
!pip install -q tensorflow_probability==0.6.0
!pip install -q tensor2tensor==1.13.1
!pip install -q gym[atari]

[K    100% |████████████████████████████████| 1.3MB 9.4MB/s 
[K    100% |████████████████████████████████| 215kB 27.3MB/s 
[K    100% |████████████████████████████████| 143kB 29.6MB/s 
[K    100% |████████████████████████████████| 21.1MB 1.7MB/s 
[K    100% |████████████████████████████████| 409kB 24.7MB/s 
[K    100% |████████████████████████████████| 296kB 25.0MB/s 
[K    100% |████████████████████████████████| 61kB 21.5MB/s 
[?25h  Building wheel for pypng (setup.py) ... [?25ldone
[?25h  Building wheel for opt-einsum (setup.py) ... [?25ldone
[?25h

In [0]:
# Helper function for playing videos in the colab.
def play_video(path):
  from IPython.core.magics.display import HTML
  display_path = "/nbextensions/vid.mp4"
  display_abs_path = "/usr/local/share/jupyter" + display_path
  !rm -f $display_abs_path
  !ffmpeg -loglevel error -i $path $display_abs_path
  return HTML("""
    <video width="640" height="480" controls>
      <source src="{}" type="video/mp4">
    </video>
  """.format(display_path))

# Play using a pre-trained policy

We provide pretrained policies for the following games from the Atari Learning Environment ( [ALE](https://github.com/mgbellemare/Arcade-Learning-Environment)) : alien,
amidar,
 assault,
 asterix,
 asteroids,
 atlantis,
 bank_heist,
 battle_zone,
 beam_rider,
 bowling,
 boxing,
 breakout,
 chopper_command,
 crazy_climber,
 demon_attack,
 fishing_derby,
 freeway,
 frostbite,
 gopher,
 gravitar,
 hero,
 ice_hockey,
 jamesbond,
 kangaroo,
 krull,
 kung_fu_master,
 ms_pacman,
 name_this_game,
 pong,
 private_eye,
 qbert,
 riverraid,
 road_runner,
 seaquest,
 up_n_down,
 yars_revenge.
 
 We have 5 checkpoints for each game saved on Google Storage. Run the following command get the storage path:

In [0]:
# experiment_id is an integer from [0, 4].
def get_run_dir(game, experiment_id):
  from tensor2tensor.data_generators.gym_env import ATARI_GAMES_WITH_HUMAN_SCORE_NICE
  EXPERIMENTS_PER_GAME = 5
  run_id = ATARI_GAMES_WITH_HUMAN_SCORE_NICE.index(game) * EXPERIMENTS_PER_GAME + experiment_id + 1
  return "gs://tensor2tensor-checkpoints/modelrl_experiments/train_sd/{}".format(run_id)

get_run_dir('pong', 2)


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.



'gs://tensor2tensor-checkpoints/modelrl_experiments/train_sd/143'

To evaluate and generate videos for a pretrained policy on Pong:

In [0]:
game = 'pong'
run_dir = get_run_dir(game, 1)
!python -m tensor2tensor.rl.evaluator \
  --loop_hparams_set=rlmb_long_stochastic_discrete \
  --loop_hparams=game=$game,eval_max_num_noops=8,eval_sampling_temps=[0.5] \
  --policy_dir=$run_dir/policy \
  --eval_metrics_dir=pong_pretrained \
  --debug_video_path=pong_pretrained \
  --num_debug_videos=4


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Overriding hparams in rlmb_long_stochastic_discrete with game=pong,eval_max_num_noops=8,eval_sampling_temps=[0.5]
INFO:tensorflow:Evaluating metric mean_reward/eval/sampling_temp_0.5_max_noops_8_unclipped
2019-03-22 16:05:45.007030: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-03-22 16:05:45.007306: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2697860 executing computations on platform Host. Devices:
2019-03-22 16:05:45.007346: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-03-22 16:05:45.105281: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-

The above command will run a single evaluation setting to get the results fast. We usually run a grid of different settings (sampling temperatures and whether to do initial no-ops). To do that, remove `eval_max_num_noops=8,eval_sampling_temps=[0.5]` from the command. You can override the evaluation settings:

```
  --loop_hparams=game=pong,eval_max_num_noops=0,eval_sampling_temps=[0.0]
 ```
 
 The evaluator generates videos from the environment:

In [0]:
play_video('pong_pretrained/0.avi')

# Train your policy (model-free training)
Training model-free on Pong (it takes a few hours):

In [0]:
!python -m tensor2tensor.rl.trainer_model_free \
  --hparams_set=rlmf_base \
  --hparams=game=pong \
  --output_dir=mf_pong


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

2019-03-22 11:30:42.987149: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-03-22 11:30:42.987392: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x30323c0 executing computations on platform Host. Devices:
2019-03-22 11:30:42.987491: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-03-22 11:30:43.082876: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-22 11:30:43.083442: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3032100 executing computations on platform CUDA. Device

Hyperparameter sets are defined in `tensor2tensor/models/research/rl.py`. You can override them using the hparams flag, e.g.

```
--hparams=game=kung_fu_master,frame_stack_size=5
```

As in model-based training, the periodic evaluation runs with timestep limit of 1000. To do full evaluation after training, run:

In [0]:
!python -m tensor2tensor.rl.evaluator \
  --loop_hparams_set=rlmf_tiny \
  --hparams=game=pong \
  --policy_dir=mf_pong \
  --debug_video_path=mf_pong \
  --num_debug_videos=4 \
  --eval_metrics_dir=mf_pong/full_eval_metrics


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Overriding hparams in rlmf_tiny with game=pong,eval_max_num_noops=0,eval_sampling_temps=[0.5]
INFO:tensorflow:Evaluating metric mean_reward/eval/sampling_temp_0.5_max_noops_0_unclipped
2019-03-22 11:33:23.214052: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-03-22 11:33:23.214294: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x2d07020 executing computations on platform Host. Devices:
2019-03-22 11:33:23.214335: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-03-22 11:33:23.309948: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must b

In [0]:
play_video('mf_pong/0.avi')

# Model-based training

The `rl` package offers many more features, including model-based training. For instructions on how to use them, go to our [README](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/rl/README.md).