# Ray RLlib - Extra Application Example - Taxi-v3

© 2019-2021, Anyscale. All Rights Reserved

This example uses [RLlib](https://ray.readthedocs.io/en/latest/rllib.html) to train a policy with the `Taxi-v3` environment ([gym.openai.com/envs/Taxi-v3/](https://gym.openai.com/envs/Taxi-v3/)). The goal is to pick up passengers as fast as possible, negotiating the available paths. This is one of OpenAI Gym's ["toy text"](https://gym.openai.com/envs/#toy_text) problems.

For more background about this problem, see:

* ["Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition"](https://arxiv.org/abs/cs/9905014), [Thomas G. Dietteric](https://twitter.com/tdietterich)
* ["Reinforcement Learning: let’s teach a taxi-cab how to drive"](https://towardsdatascience.com/reinforcement-learning-lets-teach-a-taxi-cab-how-to-drive-4fd1a0d00529), [Valentina Alto](https://twitter.com/AltoValentina)

In [1]:
import pandas as pd
import json
import os
import shutil
import sys
import ray
import ray.rllib.agents.ppo as ppo
import ray.rllib.agents.sac as sac


from EnvWrapperRay import HungryGeeseKaggle
from ray_utils import CustomModel, CustomConvModel
from ray.rllib.models import ModelCatalog

Instructions for updating:
non-resource variables are not supported in the long term
Loading environment football failed: No module named 'gfootball'


In [2]:
try:
    ray.shutdown()
except:
    print('not running')

In [3]:
info = ray.init(ignore_reinit_error=False)

2021-06-08 14:34:17,097	INFO services.py:1267 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8266[39m[22m


In [4]:
print("Dashboard URL: http://{}".format(info["webui_url"]))

Dashboard URL: http://127.0.0.1:8266


Set up the checkpoint location:

In [5]:
checkpoint_root = "tmp/ppo-lstm/geese"
shutil.rmtree(checkpoint_root, ignore_errors=True, onerror=None)   # clean up old runs

Next we'll train an RLlib policy

By default, training runs for `10` iterations. Increase the `N_ITER` setting if you want to see the resulting rewards improve.
Also note that *checkpoints* get saved after each iteration into the `/tmp/ppo/taxi` directory.

> **Note:** If you prefer to use a different directory root than `/tmp`, change it in the next cell **and** in the `rllib rollout` command below.

In [6]:
ModelCatalog.register_custom_model("my_model", CustomConvModel)

In [10]:
N_ITER = 1000

config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"
config['model'] = {
#           "custom_model": "my_model"
        "conv_filters": [[32, 16, 2], [3,4,8]],
        "use_lstm": True,
        "lstm_use_prev_action": True,
        #"fcnet_hiddens": [256,256],
        #"fcnet_activation": "relu",
      }
#config['framework'] = 'tfe'
#config["eager_tracing"] =  True
#config["num_gpus"] = 0
#config['conv_filters'] = [[32,16,1]]

#config['Q_model'] = {
#           "custom_model": "my_model"
#        }
config["no_done_at_end"] = True
config["normalize_actions"] =  False
config['vf_clip_param'] = 100

agent = ppo.PPOTrainer(config, env=HungryGeeseKaggle)

[2m[36m(pid=38194)[0m Instructions for updating:
[2m[36m(pid=38194)[0m non-resource variables are not supported in the long term
[2m[36m(pid=38188)[0m Instructions for updating:
[2m[36m(pid=38188)[0m non-resource variables are not supported in the long term


[2m[36m(pid=38188)[0m Loading environment football failed: No module named 'gfootball'
[2m[36m(pid=38194)[0m Loading environment football failed: No module named 'gfootball'


[2m[36m(pid=38188)[0m Instructions for updating:
[2m[36m(pid=38188)[0m If using Keras pass *_constraint arguments to layers.
[2m[36m(pid=38194)[0m Instructions for updating:
[2m[36m(pid=38194)[0m If using Keras pass *_constraint arguments to layers.


[2m[36m(pid=38188)[0m Model: "model_1"
[2m[36m(pid=38188)[0m __________________________________________________________________________________________________
[2m[36m(pid=38188)[0m Layer (type)                    Output Shape         Param #     Connected to                     
[2m[36m(pid=38188)[0m seq_in (InputLayer)             [(None,)]            0                                            
[2m[36m(pid=38188)[0m __________________________________________________________________________________________________
[2m[36m(pid=38188)[0m tf_op_layer_default_policy/Sequ [()]                 0           seq_in[0][0]                     
[2m[36m(pid=38188)[0m __________________________________________________________________________________________________
[2m[36m(pid=38188)[0m tf_op_layer_default_policy/Sequ [()]                 0           tf_op_layer_default_policy/Sequen
[2m[36m(pid=38188)[0m _________________________________________________________________



In [None]:
results = []
episode_data = []
episode_json = []

for n in range(N_ITER):
    result = agent.train()
    results.append(result)
    
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'], 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']
              }
    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    file_name = agent.save(checkpoint_root)
    
    print(f'{n+1:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}, len mean: {result["episode_len_mean"]:8.4f}. Checkpoint saved to {file_name}')

  1: Min/Mean/Max reward: -187.9036/ 80.3153/627.9761, len mean:   5.1373. Checkpoint saved to tmp/ppo-lstm/geese/checkpoint_000001/checkpoint-1
  2: Min/Mean/Max reward: -349.3524/ 70.1586/634.2083, len mean:   5.0083. Checkpoint saved to tmp/ppo-lstm/geese/checkpoint_000002/checkpoint-2
  3: Min/Mean/Max reward: -240.6193/ 89.2747/575.2131, len mean:   5.2368. Checkpoint saved to tmp/ppo-lstm/geese/checkpoint_000003/checkpoint-3
  4: Min/Mean/Max reward: -256.5463/106.2915/732.7177, len mean:   5.0420. Checkpoint saved to tmp/ppo-lstm/geese/checkpoint_000004/checkpoint-4
  5: Min/Mean/Max reward: -310.5533/ 91.9799/812.8545, len mean:   5.1416. Checkpoint saved to tmp/ppo-lstm/geese/checkpoint_000005/checkpoint-5
  6: Min/Mean/Max reward: -413.9616/ 92.3674/569.5937, len mean:   5.2026. Checkpoint saved to tmp/ppo-lstm/geese/checkpoint_000006/checkpoint-6
  7: Min/Mean/Max reward: -191.6667/102.8590/736.8381, len mean:   5.0983. Checkpoint saved to tmp/ppo-lstm/geese/checkpoint_00000

Do the episode rewards increase after multiple iterations?

Also, print out the policy and model to see the results of training in detail…

In [None]:
import pprint

policy = agent.get_policy()
model = policy.model

print('variables')
pprint.pprint(model.variables())
pprint.pprint(model.value_function())
policy.export_model('./submission_ray/model')
print(model.base_model.summary())


## Rollout

Next we'll use the [`rollout` script](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies) to evaluate the trained policy.

The output from the following command visualizes the "taxi" agent operating within its simulation: picking up a passenger, driving, turning, dropping off a passenger ("put-down"), and so on. 

A 2-D map of the *observation space* is visualized as text, which needs some decoding instructions:

  * `R` -- R(ed) location in the Northwest corner
  * `G` -- G(reen) location in the Northeast corner
  * `Y` -- Y(ellow) location in the Southwest corner
  * `B` -- B(lue) location in the Southeast corner
  * `:` -- cells where the taxi can drive
  * `|` -- obstructions ("walls") which the taxi must avoid
  * blue letter represents the current passenger’s location for pick-up
  * purple letter represents the drop-off location
  * yellow rectangle is the current location of our taxi/agent

That allows for a total of 500 states, and these known states are numbered between 0 and 499.

The *action space* for the taxi/agent is defined as:

  * move the taxi one square North
  * move the taxi one square South
  * move the taxi one square East
  * move the taxi one square West
  * pick-up the passenger
  * put-down the passenger

The *rewards* are structured as −1 for each action plus:

 * +20 points when the taxi performs a correct drop-off for the passenger
 * -10 points when the taxi attempts illegal pick-up/drop-off actions

Admittedly it'd be better if these state visualizations showed the *reward* along with observations.

In [None]:
path = 'tmp/ppo/geese/checkpoint_000001/checkpoint-1'
agent = ppo.PPOTrainer(config, env=HungryGeeseKaggle)
agent.restore(path)



In [None]:
model.base_model.summary()

In [None]:
policy = agent.get_policy()
model = policy.model




In [None]:
policy.compute_single_action()

In [None]:
!rllib rollout \
    tmp/ppo/taxi/checkpoint_000010/checkpoint-10 \
    --config "{\"env\": \"Taxi-v3\"}" \
    --run PPO \
    --steps 2000

In [None]:
ray.shutdown()  # "Undo ray.init()".

## Exercise ("Homework")

In addition to _Taxi_, there are other so-called ["toy text"](https://gym.openai.com/envs/#toy_text) problems you can try.