# Hands-on RL with Ray’s RLlib
## A beginner’s tutorial for working with multi-agent environments, models, and algorithms

<img src="images/pitfall.jpg" width=250> <img src="images/tesla.jpg" width=254> <img src="images/forklifts.jpg" width=169> <img src="images/robots.jpg" width=252> <img src="images/dota2.jpg" width=213>

### Overview
“Hands-on RL with Ray’s RLlib” is a beginners tutorial for working with reinforcement learning (RL) environments, models, and algorithms using Ray’s RLlib library. RLlib offers high scalability, a large list of algos to choose from (offline, model-based, model-free, etc..), support for TensorFlow and PyTorch, and a unified API for a variety of applications. This tutorial includes a brief introduction to provide an overview of concepts (e.g. why RL) before proceeding to RLlib (multi- and single-agent) environments, neural network models, hyperparameter tuning, debugging, student exercises, Q/A, and more. All code will be provided as .py files in a GitHub repo.

### Intended Audience
* Python programmers who want to get started with reinforcement learning and RLlib.

### Prerequisites
* Some Python programming experience.
* Some familiarity with machine learning.
* *Helpful, but not required:* Experience in reinforcement learning and Ray.
* *Helpful, but not required:* Experience with TensorFlow or PyTorch.

### Requirements/Dependencies

Install conda (https://www.anaconda.com/products/individual)

Then ...

#### Quick `conda` setup instructions (Linux):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install tensorflow  # <- either one works!
$ pip install torch  # <- either one works!
$ pip install jupyterlab
```

#### Quick `conda` setup instructions (Mac):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install cmake "ray[rllib]"
$ pip install tensorflow  # <- either one works!
$ pip install torch  # <- either one works!
$ pip install jupyterlab
```

#### Quick `conda` setup instructions (Win10):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install [tensorflow|torch]  # <- either one works!
$ pip install jupyterlab
$ conda install pywin32
```

Also, for Win10 Atari support, we have to install atari_py from a different source (gym does not support Atari envs on Windows).

```
$ pip install git+https://github.com/Kojoley/atari-py.git
```

### Opening these tutorial files:
```
$ git clone https://github.com/sven1977/rllib_tutorials
$ cd rllib_tutorials
$ jupyter-lab
```

### Key Takeaways
* What is reinforcement learning and why RLlib?
* Core concepts of RLlib: Environments, Trainers, Policies, and Models.
* How to configure, hyperparameter-tune, and parallelize RLlib.
* RLlib debugging best practices.

### Tutorial Outline
1. RL and RLlib in a nutshell.
1. Defining an RL-solvable problem: Our first environment.
1. **Exercise No.1**: Environment loop.

(15min break)

1. Picking an algorithm and training our first RLlib Trainer.
1. Configurations and hyperparameters - Easy tuning with Ray Tune.
1. Fixing our experiment's config - Going multi-agent.
1. The "infinite laptop": Quick intro into how to use RLlib with Anyscale's product.
1. **Exercise No.2**: Run your own Ray RLlib+Tune experiment)
1. Neural network models - Provide your custom models using tf.keras or torch.nn.

(15min break)

1. Deeper dive into RLlib's parallelization architecture.
1. Specifying different compute resources and parallelization options through our config.
1. "Hacking in": Using callbacks to customize the RL loop and generate our own metrics.
1. **Exercise No.3**: Write your own custom callback.
1. "Hacking in (part II)" - Debugging with RLlib and PyCharm.
1. Checking on the "infinite laptop" - Did RLlib learn to solve the problem?

### Other Recommended Readings
* [Attention Nets and More with RLlib's Trajectory View API](https://medium.com/distributed-computing-with-ray/attention-nets-and-more-with-rllibs-trajectory-view-api-d326339a6e65)
* [Intro to RLlib: Example Environments](https://medium.com/distributed-computing-with-ray/intro-to-rllib-example-environments-3a113f532c70)
* [Reinforcement Learning with RLlib in the Unity Game Engine](https://medium.com/distributed-computing-with-ray/reinforcement-learning-with-rllib-in-the-unity-game-engine-1a98080a7c0d)


<img src="images/rl-cycle.png" width=800>

### Coding/defining our "problem" via an RL environment.

We will use the following (adversarial) multi-agent environment
throughout this tutorial to demonstrate a large fraction of RLlib's
APIs, features, and customization options.

<img src="images/environment.png" width=800>

### A word or two on Spaces:

Spaces are used in ML to describe what possible/valid values inputs and outputs of a neural network can have.

RL environments also use them to describe what their valid observations and actions are.

Spaces are usually defined by their shape (e.g. 84x84x3 RGB images) and datatype (e.g. uint8 for RGB values between 0 and 255).
However, spaces could also be composed of other spaces (see Tuple or Dict spaces) or could be simply discrete with n fixed possible values
(represented by integers). For example, in our game, where each agent can only go up/down/left/right, the action space would be "Discrete(4)"
(no datatype, no shape needs to be defined here).

<img src="images/spaces.png" width=800>

In [1]:
# Let's code (parts of) our multi-agent environment.

import gym
from gym.spaces import Discrete, MultiDiscrete
import numpy as np
import random

from ray.rllib.env.multi_agent_env import MultiAgentEnv


class MultiAgentArena(MultiAgentEnv):
    def __init__(self, config=None):
        config = config or {}
        # Dimensions of the grid.
        self.width = config.get("width", 10)
        self.height = config.get("height", 10)

        # End an episode after this many timesteps.
        self.timestep_limit = config.get("ts", 100)

        self.observation_space = MultiDiscrete([self.width * self.height,
                                                self.width * self.height])
        # 0=up, 1=right, 2=down, 3=left.
        self.action_space = Discrete(4)

        # Reset env.
        self.reset()
        
    def reset(self):
        """Returns initial observation of next(!) episode."""
        # !LIVE CODING!
        #pass
        # Row-major coords!
        self.agent1_pos = [0, 0]
        self.agent2_pos = [self.height - 1, self.width - 1]
        # Reset agent1's visited fields.
        self.agent1_visited_fields = set()
        # How many timesteps have we done in this episode.
        self.timesteps = 0
        return self._get_obs()

    def step(self, action: dict):
        """Returns (next observation, rewards, dones, infos) after having taken the given action."""
        # !LIVE CODING!
        #pass
        # increase our time steps counter by 1.
        self.timesteps += 1
        # An episode is "done" when we reach the time step limit.
        is_done = self.timesteps >= self.timestep_limit

        # Determine, who is allowed to move first (50:50).
        if random.random() > 0.5:
            # events = [collision|new_field]
            events = self._move(self.agent1_pos, action["agent1"], is_agent1=True)
            events |= self._move(self.agent2_pos, action["agent2"], is_agent1=False)
        else:
            events = self._move(self.agent2_pos, action["agent2"], is_agent1=False)
            events |= self._move(self.agent1_pos, action["agent1"], is_agent1=True)

        # Get observations (based on new agent positions).
        obs = self._get_obs()

        # Determine rewards based on the collected events:
        rewards = {
            "agent1": -1.0 if "collision" in events else 1.0 if "new_field" in events else -0.5,
            "agent2": 1.0 if "collision" in events else -0.1,
        }

        # Generate a `done` dict (per-agent and total).
        dones = {
            "agent1": is_done,
            "agent2": is_done,
            # special `__all__` key indicates that the episode is done for all agents.
            "__all__": is_done,
        }

        return obs, rewards, dones, {}  # <- info dict (not needed here).

    def _get_obs(self):
        """
        Returns obs dict (agent name to discrete-pos tuple) using each
        agent's current x/y-positions.
        """
        ag1_discrete_pos = self.agent1_pos[0] * self.width + \
            (self.agent1_pos[1] % self.width)
        ag2_discrete_pos = self.agent2_pos[0] * self.width + \
            (self.agent2_pos[1] % self.width)
        return {
            "agent1": np.array([ag1_discrete_pos, ag2_discrete_pos]),
            "agent2": np.array([ag2_discrete_pos, ag1_discrete_pos]),
        }

    def _move(self, coords, action, is_agent1):
        """
        Moves an agent (agent1 iff is_agent1=True, else agent2) from `coords` (x/y) using the
        given action (0=up, 1=right, etc..) and returns a resulting events dict:
        Agent1: "new" when entering a new field. "bumped" when having been bumped into by agent2.
        Agent2: "bumped" when bumping into agent1 (agent1 then gets -1.0).
        """
        orig_coords = coords[:]
        # Change the row: 0=up (-1), 2=down (+1)
        coords[0] += -1 if action == 0 else 1 if action == 2 else 0
        # Change the column: 1=right (+1), 3=left (-1)
        coords[1] += 1 if action == 1 else -1 if action == 3 else 0

        # Solve collisions.
        # Make sure, we don't end up on the other agent's position.
        # If yes, don't move (we are blocked).
        if (is_agent1 and coords == self.agent2_pos) or (not is_agent1 and coords == self.agent1_pos):
            coords[0], coords[1] = orig_coords
            # Agent2 blocked agent1 (agent1 tried to run into agent2)
            # OR Agent2 bumped into agent1 (agent2 tried to run into agent1)
            return {"collision"}

        # No agent blocking -> check walls.
        if coords[0] < 0:
            coords[0] = 0
        elif coords[0] >= self.height:
            coords[0] = self.height - 1
        if coords[1] < 0:
            coords[1] = 0
        elif coords[1] >= self.width:
            coords[1] = self.width - 1

        # If agent1 -> "new" if new tile covered.
        if is_agent1 and not tuple(coords) in self.agent1_visited_fields:
            self.agent1_visited_fields.add(tuple(coords))
            return {"new_field"}
        # No new tile for agent1.
        return set()

    # Optionally: Add `render` method returning some img.
    def render(self, mode=None):
        field_size = 40

        if not hasattr(self, "viewer"):
            from gym.envs.classic_control import rendering
            self.viewer = rendering.Viewer(400, 400)
            self.fields = {}
            # Add our grid, and the two agents to the viewer.
            for i in range(self.width):
                l = i * field_size
                r = l + field_size
                for j in range(self.height):
                    b = 400 - j * field_size - field_size
                    t = b + field_size
                    field = rendering.PolyLine([(l, b), (l, t), (r, t), (r, b)], close=True)
                    field.set_color(.0, .0, .0)
                    field.set_linewidth(1.0)
                    self.fields[(j, i)] = field
                    self.viewer.add_geom(field)
            
            agent1 = rendering.make_circle(radius=field_size // 2 - 4)
            agent1.set_color(.0, 0.8, 0.1)
            self.agent1_trans = rendering.Transform()
            agent1.add_attr(self.agent1_trans)
            agent2 = rendering.make_circle(radius=field_size // 2 - 4)
            agent2.set_color(.5, 0.1, 0.1)
            self.agent2_trans = rendering.Transform()
            agent2.add_attr(self.agent2_trans)
            self.viewer.add_geom(agent1)
            self.viewer.add_geom(agent2)

        # Mark those fields green that have been covered by agent1,
        # all others black.
        for i in range(self.width):
            for j in range(self.height):
                self.fields[(j, i)].set_color(.0, .0, .0)
                self.fields[(j, i)].set_linewidth(1.0)
        for (j, i) in self.agent1_visited_fields:
            self.fields[(j, i)].set_color(.1, .5, .1)
            self.fields[(j, i)].set_linewidth(5.0)
        
        # Edit the pole polygon vertex
        self.agent1_trans.set_translation(self.agent1_pos[1] * field_size + field_size / 2, 400 - (self.agent1_pos[0] * field_size + field_size / 2))
        self.agent2_trans.set_translation(self.agent2_pos[1] * field_size + field_size / 2, 400 - (self.agent2_pos[0] * field_size + field_size / 2))

        return self.viewer.render(return_rgb_array=mode == 'rgb_array')

dummy_env = MultiAgentArena()

obs = dummy_env.reset()

# Agent1 will move down, Agent2 moves up.
obs, rewards, dones, infos = dummy_env.step(action={"agent1": 2, "agent2": 0})

print("Agent1's x/y position={}".format(dummy_env.agent1_pos))
print("Agent2's x/y position={}".format(dummy_env.agent2_pos))
print("Env timesteps={}".format(dummy_env.timesteps))
print("r1={} r2={}".format(rewards["agent1"], rewards["agent2"]))



Agent1's x/y position=[1, 0]
Agent2's x/y position=[8, 9]
Env timesteps=1
r1=1.0 r2=-0.1


## Exercise No 1

<hr />

<img src="images/exercise1.png" width=400>

In the cell above, we performed a `reset()` and a single `step()` call. To walk through an entire episode, one would normally call `step()` repeatedly (with different actions) until the returned `done` dict has the "agent1" or "agent2" (or "__all__") key set to True. Your task is to write an "environment loop" that runs for exactly one episode using our `MultiAgentArena` class.

Follow these instructions here to get this done.

1. Create an env object.
1. `reset` your environment to get the first (initial) observation.
1. Compute the actions for "agent1" and "agent2" calling `DummyTrainer.compute_action([obs])` twice and putting the results into an action dict to be passed into `step()`, just like it's done in the above cell (where we do a single `step()`).
1. Repeat this, `step`ing through an entire episode.
1. When an episode is done, `step()` will return a done dict with key `__all__` set to True.
1. If you feel, this is way too easy for you ;) , try to extract each agent's reward, sum it up over the episode and - at the end of the episode - print out each agent's accumulated reward (also called the "return" of an episode).

**Good luck! :)**


In [2]:
class DummyTrainer:
    """Dummy Trainer class used in Exercise #1.

    Use its `compute_action` method to get a new action for one of the agents,
    given the agent's observation (a single discrete value encoding the field
    the agent is currently in).
    """

    def compute_action(self, single_agent_obs=None):
        # Returns a random action for a single agent.
        return np.random.randint(4)  # Discrete(4) -> return rand int between 0 and 3 (incl. 3).

dummy_trainer = DummyTrainer()
# Check, whether it's working.
for _ in range(3):
    # Get action for agent1 (providing agent1's and agent2's positions).
    print("action_agent1={}".format(dummy_trainer.compute_action(np.array([0, 99]))))

    # Get action for agent2 (providing agent2's and agent1's positions).
    print("action_agent2={}".format(dummy_trainer.compute_action(np.array([99, 0]))))

    print()

action_agent1=2
action_agent2=1

action_agent1=1
action_agent2=1

action_agent1=2
action_agent2=3



Write your solution code into this cell here:

In [4]:
# Solution to Exercise #1
import time

env = MultiAgentArena()

obs = env.reset()

num_episodes = 0

r1 = r2 = 0.0

while num_episodes < 10:
    action1 = dummy_trainer.compute_action(single_agent_obs=obs["agent1"])
    action2 = dummy_trainer.compute_action(single_agent_obs=obs["agent2"])

    actions = {"agent1": action1, "agent2": action2}
    
    obs, reward, done, info = env.step(actions)
    r1 += reward["agent1"]
    r2 += reward["agent2"]

    if done["__all__"] is True:
        num_episodes += 1
        print(f"episode is done r1={r1} r2={r2}!")
        r1 = r2 = 0.0

    #env.render()
        
# !LIVE CODING!

episode is done r1=-7.0 r2=-8.89999999999998!
episode is done r1=-1.0 r2=1.0!
episode is done r1=1.0 r2=-0.1!
episode is done r1=-1.0 r2=1.0!
episode is done r1=1.0 r2=-0.1!
episode is done r1=1.0 r2=-0.1!
episode is done r1=1.0 r2=-0.1!
episode is done r1=-0.5 r2=-0.1!
episode is done r1=-0.5 r2=-0.1!
episode is done r1=1.0 r2=-0.1!


#### Exercise 1: Results video
If you added the env.render() call after each step, you should expect to see something like this:


In [5]:
%%HTML
<video width="320" height="240" controls>
  <source src="videos/random_walk.mov" type="video/mp4">
</video>

------------------
## 15 min break :)
------------------

### And now for something completely different:
#### Plugging in RLlib!

In [6]:
import numpy as np
import pprint
import ray

# Start a new instance of Ray (when running this tutorial locally) or
# connect to an already running one (when running this tutorial through Anyscale).

ray.init()  # Hear the engine humming? ;)

# In case you encounter the following error during our tutorial: `RuntimeError: Maybe you called ray.init twice by accident?`
# Try: `ray.shutdown() + ray.init()` or `ray.init(ignore_reinit_error=True)`

2021-06-18 19:01:54,546	INFO services.py:1272 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.0.100',
 'raylet_ip_address': '192.168.0.100',
 'redis_address': '192.168.0.100:6379',
 'object_store_address': '/tmp/ray/session_2021-06-18_19-01-52_951008_15020/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-06-18_19-01-52_951008_15020/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2021-06-18_19-01-52_951008_15020',
 'metrics_export_port': 60637,
 'node_id': '952e326ec77df0bee4624ea5d808aac5a372bb18a68d2a067c74aee7'}

### Picking an RLlib algorithm - We'll use PPO throughout this tutorial (one-size-fits-all-kind-of-algo)

<img src="images/rllib_algos.png" width=800>

https://docs.ray.io/en/master/rllib-algorithms.html#available-algorithms-overview

In [7]:
# Import a Trainable (one of RLlib's built-in algorithms):
# We use the PPO algorithm here b/c its very flexible wrt its supported
# action spaces and model types and b/c it learns well almost any problem.
from ray.rllib.agents.ppo import PPOTrainer

# Specify a very simple config, defining our environment and some environment
# options (see environment.py).
config = {
    "env": MultiAgentArena,  # "my_env" <- if we previously have registered the env with `tune.register_env("[name]", lambda config: [returns env object])`.
    "env_config": {
        "config": {
            "width": 10,
            "height": 10,
            "ts": 100,
        },
    },

    # !PyTorch users!
    #"framework": "torch",  # If users have chosen to install torch instead of tf.

    "create_env_on_driver": True,
}
# Instantiate the Trainer object using above config.
rllib_trainer = PPOTrainer(config=config)
rllib_trainer

2021-06-18 19:04:56,856	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2021-06-18 19:04:56,858	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2021-06-18 19:05:07,149	INFO trainable.py:101 -- Trainable.setup took 10.295 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


PPO

### Ready to train with RLlib's PPO algorithm

That's it, we are ready to train.
Calling `Trainer.train()` will execute a single "training iteration".

One iteration for most algos involves:

1) sampling from the environment(s)
2) using the sampled data (observations, actions taken, rewards) to update the policy model (neural network), such that it would pick better actions in the future, leading to higher rewards.

Let's try it out:

In [8]:
results = rllib_trainer.train()

# Delete the config from the results for clarity.
# Only the stats will remain, then.
del results["config"]
# Pretty print the stats.
pprint.pprint(results)

{'agent_timesteps_total': 4000,
 'custom_metrics': {},
 'date': '2021-06-18_19-06-06',
 'done': False,
 'episode_len_mean': 100.0,
 'episode_media': {},
 'episode_reward_max': 15.9,
 'episode_reward_mean': -5.399999999999995,
 'episode_reward_min': -33.00000000000003,
 'episodes_this_iter': 20,
 'episodes_total': 20,
 'experiment_id': 'd1c9826afaea4f9d8a902daa946ab921',
 'hist_stats': {'episode_lengths': [100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
            

In [10]:
# Run `train()` n times. Try to repeatedly call this to see rewards increase.
# Move on once you see episode rewards of 10.0 or more.
for _ in range(10):
    results = rllib_trainer.train()
    print(f"Iteration={rllib_trainer.iteration}: R(\"return\")={results['episode_reward_mean']}")

Iteration=12: R("return")=5.46
Iteration=13: R("return")=6.374999999999997
Iteration=14: R("return")=7.640999999999991
Iteration=15: R("return")=8.957999999999984
Iteration=16: R("return")=9.914999999999981
Iteration=17: R("return")=10.739999999999975
Iteration=18: R("return")=12.131999999999966
Iteration=19: R("return")=12.500999999999966
Iteration=20: R("return")=13.562999999999958
Iteration=21: R("return")=14.138999999999955


#### !OPTIONAL HACK! (<-- we will not do these during the tutorial, but feel free to try cells marked like this)

Use the above solution of Exercise #1 and replace our `dummy_trainer` in that solution
with the now trained `rllib_trainer`. You should see a better performance of the two agents.

However, keep in mind that we are mostly training agent1 as we only trian a single policy and agent1
is the "easier" one to collect high rewards with.

#### !OPTIONAL HACK!

Feel free to play around with the following code in order to learn how RLlib - under the hood - calculates actions from the environment's observations using Policies and their model(s) inside our Trainer object):

In [None]:
# Let's actually "look inside" our Trainer to see what's in there.
from ray.rllib.utils.numpy import softmax

# To get to one of the policies inside the Trainer, use `Trainer.get_policy()`:
policy = rllib_trainer.get_policy()
print(f"Our (only!) Policy right now is: {policy}")

# To get to the model inside any policy, do:
model = policy.model
#print(f"Our Policy's model is: {model}")

# Print out the policy's action and observation spaces.
print(f"Our Policy's observation space is: {policy.observation_space}")
print(f"Our Policy's action space is: {policy.action_space}")

# Produce a random obervation (B=1; batch of size 1).
obs = np.array([policy.observation_space.sample()])
# Alternatively for PyTorch:
#import torch
#obs = torch.from_numpy(obs)

# Get the action logits (as tf tensor).
# If you are using torch, you would get a torch tensor here.
logits, _ = model({"obs": obs})
logits

# Numpyize the tensor by running `logits` through the Policy's own tf.Session.
logits_np = policy.get_session().run(logits)
# For torch, you can simply do: `logits_np = logits.detach().cpu().numpy()`.

# Convert logits into action probabilities and remove the B=1.
action_probs = np.squeeze(softmax(logits_np))

# Sample an action, using the probabilities.
action = np.random.choice([0, 1, 2, 3], p=action_probs)

# Print out the action.
print(f"samped action={action}")

### Saving and restoring a trained Trainer.
Currently, `rllib_trainer` is in an already trained state.
It holds optimized weights in its Policy's model that allow it to act
already somewhat smart in our environment when given an observation.

However, if we closed this notebook right now, all the effort would have been for nothing.
Let's therefore save the state of our trainer to disk for later!

In [11]:
# We use the `Trainer.save()` method to create a checkpoint.
checkpoint_file = rllib_trainer.save()
print(f"Trainer (at iteration {rllib_trainer.iteration} was saved in '{checkpoint_file}'!")

# Here is what a checkpoint directory contains:
print("The checkpoint directory contains the following files:")
import os
os.listdir(os.path.dirname(checkpoint_file))

Trainer (at iteration 21 was saved in '/Users/sven/ray_results/PPO_MultiAgentArena_2021-06-18_19-04-5642yoh49e/checkpoint_000021/checkpoint-21'!
The checkpoint directory contains the following files:


['checkpoint-21', 'checkpoint-21.tune_metadata', '.is_checkpoint']

### Restoring and evaluating a Trainer
In the following cell, we'll learn how to restore a saved Trainer from a checkpoint file.

We'll also evaluate a completely new Trainer (should act more or less randomly) vs an already trained one (the one we just restored from the created checkpoint file).

In [12]:
# Pretend, we wanted to pick up training from a previous run:
new_trainer = PPOTrainer(config=config)
# Evaluate the new trainer (this should yield random results).
results = new_trainer.evaluate()
print(f"Evaluating new trainer: R={results['evaluation']['episode_reward_mean']}")

# Restoring the trained state into the `new_trainer` object.
print(f"Before restoring: Trainer is at iteration={new_trainer.iteration}")
new_trainer.restore(checkpoint_file)
print(f"After restoring: Trainer is at iteration={new_trainer.iteration}")

# Evaluate again (this should yield results we saw after having trained our saved agent).
results = new_trainer.evaluate()
print(f"Evaluating restored trainer: R={results['evaluation']['episode_reward_mean']}")

2021-06-18 19:14:22,166	INFO trainable.py:101 -- Trainable.setup took 10.086 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2021-06-18 19:14:23,964	INFO trainable.py:377 -- Restored on 192.168.0.100 from checkpoint: /Users/sven/ray_results/PPO_MultiAgentArena_2021-06-18_19-04-5642yoh49e/checkpoint_000021/checkpoint-21
2021-06-18 19:14:23,965	INFO trainable.py:385 -- Current state after restoring: {'_iteration': 21, '_timesteps_total': None, '_time_total': 70.34251928329468, '_episodes_total': 420}


Evaluating new trainer: R=-11.280000000000008
Before restoring: Trainer is at iteration=0
After restoring: Trainer is at iteration=21
Evaluating restored trainer: R=17.03999999999994


In order to release all resources from a Trainer, you can use a Trainer's `stop()` method.
You should definitley run this cell as it frees resources that we'll need later in this tutorial, when we'll do parallel hyperparameter sweeps.

In [13]:
rllib_trainer.stop()
new_trainer.stop()

### Moving stuff to the professional level: RLlib in connection w/ Ray Tune

Running any experiments through Ray Tune is the recommended way of doing things with RLlib. If you look at our
<a href="https://github.com/ray-project/ray/tree/master/rllib/examples">examples scripts folder</a>, you will see that almost all of the scripts use Ray Tune to run the particular RLlib workload demonstrated in each script.

<img src="images/rllib_and_tune.png" width=400>

When setting up hyperparameter sweeps for Tune, we'll do this in our already familiar config dict.

So let's take a quick look at our PPO algo's default config to understand, which hyperparameters we may want to play around with:

In [14]:
# Configuration dicts and Ray Tune.
# Where are the default configuration dicts stored?

# PPO algorithm:
from ray.rllib.agents.ppo import DEFAULT_CONFIG as PPO_DEFAULT_CONFIG
print(f"PPO's default config is:")
pprint.pprint(PPO_DEFAULT_CONFIG)

# DQN algorithm:
#from ray.rllib.agents.dqn import DEFAULT_CONFIG as DQN_DEFAULT_CONFIG
#print(f"DQN's default config is:")
#pprint.pprint(DQN_DEFAULT_CONFIG)

# Common (all algorithms).
#from ray.rllib.agents.trainer import COMMON_CONFIG
#print(f"RLlib Trainer's default config is:")
#pprint.pprint(COMMON_CONFIG)

PPO's default config is:
{'_fake_gpus': False,
 'batch_mode': 'truncate_episodes',
 'callbacks': <class 'ray.rllib.agents.callbacks.DefaultCallbacks'>,
 'clip_actions': True,
 'clip_param': 0.3,
 'clip_rewards': None,
 'collect_metrics_timeout': 180,
 'compress_observations': False,
 'create_env_on_driver': False,
 'custom_eval_function': None,
 'custom_resources_per_worker': {},
 'eager_tracing': False,
 'entropy_coeff': 0.0,
 'entropy_coeff_schedule': None,
 'env': None,
 'env_config': {},
 'env_task_fn': None,
 'evaluation_config': {},
 'evaluation_interval': None,
 'evaluation_num_episodes': 10,
 'evaluation_num_workers': 0,
 'evaluation_parallel_to_training': False,
 'exploration_config': {'type': 'StochasticSampling'},
 'explore': True,
 'extra_python_environs_for_driver': {},
 'extra_python_environs_for_worker': {},
 'fake_sampler': False,
 'framework': 'tf',
 'gamma': 0.99,
 'grad_clip': None,
 'horizon': None,
 'ignore_worker_failures': False,
 'in_evaluation': False,
 'input'

In [15]:
# Plugging in Ray Tune.
# Note that this is the recommended way to run any experiments with RLlib.
# Reasons:
# - Tune allows you to do hyperparameter tuning in a user-friendly way
#   and at large scale!
# - Tune automatically allocates needed resources for the different
#   hyperparam trials and experiment runs on a cluster.

from ray import tune

# Now that we will run things "automatically" through tune, we have to
# define one or more stopping criteria.
# Tune will stop the run, once any single one of the criteria is matched (not all of them!).
stop = {
    # Note that the keys used here can be anything present in the above `rllib_trainer.train()` output dict.
    "training_iteration": 2,
    "episode_reward_mean": 20.0,
}

# "PPO" is a registered name that points to RLlib's PPOTrainer.
# See `ray/rllib/agents/registry.py`

# Run a simple experiment until one of the stopping criteria is met.
tune.run("PPO", config=config, stop=stop)


Trial name,status,loc
PPO_MultiAgentArena_2ca9a_00000,PENDING,


[2m[36m(pid=15528)[0m 2021-06-18 19:18:23,928	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=15528)[0m 2021-06-18 19:18:23,928	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Result for PPO_MultiAgentArena_2ca9a_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-06-18_19-18-36
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 13.79999999999996
  episode_reward_mean: -7.5299999999999985
  episode_reward_min: -33.900000000000034
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: 26c3946481d24f7d9b8cf823a462505c
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.369903326034546
          entropy_coeff: 0.0
          kl: 0.01663612760603428
          model: {}
          policy_loss: -0.049911659210920334
          total_loss: 21.39533805847168
          vf_explained_var: 0.10548873990774155
          vf_loss: 21.441926956176758
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_2ca9a_00000,RUNNING,192.168.0.100:15528,1,3.57105,4000,-7.53,13.8,-33.9,100


Result for PPO_MultiAgentArena_2ca9a_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-06-18_19-18-40
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 17.699999999999974
  episode_reward_mean: -4.019999999999991
  episode_reward_min: -33.900000000000034
  episodes_this_iter: 20
  episodes_total: 40
  experiment_id: 26c3946481d24f7d9b8cf823a462505c
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.3351209163665771
          entropy_coeff: 0.0
          kl: 0.022036012262105942
          model: {}
          policy_loss: -0.06060592457652092
          total_loss: 13.240772247314453
          vf_explained_var: 0.2622607350349426
          vf_loss: 13.29697036743164
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_steps_sampled: 8000
    num_steps_trained: 8

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_2ca9a_00000,TERMINATED,,2,7.61075,8000,-4.02,17.7,-33.9,100


2021-06-18 19:18:40,464	INFO tune.py:549 -- Total run time: 21.98 seconds (21.75 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7fbf1bf35340>

### Let's do a very simple grid-search over two learning rates with tune.run().

In particular, we will try the learning rates 0.0001 and 0.5 (ouch!) using `tune.grid_search([...])`
inside our config dict:

In [16]:
# Updating an algo's default config dict and adding hyperparameter tuning
# options to it.
# Note: Hyperparameter tuning options (e.g. grid_search) will only work,
# if we run these configs via `tune.run`.
config.update(
    {
        # Try 2 different learning rates.
        "lr": tune.grid_search([0.0001, 0.5, 0.00003]),

        # NN model config to tweak the default model
        # that'll be created by RLlib for the policy.
        "model": {
            # e.g. change the dense layer stack (default is 2 layers: [256, 256]).
            "fcnet_hiddens": [256, 256, 256],

            # Alternatively, you can specify a custom model here
            # (we'll cover that later).
            # "custom_model": ...
            # Pass kwargs to your custom model.
            # "custom_model_config": {}
        },
    }
)

# Repeat our experiment using tune's grid-search feature.
results = tune.run(
    "PPO",
    config=config,
    stop=stop,
    # Note that no trainers will be returned from this call here.
    # Tune will create n Trainers internally, run them in parallel and destroy them at the end.
    # However, you can
    checkpoint_at_end=True,  # create a checkpoint when done.
    checkpoint_freq=10,  # create a checkpoint every 10th iteration.
)
print(results)

Trial name,status,loc,lr
PPO_MultiAgentArena_af18f_00000,PENDING,,0.0001
PPO_MultiAgentArena_af18f_00001,PENDING,,0.5


[2m[36m(pid=15531)[0m 2021-06-18 19:22:02,947	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=15531)[0m 2021-06-18 19:22:02,947	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=15530)[0m 2021-06-18 19:22:02,947	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=15530)[0m 2021-06-18 19:22:02,947	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=15530)[0m 2021-06-18 19:22:13,167	INFO trainable.py:101 -- Trainable.setup took 10.221 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=15531)[0m 2021-06-18 19:22:13,188	INFO trainable.py:101 -- Trainable.setup took 10.243 sec

Result for PPO_MultiAgentArena_af18f_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-06-18_19-22-19
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 14.69999999999993
  episode_reward_mean: -7.004999999999998
  episode_reward_min: -31.50000000000005
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: bd4e956183a249a399618e76a3fc6d40
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 1.3468706607818604
          entropy_coeff: 0.0
          kl: 0.04043907672166824
          model: {}
          policy_loss: -0.07337543368339539
          total_loss: 18.143760681152344
          vf_explained_var: 0.17624105513095856
          vf_loss: 18.20905113220215
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 40

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_af18f_00000,RUNNING,192.168.0.100:15530,0.0001,1.0,6.26666,4000.0,-7.005,14.7,-31.5,100.0
PPO_MultiAgentArena_af18f_00001,RUNNING,,0.5,,,,,,,


Result for PPO_MultiAgentArena_af18f_00001:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-06-18_19-22-19
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 14.999999999999986
  episode_reward_mean: -7.140000000000001
  episode_reward_min: -36.000000000000064
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: 5682a43e8f814e299c87ba817748e989
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.5
          entropy: 0.11902828514575958
          entropy_coeff: 0.0
          kl: 21.256975173950195
          model: {}
          policy_loss: 0.4013426601886749
          total_loss: 43.357704162597656
          vf_explained_var: -0.09640281647443771
          vf_loss: 38.704959869384766
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iterations_

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_af18f_00000,RUNNING,192.168.0.100:15530,0.0001,2,12.7041,8000,-6.3525,14.7,-31.5,100
PPO_MultiAgentArena_af18f_00001,RUNNING,192.168.0.100:15531,0.5,1,6.25784,4000,-7.14,15.0,-36.0,100


Result for PPO_MultiAgentArena_af18f_00001:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-06-18_19-22-25
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 14.999999999999986
  episode_reward_mean: -26.857500000000034
  episode_reward_min: -54.000000000000085
  episodes_this_iter: 20
  episodes_total: 40
  experiment_id: 5682a43e8f814e299c87ba817748e989
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 0.5
          entropy: 0.00425629923120141
          entropy_coeff: 0.0
          kl: 0.4869486689567566
          model: {}
          policy_loss: 0.00938626192510128
          total_loss: 161.44923400878906
          vf_explained_var: 0.11229164153337479
          vf_loss: 161.29376220703125
    num_agent_steps_sampled: 8000
    num_agent_steps_trained: 8000
    num_steps_sampled: 8000
    num_steps_trained: 8000
  iterations_

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_af18f_00000,TERMINATED,,0.0001,2,12.7041,8000,-6.3525,14.7,-31.5,100
PPO_MultiAgentArena_af18f_00001,TERMINATED,,0.5,2,12.7229,8000,-26.8575,15.0,-54.0,100


2021-06-18 19:22:26,684	INFO tune.py:549 -- Total run time: 29.36 seconds (28.77 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis object at 0x7fbf13eb40d0>


### Going multi-policy:

So far, our experiment has been ill-configured, because both
agents, which should behave differently due to their different
tasks and reward functions, learn the same policy: the "default_policy",
which RLlib always provides if you don't configure anything else.
Remember that RLlib does not know at Trainer setup time, how many and which agents
the environment will "produce". Agent control (adding agents, removing them, terminating
episodes for agents) is entirely in the Env's hands.
Let's fix our single policy problem and introduce the "multiagent" API.

<img src="images/from_single_agent_to_multi_agent.png" width=800>

In order to turn on RLlib's multi-agent functionality, we need two things:

1. A policy mapping function, mapping agent IDs (e.g. a string like "agent1", produced by the environment in the returned observation/rewards/dones-dicts) to a policy ID (another string, e.g. "pol1", which is under our control).
1. A policies definition dict, mapping policy IDs (e.g. "pol1") to 4-tuples consisting of 1) policy class (None for using the default class), 2) observation space, 3) action space, and 4) config overrides (empty dict for no overrides and using the Trainer's main config dict).

Let's take a closer look:

In [17]:
# Define an agent->policy mapping function.
# Which agents (defined by the environment) use which policies (defined by us)?
# The mapping here is M (agents) -> N (policies), where M >= N.
def policy_mapping_fn(agent: str):
    # Make sure agent ID is valid.
    assert agent in ["agent1", "agent2"], f"ERROR: invalid agent {agent}!"
    # Map agent1 to pol1, and agent2 to pol2.
    return "pol1" if agent == "agent1" else "pol2"

# Get the spaces for our two policies from our already existing Trainer object:
observation_space = dummy_env.observation_space
action_space = dummy_env.action_space

# Define the policies definition dict:
# Each policy in there is defined by its ID mapping to a 4-tuple:
# - Policy class (None for using the default class)
# - obs-space
# - act-space
# - config-overrides dict (leave empty for using the Trainer's config as-is)
policies = {
    "pol1": (None, observation_space, action_space, {"lr": 0.0002}),
    "pol2": (None, observation_space, action_space, {}),
}
# Note that now we won't have a "default_policy" anymore, just "pol1" and "pol2".

# We could - if we wanted - specify, which policies should be learnt (by default, RLlib learns all).
# Non-learnt policies will be frozen and not updated:
# policies_to_train = ["pol1", "pol2"]

# Adding the above to our config.
config.update({
    "multiagent": {
        "policies": policies,
        "policy_mapping_fn": policy_mapping_fn,
        # We'll leave this empty: Means, we train both pol1 and pol2.
        # "policies_to_train": policies_to_train,
    },
})

## Exercise No 2

<hr />

Using the `config` that we have built so far (the one we just updated to include a multi-agent setup),
try learning our environment using Ray tune.run and a simple hyperparameter grid_search over 2 different hyperparameters:
- 2 learning rates (pick your own values).
- AND 2 `train_batch_size`s: use 2000 and 3000.

Also set the `num_workers` config parameter to 1. This will make sure that 3 trials at a time will run in parallel.
Btw, how many trials in total are we expecting to be created by tune?

Background: PPO - by default - uses 2 workers (`num_workers=2`), which makes each trial use 3 CPUs (2 workers + 1 "driver" CPU).
Using only 1 worker would lower this to 2 CPUs per trial. We have 6 CPUs available in this tutorial (if you run this locally, you may have more), so 3 trials can run simultaneously. Tune will run trials in sequence in case it cannot allocate enough CPUs at once
(which is also fine, but then takes longer).

Try to reach a total reward (sum of agent1 and agent2) of 15.0.

**Good luck! :)**


In [21]:
# Solution to Exercise #2
del config["num_worker"]

config.update({
    "lr": tune.grid_search([0.0001, 0.00001]),
    "train_batch_size": tune.grid_search([2000, 3000]),
    "num_workers": 1, # 2 CPUs per trial (why? 1 "local worker" (learner), "rollout" 1 worker (num_workers=1))
})

analysis = tune.run("PPO", config=config, stop=stop, checkpoint_freq=3, checkpoint_at_end=True, verbose=2)

[2m[36m(pid=15770)[0m 2021-06-18 19:38:11,463	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=15770)[0m 2021-06-18 19:38:11,463	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=15772)[0m 2021-06-18 19:38:11,463	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=15772)[0m 2021-06-18 19:38:11,463	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=15779)[0m 2021-06-18 19:38:11,463	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=15779)[0m 2021-06-18 19:38:11,463	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv

Trial PPO_MultiAgentArena_f007c_00001 reported episode_reward_max=10.500000000000025,episode_reward_min=-27.000000000000064,episode_reward_mean=-9.01499999999999,episode_len_mean=100.0,episode_media={},episodes_this_iter=20,policy_reward_min={'pol1': -17.0, 'pol2': -9.99999999999998},policy_reward_max={'pol1': 20.5, 'pol2': -4.500000000000001},policy_reward_mean={'pol1': 0.05, 'pol2': -9.06499999999998},custom_metrics={},sampler_perf={'mean_raw_obs_processing_ms': 0.20385634476157918, 'mean_inference_ms': 1.6513334280964371, 'mean_action_processing_ms': 0.09069843092064805, 'mean_env_wait_ms': 0.04387926543015115, 'mean_env_render_ms': 0.0},off_policy_estimator={},num_healthy_workers=1,agent_timesteps_total=4000,timers={'sample_time_ms': 4028.55, 'sample_throughput': 496.457, 'load_time_ms': 163.764, 'load_throughput': 12212.696, 'learn_time_ms': 6836.038, 'learn_throughput': 292.567, 'update_time_ms': 11.994},info={'learner': {'pol1': {'learner_stats': {'cur_kl_coeff': 0.2000000029802

Trial PPO_MultiAgentArena_f007c_00000 reported episode_reward_max=14.09999999999998,episode_reward_min=-21.000000000000007,episode_reward_mean=-1.6499999999999944,episode_len_mean=100.0,episode_media={},episodes_this_iter=20,policy_reward_min={'pol1': -11.0, 'pol2': -9.99999999999998},policy_reward_max={'pol1': 23.0, 'pol2': -1.1999999999999855},policy_reward_mean={'pol1': 6.975, 'pol2': -8.624999999999982},custom_metrics={},sampler_perf={'mean_raw_obs_processing_ms': 0.21032867641344125, 'mean_inference_ms': 1.662782762480759, 'mean_action_processing_ms': 0.09154630029040653, 'mean_env_wait_ms': 0.044714743229581494, 'mean_env_render_ms': 0.0},off_policy_estimator={},num_healthy_workers=1,agent_timesteps_total=4000,timers={'sample_time_ms': 4069.719, 'sample_throughput': 491.434, 'load_time_ms': 167.373, 'load_throughput': 11949.346, 'learn_time_ms': 6881.159, 'learn_throughput': 290.649, 'update_time_ms': 14.418},info={'learner': {'pol1': {'learner_stats': {'cur_kl_coeff': 0.20000000

Trial PPO_MultiAgentArena_f007c_00002 reported episode_reward_max=19.500000000000007,episode_reward_min=-33.00000000000006,episode_reward_mean=-4.319999999999999,episode_len_mean=100.0,episode_media={},episodes_this_iter=30,policy_reward_min={'pol1': -32.0, 'pol2': -9.99999999999998},policy_reward_max={'pol1': 29.5, 'pol2': -0.0999999999999869},policy_reward_mean={'pol1': 4.433333333333334, 'pol2': -8.753333333333314},custom_metrics={},sampler_perf={'mean_raw_obs_processing_ms': 0.24494120614681977, 'mean_inference_ms': 1.995622853524126, 'mean_action_processing_ms': 0.11405298130705294, 'mean_env_wait_ms': 0.05140069404152065, 'mean_env_render_ms': 0.0},off_policy_estimator={},num_healthy_workers=1,agent_timesteps_total=6000,timers={'sample_time_ms': 7297.943, 'sample_throughput': 411.075, 'load_time_ms': 162.855, 'load_throughput': 18421.278, 'learn_time_ms': 10036.045, 'learn_throughput': 298.923, 'update_time_ms': 7.767},info={'learner': {'pol1': {'learner_stats': {'cur_kl_coeff': 

Trial PPO_MultiAgentArena_f007c_00000 reported episode_reward_max=16.5,episode_reward_min=-21.000000000000007,episode_reward_mean=-3.1574999999999926,episode_len_mean=100.0,episode_media={},episodes_this_iter=20,policy_reward_min={'pol1': -11.0, 'pol2': -9.99999999999998},policy_reward_max={'pol1': 26.5, 'pol2': -1.1999999999999855},policy_reward_mean={'pol1': 5.4125, 'pol2': -8.569999999999983},custom_metrics={},sampler_perf={'mean_raw_obs_processing_ms': 0.2500196248580802, 'mean_inference_ms': 1.9589582637144787, 'mean_action_processing_ms': 0.1124810715725918, 'mean_env_wait_ms': 0.0524420600202498, 'mean_env_render_ms': 0.0},off_policy_estimator={},num_healthy_workers=1,agent_timesteps_total=8000,timers={'sample_time_ms': 5535.318, 'sample_throughput': 361.316, 'load_time_ms': 87.398, 'load_throughput': 22883.805, 'learn_time_ms': 6095.418, 'learn_throughput': 328.115, 'update_time_ms': 10.716},info={'learner': {'pol1': {'learner_stats': {'cur_kl_coeff': 0.30000001192092896, 'cur_

Trial PPO_MultiAgentArena_f007c_00002 reported episode_reward_max=19.500000000000007,episode_reward_min=-33.00000000000006,episode_reward_mean=-3.039999999999994,episode_len_mean=100.0,episode_media={},episodes_this_iter=30,policy_reward_min={'pol1': -32.0, 'pol2': -9.99999999999998},policy_reward_max={'pol1': 29.5, 'pol2': -0.0999999999999869},policy_reward_mean={'pol1': 5.566666666666666, 'pol2': -8.60666666666665},custom_metrics={},sampler_perf={'mean_raw_obs_processing_ms': 0.26534294653247337, 'mean_inference_ms': 2.111634136837117, 'mean_action_processing_ms': 0.12251351096840844, 'mean_env_wait_ms': 0.05533979705967596, 'mean_env_render_ms': 0.0},off_policy_estimator={},num_healthy_workers=1,agent_timesteps_total=12000,timers={'sample_time_ms': 8196.839, 'sample_throughput': 365.995, 'load_time_ms': 83.957, 'load_throughput': 35732.494, 'learn_time_ms': 8304.951, 'learn_throughput': 361.23, 'update_time_ms': 6.379},info={'learner': {'pol1': {'learner_stats': {'cur_kl_coeff': 0.3

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_f007c_00000,TERMINATED,,0.0001,2000,2,23.719,4000,-3.1575,16.5,-21.0,100
PPO_MultiAgentArena_f007c_00001,TERMINATED,,1e-05,2000,2,23.6519,4000,-7.1025,10.5,-27.0,100
PPO_MultiAgentArena_f007c_00002,TERMINATED,,0.0001,3000,2,33.3586,6000,-3.04,19.5,-33.0,100
PPO_MultiAgentArena_f007c_00003,TERMINATED,,1e-05,3000,2,32.9821,6000,-5.435,17.1,-40.5,100


2021-06-18 19:39:03,148	INFO tune.py:549 -- Total run time: 57.89 seconds (57.60 seconds for the tuning loop).


In [None]:
print(analysis)

best_checkpoint = analysis.get_best_checkpoint()

------------------
## 15 min break :)
------------------

## Let's talk about customization options

### Deep Dive: Providing your custom Models in tf or torch.

In [None]:
# Custom Neural Network Models.

import tensorflow as tf


class MyModel(tf.keras.Model):
    def __init__(self,
                input_space,
                action_space,
                num_outputs,
                name="",
                *,
                layers = (256, 256)):
        super().__init__(name=name)

        self.dense_layers = []
        for i, layer_size in enumerate(layers):
            self.dense_layers.append(tf.keras.layers.Dense(
                layer_size, activation=tf.nn.relu, name=f"dense_{i}"))

        self.logits = tf.keras.layers.Dense(
            num_outputs,
            activation=tf.keras.activations.linear,
            name="logits")
        self.values = tf.keras.layers.Dense(
            1, activation=None, name="values")

    def call(self, inputs, training=None, mask=None):
        # Standardized input args:
        # - input_dict (RLlib `SampleBatch` object, which is basically a dict with numpy arrays
        # in it)
        out = inputs["obs"]
        for l in self.dense_layers:
            out = l(out)
        logits = self.logits(out)
        values = self.values(out)

        # Standardized output:
        # - "normal" model output tensor (e.g. action logits).
        # - list of internal state outputs (only needed for RNN-/memory enhanced models).
        # - "extra outs", such as model's side branches, e.g. value function outputs.
        return logits, [], {"vf_preds": tf.reshape(values, [-1])}

In [None]:
# Do a quick test on the custom model class.
from gym.spaces import Box
test_model = MyModel(
    input_space=Box(-1.0, 1.0, (2, )),
    action_space=None,
    num_outputs=2,
)
test_model({"obs": np.array([[0.5, 0.5]])})

In [None]:
# Set up our custom model and re-run the experiment.
config.update({
    "model": {
        "custom_model": MyModel,
        "custom_model_config": {
            "layers": [128, 128],
        },
    },
    # Revert these to single trials (and use those hyperparams that performed well in our Exercise #2).
    "lr": 0.0005,
    "train_batch_size": 2000,
})

tune.run("PPO", config=config, stop=stop)

### Deep Dive: How do we customize RLlib's RL loop?

RLlib offers a callbacks API that allows you to add custom behavior to
all major events during the environment sampling- and learning process.

**Our problem:** So far, we can only see the total reward (sum for both agents).
This does not give us enough insights into the question of which agent
learns what (maybe agent2 doesn't learn anything and the reward we are observing
is mostly due to agent1's progress in covering the map!).

In the following cell, we will create some custom callback "hooks" that will allow us to
add each agents single reward to the returned metrics (which will then be displayed in tensorboard!).

For that we will override RLlib's DefaultCallbacks class and implement the
`on_episode_start`, `on_episode_step`, and `on_episode_end` methods therein:


In [None]:
# Override the DefaultCallbacks with your own and implement any methods (hooks)
# that you need.
from ray.rllib.agents.callbacks import DefaultCallbacks
from ray.rllib.evaluation.episode import MultiAgentEpisode


class MyCallbacks(DefaultCallbacks):
    def on_episode_start(self,
                         *,
                         worker,
                         base_env,
                         policies,
                         episode: MultiAgentEpisode,
                         env_index,
                         **kwargs):
        # We will use the `MultiAgentEpisode` object being passed into
        # all episode-related callbacks. It comes with a user_data property (dict),
        # which we can write arbitrary data into.

        # At the end of an episode, we'll transfer that data into the `hist_data`, and `custom_metrics`
        # properties to make sure our custom data is displayed in TensorBoard.

        # The episode is starting:
        # Wipe out the rewards-lists for individual agents 1 and 2.
        episode.user_data["agent1_rewards"] = []
        episode.user_data["agent2_rewards"] = []

    def on_episode_step(self,
                        *,
                        worker,
                        base_env,
                        episode: MultiAgentEpisode,
                        env_index,
                        **kwargs):
        # Get the last rewards for individual agents 1 and 2
        # from the MultiAgentEpisode object (`episode`).
        ag1_r = episode.prev_reward_for("agent1")
        ag2_r = episode.prev_reward_for("agent2")
        #print("ag1_r={} ag2_r={}".format(ag1_r, ag2_r))

        # Add individual rewards to our lists.
        episode.user_data["agent1_rewards"].append(ag1_r)
        episode.user_data["agent2_rewards"].append(ag2_r)

    def on_episode_end(self,
                       *,
                       worker,
                       base_env,
                       policies,
                       episode: MultiAgentEpisode,
                       env_index,
                       **kwargs):
        # Episode is done:
        # Write scalar values (sum over rewards) to `custom_metrics` and
        # time-series data (rewards per time step) to `hist_data`.
        # Both will be visible then in TensorBoard.

        # Put scalar values (one per episode) under `custom_metrics`.
        episode.custom_metrics["ag1_R"] = sum(episode.user_data["agent1_rewards"])
        episode.custom_metrics["ag2_R"] = sum(episode.user_data["agent2_rewards"])
        # Time series data goes into `hist_data`.
        episode.hist_data["agent1_rewards"] = episode.user_data["agent1_rewards"]
        episode.hist_data["agent2_rewards"] = episode.user_data["agent2_rewards"]

In [None]:
# Setting up our config to point to our new custom callbacks class:
config.update({
    "env": MultiAgentArena,  # force "reload"
    "callbacks": MyCallbacks,  # by default, this would point to `rllib.agents.callbacks.DefaultCallbacks`, which does nothing.
})

results = tune.run(
    "PPO",
    config=config,
    stop={"training_iteration": 20},
    checkpoint_at_end=True,
    # If you'd like to restore the tune run from an existing checkpoint file, you can do the following:
    #restore="/Users/sven/ray_results/PPO/PPO_MultiAgentArena_fd451_00000_0_2021-05-25_15-13-26/checkpoint_000010/checkpoint-10",
)

### Let's check tensorboard for the new custom metrics!

1. Head over to ~/ray_results/PPO/PPO_MultiAgentArena_[some key]_00000_0_[date]_[time]/
1. In that directory, you should see a `event.out....` file.
1. Run `tensorboard --logdir .` and head to https://localhost:6006

<img src="images/tensorboard.png" width=800>


## !Optional Hack! (Exercise No 3)

<hr />

Assume we would like to know exactly how much (new) ground agent1 
covers on average in an episode.
Write your own custom callback class (sub-class
ray.rllib.agents.callback::DefaultCallbacks) and override one or more methods
therein to collect the following data:
- The number of (unique) fields agent1 has covered in an episode.
- The number of times agent2 has blocked agent1.

Remember that you can get the last rewards for both agents via the `MultiAgentEpisode` object (`episode`) and its
`prev_reward_for([agent name])` method. From these, you should be able to infer, whether a collision happened or whether agent1
discovered a new field in the grid.

Run a simple experiment using tune.run (and your custom callbacks class)
and confirm the new metric shows up in tensorboard.

**Good luck! :)**

In [None]:
# Solution Exercise #3

import ray
from ray.rllib.agents.callbacks import DefaultCallbacks
from ray import tune


class MyCallback(DefaultCallbacks):
    def on_episode_start(self, *, worker, base_env,
                         policies, episode,
                         env_index, **kwargs):
        # Set per-episode object to capture, which states (observations)
        # have been visited by agent1.
        episode.user_data["new_fields_discovered"] = 0
        # Set per-episode agent2-blocks counter (how many times has agent2 blocked agent1?).
        episode.user_data["num_collisions"] = 0

    def on_episode_step(self, *, worker, base_env,
                        episode, env_index, **kwargs):
        # Get both rewards.
        ag1_r = episode.prev_reward_for("agent1")
        ag2_r = episode.prev_reward_for("agent2")

        # Agent1 discovered a new field.
        if ag1_r == 1.0:
            episode.user_data["new_fields_discovered"] += 1
        # Collision.
        elif ag2_r == 1.0:
            episode.user_data["num_collisions"] += 1

    def on_episode_end(self, *, worker, base_env,
                       policies, episode,
                       env_index, **kwargs):
        # Store everything in `episode.custom_metrics`:
        episode.custom_metrics["new_fields_discovered"] = episode.user_data["new_fields_discovered"]
        episode.custom_metrics["num_collisions"] = episode.user_data["num_collisions"]


stop = {"training_iteration": 10}
# Specify env and custom callbacks in our config (leave everything else
# as-is (defaults)).
config = {
    "env": MultiAgentArena,
    "callbacks": MyCallback,
}

# Run for a few iterations.
tune.run("PPO", stop=stop, config=config)

# Check tensorboard.



## A closer look at RLlib's APIs and structure
### (Depending on time left and amount of questions having been accumulated :)

We already took a quick look inside an RLlib Trainer object and extracted its Policy(ies) and the Policy's model (neural network). Here is a much more detailed overview of what's inside a Trainer object.

At the core is the so-called `WorkerSet` sitting under `Trainer.workers`. A WorkerSet is a group of `RolloutWorker` (`rllib.evaluation.rollout_worker.py`) objects that always consists of a "local worker" (`Trainer.workers.local_worker()`) and n "remote workers" (`Trainer.workers.remote_workers()`).



<img src="images/rllib_structure.png" width=1000>

### Scaling RLlib

Scaling RLlib works by parallelizing the "jobs" that the remote `RolloutWorkers` do. In a vanilla RL algorithm, like PPO, DQN, and many others, the `@ray.remote` labeled RolloutWorkers in the figure above are responsible for interacting with one or more environments and thereby collecting experiences. Observations are produced by the environment, actions are then computed by the Policy(ies) copy located on the remote worker and sent to the environment in order to produce yet another observation. This cycle is repeated endlessly and only sometimes interrupted to send experience batches ("train batches") of a certain size to the "local worker". There these batches are used to call `Policy.learn_on_batch()`, which performs a loss calculation, followed by a model weights update, and a subsequent weights broadcast back to all the remote workers.



## Thank you for listening and participating!

TODO: add links here!!

github, slack, documents(!)
- examples script folder
- blog posts! (Unity environments)

## Time for Q&A :)