# Hands-on RL with Ray’s RLlib
## A beginner’s tutorial for working with multi-agent environments, models, and algorithms

<img src="images/pitfall.jpg" width=250> <img src="images/tesla.jpg" width=254> <img src="images/forklifts.jpg" width=169> <img src="images/robots.jpg" width=252> <img src="images/dota2.jpg" width=213>

### Overview
“Hands-on RL with Ray’s RLlib” is a beginners tutorial for working with reinforcement learning (RL) environments, models, and algorithms using Ray’s RLlib library. RLlib offers high scalability, a large list of algos to choose from (offline, model-based, model-free, etc..), support for TensorFlow and PyTorch, and a unified API for a variety of applications. This tutorial includes a brief introduction to provide an overview of concepts (e.g. why RL) before proceeding to RLlib (multi- and single-agent) environments, neural network models, hyperparameter tuning, debugging, student exercises, Q/A, and more. All code will be provided as .py files in a GitHub repo.

### Intended Audience
* Python programmers who want to get started with reinforcement learning and RLlib.

### Prerequisites
* Some Python programming experience.
* Some familiarity with machine learning.
* *Helpful, but not required:* Experience in reinforcement learning and Ray.
* *Helpful, but not required:* Experience with TensorFlow or PyTorch.

### Requirements/Dependencies

Install conda (https://www.anaconda.com/products/individual)

Then ...

#### Quick `conda` setup instructions (Linux):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install tensorflow  # <- either one works!
$ pip install torch  # <- either one works!
$ pip install jupyterlab
```

#### Quick `conda` setup instructions (Mac):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install cmake "ray[rllib]"
$ pip install tensorflow  # <- either one works!
$ pip install torch  # <- either one works!
$ pip install jupyterlab
```

#### Quick `conda` setup instructions (Win10):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install [tensorflow|torch]  # <- either one works!
$ pip install jupyterlab
$ conda install pywin32
```

Also, for Win10 Atari support, we have to install atari_py from a different source (gym does not support Atari envs on Windows).

```
$ pip install git+https://github.com/Kojoley/atari-py.git
```

### Opening these tutorial files:
```
$ git clone https://github.com/sven1977/rllib_tutorials
$ cd rllib_tutorials
$ jupyter-lab
```

### Key Takeaways
* What is reinforcement learning and why RLlib?
* Core concepts of RLlib: Environments, Trainers, Policies, and Models.
* How to configure, hyperparameter-tune, and parallelize RLlib.
* RLlib debugging best practices.

### Tutorial Outline
1. RL and RLlib in a nutshell.
1. Defining an RL-solvable problem: Our first environment.
1. **Exercise No.1**: Environment loop.

(15min break)

1. Picking an algorithm and training our first RLlib Trainer.
1. Configurations and hyperparameters - Easy tuning with Ray Tune.
1. Fixing our experiment's config - Going multi-agent.
1. The "infinite laptop": Quick intro into how to use RLlib with Anyscale's product.
1. **Exercise No.2**: Run your own Ray RLlib+Tune experiment)
1. Neural network models - Provide your custom models using tf.keras or torch.nn.

(15min break)

1. Deeper dive into RLlib's parallelization architecture.
1. Specifying different compute resources and parallelization options through our config.
1. "Hacking in": Using callbacks to customize the RL loop and generate our own metrics.
1. **Exercise No.3**: Write your own custom callback.
1. "Hacking in (part II)" - Debugging with RLlib and PyCharm.
1. Checking on the "infinite laptop" - Did RLlib learn to solve the problem?

### Other Recommended Readings
* [Attention Nets and More with RLlib's Trajectory View API](https://medium.com/distributed-computing-with-ray/attention-nets-and-more-with-rllibs-trajectory-view-api-d326339a6e65)
* [Intro to RLlib: Example Environments](https://medium.com/distributed-computing-with-ray/intro-to-rllib-example-environments-3a113f532c70)
* [Reinforcement Learning with RLlib in the Unity Game Engine](https://medium.com/distributed-computing-with-ray/reinforcement-learning-with-rllib-in-the-unity-game-engine-1a98080a7c0d)


<img src="images/rl-cycle.png" width=800>

### Coding/defining our "problem" via an RL environment.

We will use the following (adversarial) multi-agent environment
throughout this tutorial to demonstrate a large fraction of RLlib's
APIs, features, and customization options.

<img src="images/environment.png" width=800>

### A word or two on Spaces:

Spaces are used in ML to describe what possible/valid values inputs and outputs of a neural network can have.

RL environments also use them to describe what their valid observations and actions are.

Spaces are usually defined by their shape (e.g. 84x84x3 RGB images) and datatype (e.g. uint8 for RGB values between 0 and 255).
However, spaces could also be composed of other spaces (see Tuple or Dict spaces) or could be simply discrete with n fixed possible values
(represented by integers). For example, in our game, where each agent can only go up/down/left/right, the action space would be "Discrete(4)"
(no datatype, no shape needs to be defined here).

<img src="images/spaces.png" width=800>

In [5]:
# Let's code (parts of) our multi-agent environment.

import gym
from gym.spaces import Discrete, MultiDiscrete
import numpy as np
import random

from ray.rllib.env.multi_agent_env import MultiAgentEnv


class MultiAgentArena(MultiAgentEnv):
    def __init__(self, config=None):
        config = config or {}
        # Dimensions of the grid.
        self.width = config.get("width", 10)
        self.height = config.get("height", 10)

        # End an episode after this many timesteps.
        self.timestep_limit = config.get("ts", 100)

        self.observation_space = MultiDiscrete([self.width * self.height,
                                                self.width * self.height])
        # 0=up, 1=right, 2=down, 3=left.
        self.action_space = Discrete(4)

        # Reset env.
        self.reset()
        
    def reset(self):
        """Returns initial observation of next(!) episode."""
        # Row-major coords!
        self.agent1_pos = [0, 0]
        self.agent2_pos = [self.height - 1, self.width - 1]

        # Accumulated rewards in this episode.
        self.agent1_R = 0.0
        self.agent2_R = 0.0

        # Reset agent1's visited fields.
        self.agent1_visited_fields = set([tuple(self.agent1_pos)])

        # How many timesteps have we done in this episode.
        self.timesteps = 0

        # Return the initial observation in the new episode.
        return self._get_obs()

    def step(self, action: dict):
        """Returns (next observation, rewards, dones, infos) after having taken the given action."""
        # increase our time steps counter by 1.
        self.timesteps += 1
        # An episode is "done" when we reach the time step limit.
        is_done = self.timesteps >= self.timestep_limit

        # Determine, who is allowed to move first (50:50).
        if random.random() > 0.5:
            # events = [collision|new_field]
            events = self._move(self.agent1_pos, action["agent1"], is_agent1=True)
            events |= self._move(self.agent2_pos, action["agent2"], is_agent1=False)
        else:
            events = self._move(self.agent2_pos, action["agent2"], is_agent1=False)
            events |= self._move(self.agent1_pos, action["agent1"], is_agent1=True)

        # Useful for rendering.
        self.collision = "collision" in events
            
        # Get observations (based on new agent positions).
        obs = self._get_obs()

        # Determine rewards based on the collected events:
        r1 = -1.0 if "collision" in events else 1.0 if "new_field" in events else -0.5
        r2 = 1.0 if "collision" in events else -0.1

        self.agent1_R += r1
        self.agent2_R += r2
        
        rewards = {
            "agent1": r1,
            "agent2": r2,
        }

        # Generate a `done` dict (per-agent and total).
        dones = {
            "agent1": is_done,
            "agent2": is_done,
            # special `__all__` key indicates that the episode is done for all agents.
            "__all__": is_done,
        }

        return obs, rewards, dones, {}  # <- info dict (not needed here).

    def _get_obs(self):
        """
        Returns obs dict (agent name to discrete-pos tuple) using each
        agent's current x/y-positions.
        """
        ag1_discrete_pos = self.agent1_pos[0] * self.width + \
            (self.agent1_pos[1] % self.width)
        ag2_discrete_pos = self.agent2_pos[0] * self.width + \
            (self.agent2_pos[1] % self.width)
        return {
            "agent1": np.array([ag1_discrete_pos, ag2_discrete_pos]),
            "agent2": np.array([ag2_discrete_pos, ag1_discrete_pos]),
        }

    def _move(self, coords, action, is_agent1):
        """
        Moves an agent (agent1 iff is_agent1=True, else agent2) from `coords` (x/y) using the
        given action (0=up, 1=right, etc..) and returns a resulting events dict:
        Agent1: "new" when entering a new field. "bumped" when having been bumped into by agent2.
        Agent2: "bumped" when bumping into agent1 (agent1 then gets -1.0).
        """
        orig_coords = coords[:]
        # Change the row: 0=up (-1), 2=down (+1)
        coords[0] += -1 if action == 0 else 1 if action == 2 else 0
        # Change the column: 1=right (+1), 3=left (-1)
        coords[1] += 1 if action == 1 else -1 if action == 3 else 0

        # Solve collisions.
        # Make sure, we don't end up on the other agent's position.
        # If yes, don't move (we are blocked).
        if (is_agent1 and coords == self.agent2_pos) or (not is_agent1 and coords == self.agent1_pos):
            coords[0], coords[1] = orig_coords
            # Agent2 blocked agent1 (agent1 tried to run into agent2)
            # OR Agent2 bumped into agent1 (agent2 tried to run into agent1)
            return {"collision"}

        # No agent blocking -> check walls.
        if coords[0] < 0:
            coords[0] = 0
        elif coords[0] >= self.height:
            coords[0] = self.height - 1
        if coords[1] < 0:
            coords[1] = 0
        elif coords[1] >= self.width:
            coords[1] = self.width - 1

        # If agent1 -> "new" if new tile covered.
        if is_agent1 and not tuple(coords) in self.agent1_visited_fields:
            self.agent1_visited_fields.add(tuple(coords))
            return {"new_field"}
        # No new tile for agent1.
        return set()

    def render(self, mode=None):
        print("_" * (self.width + 2))
        for r in range(self.height):
            print("|", end="")
            for c in range(self.width):
                field = r * self.width + c % self.width
                if self.agent1_pos == [r, c]:
                    print("1", end="")
                elif self.agent2_pos == [r, c]:
                    print("2", end="")
                elif (r, c) in self.agent1_visited_fields:
                    print(".", end="")
                else:
                    print(" ", end="")
            print("|")
        print("‾" * (self.width + 2))
        print(f"{'!!Collision!!' if self.collision else ''}")
        print(f"R1={self.agent1_R}")
        print(f"R2={self.agent2_R}")
        print()


dummy_env = MultiAgentArena()

obs = dummy_env.reset()

# Agent1 will move down, Agent2 moves up.
obs, rewards, dones, infos = dummy_env.step(action={"agent1": 2, "agent2": 0})

dummy_env.render()

print("Agent1's x/y position={}".format(dummy_env.agent1_pos))
print("Agent2's x/y position={}".format(dummy_env.agent2_pos))
print("Env timesteps={}".format(dummy_env.timesteps))

#TODO: merge exercise 2 and long tune learnign run somehow

____________
|.         |
|1         |
|          |
|          |
|          |
|          |
|          |
|          |
|         2|
|          |
‾‾‾‾‾‾‾‾‾‾‾‾

R1=1.0
R2=-0.1

Agent1's x/y position=[1, 0]
Agent2's x/y position=[8, 9]
Env timesteps=1


## Exercise No 1

<hr />

<img src="images/exercise1.png" width=400>

In the cell above, we performed a `reset()` and a single `step()` call. To walk through an entire episode, one would normally call `step()` repeatedly (with different actions) until the returned `done` dict has the "agent1" or "agent2" (or "__all__") key set to True. Your task is to write an "environment loop" that runs for exactly one episode using our `MultiAgentArena` class.

Follow these instructions here to get this done.

1. Create an env object.
1. `reset` your environment to get the first (initial) observation.
1. Compute the actions for "agent1" and "agent2" calling `DummyTrainer.compute_action([obs])` twice and putting the results into an action dict to be passed into `step()`, just like it's done in the above cell (where we do a single `step()`).
1. Repeat this, `step`ing through an entire episode.
1. When an episode is done, `step()` will return a done dict with key `__all__` set to True.
1. If you feel, this is way too easy for you ;) , try to extract each agent's reward, sum it up over the episode and - at the end of the episode - print out each agent's accumulated reward (also called the "return" of an episode).

**Good luck! :)**


In [6]:
class DummyTrainer:
    """Dummy Trainer class used in Exercise #1.

    Use its `compute_action` method to get a new action for one of the agents,
    given the agent's observation (a single discrete value encoding the field
    the agent is currently in).
    """

    def compute_action(self, single_agent_obs=None):
        # Returns a random action for a single agent.
        return np.random.randint(4)  # Discrete(4) -> return rand int between 0 and 3 (incl. 3).

dummy_trainer = DummyTrainer()
# Check, whether it's working.
for _ in range(3):
    # Get action for agent1 (providing agent1's and agent2's positions).
    print("action_agent1={}".format(dummy_trainer.compute_action(np.array([0, 99]))))

    # Get action for agent2 (providing agent2's and agent1's positions).
    print("action_agent2={}".format(dummy_trainer.compute_action(np.array([99, 0]))))

    print()

action_agent1=2
action_agent2=3

action_agent1=2
action_agent2=0

action_agent1=1
action_agent2=3



Write your solution code into this cell here:

In [12]:
# !LIVE CODING!

# Leave the following as-is. It'll help us with rendering the env in this very cell's output.
import time
from ipywidgets import Output
from IPython import display
import time
out = Output()

with out:

    # Solution to Exercise #1:
    # Start coding here inside this `with`-block:
    # ...
    env = MultiAgentArena()
    obs = env.reset()

    while True:
        # Compute actions separately for each agent.
        a1 = dummy_trainer.compute_action(obs["agent1"])
        a2 = dummy_trainer.compute_action(obs["agent2"])

        # Send the action-dict to the env.
        obs, rewards, dones, _ = env.step({"agent1": a1, "agent2": a2})

        if dones["agent1"]:
            break

        # Get a rendered image from the env.
        time.sleep(0.1)
        display.clear_output(wait=True)
        env.render()


____________
|..        |
|...       |
|.....  2  |
|.....     |
|.....     |
|....      |
| ....     |
| ..1.     |
|   .      |
|          |
‾‾‾‾‾‾‾‾‾‾‾‾

R1=-2.5
R2=-7.6999999999999815



------------------
## 15 min break :)
------------------

### And now for something completely different:
#### Plugging in RLlib!

In [13]:
import numpy as np
import pprint
import ray

# Start a new instance of Ray (when running this tutorial locally) or
# connect to an already running one (when running this tutorial through Anyscale).

ray.init()  # Hear the engine humming? ;)

# In case you encounter the following error during our tutorial: `RuntimeError: Maybe you called ray.init twice by accident?`
# Try: `ray.shutdown() + ray.init()` or `ray.init(ignore_reinit_error=True)`

2021-06-22 11:04:09,211	INFO services.py:1272 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.0.179',
 'raylet_ip_address': '192.168.0.179',
 'redis_address': '192.168.0.179:6379',
 'object_store_address': '/tmp/ray/session_2021-06-22_11-04-07_432956_57764/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-06-22_11-04-07_432956_57764/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2021-06-22_11-04-07_432956_57764',
 'metrics_export_port': 63393,
 'node_id': '5c9016c977d25fe796393275228fddf56a0d2ccd63ce1aad66e683ea'}

### Picking an RLlib algorithm - We'll use PPO throughout this tutorial (one-size-fits-all-kind-of-algo)

<img src="images/rllib_algos.png" width=800>

https://docs.ray.io/en/master/rllib-algorithms.html#available-algorithms-overview

In [14]:
# Import a Trainable (one of RLlib's built-in algorithms):
# We use the PPO algorithm here b/c its very flexible wrt its supported
# action spaces and model types and b/c it learns well almost any problem.
from ray.rllib.agents.ppo import PPOTrainer

# Specify a very simple config, defining our environment and some environment
# options (see environment.py).
config = {
    "env": MultiAgentArena,  # "my_env" <- if we previously have registered the env with `tune.register_env("[name]", lambda config: [returns env object])`.
    "env_config": {
        "config": {
            "width": 10,
            "height": 10,
            "ts": 100,
        },
    },

    # !PyTorch users!
    #"framework": "torch",  # If users have chosen to install torch instead of tf.

    "create_env_on_driver": True,
}
# Instantiate the Trainer object using above config.
rllib_trainer = PPOTrainer(config=config)
rllib_trainer

2021-06-22 11:04:11,526	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2021-06-22 11:04:11,529	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


PPO

### Ready to train with RLlib's PPO algorithm

That's it, we are ready to train.
Calling `Trainer.train()` will execute a single "training iteration".

One iteration for most algos involves:

1) sampling from the environment(s)
2) using the sampled data (observations, actions taken, rewards) to update the policy model (neural network), such that it would pick better actions in the future, leading to higher rewards.

Let's try it out:

In [15]:
results = rllib_trainer.train()

# Delete the config from the results for clarity.
# Only the stats will remain, then.
del results["config"]
# Pretty print the stats.
pprint.pprint(results)

{'agent_timesteps_total': 4000,
 'custom_metrics': {},
 'date': '2021-06-22_11-05-14',
 'done': False,
 'episode_len_mean': 100.0,
 'episode_media': {},
 'episode_reward_max': 11.100000000000016,
 'episode_reward_mean': -8.4,
 'episode_reward_min': -28.800000000000054,
 'episodes_this_iter': 20,
 'episodes_total': 20,
 'experiment_id': '349d430c5b2d411c9baec1a1d03ead03',
 'hist_stats': {'episode_lengths': [100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
           

### Going from single policy (RLlib's default) to multi-policy:

So far, our experiment has been ill-configured, because both
agents, which should behave differently due to their different
tasks and reward functions, learn the same policy: the "default_policy",
which RLlib always provides if you don't configure anything else.
Remember that RLlib does not know at Trainer setup time, how many and which agents
the environment will "produce". Agent control (adding agents, removing them, terminating
episodes for agents) is entirely in the Env's hands.
Let's fix our single policy problem and introduce the "multiagent" API.

<img src="images/from_single_agent_to_multi_agent.png" width=800>

In order to turn on RLlib's multi-agent functionality, we need two things:

1. A policy mapping function, mapping agent IDs (e.g. a string like "agent1", produced by the environment in the returned observation/rewards/dones-dicts) to a policy ID (another string, e.g. "policy1", which is under our control).
1. A policies definition dict, mapping policy IDs (e.g. "policy1") to 4-tuples consisting of 1) policy class (None for using the default class), 2) observation space, 3) action space, and 4) config overrides (empty dict for no overrides and using the Trainer's main config dict).

Let's take a closer look:

In [16]:
# Define an agent->policy mapping function.
# Which agents (defined by the environment) use which policies (defined by us)?
# The mapping here is M (agents) -> N (policies), where M >= N.
def policy_mapping_fn(agent: str):
    # Make sure agent ID is valid.
    assert agent in ["agent1", "agent2"], f"ERROR: invalid agent {agent}!"
    # Map agent1 to policy1, and agent2 to policy2.
    return "policy1" if agent == "agent1" else "policy2"

# Get the spaces for our two policies from our already existing Trainer object:
observation_space = dummy_env.observation_space
action_space = dummy_env.action_space

# Define the policies definition dict:
# Each policy in there is defined by its ID (key) mapping to a 4-tuple (value):
# - Policy class (None for using the "default" class, e.g. PPOTFPolicy for PPO+tf or PPOTorchPolicy for PPO+torch).
# - obs-space (we get this directly from our already created env object).
# - act-space (we get this directly from our already created env object).
# - config-overrides dict (leave empty for using the Trainer's config as-is)
policies = {
    "policy1": (None, observation_space, action_space, {"lr": 0.0002}),
    "policy2": (None, observation_space, action_space, {}),
}
# Note that now we won't have a "default_policy" anymore, just "policy1" and "policy2".

# We could - if we wanted - specify, which policies should be learnt (by default, RLlib learns all).
# Non-learnt policies will be frozen and not updated:
# policies_to_train = ["policy1", "policy2"]

# Adding the above to our config.
config.update({
    "multiagent": {
        "policies": policies,
        "policy_mapping_fn": policy_mapping_fn,
        # We'll leave this empty: Means, we train both policy1 and policy2.
        # "policies_to_train": policies_to_train,
    },
})

In [17]:
# Recreate our Trainer (we cannot just change the config on-the-fly).
rllib_trainer.stop()

# Using our updated (now multiagent!) config dict.
rllib_trainer = PPOTrainer(config=config)

2021-06-22 11:08:03,805	INFO trainable.py:101 -- Trainable.setup took 12.909 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Now that we are setup correctly with two policies as per our "multiagent" config, let's call `train()` on the new Trainer several times (what about 10 times?).

In [18]:
# Run `train()` n times. Repeatedly call `train()` now to see rewards increase.
# Move on once you see (agent1 + agent2) episode rewards of 10.0 or more.
for _ in range(10):
    results = rllib_trainer.train()
    print(f"Iteration={rllib_trainer.iteration}: R(\"return\")={results['episode_reward_mean']}")

Iteration=1: R("return")=-10.627500000000001
Iteration=2: R("return")=-7.829999999999996
Iteration=3: R("return")=-4.484999999999991
Iteration=4: R("return")=-1.3409999999999898
Iteration=5: R("return")=-0.47099999999998965
Iteration=6: R("return")=1.2780000000000067
Iteration=7: R("return")=1.914000000000008
Iteration=8: R("return")=2.1180000000000097
Iteration=9: R("return")=1.515000000000009
Iteration=10: R("return")=2.3670000000000093


In [19]:
# Do another loop, but this time, we will print out each policies' individual rewards.
for _ in range(10):
    results = rllib_trainer.train()
    print(f"Iteration={rllib_trainer.iteration}: R1={results['policy_reward_mean']['policy1']} R2={results['policy_reward_mean']['policy2']}")

Iteration=11: R1("return")=9.105 R2("return")=-7.766999999999984
Iteration=12: R1("return")=11.48 R2("return")=-7.6129999999999844
Iteration=13: R1("return")=12.625 R2("return")=-7.557999999999988
Iteration=14: R1("return")=13.985 R2("return")=-7.3159999999999865
Iteration=15: R1("return")=14.585 R2("return")=-7.249999999999987
Iteration=16: R1("return")=15.135 R2("return")=-7.106999999999987
Iteration=17: R1("return")=14.49 R2("return")=-6.479999999999987
Iteration=18: R1("return")=16.03 R2("return")=-6.468999999999989
Iteration=19: R1("return")=16.77 R2("return")=-6.182999999999988
Iteration=20: R1("return")=18.64 R2("return")=-6.435999999999989


#### !OPTIONAL HACK! (<-- we will not do these during the tutorial, but feel free to try these cells by yourself)

Use the above solution of Exercise #1 and replace our `dummy_trainer` in that solution
with the now trained `rllib_trainer`. You should see a better performance of the two agents.

However, keep in mind that we are mostly training agent1 as we only trian a single policy and agent1
is the "easier" one to collect high rewards with.

#### !OPTIONAL HACK!

Feel free to play around with the following code in order to learn how RLlib - under the hood - calculates actions from the environment's observations using Policies and their model(s) inside our Trainer object):

In [23]:
# Let's actually "look inside" our Trainer to see what's in there.
from ray.rllib.utils.numpy import softmax

# To get to one of the policies inside the Trainer, use `Trainer.get_policy([policy ID])`:
policy = rllib_trainer.get_policy("policy1")
print(f"Our (only!) Policy right now is: {policy}")

# To get to the model inside any policy, do:
model = policy.model
#print(f"Our Policy's model is: {model}")

# Print out the policy's action and observation spaces.
print(f"Our Policy's observation space is: {policy.observation_space}")
print(f"Our Policy's action space is: {policy.action_space}")

# Produce a random obervation (B=1; batch of size 1).
obs = np.array([policy.observation_space.sample()])
# Alternatively for PyTorch:
#import torch
#obs = torch.from_numpy(obs)

# Get the action logits (as tf tensor).
# If you are using torch, you would get a torch tensor here.
logits, _ = model({"obs": obs})
logits

# Numpyize the tensor by running `logits` through the Policy's own tf.Session.
logits_np = policy.get_session().run(logits)
# For torch, you can simply do: `logits_np = logits.detach().cpu().numpy()`.

# Convert logits into action probabilities and remove the B=1.
action_probs = np.squeeze(softmax(logits_np))

# Sample an action, using the probabilities.
action = np.random.choice([0, 1, 2, 3], p=action_probs)

# Print out the action.
print(f"sampled action={action}")

Our (only!) Policy right now is: <ray.rllib.policy.tf_policy_template.PPOTFPolicy object at 0x7f81ec1eda00>
Our Policy's observation space is: Box(-1.0, 1.0, (200,), float32)
Our Policy's action space is: Discrete(4)
sampled action=3


### Saving and restoring a trained Trainer.
Currently, `rllib_trainer` is in an already trained state.
It holds optimized weights in its Policy's model that allow it to act
already somewhat smart in our environment when given an observation.

However, if we closed this notebook right now, all the effort would have been for nothing.
Let's therefore save the state of our trainer to disk for later!

In [25]:
# We use the `Trainer.save()` method to create a checkpoint.
checkpoint_file = rllib_trainer.save()
print(f"Trainer (at iteration {rllib_trainer.iteration} was saved in '{checkpoint_file}'!")

# Here is what a checkpoint directory contains:
print("The checkpoint directory contains the following files:")
import os
os.listdir(os.path.dirname(checkpoint_file))

Trainer (at iteration 20 was saved in '/Users/sven/ray_results/PPO_MultiAgentArena_2021-06-22_11-07-50va2w_ji5/checkpoint_000020/checkpoint-20'!
The checkpoint directory contains the following files:


['checkpoint-20', 'checkpoint-20.tune_metadata', '.is_checkpoint']

### Restoring and evaluating a Trainer
In the following cell, we'll learn how to restore a saved Trainer from a checkpoint file.

We'll also evaluate a completely new Trainer (should act more or less randomly) vs an already trained one (the one we just restored from the created checkpoint file).

In [22]:
# Pretend, we wanted to pick up training from a previous run:
new_trainer = PPOTrainer(config=config)
# Evaluate the new trainer (this should yield random results).
results = new_trainer.evaluate()
print(f"Evaluating new trainer: R={results['evaluation']['episode_reward_mean']}")

# Restoring the trained state into the `new_trainer` object.
print(f"Before restoring: Trainer is at iteration={new_trainer.iteration}")
new_trainer.restore(checkpoint_file)
print(f"After restoring: Trainer is at iteration={new_trainer.iteration}")

# Evaluate again (this should yield results we saw after having trained our saved agent).
results = new_trainer.evaluate()
print(f"Evaluating restored trainer: R={results['evaluation']['episode_reward_mean']}")

2021-06-22 11:10:29,827	INFO trainable.py:101 -- Trainable.setup took 12.382 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Evaluating new trainer: R=-10.605000000000011
Before restoring: Trainer is at iteration=0


2021-06-22 11:10:32,847	INFO trainable.py:377 -- Restored on 192.168.0.179 from checkpoint: /Users/sven/ray_results/PPO_MultiAgentArena_2021-06-22_11-07-50va2w_ji5/checkpoint_000020/checkpoint-20
2021-06-22 11:10:32,847	INFO trainable.py:385 -- Current state after restoring: {'_iteration': 20, '_timesteps_total': None, '_time_total': 132.84255385398865, '_episodes_total': 800}


After restoring: Trainer is at iteration=20
Evaluating restored trainer: R=11.759999999999962


In order to release all resources from a Trainer, you can use a Trainer's `stop()` method.
You should definitley run this cell as it frees resources that we'll need later in this tutorial, when we'll do parallel hyperparameter sweeps.

In [26]:
rllib_trainer.stop()
new_trainer.stop()

### Moving stuff to the professional level: RLlib in connection w/ Ray Tune

Running any experiments through Ray Tune is the recommended way of doing things with RLlib. If you look at our
<a href="https://github.com/ray-project/ray/tree/master/rllib/examples">examples scripts folder</a>, you will see that almost all of the scripts use Ray Tune to run the particular RLlib workload demonstrated in each script.

<img src="images/rllib_and_tune.png" width=400>

When setting up hyperparameter sweeps for Tune, we'll do this in our already familiar config dict.

So let's take a quick look at our PPO algo's default config to understand, which hyperparameters we may want to play around with:

In [27]:
# Configuration dicts and Ray Tune.
# Where are the default configuration dicts stored?

# PPO algorithm:
from ray.rllib.agents.ppo import DEFAULT_CONFIG as PPO_DEFAULT_CONFIG
print(f"PPO's default config is:")
pprint.pprint(PPO_DEFAULT_CONFIG)

# DQN algorithm:
#from ray.rllib.agents.dqn import DEFAULT_CONFIG as DQN_DEFAULT_CONFIG
#print(f"DQN's default config is:")
#pprint.pprint(DQN_DEFAULT_CONFIG)

# Common (all algorithms).
#from ray.rllib.agents.trainer import COMMON_CONFIG
#print(f"RLlib Trainer's default config is:")
#pprint.pprint(COMMON_CONFIG)

PPO's default config is:
{'_fake_gpus': False,
 'batch_mode': 'truncate_episodes',
 'callbacks': <class 'ray.rllib.agents.callbacks.DefaultCallbacks'>,
 'clip_actions': True,
 'clip_param': 0.3,
 'clip_rewards': None,
 'collect_metrics_timeout': 180,
 'compress_observations': False,
 'create_env_on_driver': False,
 'custom_eval_function': None,
 'custom_resources_per_worker': {},
 'eager_tracing': False,
 'entropy_coeff': 0.0,
 'entropy_coeff_schedule': None,
 'env': None,
 'env_config': {},
 'env_task_fn': None,
 'evaluation_config': {},
 'evaluation_interval': None,
 'evaluation_num_episodes': 10,
 'evaluation_num_workers': 0,
 'evaluation_parallel_to_training': False,
 'exploration_config': {'type': 'StochasticSampling'},
 'explore': True,
 'extra_python_environs_for_driver': {},
 'extra_python_environs_for_worker': {},
 'fake_sampler': False,
 'framework': 'tf',
 'gamma': 0.99,
 'grad_clip': None,
 'horizon': None,
 'ignore_worker_failures': False,
 'in_evaluation': False,
 'input'

### Let's do a very simple grid-search over two learning rates with tune.run().

In particular, we will try the learning rates 0.00005 and 0.5 using `tune.grid_search([...])`
inside our config dict:

In [31]:
# Plugging in Ray Tune.
# Note that this is the recommended way to run any experiments with RLlib.
# Reasons:
# - Tune allows you to do hyperparameter tuning in a user-friendly way
#   and at large scale!
# - Tune automatically allocates needed resources for the different
#   hyperparam trials and experiment runs on a cluster.

from ray import tune

# Running stuff with tune, we can re-use the exact
# same config that we used when working with RLlib directly!
tune_config = config.copy()

# Let's add our first hyperparameter search via our config.
# How about we try two different learning rates? Let's say 0.00005 and 0.5 (ouch!).
tune_config["lr"] = tune.grid_search([0.00005, 0.5])  # <- 0.5? again: ouch!

# Now that we will run things "automatically" through tune, we have to
# define one or more stopping criteria.
# Tune will stop the run, once any single one of the criteria is matched (not all of them!).
stop = {
    # Note that the keys used here can be anything present in the above `rllib_trainer.train()` output dict.
    "training_iteration": 5,
    "episode_reward_mean": 20.0,
}

# "PPO" is a registered name that points to RLlib's PPOTrainer.
# See `ray/rllib/agents/registry.py`

# Run a simple experiment until one of the stopping criteria is met.
tune.run(
    "PPO",
    config=tune_config,
    stop=stop,

    # Note that no trainers will be returned from this call here.
    # Tune will create n Trainers internally, run them in parallel and destroy them at the end.
    # However, you can ...
    checkpoint_at_end=True,  # ... create a checkpoint when done.
    checkpoint_freq=10,  # ... create a checkpoint every 10 training iterations.
)

Trial name,status,loc,lr
PPO_MultiAgentArena_38c9e_00000,PENDING,,5e-05
PPO_MultiAgentArena_38c9e_00001,PENDING,,0.5


[2m[36m(pid=58208)[0m 2021-06-22 11:43:02,337	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=58208)[0m 2021-06-22 11:43:02,337	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=58216)[0m 2021-06-22 11:43:02,337	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=58216)[0m 2021-06-22 11:43:02,337	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=58216)[0m 2021-06-22 11:43:14,173	INFO trainable.py:101 -- Trainable.setup took 11.837 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=58208)[0m 2021-06-22 11:43:14,891	INFO trainable.py:101 -- Trainable.setup took 12.554 sec

Result for PPO_MultiAgentArena_38c9e_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-06-22_11-43-24
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 13.500000000000027
  episode_reward_mean: -7.229999999999997
  episode_reward_min: -37.50000000000007
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: e01e80a4879747e58fbf19a5912d5bf2
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.00019999999494757503
          entropy: 1.3465654850006104
          entropy_coeff: 0.0
          kl: 0.04035121574997902
          model: {}
          policy_loss: -0.07320234924554825
          total_loss: 38.04973602294922
          vf_explained_var: 0.13369648158550262
          vf_loss: 38.1148681640625
      policy2:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
      

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_38c9e_00000,RUNNING,192.168.0.179:58216,5e-05,1.0,10.8214,4000.0,-7.23,13.5,-37.5,100.0
PPO_MultiAgentArena_38c9e_00001,RUNNING,,0.5,,,,,,,


Result for PPO_MultiAgentArena_38c9e_00001:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-06-22_11-43-25
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 7.500000000000023
  episode_reward_mean: -12.442500000000004
  episode_reward_min: -33.900000000000034
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: 35145892a4bf4cc4a4af79467da0fcb1
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.00019999999494757503
          entropy: 1.3476173877716064
          entropy_coeff: 0.0
          kl: 0.039893217384815216
          model: {}
          policy_loss: -0.07687855511903763
          total_loss: 35.35573196411133
          vf_explained_var: 0.13819590210914612
          vf_loss: 35.42463302612305
      policy2:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.5
          entropy: 0.

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_38c9e_00000,RUNNING,192.168.0.179:58216,5e-05,2,21.4628,8000,-4.33125,18.0,-37.5,100
PPO_MultiAgentArena_38c9e_00001,RUNNING,192.168.0.179:58208,0.5,1,10.9737,4000,-12.4425,7.5,-33.9,100


Result for PPO_MultiAgentArena_38c9e_00001:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-06-22_11-43-36
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 7.500000000000023
  episode_reward_mean: -14.977500000000006
  episode_reward_min: -45.000000000000064
  episodes_this_iter: 40
  episodes_total: 80
  experiment_id: 35145892a4bf4cc4a4af79467da0fcb1
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 0.00019999999494757503
          entropy: 1.345901608467102
          entropy_coeff: 0.0
          kl: 0.029133351519703865
          model: {}
          policy_loss: -0.04670372232794762
          total_loss: 51.22841262817383
          vf_explained_var: 0.08547636866569519
          vf_loss: 51.26637268066406
      policy2:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 0.5
          entropy: 0.

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_38c9e_00000,RUNNING,192.168.0.179:58216,5e-05,3,30.9242,12000,-2.781,18.0,-31.5,100
PPO_MultiAgentArena_38c9e_00001,RUNNING,192.168.0.179:58208,0.5,2,21.5462,8000,-14.9775,7.5,-45.0,100


Result for PPO_MultiAgentArena_38c9e_00001:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-06-22_11-43-45
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 15.000000000000014
  episode_reward_mean: -12.723000000000003
  episode_reward_min: -45.000000000000064
  episodes_this_iter: 40
  episodes_total: 120
  experiment_id: 35145892a4bf4cc4a4af79467da0fcb1
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 0.00019999999494757503
          entropy: 1.311800479888916
          entropy_coeff: 0.0
          kl: 0.028345393016934395
          model: {}
          policy_loss: -0.052656546235084534
          total_loss: 38.49173355102539
          vf_explained_var: 0.04838701710104942
          vf_loss: 38.53163528442383
      policy2:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 0.5
          entropy:

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_38c9e_00000,RUNNING,192.168.0.179:58216,5e-05,4,41.3207,16000,-0.108,23.7,-19.5,100
PPO_MultiAgentArena_38c9e_00001,RUNNING,192.168.0.179:58208,0.5,3,30.9031,12000,-12.723,15.0,-45.0,100


Result for PPO_MultiAgentArena_38c9e_00001:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-06-22_11-43-56
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 15.000000000000023
  episode_reward_mean: -6.482999999999993
  episode_reward_min: -45.000000000000064
  episodes_this_iter: 40
  episodes_total: 160
  experiment_id: 35145892a4bf4cc4a4af79467da0fcb1
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.675000011920929
          cur_lr: 0.00019999999494757503
          entropy: 1.283635139465332
          entropy_coeff: 0.0
          kl: 0.021360205486416817
          model: {}
          policy_loss: -0.04965157434344292
          total_loss: 32.91261291503906
          vf_explained_var: 0.16926634311676025
          vf_loss: 32.947845458984375
      policy2:
        learner_stats:
          cur_kl_coeff: 0.22499999403953552
          cur_lr: 0.5
          entropy: 6.

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_38c9e_00000,RUNNING,192.168.0.179:58216,5e-05,5,52.01,20000,0.951,23.7,-21.3,100
PPO_MultiAgentArena_38c9e_00001,RUNNING,192.168.0.179:58208,0.5,4,41.6801,16000,-6.483,15.0,-45.0,100


Result for PPO_MultiAgentArena_38c9e_00001:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-06-22_11-44-06
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 22.499999999999936
  episode_reward_mean: -0.16199999999999096
  episode_reward_min: -28.500000000000014
  episodes_this_iter: 40
  episodes_total: 200
  experiment_id: 35145892a4bf4cc4a4af79467da0fcb1
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
          entropy: 1.2669665813446045
          entropy_coeff: 0.0
          kl: 0.016159681603312492
          model: {}
          policy_loss: -0.04417208582162857
          total_loss: 32.857418060302734
          vf_explained_var: 0.28577640652656555
          vf_loss: 32.88522720336914
      policy2:
        learner_stats:
          cur_kl_coeff: 0.11249999701976776
          cur_lr: 0.5
          entropy:

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_38c9e_00000,TERMINATED,,5e-05,5,52.01,20000,0.951,23.7,-21.3,100
PPO_MultiAgentArena_38c9e_00001,TERMINATED,,0.5,5,51.8954,20000,-0.162,22.5,-28.5,100


2021-06-22 11:44:07,702	INFO tune.py:549 -- Total run time: 71.79 seconds (71.22 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7f81ec46fa60>

### Why did we use 6 CPUs in the tune run above (3 CPUs per trial)?

PPO - by default - uses 2 ("rollout") workers (`num_workers=2`). These are Ray Actors that have their own environment copy(ies) and step through those in parallel. On top of these two "rollout" workers, every Trainer in RLlib always also has a "local" worker, which - in case of PPO - handles the learning updates. This gives us 3 workers (2 rollout + 1 local learner), which require 3 CPUs.

## Exercise No 2

<hr />

Using the `tune_config` that we have built so far to run another `tune.run()`, but apply the following changes to our setup this time:
- Setup only 1 learning rates under the "lr" config key: Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Setup 2 train batch sizes using `tune.grid_search([batch size 1, batch size 2])` under the "train_batch_size" config key. Use the values 3000 and 4000.
- Set the `num_envs_per_worker` config parameter to 5. This will "sequentialize" our env, but parallelize action computing forward passes through our neural network.

Other than that, use the exact same args as in our `tune.run()` call in the previous cell.

**Good luck! :)**


In [None]:
# !LIVE CODING!

# Solution to Exercise #2

# Run for longer this time (not just 2 iterations) and try to reach 40.0 reward (sum of both agents).
stop = {
    "training_iteration": 100,
    "episode_reward_mean": 40.0,
}

# tune_config.update({
# ???
# })

# analysis = tune.run(...)

tune_config["lr"] = 0.00005
tune_config["train_batch_size"] = tune.grid_search([3000, 4000])
tune_config["num_envs_per_worker"] = 10
tune_config["num_sgd_iter"] = 20
tune_config["model"] = {"fcnet_hiddens": [512, 512]}

analysis = tune.run("PPO", config=tune_config, stop=stop, checkpoint_at_end=True, checkpoint_freq=10)

Trial name,status,loc,train_batch_size
PPO_MultiAgentArena_b1a2b_00000,PENDING,,3000
PPO_MultiAgentArena_b1a2b_00001,PENDING,,4000


[2m[36m(pid=58399)[0m 2021-06-22 12:00:43,998	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=58399)[0m 2021-06-22 12:00:43,998	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=58400)[0m 2021-06-22 12:00:43,998	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=58400)[0m 2021-06-22 12:00:43,998	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=58399)[0m 2021-06-22 12:00:57,405	INFO trainable.py:101 -- Trainable.setup took 13.408 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[2m[36m(pid=58400)[0m 2021-06-22 12:00:57,465	INFO trainable.py:101 -- Trainable.setup took 13.468 sec

Result for PPO_MultiAgentArena_b1a2b_00000:
  agent_timesteps_total: 6000
  custom_metrics: {}
  date: 2021-06-22_12-01-04
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 10.500000000000028
  episode_reward_mean: -9.194999999999995
  episode_reward_min: -30.000000000000018
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: b67292c4160f4be099f3182821b66779
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.00019999999494757503
          entropy: 1.346530556678772
          entropy_coeff: 0.0
          kl: 0.041253022849559784
          model: {}
          policy_loss: -0.08638487756252289
          total_loss: 36.72517013549805
          vf_explained_var: 0.1265425831079483
          vf_loss: 36.80330276489258
      policy2:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
     

Trial name,status,loc,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_b1a2b_00000,RUNNING,192.168.0.179:58399,3000,1.0,7.52223,3000.0,-9.195,10.5,-30.0,100.0
PPO_MultiAgentArena_b1a2b_00001,RUNNING,,4000,,,,,,,


Result for PPO_MultiAgentArena_b1a2b_00001:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-06-22_12-01-07
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 16.20000000000001
  episode_reward_mean: -7.394999999999999
  episode_reward_min: -36.90000000000006
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: 9d3a10e163f347328c1aa983ffea5cb5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.00019999999494757503
          entropy: 1.349806547164917
          entropy_coeff: 0.0
          kl: 0.03762923553586006
          model: {}
          policy_loss: -0.07011692225933075
          total_loss: 42.256065368652344
          vf_explained_var: 0.11579272150993347
          vf_loss: 42.31865310668945
      policy2:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
      

Trial name,status,loc,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_b1a2b_00000,RUNNING,192.168.0.179:58399,3000,2,14.4428,6000,-7.69,10.5,-30.0,100
PPO_MultiAgentArena_b1a2b_00001,RUNNING,192.168.0.179:58400,4000,1,10.0252,4000,-7.395,16.2,-36.9,100


------------------
## 15 min break :)
------------------


(while the above experiment is running (and hopefully learning))


## How do we extract any checkpoint from some trial of a tune.run?

In [None]:
# The previous tune.run (the one we did before the exercise) returned an Analysis object, from which we can access any checkpoint
# (given we set checkpoint_freq or checkpoint_at_end to reasonable values) like so:
print(analysis_multi_agent_run)
# Get all trials.
trials = analysis_multi_agent_run.trials
# Assuming, the first trial was the best, we'd like to extract this trial's best checkpoint "":
best_checkpoint = analysis_multi_agent_run.get_best_checkpoint(trial=trials[1], mode="max")
print(f"Found best checkpoint for trial #2: {best_checkpoint}")

# Undo the grid-search config, which RLlib doesn't understand.
rllib_config = tune_config.copy()
rllib_config["lr"] = 0.00005
rllib_config["train_batch_size"] = 3000

# Restore a RLlib Trainer from the checkpoint.
new_trainer = PPOTrainer(config=rllib_config)
new_trainer.restore(best_checkpoint)
# Evaluate to see, how it's doing.
print(new_trainer.evaluate())

In [None]:
%%HTML
<video width="400" height="300" controls>
  <source src="videos/learnt_2_policies_to_30_reward.mov" type="video/mp4">
</video>

## Let's talk about customization options

### Deep Dive: How do we customize RLlib's RL loop?

RLlib offers a callbacks API that allows you to add custom behavior to
all major events during the environment sampling- and learning process.

**Our problem:** So far, we can only see the total reward (sum for both agents).
This does not give us enough insights into the question of which agent
learns what (maybe agent2 doesn't learn anything and the reward we are observing
is mostly due to agent1's progress in covering the map!).

In the following cell, we will create some custom callback "hooks" that will allow us to
add each agents single reward to the returned metrics (which will then be displayed in tensorboard!).

For that we will override RLlib's DefaultCallbacks class and implement the
`on_episode_start`, `on_episode_step`, and `on_episode_end` methods therein:


In [None]:
# Override the DefaultCallbacks with your own and implement any methods (hooks)
# that you need.
from ray.rllib.agents.callbacks import DefaultCallbacks
from ray.rllib.evaluation.episode import MultiAgentEpisode


class MyCallbacks(DefaultCallbacks):
    def on_episode_start(self,
                         *,
                         worker,
                         base_env,
                         policies,
                         episode: MultiAgentEpisode,
                         env_index,
                         **kwargs):
        # We will use the `MultiAgentEpisode` object being passed into
        # all episode-related callbacks. It comes with a user_data property (dict),
        # which we can write arbitrary data into.

        # At the end of an episode, we'll transfer that data into the `hist_data`, and `custom_metrics`
        # properties to make sure our custom data is displayed in TensorBoard.

        # The episode is starting:
        # Set per-episode object to capture, which states (observations)
        # have been visited by agent1.
        episode.user_data["new_fields_discovered"] = 0
        # Set per-episode agent2-blocks counter (how many times has agent2 blocked agent1?).
        episode.user_data["num_collisions"] = 0

    def on_episode_step(self,
                        *,
                        worker,
                        base_env,
                        episode: MultiAgentEpisode,
                        env_index,
                        **kwargs):
        # Get both rewards.
        ag1_r = episode.prev_reward_for("agent1")
        ag2_r = episode.prev_reward_for("agent2")

        # Agent1 discovered a new field.
        if ag1_r == 1.0:
            episode.user_data["new_fields_discovered"] += 1
        # Collision.
        elif ag2_r == 1.0:
            episode.user_data["num_collisions"] += 1

    def on_episode_end(self,
                       *,
                       worker,
                       base_env,
                       policies,
                       episode: MultiAgentEpisode,
                       env_index,
                       **kwargs):
        # Episode is done:
        # Write scalar values (sum over rewards) to `custom_metrics` and
        # time-series data (rewards per time step) to `hist_data`.
        # Both will be visible then in TensorBoard.
        episode.custom_metrics["new_fields_discovered"] = episode.user_data["new_fields_discovered"]
        episode.custom_metrics["num_collisions"] = episode.user_data["num_collisions"]


In [None]:
# Setting up our config to point to our new custom callbacks class:
config.update({
    "callbacks": MyCallbacks,  # by default, this would point to `rllib.agents.callbacks.DefaultCallbacks`, which does nothing.
})

analysis = tune.run(
    "PPO",
    config=config,
    stop={"training_iteration": 20},
    checkpoint_at_end=True,
    # If you'd like to restore the tune run from an existing checkpoint file, you can do the following:
    #restore="/Users/sven/ray_results/PPO/PPO_MultiAgentArena_fd451_00000_0_2021-05-25_15-13-26/checkpoint_000010/checkpoint-10",
)
print(analysis.get_last_checkpoint())

### Let's check tensorboard for the new custom metrics!

1. Head over to ~/ray_results/PPO/PPO_MultiAgentArena_[some key]_00000_0_[date]_[time]/
1. In that directory, you should see a `event.out....` file.
1. Run `tensorboard --logdir .` and head to https://localhost:6006

<img src="images/tensorboard.png" width=800>


### Deep Dive: Providing your custom Models in tf or torch.

In [None]:
# Custom Neural Network Models.

import tensorflow as tf


class MyModel(tf.keras.Model):
    def __init__(self,
                input_space,
                action_space,
                num_outputs,
                name="",
                *,
                layers = (256, 256)):
        super().__init__(name=name)

        self.dense_layers = []
        for i, layer_size in enumerate(layers):
            self.dense_layers.append(tf.keras.layers.Dense(
                layer_size, activation=tf.nn.relu, name=f"dense_{i}"))

        self.logits = tf.keras.layers.Dense(
            num_outputs,
            activation=tf.keras.activations.linear,
            name="logits")
        self.values = tf.keras.layers.Dense(
            1, activation=None, name="values")

    def call(self, inputs, training=None, mask=None):
        # Standardized input args:
        # - input_dict (RLlib `SampleBatch` object, which is basically a dict with numpy arrays
        # in it)
        out = inputs["obs"]
        for l in self.dense_layers:
            out = l(out)
        logits = self.logits(out)
        values = self.values(out)

        # Standardized output:
        # - "normal" model output tensor (e.g. action logits).
        # - list of internal state outputs (only needed for RNN-/memory enhanced models).
        # - "extra outs", such as model's side branches, e.g. value function outputs.
        return logits, [], {"vf_preds": tf.reshape(values, [-1])}

In [None]:
# Do a quick test on the custom model class.
from gym.spaces import Box
test_model = MyModel(
    input_space=Box(-1.0, 1.0, (2, )),
    action_space=None,
    num_outputs=2,
)
test_model({"obs": np.array([[0.5, 0.5]])})

In [None]:
# Set up our custom model and re-run the experiment.
config.update({
    "model": {
        "custom_model": MyModel,
        "custom_model_config": {
            "layers": [128, 128],
        },
    },
    # Revert these to single trials (and use those hyperparams that performed well in our Exercise #2).
    "lr": 0.0005,
    "train_batch_size": 2000,
})

tune.run("PPO", config=config, stop=stop)

TODO: Introduce custom Model earlier.

## A closer look at RLlib's APIs and Components
### (Depending on time left and amount of questions having been accumulated :)

We already took a quick look inside an RLlib Trainer object and extracted its Policy(ies) and the Policy's model (neural network). Here is a much more detailed overview of what's inside a Trainer object.

At the core is the so-called `WorkerSet` sitting under `Trainer.workers`. A WorkerSet is a group of `RolloutWorker` (`rllib.evaluation.rollout_worker.py`) objects that always consists of a "local worker" (`Trainer.workers.local_worker()`) and n "remote workers" (`Trainer.workers.remote_workers()`).



<img src="images/rllib_structure.png" width=1000>

### Scaling RLlib

Scaling RLlib works by parallelizing the "jobs" that the remote `RolloutWorkers` do. In a vanilla RL algorithm, like PPO, DQN, and many others, the `@ray.remote` labeled RolloutWorkers in the figure above are responsible for interacting with one or more environments and thereby collecting experiences. Observations are produced by the environment, actions are then computed by the Policy(ies) copy located on the remote worker and sent to the environment in order to produce yet another observation. This cycle is repeated endlessly and only sometimes interrupted to send experience batches ("train batches") of a certain size to the "local worker". There these batches are used to call `Policy.learn_on_batch()`, which performs a loss calculation, followed by a model weights update, and a subsequent weights broadcast back to all the remote workers.



## Thank you for listening and participating!

### Here are a couple of links that you may find useful

- The <a href="https://github.com/sven1977/rllib_tutorials.git">github repo of this tutorial</a>.
- <a href="https://docs.ray.io/en/master/rllib.html">RLlib's documentation main page</a>.
- <a href="http://discuss.ray.io">Our discourse forum</a> to ask questions on RLlib.
- Our <a href="https://forms.gle/9TSdDYUgxYs8SA9e8">Slack channel</a> for interacting with other Ray RLlib users.
- The <a href="https://github.com/ray-project/ray/blob/master/rllib/examples/">RLlib examples scripts folder</a> with tons of examples on how to do different stuff with RLlib.
- A <a href="https://medium.com/distributed-computing-with-ray/reinforcement-learning-with-rllib-in-the-unity-game-engine-1a98080a7c0d">blog post on training with RLlib inside a Unity3D environment</a>.

## Time for Q&A