# Hands-on RL with Ray’s RLlib
## A beginner’s tutorial for working with multi-agent environments, models, and algorithms

<img src="images/pitfall.jpg" width=250> <img src="images/tesla.jpg" width=254> <img src="images/forklifts.jpg" width=169> <img src="images/robots.jpg" width=252> <img src="images/dota2.jpg" width=213>

### Overview
“Hands-on RL with Ray’s RLlib” is a beginners tutorial for working with reinforcement learning (RL) environments, models, and algorithms using Ray’s RLlib library. RLlib offers high scalability, a large list of algos to choose from (offline, model-based, model-free, etc..), support for TensorFlow and PyTorch, and a unified API for a variety of applications. This tutorial includes a brief introduction to provide an overview of concepts (e.g. why RL) before proceeding to RLlib (multi- and single-agent) environments, neural network models, hyperparameter tuning, debugging, student exercises, Q/A, and more. All code will be provided as .py files in a GitHub repo.

### Intended Audience
* Python programmers who want to get started with reinforcement learning and RLlib.

### Prerequisites
* Some Python programming experience.
* Some familiarity with machine learning.
* *Helpful, but not required:* Experience in reinforcement learning and Ray.
* *Helpful, but not required:* Experience with TensorFlow or PyTorch.

### Requirements/Dependencies

Install conda (https://www.anaconda.com/products/individual)

Then ...

#### Quick `conda` setup instructions (Mac and Linux):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install [tensorflow|torch]  # <- either one works!
$ pip install jupyterlab
```

#### Quick `conda` setup instructions (Win10):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install [tensorflow|torch]  # <- either one works!
$ pip install jupyterlab
$ conda install pywin32
```

Also, for Win10 Atari support, we have to install atari_py from a different source (gym does not support Atari envs on Windows).

```
$ pip install git+https://github.com/Kojoley/atari-py.git
```

### Opening these tutorial files:
```
$ git clone https://github.com/sven1977/rllib_tutorials
$ cd rllib_tutorials
$ jupyter-lab
```

### Key Takeaways
* What is reinforcement learning and why RLlib?
* Core concepts of RLlib: Environments, Trainers, Policies, and Models.
* How to configure, hyperparameter-tune, and parallelize RLlib.
* RLlib debugging best practices.

### Tutorial Outline
1. RL and RLlib in a nutshell.
1. Defining an RL-solvable problem: Our first environment.
1. **Exercise No.1**: Environment loop.

(15min break)

1. Picking an algorithm and training our first RLlib Trainer.
1. Configurations and hyperparameters - Easy tuning with Ray Tune.
1. Fixing our experiment's config - Going multi-agent.
1. The "infinite laptop": Quick intro into how to use RLlib with Anyscale's product.
1. **Exercise No.2**: Run your own Ray RLlib+Tune experiment)
1. Neural network models - Provide your custom models using tf.keras or torch.nn.

(15min break)

1. Deeper dive into RLlib's parallelization architecture.
1. Specifying different compute resources and parallelization options through our config.
1. "Hacking in": Using callbacks to customize the RL loop and generate our own metrics.
1. **Exercise No.3**: Write your own custom callback.
1. "Hacking in (part II)" - Debugging with RLlib and PyCharm.
1. Checking on the "infinite laptop" - Did RLlib learn to solve the problem?

### Other Recommended Readings
* [Attention Nets and More with RLlib's Trajectory View API](https://medium.com/distributed-computing-with-ray/attention-nets-and-more-with-rllibs-trajectory-view-api-d326339a6e65)
* [Intro to RLlib: Example Environments](https://medium.com/distributed-computing-with-ray/intro-to-rllib-example-environments-3a113f532c70)
* [Reinforcement Learning with RLlib in the Unity Game Engine](https://medium.com/distributed-computing-with-ray/reinforcement-learning-with-rllib-in-the-unity-game-engine-1a98080a7c0d)


<img src="images/rl-cycle.png" width=1200>

### Coding/defining our "problem" via an RL environment.

We will use the following (adversarial) multi-agent environment
throughout this tutorial to demonstrate a large fraction of RLlib's
APIs, features, and customization options.

<img src="images/environment.png" width=800>

### A word or two on Spaces:

Spaces are used in ML to describe what possible/valid values inputs and outputs of a neural network can have.

RL environments also use them to describe what their valid observations and actions are.

Spaces are usually defined by their shape (e.g. 84x84x3 RGB images) and datatype (e.g. uint8 for RGB values between 0 and 255).
However, spaces could also be composed of other spaces (see Tuple or Dict spaces) or could be simply discrete with n fixed possible values
(represented by integers). For example, in our game, where each agent can only go up/down/left/right, the action space would be "Discrete(4)"
(no datatype, no shape needs to be defined here).

<img src="images/spaces.png" width=800>

In [2]:
# Let's code (parts of) our multi-agent environment.

import gym
from gym.spaces import Discrete, MultiDiscrete
import numpy as np
import random

from ray.rllib.env.multi_agent_env import MultiAgentEnv


class MultiAgentArena(MultiAgentEnv):
    def __init__(self, config=None):
        # !LIVE CODING!
        #from environment import _init
        #_init(self, config)
        
        config = config or {}
        self.height = config.get("height", 10)
        self.width = config.get("width", 10)

        self.observation_space = gym.spaces.MultiDiscrete([self.height*self.width, self.height*self.width])
        self.action_space = gym.spaces.Discrete(4)
        
        self.max_timesteps = config.get("max_timesteps", 100)
        
        self.reset()
        
    def reset(self):  # returns initial observation of next(!) episode
        # !LIVE CODING!
        #from environment import _reset
        #return _reset(self)
        self.ts = 0
        self.agent1_pos = [0, 0]
        # Row major.
        self.agent2_pos = [self.height - 1, self.width - 1]  # [9, 9]
        
        self.agent1_visited_states = set()

        return self._get_obs()

    def step(self, action: dict):  # returns obs, rewards, dones, infos.
        # !LIVE CODING!
        #from environment import _step
        #return _step(action)
        self.ts += 1
        
        # Determine, which agent moves first.
        if random.random() > 0.5:
            # Possible events: new_field|collision
            events = self._move(self.agent1_pos, action["agent1"], is_agent1=True)
            events |= self._move(self.agent2_pos, action["agent2"], is_agent1=False)
        else:
            events = self._move(self.agent2_pos, action["agent2"], is_agent1=False)
            events |= self._move(self.agent1_pos, action["agent1"], is_agent1=True)

        # Reward function:
        if "collision" in events:
            r1 = -1.0
            r2 = 1.0
        elif "new_field" in events:
            r1 = 1.0
            r2 = -0.1
        else:
            r1 = -0.5
            r2 = -0.1
            
        done = self.ts >= self.max_timesteps
        
        rewards = {"agent1": r1, "agent2": r2}
        dones = {"agent1": done, "agent2": done, "__all__": done}
        
        return self._get_obs(), rewards, dones, {}  # <- info dict (not used here)

    def _get_obs(self):
        """
        Returns obs dict (agent name to discrete-pos tuple) using each
        agent's current x/y-positions.
        """
        ag1_discrete_pos = self.agent1_pos[0] * self.width + \
            (self.agent1_pos[1] % self.width)
        ag2_discrete_pos = self.agent2_pos[0] * self.width + \
            (self.agent2_pos[1] % self.width)
        return {
            "agent1": np.array([ag1_discrete_pos, ag2_discrete_pos]),
            "agent2": np.array([ag2_discrete_pos, ag1_discrete_pos]),
        }

    def _move(self, coords, action, is_agent1):
        """
        Moves an agent (agent1 iff is_agent1=True, else agent2) from `coords` (x/y) using the
        given action (0=up, 1=right, etc..) and returns a resulting events dict:
        Agent1: "new" when entering a new field. "bumped" when having been bumped into by agent2.
        Agent2: "bumped" when bumping into agent1 (agent1 then gets -1.0).
        """
        orig_coords = coords[:]
        # Change the row: 0=up (-1), 2=down (+1)
        coords[0] += -1 if action == 0 else 1 if action == 2 else 0
        # Change the column: 1=right (+1), 3=left (-1)
        coords[1] += 1 if action == 1 else -1 if action == 3 else 0

        # Solve collisions.
        # Make sure, we don't end up on the other agent's position.
        # If yes, don't move (we are blocked).
        if (is_agent1 and coords == self.agent2_pos) or (not is_agent1 and coords == self.agent1_pos):
            coords[0], coords[1] = orig_coords
            # Agent2 blocked agent1 (agent1 tried to run into agent2)
            # OR Agent2 bumped into agent1 (agent2 tried to run into agent1)
            return {"collision"}

        # No agent blocking -> check walls.
        if coords[0] < 0:
            coords[0] = 0
        elif coords[0] >= self.height:
            coords[0] = self.height - 1
        if coords[1] < 0:
            coords[1] = 0
        elif coords[1] >= self.width:
            coords[1] = self.width - 1

        # If agent1 -> "new" if new tile covered.
        if is_agent1 and not tuple(coords) in self.agent1_visited_states:
            self.agent1_visited_states.add(tuple(coords))
            return {"new_field"}
        # No new tile for agent1.
        return set()

    # Optionally: Add `render` method returning some img.
    def render(self, mode=None):
        field_size = 40

        if not hasattr(self, "viewer"):
            from gym.envs.classic_control import rendering
            self.viewer = rendering.Viewer(400, 400)
            self.fields = {}
            # Add our grid, and the two agents to the viewer.
            for i in range(self.width):
                l = i * field_size
                r = l + field_size
                for j in range(self.height):
                    b = 400 - j * field_size - field_size
                    t = b + field_size
                    field = rendering.PolyLine([(l, b), (l, t), (r, t), (r, b)], close=True)
                    field.set_color(.0, .0, .0)
                    field.set_linewidth(1.0)
                    self.fields[(j, i)] = field
                    self.viewer.add_geom(field)
            
            agent1 = rendering.make_circle(radius=field_size // 2 - 4)
            agent1.set_color(.0, 0.8, 0.1)
            self.agent1_trans = rendering.Transform()
            agent1.add_attr(self.agent1_trans)
            agent2 = rendering.make_circle(radius=field_size // 2 - 4)
            agent2.set_color(.5, 0.1, 0.1)
            self.agent2_trans = rendering.Transform()
            agent2.add_attr(self.agent2_trans)
            self.viewer.add_geom(agent1)
            self.viewer.add_geom(agent2)

        # Mark those fields green that have been covered by agent1,
        # all others black.
        for i in range(self.width):
            for j in range(self.height):
                self.fields[(j, i)].set_color(.0, .0, .0)
                self.fields[(j, i)].set_linewidth(1.0)
        for (j, i) in self.agent1_visited_states:
            self.fields[(j, i)].set_color(.1, .5, .1)
            self.fields[(j, i)].set_linewidth(5.0)
        
        # Edit the pole polygon vertex
        self.agent1_trans.set_translation(self.agent1_pos[1] * field_size + field_size / 2, 400 - (self.agent1_pos[0] * field_size + field_size / 2))
        self.agent2_trans.set_translation(self.agent2_pos[1] * field_size + field_size / 2, 400 - (self.agent2_pos[0] * field_size + field_size / 2))

        return self.viewer.render(return_rgb_array=mode == 'rgb_array')

dummy_env = MultiAgentArena()
dummy_env



<__main__.MultiAgentArena at 0x7f9375787670>

## Exercise No 1

<hr />

Write an "environment loop" using our `MultiAgentArena` class.

1. Create an env object.
1. `reset` your environment to get the first (initial) observation.
1. `step` through the environment using a provided
   "DummyTrainer.compute_action([obs])" method to compute action dicts (see cell below, in which you can create a DummyTrainer object and query it for random actions).
1. When an episode is done, remember to `reset()` your environment before the next call to `step()`.
1. If you feel, this is way too easy for you ;) , try to extract each agent's reward, sum it up over one episode and - at the end of an episode (when done=True) - print out each agent's accumulated reward (also called "return").

**Good luck! :)**


In [3]:
class DummyTrainer:
    """Dummy Trainer class used in Exercise #1.

    Use its `compute_action` method to get a new action, given some environment
    observation.
    """

    def compute_action(self, obs=None):
        # Returns a random action for a single agent.
        return np.random.randint(4)  # Discrete(4) -> return rand int between 0 and 3 (incl. 3).

dummy_trainer = DummyTrainer()
# Check, whether it's working.
for _ in range(5):
    print("action={}".format(dummy_trainer.compute_action(np.array([0, 10]))))

action=3
action=0
action=2
action=0
action=0


In [3]:
# Solution to Exercise #1
# !LIVE CODING!

import time

env = MultiAgentArena()
num_episodes = 0
episode_return = 0.0

obs = env.reset()

while num_episodes < 5:
    action1 = dummy_trainer.compute_action(obs["agent1"])
    action2 = dummy_trainer.compute_action(obs["agent2"])

    obs, reward, done, _ = env.step({"agent1": action1, "agent2": action2})
    episode_return += reward["agent1"] + reward["agent2"]

    if done["agent1"] and done["agent2"]:
        num_episodes += 1
        print(episode_return)
        episode_return = 0.0
        obs = env.reset()

    env.render()
    time.sleep(0.02)

env.viewer.close()

-5.999999999999975
4.500000000000026
-4.5
-22.500000000000025
-11.999999999999984


In [4]:
# Now for something completely different:
# Plugging in RLlib!

import numpy as np
import ray

# Start a new instance of Ray or connect to an already running one.
ray.init()  # Hear the engine humming? ;)

# In case you encounter the following error during our tutorial:
# RuntimeError: Maybe you called ray.init twice by accident?
# Try: ray.shutdown() or ray.init(ignore_reinit_error=True)

2021-05-25 15:01:13,930	INFO services.py:1262 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.0.102',
 'raylet_ip_address': '192.168.0.102',
 'redis_address': '192.168.0.102:6379',
 'object_store_address': '/tmp/ray/session_2021-05-25_15-01-11_922474_33329/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-05-25_15-01-11_922474_33329/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2021-05-25_15-01-11_922474_33329',
 'metrics_export_port': 65503,
 'node_id': '15b74b842c7a56a3a4e799801a9ffd44b5bdf841fede0758bd57afdb'}

<img src="images/rllib_algos.png" width=800>

https://docs.ray.io/en/master/rllib-algorithms.html#available-algorithms-overview

In [5]:
# Import a Trainable (one of RLlib's built-in algorithms):
# We use the PPO algorithm here b/c its very flexible wrt its supported
# action spaces and model types and b/c it learns well almost any problem.
from ray.rllib.agents.ppo import PPOTrainer

# Specify a very simple config, defining our environment and some environment
# options (see environment.py).
config = {
    "env": MultiAgentArena,  # "my_env" <- if we previously have registered the env with `tune.register_env("[name]", lambda config: [returns env object])`.
    "env_config": {
        "config": {
            "width": 10,
            "height": 10,
        },
    },
    # "framework": "torch",  # If users have chosen to install torch instead of tf.
    "create_env_on_driver": True,
}
# Instantiate the Trainer object using above config.
rllib_trainer = PPOTrainer(config=config)

2021-05-25 15:01:15,742	INFO trainer.py:666 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2021-05-25 15:01:15,744	INFO trainer.py:691 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


In [6]:
# That's it, we are ready to train.
# Calling `train` once runs a single "training iteration". One iteration
# for most algos contains a) sampling from the environment(s) + b) using the
# sampled data (observations, actions taken, rewards) to update the policy
# model (neural network), such that it would pick better actions in the future,
# leading to higher rewards.
# !LIVE CODING! (call and print out `trainer.train()`)
results = rllib_trainer.train()
print(results)

{'episode_reward_max': 14.100000000000016, 'episode_reward_min': -31.500000000000036, 'episode_reward_mean': -6.209999999999995, 'episode_len_mean': 100.0, 'episode_media': {}, 'episodes_this_iter': 20, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [14.100000000000016, -19.500000000000004, 12.00000000000003, -31.500000000000036, -14.999999999999993, -24.0, 0.9000000000000111, 4.200000000000031, -4.499999999999989, -5.999999999999979, -5.999999999999993, 6.000000000000012, -31.50000000000003, 12.600000000000017, 3.00000000000001, 4.500000000000007, -5.999999999999979, 4.5000000000000036, -27.000000000000036, -14.99999999999999], 'episode_lengths': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 0.13066910125397063, 'mean_inference_ms': 0.5442456646518157, 'mean_action_processing_ms': 0.04946363793981897

In [12]:
# Run `train()` n times. Try to repeatedly call this to see rewards increase.
# Move on once you see episode rewards of 15.0 or more.
for _ in range(15):
    results = rllib_trainer.train()
    print(f"Iteration={rllib_trainer.iteration}: R={results['episode_reward_mean']}")

Iteration=17: R=8.98199999999999
Iteration=18: R=9.434999999999985
Iteration=19: R=10.349999999999982
Iteration=20: R=10.742999999999984
Iteration=21: R=11.993999999999973
Iteration=22: R=12.644999999999971
Iteration=23: R=13.499999999999968
Iteration=24: R=13.454999999999966
Iteration=25: R=13.736999999999961
Iteration=26: R=14.516999999999964
Iteration=27: R=14.627999999999961
Iteration=28: R=14.906999999999956
Iteration=29: R=16.64999999999995
Iteration=30: R=18.350999999999946
Iteration=31: R=19.118999999999936


In [8]:
# !LIVE CODING!
# Use the above solution of Exercise #1 and replace our `dummy_trainer`
# with the already trained `rllib_trainer`.
rllib_trainer

PPO

In [9]:
# !LIVE CODING!
# Let's actually "look inside" our Trainer to see what's in there.
import numpy as np
policy = rllib_trainer.get_policy()
sess = policy.get_session()
print(f"policy={policy}")
print(f"observation-space={policy.observation_space}")
print(f"action-space={policy.action_space}")

model = policy.model
print(f"model={model}")

obs_sample = np.expand_dims(policy.observation_space.sample(), 0)
action_logits, _ = model({"obs": obs_sample})

print("logits={}".format(sess.run(action_logits)))

action_distribution = policy.dist_class(action_logits)
action = action_distribution.sample()
print("action sample={}".format(sess.run(action)))

policy=<ray.rllib.policy.tf_policy_template.PPOTFPolicy object at 0x7fabeeb2fac0>
observation-space=Box(-1.0, 1.0, (200,), float32)
action-space=Discrete(4)
model=<ray.rllib.models.tf.fcnet.FullyConnectedNetwork object at 0x7fabeeb2faf0>
logits=[[-4.9493103  7.095663   1.2463962 -5.00922  ]]
action sample=[1]


In [13]:
# Currently, `rllib_trainer` is in an already trained state.
# It holds optimized weights in its Policy's model that allow it to act
# already somewhat smart in our environment when given an action.

# If we closed this notebook, all the effort would have been for nothing.
# Let's save the state of our trainer to disk for later!
checkpoint_path = rllib_trainer.save()
print(f"Trainer (at iteration {rllib_trainer.iteration} was saved in '{checkpoint_path}'!")

# Here is what a checkpoint directory contains:
print("The checkpoint directory contains the following files:")
import os
os.listdir(os.path.dirname(checkpoint_path))

Trainer (at iteration 31 was saved in '/Users/sven/ray_results/PPO_MultiAgentArena_2021-05-25_13-07-562s553d7d/checkpoint_000031/checkpoint-31'!
The checkpoint directory contains the following files:


['checkpoint-31.tune_metadata', '.is_checkpoint', 'checkpoint-31']

In [14]:
# Pretend, we wanted to pick up training from a previous run:
new_trainer = PPOTrainer(config=config)
# Evaluate the new trainer (this should yield random results).
results = new_trainer.evaluate()
print(f"Evaluating new trainer: R={results['evaluation']['episode_reward_mean']}")

# Restoring the trained state into the `new_trainer` object.
new_trainer.restore(checkpoint_path)

# Evaluate again (this should yield results we saw after having trained our saved agent).
results = new_trainer.evaluate()
print(f"Evaluating restored trainer: R={results['evaluation']['episode_reward_mean']}")

2021-05-25 13:10:39,086	INFO trainable.py:377 -- Restored on 192.168.0.102 from checkpoint: /Users/sven/ray_results/PPO_MultiAgentArena_2021-05-25_13-07-562s553d7d/checkpoint_000031/checkpoint-31
2021-05-25 13:10:39,086	INFO trainable.py:385 -- Current state after restoring: {'_iteration': 31, '_timesteps_total': None, '_time_total': 86.76738381385803, '_episodes_total': 620}


Evaluating new trainer: R=-10.934999999999999
Evaluating restored trainer: R=15.959999999999946


In [19]:
# 5) Configuration dicts and Ray Tune.
# Where are the default configuration dicts stored?
import pprint

# PPO algorithm:
from ray.rllib.agents.ppo import DEFAULT_CONFIG as PPO_DEFAULT_CONFIG
print(f"PPO's default config is:")
pprint.pprint(PPO_DEFAULT_CONFIG)

# DQN algorithm:
#from ray.rllib.agents.dqn import DEFAULT_CONFIG as DQN_DEFAULT_CONFIG
#print(f"DQN's default config is:")
#pprint.pprint(DQN_DEFAULT_CONFIG)

# Common (all algorithms).
#from ray.rllib.agents.trainer import COMMON_CONFIG
#print(f"RLlib Trainer's default config is:")
#pprint.pprint(COMMON_CONFIG)

PPO's default config is:
{'_fake_gpus': False,
 'batch_mode': 'truncate_episodes',
 'callbacks': <class 'ray.rllib.agents.callbacks.DefaultCallbacks'>,
 'clip_actions': True,
 'clip_param': 0.3,
 'clip_rewards': None,
 'collect_metrics_timeout': 180,
 'compress_observations': False,
 'create_env_on_driver': False,
 'custom_eval_function': None,
 'custom_resources_per_worker': {},
 'eager_tracing': False,
 'entropy_coeff': 0.0,
 'entropy_coeff_schedule': None,
 'env': None,
 'env_config': {},
 'env_task_fn': None,
 'evaluation_config': {},
 'evaluation_interval': None,
 'evaluation_num_episodes': 10,
 'evaluation_num_workers': 0,
 'evaluation_parallel_to_training': False,
 'exploration_config': {'type': 'StochasticSampling'},
 'explore': True,
 'extra_python_environs_for_driver': {},
 'extra_python_environs_for_worker': {},
 'fake_sampler': False,
 'framework': 'tf',
 'gamma': 0.99,
 'grad_clip': None,
 'horizon': None,
 'ignore_worker_failures': False,
 'in_evaluation': False,
 'input'

In [6]:
# Plugging in Ray Tune.
# Note that this is the recommended way to run any experiments with RLlib.
# Reasons:
# - Tune allows you to do hyperparameter tuning in a user-friendly way
#   and at large scale!
# - Tune automatically allocates needed resources for the different
#   hyperparam trials and experiment runs.

from ray import tune

# Now that we will run things "automatically" through tune, we have to
# define one or more stopping criteria.
stop = {
    # explain that keys here can be anything present in the above print(trainer.train())
    "training_iteration": 3,
    "episode_reward_mean": 9999.9,
}

# "PPO" is a registered name that points to RLlib's PPOTrainer.
# See `ray/rllib/agents/registry.py`
# Run our simple experiment until one of the stop criteria is met.
tune.run("PPO", config=config, stop=stop)


Trial name,status,loc
PPO_MultiAgentArena_55818_00000,PENDING,


[2m[36m(pid=33378)[0m 2021-05-25 15:01:40,826	INFO trainer.py:666 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=33378)[0m 2021-05-25 15:01:40,826	INFO trainer.py:691 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Result for PPO_MultiAgentArena_55818_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-05-25_15-01-53
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 9.000000000000028
  episode_reward_mean: -7.664999999999994
  episode_reward_min: -33.00000000000007
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: 2163d328c3014ebd9bd67eb7a55dbbe4
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.3665188550949097
          entropy_coeff: 0.0
          kl: 0.020290959626436234
          model: {}
          policy_loss: -0.05183352530002594
          total_loss: 17.472394943237305
          vf_explained_var: 0.12051534652709961
          vf_loss: 17.52016830444336
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_55818_00000,RUNNING,192.168.0.102:33378,1,4.13073,4000,-7.665,9,-33,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_55818_00000,RUNNING,192.168.0.102:33378,1,4.13073,4000,-7.665,9,-33,100


[2m[36m(pid=33378)[0m 2021-05-25 15:01:53,891	ERROR worker.py:396 -- SystemExit was raised from the worker
[2m[36m(pid=33378)[0m Traceback (most recent call last):
[2m[36m(pid=33378)[0m   File "python/ray/_raylet.pyx", line 594, in ray._raylet.task_execution_handler
[2m[36m(pid=33378)[0m   File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
[2m[36m(pid=33378)[0m   File "python/ray/_raylet.pyx", line 489, in ray._raylet.execute_task
[2m[36m(pid=33378)[0m   File "python/ray/_raylet.pyx", line 496, in ray._raylet.execute_task
[2m[36m(pid=33378)[0m   File "python/ray/_raylet.pyx", line 500, in ray._raylet.execute_task
[2m[36m(pid=33378)[0m   File "python/ray/_raylet.pyx", line 450, in ray._raylet.execute_task.function_executor
[2m[36m(pid=33378)[0m   File "/Users/sven/opt/anaconda3/envs/rllib_tutorial/lib/python3.8/site-packages/ray/_private/function_manager.py", line 566, in actor_method_executor
[2m[36m(pid=33378)[0m     return method(__ray_

<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7f9363f3d730>

In [7]:
# Updating an algo's default config dict and adding hyperparameter tuning
# options to it.
# Note: Hyperparameter tuning options (e.g. grid_search) will only work,
# if we run these configs via `tune.run`.
config.update(
    {
        # Try 2 different learning rates.
        "lr": tune.grid_search([0.0001, 0.5]),
        # NN model config to tweak the default model
        # that'll be created by RLlib for the policy.
        "model": {
            # e.g. change the dense layer stack.
            "fcnet_hiddens": [256, 256, 256],
            # Alternatively, you can specify a custom model here
            # (we'll cover that later).
            # "custom_model": ...
            # Pass kwargs to your custom model.
            # "custom_model_config": {}
        },
    }
)
# Repeat our experiment using tune's grid-search feature.
results = tune.run(
    "PPO",
    config=config,
    stop=stop,
    checkpoint_at_end=True,  # create a checkpoint when done.
    checkpoint_freq=1,  # create a checkpoint on every iteration.
)
print(results)


Trial name,status,loc,lr
PPO_MultiAgentArena_60f15_00000,PENDING,,0.0001
PPO_MultiAgentArena_60f15_00001,PENDING,,0.5


[2m[36m(pid=33382)[0m 2021-05-25 15:02:01,498	INFO trainer.py:666 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=33382)[0m 2021-05-25 15:02:01,498	INFO trainer.py:691 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=33380)[0m 2021-05-25 15:02:01,496	INFO trainer.py:666 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=33380)[0m 2021-05-25 15:02:01,496	INFO trainer.py:691 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Result for PPO_MultiAgentArena_60f15_00001:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-05-25_15-02-16
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 7.50000000000003
  episode_reward_mean: -8.594999999999988
  episode_reward_min: -28.50000000000005
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: bd6f019bf13048638270272320b576ee
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.5
          entropy: 0.08564744144678116
          entropy_coeff: 0.0
          kl: 29.168235778808594
          model: {}
          policy_loss: 0.4312191903591156
          total_loss: 41.855220794677734
          vf_explained_var: -0.00611991249024868
          vf_loss: 35.59035873413086
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iterations_sinc

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_60f15_00000,RUNNING,,0.0001,,,,,,,
PPO_MultiAgentArena_60f15_00001,RUNNING,192.168.0.102:33380,0.5,1.0,5.40766,4000.0,-8.595,7.5,-28.5,100.0


Result for PPO_MultiAgentArena_60f15_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-05-25_15-02-16
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 12.300000000000027
  episode_reward_mean: -14.580000000000009
  episode_reward_min: -36.000000000000064
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: 0ea7872cd21243c9b688a6f66fb28c88
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 1.3483853340148926
          entropy_coeff: 0.0
          kl: 0.03889012709259987
          model: {}
          policy_loss: -0.07277801632881165
          total_loss: 30.49079704284668
          vf_explained_var: 0.12279991805553436
          vf_loss: 30.555797576904297
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained:

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_60f15_00000,RUNNING,192.168.0.102:33382,0.0001,2,10.302,8000,-6.7425,21.6,-36.0,100
PPO_MultiAgentArena_60f15_00001,RUNNING,192.168.0.102:33380,0.5,2,10.3092,8000,-26.91,7.5,-46.5,100


Result for PPO_MultiAgentArena_60f15_00001:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-05-25_15-02-26
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 7.50000000000003
  episode_reward_mean: -33.44000000000003
  episode_reward_min: -46.500000000000064
  episodes_this_iter: 20
  episodes_total: 60
  experiment_id: bd6f019bf13048638270272320b576ee
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 0.5
          entropy: 0.0
          entropy_coeff: 0.0
          kl: 0.0
          model: {}
          policy_loss: 0.0016429759562015533
          total_loss: 111.94747161865234
          vf_explained_var: -0.0012123853666707873
          vf_loss: 111.94583129882812
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
  iterations_since_restore: 3
  node_

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_60f15_00000,TERMINATED,,0.0001,3,15.0006,12000,-4.43,21.6,-36.0,100
PPO_MultiAgentArena_60f15_00001,TERMINATED,,0.5,3,15.004,12000,-33.44,7.5,-46.5,100


2021-05-25 15:02:26,788	INFO tune.py:549 -- Total run time: 32.69 seconds (32.31 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis object at 0x7f93755ab790>


In [8]:
# Going multi-policy:

# Our experiment is ill-configured b/c both
# agents, which should behave differently due to their different
# tasks and reward functions, learn the same policy (the "default_policy",
# which RLlib always provides if you don't configure anything else; Remember
# that RLlib does not know at Trainer setup time, how many and which agents
# the environment will "produce").
# Let's fix this and introduce the "multiagent" API.

<img src="images/from_single_agent_to_multi_agent.png" width=800>

In [9]:
# Define an agent->policy mapping function.
# Which agents (defined by the environment) use which policies
# (defined by us)? Mapping is M (agents) -> N (policies), where M >= N.
def policy_mapping_fn(agent: str):
    assert agent in ["agent1", "agent2"], f"ERROR: invalid agent {agent}!"
    return "pol1" if agent == "agent1" else "pol2"
    
# Define details for our two policies.
#TODO: coding Sven: Make it possible to not need obs/action spaces
#  if they are the default anyways.
observation_space = rllib_trainer.workers.local_worker().env.observation_space
action_space = rllib_trainer.workers.local_worker().env.action_space
# Btw, the above is equivalent to saying:
# >>> rllib_trainer.get_policy("default_policy").obs/action_space
policies = {
    "pol1": (None, observation_space, action_space, {"lr": 0.0003}),
    "pol2": (None, observation_space, action_space, {"lr": 0.0004}),
}

# policies_to_train = ["pol1", "pol2"]

# Adding the above to our config.
config.update({
    "multiagent": {
        "policies": policies,
        "policy_mapping_fn": policy_mapping_fn,
        # "policies_to_train": policies_to_train,
    },
})

## Exercise No 2

<hr />

Using the `config` that we have built so far (the one we just updated to include a multi-agent setup),
try learning our environment using Ray tune.run and a simple hyperparameter grid_search over:
- 2 different learning rates (pick your own values).
- AND 2 different `train_batch_size` settings (use 2000 and 3000).

Also, make RLlib use a [128,128] dense layer stack as the NN model.

Also, use the config setting of `num_envs_per_worker=10` to increase the sampling throughput.

In case your local machine has less than 12 CPUs, try setting `num_workers=1` to make all tune trials run at the same time.
Background: PPO by default uses 2 workers, which makes 1 trial use 3 CPUs (2 workers + "driver" ("local-worker")),
which makes the entire experiment use 12 CPUs. Tune will run trials in sequence in case it cannot allocate enough CPUs at once
(which is also fine, but then takes longer).

Try to reach a total reward (sum of agent1 and agent2) of 15.0.

**Good luck! :)**


In [10]:
# Solution to Exercise #2
# !LIVE CODING!
# Solution to Exercise #2:

# Update our config and set it up for 2x tune grid-searches (leading to 4 parallel trials in total).
config.update({
    "lr": tune.grid_search([0.0001, 0.0005]),
    "train_batch_size": tune.grid_search([2000, 3000]),
    "num_envs_per_worker": 10,
    # Change our model to be simpler.
    "model": {
        "fcnet_hiddens": [128, 128],
    },
})

# Run the experiment.
experiment_analysis = tune.run("PPO", config=config, stop={"episode_reward_mean": 15.0, "training_iteration": 100})


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_74737_00000,PENDING,,0.0001,2000
PPO_MultiAgentArena_74737_00001,PENDING,,0.0005,2000
PPO_MultiAgentArena_74737_00002,PENDING,,0.0001,3000
PPO_MultiAgentArena_74737_00003,PENDING,,0.0005,3000


[2m[36m(pid=33375)[0m 2021-05-25 15:02:34,584	INFO trainer.py:666 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=33375)[0m 2021-05-25 15:02:34,584	INFO trainer.py:691 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=33379)[0m 2021-05-25 15:02:34,584	INFO trainer.py:666 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=33379)[0m 2021-05-25 15:02:34,584	INFO trainer.py:691 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=33377)[0m 2021-05-25 15:02:34,584	INFO trainer.py:666 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=33377)[0m 2021-05-25 15:02:34,584	INFO trainer.py:691 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv

Result for PPO_MultiAgentArena_74737_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-05-25_15-02-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 16.499999999999986
  episode_reward_mean: -8.722500000000002
  episode_reward_min: -36.00000000000005
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: 2d48b8f2f76245db9aa0e0888af44ced
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.0003000000142492354
          entropy: 1.3459675312042236
          entropy_coeff: 0.0
          kl: 0.041345931589603424
          model: {}
          policy_loss: -0.08170092850923538
          total_loss: 45.23959732055664
          vf_explained_var: 0.1294073462486267
          vf_loss: 45.31303024291992
      pol2:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_74737_00000,RUNNING,192.168.0.102:33377,0.0001,2000,1.0,6.48751,4000.0,-8.7225,16.5,-36.0,100.0
PPO_MultiAgentArena_74737_00001,RUNNING,,0.0005,2000,,,,,,,
PPO_MultiAgentArena_74737_00002,RUNNING,,0.0001,3000,,,,,,,
PPO_MultiAgentArena_74737_00003,RUNNING,,0.0005,3000,,,,,,,


Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_74737_00000,RUNNING,192.168.0.102:33377,0.0001,2000,1.0,6.48751,4000.0,-8.7225,16.5,-36.0,100.0
PPO_MultiAgentArena_74737_00001,RUNNING,,0.0005,2000,,,,,,,
PPO_MultiAgentArena_74737_00002,RUNNING,,0.0001,3000,,,,,,,
PPO_MultiAgentArena_74737_00003,RUNNING,,0.0005,3000,,,,,,,


[2m[36m(pid=33375)[0m 2021-05-25 15:02:52,750	ERROR worker.py:396 -- SystemExit was raised from the worker
[2m[36m(pid=33375)[0m Traceback (most recent call last):
[2m[36m(pid=33375)[0m   File "python/ray/_raylet.pyx", line 594, in ray._raylet.task_execution_handler
[2m[36m(pid=33375)[0m   File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
[2m[36m(pid=33375)[0m   File "python/ray/_raylet.pyx", line 489, in ray._raylet.execute_task
[2m[36m(pid=33375)[0m   File "python/ray/_raylet.pyx", line 496, in ray._raylet.execute_task
[2m[36m(pid=33375)[0m   File "python/ray/_raylet.pyx", line 500, in ray._raylet.execute_task
[2m[36m(pid=33375)[0m   File "python/ray/_raylet.pyx", line 450, in ray._raylet.execute_task.function_executor
[2m[36m(pid=33375)[0m   File "/Users/sven/opt/anaconda3/envs/rllib_tutorial/lib/python3.8/site-packages/ray/_private/function_manager.py", line 566, in actor_method_executor
[2m[36m(pid=33375)[0m     return method(__ray_

In [29]:
# Anyscale's Infinite laptop:

# NOTE: The following cell will only work if you are already on-boarded to our Anyscale Inc. "Infinite Laptop".
# To get more information, see https://www.anyscale.com/product

# Let's quickly divert from our MultiAgentArena and move to something much heavier in terms of environment/simulator complexity.
# We will now demonstrate, how you can use Anyscale's infinite laptop to launch an RLlib experiment on a cloud 4 GPU + 32 CPU machine
# all from within this Jupyter cell here.
# Start an experiment in the cloud using Anyscale's product, RLlib, and a more complex multi-agent env.

# NOTE 
import ray
ray.shutdown()
ray.client().connect()


2021-05-25 14:08:19,854	INFO services.py:1272 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.0.101',
 'raylet_ip_address': '192.168.0.101',
 'redis_address': '192.168.0.101:6379',
 'object_store_address': '/tmp/ray/session_2021-05-25_14-08-18_732159_31529/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-05-25_14-08-18_732159_31529/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2021-05-25_14-08-18_732159_31529',
 'metrics_export_port': 64172,
 'node_id': '9c5f474ac4dfdc490dd4df5141be1c58807009b827a263430032ccca'}

In [11]:
# Custom Neural Network Models.

import tensorflow as tf


class MyModel(tf.keras.Model):
    def __init__(self,
                input_space,
                action_space,
                num_outputs,
                name="",
                *,
                layers = (256, 256)):
        super().__init__(name=name)

        self.dense_layers = []
        for i, layer_size in enumerate(layers):
            self.dense_layers.append(tf.keras.layers.Dense(
                layer_size, activation=tf.nn.relu, name=f"dense_{i}"))

        self.logits = tf.keras.layers.Dense(
            num_outputs,
            activation=tf.keras.activations.linear,
            name="logits")
        self.values = tf.keras.layers.Dense(
            1, activation=None, name="values")

    def call(self, inputs, training=None, mask=None):
        # Standardized input args:
        # - input_dict (RLlib `SampleBatch` object, which is basically a dict with numpy arrays
        # in it)
        out = inputs["obs"]
        for l in self.dense_layers:
            out = l(out)
        logits = self.logits(out)
        values = self.values(out)

        # Standardized output:
        # - "normal" model output tensor (e.g. action logits).
        # - list of internal state outputs (only needed for RNN-/memory enhanced models).
        # - "extra outs", such as model's side branches, e.g. value function outputs.
        return logits, [], {"vf_preds": tf.reshape(values, [-1])}

In [12]:
# Do a quick test on the custom model class.
from gym.spaces import Box
test_model = MyModel(
    input_space=Box(-1.0, 1.0, (2, )),
    action_space=None,
    num_outputs=2,
)
test_model({"obs": np.array([[0.5, 0.5]])})

(<tf.Tensor 'my_model/logits/BiasAdd:0' shape=(1, 2) dtype=float64>,
 [],
 {'vf_preds': <tf.Tensor 'my_model/Reshape:0' shape=(1,) dtype=float64>})

In [13]:
# Set up our custom model and re-run the experiment.

config.update({
    "model": {
        "custom_model": MyModel,
        "custom_model_config": {
            "layers": [128, 128],
        },
    },
    # Revert these to single trials (and use those hyperparams that performed well in our Exercise #2).
    "lr": 0.0005,
    "train_batch_size": 2000,
})

tune.run("PPO", config=config, stop=stop)

Trial name,status,loc
PPO_MultiAgentArena_841c3_00000,PENDING,


[2m[36m(pid=33454)[0m 2021-05-25 15:02:58,026	INFO trainer.py:666 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=33454)[0m 2021-05-25 15:02:58,026	INFO trainer.py:691 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=33457)[0m Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
[2m[36m(pid=33457)[0m Cause: Unable to locate the source code of <bound method MyModel.call of <__main__.MyModel object at 0x7fecb7958b50>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
[2m

Result for PPO_MultiAgentArena_841c3_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-05-25_15-03-09
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 9.600000000000035
  episode_reward_mean: -9.202499999999995
  episode_reward_min: -31.50000000000004
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: 7a06020468884649b96d5538baad0125
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.0003000000142492354
          entropy: 1.3631614446640015
          entropy_coeff: 0.0
          kl: 0.025275129824876785
          policy_loss: -0.03527597337961197
          total_loss: 26.51007843017578
          vf_explained_var: 0.27398768067359924
          vf_loss: 26.540298461914062
      pol2:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.00039999998989515007
          entropy: 1.370805740

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_841c3_00000,RUNNING,192.168.0.102:33454,1,4.40055,4000,-9.2025,9.6,-31.5,100




Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_841c3_00000,RUNNING,192.168.0.102:33454,2,8.3951,8000,-5.8725,17.1,-31.5,100


[2m[36m(pid=33454)[0m 2021-05-25 15:03:13,691	ERROR worker.py:396 -- SystemExit was raised from the worker
[2m[36m(pid=33454)[0m Traceback (most recent call last):
[2m[36m(pid=33454)[0m   File "python/ray/_raylet.pyx", line 594, in ray._raylet.task_execution_handler
[2m[36m(pid=33454)[0m   File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
[2m[36m(pid=33454)[0m   File "python/ray/_raylet.pyx", line 489, in ray._raylet.execute_task
[2m[36m(pid=33454)[0m   File "python/ray/_raylet.pyx", line 496, in ray._raylet.execute_task
[2m[36m(pid=33454)[0m   File "python/ray/_raylet.pyx", line 500, in ray._raylet.execute_task
[2m[36m(pid=33454)[0m   File "python/ray/_raylet.pyx", line 450, in ray._raylet.execute_task.function_executor
[2m[36m(pid=33454)[0m   File "/Users/sven/opt/anaconda3/envs/rllib_tutorial/lib/python3.8/site-packages/ray/_private/function_manager.py", line 566, in actor_method_executor
[2m[36m(pid=33454)[0m     return method(__ray_

<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7f93668d2220>

In [15]:
# "Hacking in": How do we customize our RL loop?
# RLlib offers a callbacks API that allows you to add custom behavior to
# all major events during the environment sampling- and learning process.

# Our problem: So far, we can only see the total reward (sum for both agents).
# This does not give us enough insights into the question of which agent
# learns what (maybe agent2 doesn't learn anything and the reward we are observing
# is mostly due to agent1's progress in covering the map!).

# The following custom callbacks class allows us to add each agents single reward to
# the returned metrics, which will then be displayed in tensorboard.

# We will override RLlib's DefaultCallbacks class and implement the
# `on_episode_start`, `on_episode_step`, and `on_episode_end` methods therein.

from ray.rllib.agents.callbacks import DefaultCallbacks


class MyCallbacks(DefaultCallbacks):
    def on_episode_start(self, *, worker, base_env,
                         policies, episode,
                         env_index, **kwargs):
        # We will use the `MultiAgentEpisode` object being passed into
        # all episode-related callbacks. It comes with a user_data property (dict),
        # which we can write arbitrary data into.

        # At the end of an episode, we'll transfer that data into the `hist_data`, and `custom_metrics`
        # properties to make sure our custom data is displayed in TensorBoard.

        # The episode is starting:
        # Wipe out the rewards-lists for individual agents 1 and 2.
        episode.user_data["agent1_rewards"] = []
        episode.user_data["agent2_rewards"] = []

    def on_episode_step(self, *, worker, base_env,
                        episode, env_index, **kwargs):
        # Get the last rewards for individual agents 1 and 2
        # from the MultiAgentEpisode object (`episode`).
        ag1_r = episode.prev_reward_for("agent1")
        ag2_r = episode.prev_reward_for("agent2")
        #print("ag1_r={} ag2_r={}".format(ag1_r, ag2_r))

        # Add individual rewards to our lists.
        episode.user_data["agent1_rewards"].append(ag1_r)
        episode.user_data["agent2_rewards"].append(ag2_r)

    def on_episode_end(self, *, worker, base_env,
                       policies, episode,
                       env_index, **kwargs):
        # Episode is done:
        # Write scalar values (sum over rewards) to `custom_metrics` and
        # time-series data (rewards per time step) to `hist_data`.
        # Both will be visible then in TensorBoard.

        # Put scalar values (one per episode) under `custom_metrics`.
        episode.custom_metrics["ag1_R"] = sum(episode.user_data["agent1_rewards"])
        episode.custom_metrics["ag2_R"] = sum(episode.user_data["agent2_rewards"])
        # Time series data goes into `hist_data`.
        episode.hist_data["agent1_rewards"] = episode.user_data["agent1_rewards"]
        episode.hist_data["agent2_rewards"] = episode.user_data["agent2_rewards"]



In [24]:
# Setting up our config to point to our new custom callbacks class:
config.update({
    "env": MultiAgentArena,  # force "reload"
    "callbacks": MyCallbacks,  # by default, this would point to `rllib.agents.callbacks.DefaultCallbacks`, which does nothing.
    #TODO: remove this once native keras models are supported!
    "model": {
        "custom_model": None,
    },
})

results = tune.run("PPO", config=config, stop={"training_iteration": 20}, checkpoint_at_end=True, restore="/Users/sven/ray_results/PPO/PPO_MultiAgentArena_fd451_00000_0_2021-05-25_15-13-26/checkpoint_000010/checkpoint-10")

Trial name,status,loc
PPO_MultiAgentArena_a19a9_00000,PENDING,


[2m[36m(pid=33644)[0m 2021-05-25 15:18:06,247	INFO trainer.py:666 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=33644)[0m 2021-05-25 15:18:06,247	INFO trainer.py:691 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Trial name,status,loc
PPO_MultiAgentArena_a19a9_00000,RUNNING,


[2m[36m(pid=33644)[0m 2021-05-25 15:18:14,374	INFO trainable.py:377 -- Restored on 192.168.0.102 from checkpoint: /Users/sven/ray_results/PPO/PPO_MultiAgentArena_a19a9_00000_0_2021-05-25_15-18-01/tmpt2yot0_trestore_from_object/checkpoint-10
[2m[36m(pid=33644)[0m 2021-05-25 15:18:14,375	INFO trainable.py:385 -- Current state after restoring: {'_iteration': 10, '_timesteps_total': None, '_time_total': 44.477219343185425, '_episodes_total': 400}


Result for PPO_MultiAgentArena_a19a9_00000:
  agent_timesteps_total: 88000
  custom_metrics:
    ag1_R_max: 35.5
    ag1_R_mean: 19.5125
    ag1_R_min: -2.5
    ag2_R_max: -1.1000000000000039
    ag2_R_mean: -8.277499999999986
    ag2_R_min: -9.89999999999998
  date: 2021-05-25_15-18-18
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 26.099999999999923
  episode_reward_mean: 10.77749999999998
  episode_reward_min: -10.79999999999998
  episodes_this_iter: 40
  episodes_total: 440
  experiment_id: 88f4f9358b2746b4a433baa3528e946e
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.0003000000142492354
          entropy: 1.0878950357437134
          entropy_coeff: 0.0
          kl: 0.04845314100384712
          model: {}
          policy_loss: -0.07339268177747726
          total_loss: 23.10724639892578
          vf_explained_var: 0.5684372186660767
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_a19a9_00000,RUNNING,192.168.0.102:33644,12,53.0848,48000,11.6812,26.1,-10.8,100


Result for PPO_MultiAgentArena_a19a9_00000:
  agent_timesteps_total: 104000
  custom_metrics:
    ag1_R_max: 42.0
    ag1_R_mean: 19.335
    ag1_R_min: -7.5
    ag2_R_max: 4.400000000000013
    ag2_R_mean: -6.929999999999989
    ag2_R_min: -9.89999999999998
  date: 2021-05-25_15-18-27
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 31.499999999999908
  episode_reward_mean: 12.233999999999982
  episode_reward_min: -17.999999999999986
  episodes_this_iter: 40
  episodes_total: 520
  experiment_id: 88f4f9358b2746b4a433baa3528e946e
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 0.0003000000142492354
          entropy: 1.0211660861968994
          entropy_coeff: 0.0
          kl: 0.03176135569810867
          model: {}
          policy_loss: -0.06694947183132172
          total_loss: 33.059776306152344
          vf_explained_var: 0.48005741834640503
  

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_a19a9_00000,RUNNING,192.168.0.102:33644,14,61.1943,56000,13.512,31.5,-18,100


Result for PPO_MultiAgentArena_a19a9_00000:
  agent_timesteps_total: 120000
  custom_metrics:
    ag1_R_max: 39.0
    ag1_R_mean: 21.78
    ag1_R_min: -9.0
    ag2_R_max: 4.400000000000013
    ag2_R_mean: -7.127999999999989
    ag2_R_min: -9.89999999999998
  date: 2021-05-25_15-18-35
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 30.899999999999913
  episode_reward_mean: 14.477999999999968
  episode_reward_min: -17.999999999999986
  episodes_this_iter: 40
  episodes_total: 600
  experiment_id: 88f4f9358b2746b4a433baa3528e946e
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9753550291061401
          entropy_coeff: 0.0
          kl: 0.01630435511469841
          model: {}
          policy_loss: -0.047961167991161346
          total_loss: 35.39598846435547
          vf_explained_var: 0.4754991829395294
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_a19a9_00000,RUNNING,192.168.0.102:33644,16,69.3193,64000,16.344,30.9,-10.5,100


Result for PPO_MultiAgentArena_a19a9_00000:
  agent_timesteps_total: 136000
  custom_metrics:
    ag1_R_max: 39.5
    ag1_R_mean: 25.98
    ag1_R_min: 0.5
    ag2_R_max: 2.200000000000002
    ag2_R_mean: -7.160999999999988
    ag2_R_min: -9.89999999999998
  date: 2021-05-25_15-18-43
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 32.699999999999946
  episode_reward_mean: 18.61799999999995
  episode_reward_min: -7.799999999999985
  episodes_this_iter: 40
  episodes_total: 680
  experiment_id: 88f4f9358b2746b4a433baa3528e946e
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9469641447067261
          entropy_coeff: 0.0
          kl: 0.014672379940748215
          model: {}
          policy_loss: -0.047013480216264725
          total_loss: 24.736125946044922
          vf_explained_var: 0.6006038784980774
      

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_a19a9_00000,RUNNING,192.168.0.102:33644,18,77.5612,72000,19.494,35.1,-8.1,100


Result for PPO_MultiAgentArena_a19a9_00000:
  agent_timesteps_total: 152000
  custom_metrics:
    ag1_R_max: 45.0
    ag1_R_mean: 27.285
    ag1_R_min: -7.0
    ag2_R_max: 8.80000000000002
    ag2_R_mean: -6.665999999999989
    ag2_R_min: -9.89999999999998
  date: 2021-05-25_15-18-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 35.399999999999906
  episode_reward_mean: 20.483999999999945
  episode_reward_min: -8.10000000000001
  episodes_this_iter: 40
  episodes_total: 760
  experiment_id: 88f4f9358b2746b4a433baa3528e946e
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9091278314590454
          entropy_coeff: 0.0
          kl: 0.016280846670269966
          model: {}
          policy_loss: -0.059767886996269226
          total_loss: 34.662628173828125
          vf_explained_var: 0.5539247989654541
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_a19a9_00000,RUNNING,192.168.0.102:33644,20,86.2304,80000,20.913,35.4,-8.1,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_a19a9_00000,TERMINATED,,20,86.2304,80000,20.913,35.4,-8.1,100


2021-05-25 15:18:57,233	INFO tune.py:549 -- Total run time: 55.65 seconds (55.46 seconds for the tuning loop).


### Let's check tensorboard for the new custom metrics!

1. Head over to ~/ray_results/PPO/PPO_MultiAgentArena_[some key]_00000_0_[date]_[time]/
1. In that directory, you should see a `event.out....` file.
1. Run `tensorboard --logdir .` and head to https://localhost:6006

<img src="images/tensorboard.png" width=800>


## Exercise No 3

<hr />

Assume we would like to know exactly how much (new) ground agent1 
covers on average in an episode.
Write your own custom callback class (sub-class
ray.rllib.agents.callback::DefaultCallbacks) and override one or more methods
therein to collect the following data:
- The number of (unique) fields agent1 has covered in an episode.
- The number of times agent2 has blocked agent1.

Remember that you can get the last rewards for both agents via the `MultiAgentEpisode` object (`episode`) and its
`prev_reward_for([agent name])` method. From these, you should be able to infer, whether a collision happened or whether agent1
discovered a new field in the grid.

Run a simple experiment using tune.run (and your custom callbacks class)
and confirm the new metric shows up in tensorboard.

**Good luck! :)**

In [26]:
# Solution Exercise #3

import ray
from ray.rllib.agents.callbacks import DefaultCallbacks
from ray import tune


class MyCallback(DefaultCallbacks):
    def on_episode_start(self, *, worker, base_env,
                         policies, episode,
                         env_index, **kwargs):
        # Set per-episode object to capture, which states (observations)
        # have been visited by agent1.
        episode.user_data["new_fields_discovered"] = 0
        # Set per-episode agent2-blocks counter (how many times has agent2 blocked agent1?).
        episode.user_data["num_collisions"] = 0

    def on_episode_step(self, *, worker, base_env,
                        episode, env_index, **kwargs):
        # Get both rewards.
        ag1_r = episode.prev_reward_for("agent1")
        ag2_r = episode.prev_reward_for("agent2")

        # Agent1 discovered a new field.
        if ag1_r == 1.0:
            episode.user_data["new_fields_discovered"] += 1
        # Collision.
        elif ag2_r == 1.0:
            episode.user_data["num_collisions"] += 1

    def on_episode_end(self, *, worker, base_env,
                       policies, episode,
                       env_index, **kwargs):
        # Store everything in `episode.custom_metrics`:
        episode.custom_metrics["new_fields_discovered"] = episode.user_data["new_fields_discovered"]
        episode.custom_metrics["num_collisions"] = episode.user_data["num_collisions"]


stop = {"training_iteration": 10}
# Specify env and custom callbacks in our config (leave everything else
# as-is (defaults)).
config = {
    "env": MultiAgentArena,
    "callbacks": MyCallback,
}

# Run for a few iterations.
tune.run("PPO", stop=stop, config=config)

# Check tensorboard.



Trial name,status,loc
PPO_MultiAgentArena_f5b12_00000,PENDING,


[2m[36m(pid=34739)[0m 2021-05-25 16:46:23,388	INFO trainer.py:666 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=34739)[0m 2021-05-25 16:46:23,388	INFO trainer.py:691 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Result for PPO_MultiAgentArena_f5b12_00000:
  agent_timesteps_total: 4000
  custom_metrics:
    new_fields_discovered_max: 46
    new_fields_discovered_mean: 32.95
    new_fields_discovered_min: 21
    num_collisions_max: 3
    num_collisions_mean: 0.5
    num_collisions_min: 0
  date: 2021-05-25_16-46-35
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 9.000000000000023
  episode_reward_mean: -10.019999999999994
  episode_reward_min: -28.500000000000053
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: 379eb3f5e10b4832a53f910be6d7c8ec
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.3659894466400146
          entropy_coeff: 0.0
          kl: 0.020383195951581
          model: {}
          policy_loss: -0.057828232645988464
          total_loss: 18.440322875976562
          vf_explaine

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_f5b12_00000,RUNNING,192.168.0.102:34739,1,3.18659,4000,-10.02,9,-28.5,100


Result for PPO_MultiAgentArena_f5b12_00000:
  agent_timesteps_total: 12000
  custom_metrics:
    new_fields_discovered_max: 61
    new_fields_discovered_mean: 38.36666666666667
    new_fields_discovered_min: 21
    num_collisions_max: 7
    num_collisions_mean: 1.1333333333333333
    num_collisions_min: 0
  date: 2021-05-25_16-46-41
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 32.09999999999991
  episode_reward_mean: -1.3349999999999904
  episode_reward_min: -28.500000000000053
  episodes_this_iter: 20
  episodes_total: 60
  experiment_id: 379eb3f5e10b4832a53f910be6d7c8ec
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 1.2982711791992188
          entropy_coeff: 0.0
          kl: 0.017935695126652718
          model: {}
          policy_loss: -0.05370116978883743
          total_loss: 14.2679519

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_f5b12_00000,RUNNING,192.168.0.102:34739,3,9.33737,12000,-1.335,32.1,-28.5,100


Result for PPO_MultiAgentArena_f5b12_00000:
  agent_timesteps_total: 20000
  custom_metrics:
    new_fields_discovered_max: 61
    new_fields_discovered_mean: 37.92
    new_fields_discovered_min: 21
    num_collisions_max: 11
    num_collisions_mean: 1.8
    num_collisions_min: 0
  date: 2021-05-25_16-46-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 32.09999999999991
  episode_reward_mean: -1.6679999999999888
  episode_reward_min: -28.500000000000053
  episodes_this_iter: 20
  episodes_total: 100
  experiment_id: 379eb3f5e10b4832a53f910be6d7c8ec
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 1.2496124505996704
          entropy_coeff: 0.0
          kl: 0.02063153311610222
          model: {}
          policy_loss: -0.055548422038555145
          total_loss: 14.722307205200195
          vf_exp

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_f5b12_00000,RUNNING,192.168.0.102:34739,5,14.9618,20000,-1.668,32.1,-28.5,100


Result for PPO_MultiAgentArena_f5b12_00000:
  agent_timesteps_total: 28000
  custom_metrics:
    new_fields_discovered_max: 54
    new_fields_discovered_mean: 38.81
    new_fields_discovered_min: 27
    num_collisions_max: 11
    num_collisions_mean: 2.54
    num_collisions_min: 0
  date: 2021-05-25_16-46-54
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 23.99999999999995
  episode_reward_mean: 0.10500000000001433
  episode_reward_min: -19.500000000000014
  episodes_this_iter: 20
  episodes_total: 140
  experiment_id: 379eb3f5e10b4832a53f910be6d7c8ec
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 1.1811387538909912
          entropy_coeff: 0.0
          kl: 0.019207734614610672
          model: {}
          policy_loss: -0.05731140077114105
          total_loss: 18.218515396118164
          vf_ex

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_f5b12_00000,RUNNING,192.168.0.102:34739,7,21.7398,28000,0.105,24,-19.5,100


Result for PPO_MultiAgentArena_f5b12_00000:
  agent_timesteps_total: 36000
  custom_metrics:
    new_fields_discovered_max: 52
    new_fields_discovered_mean: 39.05
    new_fields_discovered_min: 27
    num_collisions_max: 11
    num_collisions_mean: 2.37
    num_collisions_min: 0
  date: 2021-05-25_16-47-01
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 23.99999999999995
  episode_reward_mean: 0.3570000000000138
  episode_reward_min: -19.500000000000014
  episodes_this_iter: 20
  episodes_total: 180
  experiment_id: 379eb3f5e10b4832a53f910be6d7c8ec
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 1.140802264213562
          entropy_coeff: 0.0
          kl: 0.018799131736159325
          model: {}
          policy_loss: -0.053859543055295944
          total_loss: 17.579652786254883
          vf_exp

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_f5b12_00000,RUNNING,192.168.0.102:34739,9,28.7998,36000,0.357,24,-19.5,100


Result for PPO_MultiAgentArena_f5b12_00000:
  agent_timesteps_total: 40000
  custom_metrics:
    new_fields_discovered_max: 52
    new_fields_discovered_mean: 39.68
    new_fields_discovered_min: 22
    num_collisions_max: 12
    num_collisions_mean: 2.76
    num_collisions_min: 0
  date: 2021-05-25_16-47-04
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 23.99999999999995
  episode_reward_mean: 1.5660000000000123
  episode_reward_min: -27.000000000000014
  episodes_this_iter: 20
  episodes_total: 200
  experiment_id: 379eb3f5e10b4832a53f910be6d7c8ec
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 1.119084119796753
          entropy_coeff: 0.0
          kl: 0.018166445195674896
          model: {}
          policy_loss: -0.053664419800043106
          total_loss: 23.5438232421875
          vf_explai

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_f5b12_00000,TERMINATED,,10,31.7145,40000,1.566,24,-27,100


2021-05-25 16:47:04,553	INFO tune.py:549 -- Total run time: 47.96 seconds (47.51 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7f93675849a0>

### A closer look at RLlib's APIs and structure

We already took a quick look inside an RLlib Trainer object and extracted its Policy(ies) and the Policy's model (neural network). Here is a much more detailed overview of what's inside a Trainer object.

At the core is the so-called `WorkerSet` sitting under `Trainer.workers`. A WorkerSet is a group of `RolloutWorker` (`rllib.evaluation.rollout_worker.py`) objects that always consists of a "local worker" (`Trainer.workers.local_worker()`) and n "remote workers" (`Trainer.workers.remote_workers()`).



<img src="images/rllib_structure.png" width=1000>

### Scaling RLlib

Scaling RLlib works by parallelizing the "jobs" that the remote `RolloutWorkers` do. In a vanilla RL algorithm, like PPO, DQN, and many others, the `@ray.remote` labeled RolloutWorkers in the figure above are responsible for interacting with one or more environments and thereby collecting experiences. Observations are produced by the environment, actions are then computed by the Policy(ies) copy located on the remote worker and sent to the environment in order to produce yet another observation. This cycle is repeated endlessly and only sometimes interrupted to send experience batches ("train batches") of a certain size to the "local worker". There these batches are used to call `Policy.learn_on_batch()`, which performs a loss calculation, followed by a model weights update, and a subsequent weights broadcast back to all the remote workers.

