# Hands-on RL with Ray’s RLlib
## A beginner’s tutorial for working with multi-agent environments, models, and algorithms

<img src="images/pitfall.jpg" width=250> <img src="images/tesla.jpg" width=254> <img src="images/forklifts.jpg" width=169> <img src="images/robots.jpg" width=252> <img src="images/dota2.jpg" width=213>

### Overview
“Hands-on RL with Ray’s RLlib” is a beginners tutorial for working with reinforcement learning (RL) environments, models, and algorithms using Ray’s RLlib library. RLlib offers high scalability, a large list of algos to choose from (offline, model-based, model-free, etc..), support for TensorFlow and PyTorch, and a unified API for a variety of applications. This tutorial includes a brief introduction to provide an overview of concepts (e.g. why RL) before proceeding to RLlib (multi- and single-agent) environments, neural network models, hyperparameter tuning, debugging, student exercises, Q/A, and more. All code will be provided as .py files in a GitHub repo.

### Intended Audience
* Python programmers who want to get started with reinforcement learning and RLlib.

### Prerequisites
* Some Python programming experience.
* Some familiarity with machine learning.
* *Helpful, but not required:* Experience in reinforcement learning and Ray.
* *Helpful, but not required:* Experience with TensorFlow or PyTorch.

### Requirements/Dependencies

Install conda (https://www.anaconda.com/products/individual)

Then ...

#### Quick `conda` setup instructions (Mac and Linux):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install [tensorflow|torch]  # <- either one works!
$ pip install jupyter-labs
```

#### Quick `conda` setup instructions (Win10):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install [tensorflow|torch]  # <- either one works!
$ pip install jupyter-labs
$ conda install pywin32
```

Also, for Win10 Atari support, we have to install atari_py from a different source (gym does not support Atari envs on Windows).

```
$ pip install git+https://github.com/Kojoley/atari-py.git
```

### Opening these tutorial files:
```
$ git clone https://github.com/sven1977/rllib_tutorials
$ cd rllib_tutorials
$ jupyter-lab
```

### Key Takeaways
* What is reinforcement learning and why RLlib?
* Core concepts of RLlib: Environments, Trainers, Policies, and Models.
* How to configure, hyperparameter-tune, and parallelize RLlib.
* RLlib debugging best practices.

### Tutorial Outline
1. RL and RLlib in a nutshell.
1. Defining an RL-solvable problem: Our first environment.
1. Exercise No.1 (env loop)
1. Picking an algorithm and training our first RLlib Trainer.
1. Configurations and hyperparameters - Easy tuning with Ray Tune.
1. Fixing our experiment's config - Going multi-agent.
1. The "infinite laptop": Quick intro into how to use RLlib with Anyscale's product.
1. Exercise No.2 (run your own Ray RLlib+Tune experiment)
1. Neural network models - Provide your custom models using tf.keras or torch.nn.
1. Deeper dive into RLlib's parallelization architecture.
1. Specifying different compute resources and parallelization options through our config.
1. "Hacking in": Using callbacks to customize the RL loop and generate our own metrics.
1. Exercise No.3 (write your own custom callback)
1. "Hacking in (part II)" - Debugging with RLlib and PyCharm.
1. Checking on the "infinite laptop" - Did RLlib learn to solve the problem?

### Other Recommended Readings
* [Attention Nets and More with RLlib's Trajectory View API](https://medium.com/distributed-computing-with-ray/attention-nets-and-more-with-rllibs-trajectory-view-api-d326339a6e65)
* [Intro to RLlib: Example Environments](https://medium.com/distributed-computing-with-ray/intro-to-rllib-example-environments-3a113f532c70)
* [Reinforcement Learning with RLlib in the Unity Game Engine](https://medium.com/distributed-computing-with-ray/reinforcement-learning-with-rllib-in-the-unity-game-engine-1a98080a7c0d)


<img src="images/rl-cycle.png" width=1200>

### Coding/defining our "problem" via an RL environment.

We will use the following (adversarial) multi-agent environment
throughout this tutorial to demonstrate a large fraction of RLlib's
APIs, features, and customization options.

<img src="images/environment.png" width=800>

### A word or two on Spaces:

Spaces are used in ML to describe what possible/valid values inputs and outputs of a neural network can have.

RL environments also use them to describe what their valid observations and actions are.

Spaces are usually defined by their shape (e.g. 84x84x3 RGB images) and datatype (e.g. uint8 for RGB values between 0 and 255).
However, spaces could also be composed of other spaces (see Tuple or Dict spaces) or could be simply discrete with n fixed possible values
(represented by integers). For example, in our game, where each agent can only go up/down/left/right, the action space would be "Discrete(4)"
(no datatype, no shape needs to be defined here).

<img src="images/spaces.png" width=800>

In [1]:
import gym
from gym.spaces import Discrete, MultiDiscrete
import numpy as np
import random

from ray.rllib.env.multi_agent_env import MultiAgentEnv


class MultiAgentArena(MultiAgentEnv):
    def __init__(self, config=None):
        # !LIVE CODING!
        #from environment import _init
        #_init(self, config)
        config = config or {}
        self.width = config.get("width", 10)
        self.height = config.get("height", 10)

        self.observation_space = MultiDiscrete([self.width * self.height,
                                                self.width * self.height])
        self.action_space = Discrete(4)
        
        self.timestep_limit = config.get("ts", 100)
        self.reset()

    def reset(self):
        # !LIVE CODING!
        #from environment import _reset
        #return _reset(self)
        self.agent1_pos = [0, 0]
        self.agent2_pos = [self.height - 1, self.width - 1]
        
        self.timesteps = 0
        
        self.agent1_visited_states = set()
        
        return self._get_obs()

    def step(self, action: dict):
        # !LIVE CODING!
        #from environment import _step
        #return _step(action)

        # increase our time steps counter by 1.
        self.timesteps += 1

        # Determine, who is allowed to move first (50:50).
        if random.random() > 0.5:
            events = self._move(self.agent1_pos, action["agent1"], is_agent1=True)
            events |= self._move(self.agent2_pos, action["agent2"], is_agent1=False)
        else:
            events = self._move(self.agent2_pos, action["agent2"], is_agent1=False)
            events |= self._move(self.agent1_pos, action["agent1"], is_agent1=True)

        # Determine rewards based on the collected events:
        rewards = {
            "agent1": -1.0 if "bumped" in events else 1.0 if "new" in events else -0.5,
            "agent2": 1.0 if "bumped" in events else -0.1,
        }
        # Get observations (based on new agent positions).
        obs = self._get_obs()

        # Generate a `done` dict (per-agent and total).
        # We are done only when we reach the time step limit.
        is_done = self.timesteps >= self.timestep_limit
        dones = {
            "agent1": is_done,
            "agent2": is_done,
            # special `__all__` key indicates that the episode is done for all agents.
            "__all__": is_done,
        }

        return obs, rewards, dones, {}

    def _get_obs(self):
        """
        Returns obs dict (agent name to discrete-pos tuple) using each
        agent's current x/y-positions.
        """
        ag1_discrete_pos = self.agent1_pos[0] * self.width + \
            (self.agent1_pos[1] % self.width)
        ag2_discrete_pos = self.agent2_pos[0] * self.width + \
            (self.agent2_pos[1] % self.width)
        return {
            "agent1": np.array([ag1_discrete_pos, ag2_discrete_pos]),
            "agent2": np.array([ag2_discrete_pos, ag1_discrete_pos]),
        }

    def _move(self, coords, action, is_agent1):
        """
        Moves an agent (agent1 iff is_agent1=True, else agent2) from `coords` (x/y) using the
        given action (0=up, 1=right, etc..) and returns a resulting events dict:
        Agent1: "new" when entering a new field. "bumped" when having been bumped into by agent2.
        Agent2: "bumped" when bumping into agent1 (agent1 then gets -1.0).
        """
        orig_coords = coords[:]
        # Change the row: 0=up (-1), 2=down (+1)
        coords[0] += -1 if action == 0 else 1 if action == 2 else 0
        # Change the column: 1=right (+1), 3=left (-1)
        coords[1] += 1 if action == 1 else -1 if action == 3 else 0

        # Solve collisions.
        # Make sure, we don't end up on the other agent's position.
        # If yes, don't move (we are blocked).
        if (is_agent1 and coords == self.agent2_pos) or (not is_agent1 and coords == self.agent1_pos):
            coords[0], coords[1] = orig_coords
            # Agent2 blocked agent1 (agent1 tried to run into agent2)
            # OR Agent2 bumped into agent1 (agent2 tried to run into agent1)
            return {"bumped"}

        # No agent blocking -> check walls.
        if coords[0] < 0:
            coords[0] = 0
        elif coords[0] >= self.height:
            coords[0] = self.height - 1
        if coords[1] < 0:
            coords[1] = 0
        elif coords[1] >= self.width:
            coords[1] = self.width - 1

        # If agent1 -> "new" if new tile covered.
        if is_agent1 and not tuple(coords) in self.agent1_visited_states:
            self.agent1_visited_states.add(tuple(coords))
            return {"new"}
        # No new tile for agent1.
        return set()

    # Optionally: Add `render` method returning some img.
    def render(self, mode=None):
        field_size = 40

        if not hasattr(self, "viewer"):
            from gym.envs.classic_control import rendering
            self.viewer = rendering.Viewer(400, 400)
            self.fields = {}
            # Add our grid, and the two agents to the viewer.
            for i in range(self.width):
                l = i * field_size
                r = l + field_size
                for j in range(self.height):
                    b = 400 - j * field_size - field_size
                    t = b + field_size
                    field = rendering.PolyLine([(l, b), (l, t), (r, t), (r, b)], close=True)
                    field.set_color(.0, .0, .0)
                    field.set_linewidth(1.0)
                    self.fields[(j, i)] = field
                    self.viewer.add_geom(field)
            
            agent1 = rendering.make_circle(radius=field_size // 2 - 4)
            agent1.set_color(.0, 0.8, 0.1)
            self.agent1_trans = rendering.Transform()
            agent1.add_attr(self.agent1_trans)
            agent2 = rendering.make_circle(radius=field_size // 2 - 4)
            agent2.set_color(.5, 0.1, 0.1)
            self.agent2_trans = rendering.Transform()
            agent2.add_attr(self.agent2_trans)
            self.viewer.add_geom(agent1)
            self.viewer.add_geom(agent2)

        # Mark those fields green that have been covered by agent1,
        # all others black.
        for i in range(self.width):
            for j in range(self.height):
                self.fields[(j, i)].set_color(.0, .0, .0)
                self.fields[(j, i)].set_linewidth(1.0)
        for (j, i) in self.agent1_visited_states:
            self.fields[(j, i)].set_color(.1, .5, .1)
            self.fields[(j, i)].set_linewidth(5.0)
        
        # Edit the pole polygon vertex
        self.agent1_trans.set_translation(self.agent1_pos[1] * field_size + field_size / 2, 400 - (self.agent1_pos[0] * field_size + field_size / 2))
        self.agent2_trans.set_translation(self.agent2_pos[1] * field_size + field_size / 2, 400 - (self.agent2_pos[0] * field_size + field_size / 2))

        return self.viewer.render(return_rgb_array=mode == 'rgb_array')




Instructions for updating:
non-resource variables are not supported in the long term


## Exercise No 1

<hr />

Write an "environment loop" using our `MultiAgentArena` class.

1. Create an env object.
1. `reset` your environment to get the first (initial) observation.
1. `step` through the environment using a provided
   "DummyTrainer.compute_action([obs])" method to compute action dicts (see cell below, in which you can create a DummyTrainer object and query it for random actions).
1. When an episode is done, remember to `reset()` your environment before the next call to `step()`.
1. If you feel, this is way too easy for you ;) , try to extract each agent's reward, sum it up over one episode and - at the end of an episode (when done=True) - print out each agent's accumulated reward (also called "return").

**Good luck! :)**


In [2]:
class DummyTrainer:
    """Dummy Trainer class used in Exercise #1.

    Use its `compute_action` method to get a new action, given some environment
    observation.
    """

    def compute_action(self, obs=None):
        # Returns a random action.
        return {
            "agent1": np.random.randint(4),
            "agent2": np.random.randint(4)
        }

dummy_trainer = DummyTrainer()
# Check, whether it's working.
for _ in range(3):
    print(dummy_trainer.compute_action({"agent1": np.array([0, 10]), "agent2": np.array([10, 0])}))

{'agent1': 3, 'agent2': 2}
{'agent1': 2, 'agent2': 3}
{'agent1': 3, 'agent2': 1}


In [17]:
# Solution to Exercise #1
# !LIVE CODING!
# Solution:
env = MultiAgentArena(config={"width": 10, "height": 10})
obs = env.reset()
# Play through a single episode.
done = {"__all__": False}
return_ag1 = return_ag2 = 0.0
num_episodes = 0
while num_episodes < 10:
    action1 = rllib_trainer.compute_action(obs["agent1"])
    action2 = rllib_trainer.compute_action(obs["agent2"])
    #action = rllib_trainer.compute_action(None)

    obs, rewards, done, _ = env.step({"agent1": action1, "agent2": action2})
    return_ag1 += rewards["agent1"]
    return_ag2 += rewards["agent2"]    
    if done["__all__"]:
        print(f"Episode done. R1={return_ag1} R2={return_ag2}")
        num_episodes += 1
        return_ag1 = return_ag2 = 0.0
        obs = env.reset()
    # Optional:
    env.render()

    import time
    time.sleep(0.05)

env.viewer.close()

Episode done. R1=34.5 R2=-4.499999999999997
Episode done. R1=35.5 R2=-9.99999999999998
Episode done. R1=33.5 R2=-5.599999999999999
Episode done. R1=27.5 R2=-8.89999999999998
Episode done. R1=21.5 R2=-2.3000000000000047
Episode done. R1=10.0 R2=-9.99999999999998
Episode done. R1=25.5 R2=-7.7999999999999865
Episode done. R1=31.5 R2=-7.799999999999981
Episode done. R1=29.5 R2=-9.99999999999998
Episode done. R1=37.0 R2=-6.699999999999985


In [6]:
# Now for something completely different:
# Plugging in RLlib!

import ray

# Start a new instance of Ray or connect to an already running one.
ray.init()  # Hear the engine humming? ;)

# In case you encounter the following error during our tutorial:
# RuntimeError: Maybe you called ray.init twice by accident?
# Try: ray.shutdown() or ray.init(ignore_reinit_error=True)

2021-05-05 19:20:39,642	INFO services.py:1262 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.0.100',
 'raylet_ip_address': '192.168.0.100',
 'redis_address': '192.168.0.100:53537',
 'object_store_address': '/tmp/ray/session_2021-05-05_19-20-38_329023_80141/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-05-05_19-20-38_329023_80141/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2021-05-05_19-20-38_329023_80141',
 'metrics_export_port': 64927,
 'node_id': '1d45e6c38925e023e61aaeae0426d166a6188092f3f35dbeaa1c4572'}

<img src="images/rllib_algos.png" width=800>

https://docs.ray.io/en/master/rllib-algorithms.html#available-algorithms-overview

In [9]:
# Import a Trainable (one of RLlib's built-in algorithms):
# We use the PPO algorithm here b/c its very flexible wrt its supported
# action spaces and model types and b/c it learns well almost any problem.
from ray.rllib.agents.ppo import PPOTrainer

# Specify a very simple config, defining our environment and some environment
# options (see environment.py).
config = {
    "env": MultiAgentArena,  # "my_env" <- if we previously have registered the env with `tune.register_env("[name]", lambda config: [returns env object])`.
    "env_config": {
        "config": {
            "width": 10,
            "height": 10,
        },
    },
    # "framework": "torch",  # If users have chosen to install torch instead of tf.
    "create_env_on_driver": True,
}
# Instantiate the Trainer object using above config.
rllib_trainer = PPOTrainer(config=config)

[2m[36m(pid=80221)[0m Instructions for updating:
[2m[36m(pid=80221)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80221)[0m Instructions for updating:
[2m[36m(pid=80221)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80214)[0m Instructions for updating:
[2m[36m(pid=80214)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80214)[0m Instructions for updating:
[2m[36m(pid=80214)[0m non-resource variables are not supported in the long term


In [10]:
# That's it, we are ready to train.
# Calling `train` once runs a single "training iteration". One iteration
# for most algos contains a) sampling from the environment(s) + b) using the
# sampled data (observations, actions taken, rewards) to update the policy
# model (neural network), such that it would pick better actions in the future,
# leading to higher rewards.
print(rllib_trainer.train())
# !LIVE CODING! (call and print out `trainer.train()`)

{'episode_reward_max': 17.09999999999993, 'episode_reward_min': -37.50000000000006, 'episode_reward_mean': -9.930000000000001, 'episode_len_mean': 100.0, 'episode_media': {}, 'episodes_this_iter': 20, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [-22.500000000000014, -13.499999999999973, -13.499999999999991, 5.100000000000016, -21.600000000000012, 9.000000000000002, 17.09999999999993, -14.99999999999998, -37.50000000000006, 2.095545958979983e-14, 1.80000000000001, -31.20000000000003, -7.499999999999982, -20.69999999999999, -21.00000000000005, 6.9000000000000234, 10.50000000000003, -22.5, -1.4999999999999782, -20.999999999999996], 'episode_lengths': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 0.13907651205758356, 'mean_inference_ms': 0.5487502514422833, 'mean_action_processing_ms': 0.05551127644328

In [18]:
# Run `train()` n times. Try to repeatedly call this to see rewards increase.
# Move on once you see episode rewards of 15.0 or more.
for _ in range(10):
    results = rllib_trainer.train()
    print(f"Iteration={rllib_trainer.iteration}: R={results['episode_reward_mean']}")

Iteration=42: R=18.050999999999934
Iteration=43: R=17.570999999999934
Iteration=44: R=17.006999999999934
Iteration=45: R=16.880999999999936
Iteration=46: R=17.420999999999935
Iteration=47: R=17.79299999999993
Iteration=48: R=18.701999999999927
Iteration=49: R=20.165999999999926
Iteration=50: R=21.245999999999917
Iteration=51: R=21.614999999999913


In [20]:
# !LIVE CODING!
# Use the above solution of Exercise #1 and replace our `dummy_trainer`
# with the already trained `rllib_trainer`.
# Note to self: Make sure you are computing actions for agent1 and agent2 separately!
rllib_trainer

<ray.rllib.agents.trainer_template.PPO at 0x7fd13898aa00>

In [21]:
# !LIVE CODING!
# Let's actually "look inside" our Trainer to see what's in there.

# We can get the policy inside the Trainer like so:
pol = rllib_trainer.get_policy()
print(f"Policy: {pol}")
# The Policy object has an observation space and an action space.
print(f"Observation-space: {pol.observation_space}")
print(f"Action-space: {pol.action_space}")
# It also comes with an action distribution class ...
print(f"Action distribution class: {pol.dist_class}")
# ... and a (neural network) model.
print(f"Model: {pol.model}")

# Create a fake numpy B=1 (single) observation consisting of both agents positions ("one-hot'd" and "concat'd").
from ray.rllib.utils.numpy import one_hot
single_obs = np.concatenate([one_hot(0, depth=100), one_hot(99, depth=100)])
single_obs = np.array([single_obs])
#single_obs.shape

# Generate the Model's output.
out, state_out = pol.model({"obs": single_obs})

# tf1.x (static graph) -> Need to run this through a tf session.
numpy_out = pol.get_session().run(out)

# RLlib then passes the model's output to the policy's "action distribution" to sample an action.
action_dist = pol.dist_class(out)
action = action_dist.sample()

# Show us the actual action.
pol.get_session().run(action)

Policy: <ray.rllib.policy.tf_policy_template.PPOTFPolicy object at 0x7fd1389994c0>
Observation-space: Box(-1.0, 1.0, (200,), float32)
Action-space: Discrete(4)
Action distribution class: <class 'ray.rllib.models.tf.tf_action_dist.Categorical'>
Model: <ray.rllib.models.tf.fcnet.FullyConnectedNetwork object at 0x7fd1389994f0>


array([2])

In [22]:
# Currently, `rllib_trainer` is in an already trained state.
# It holds optimized weights in its Policy's model that allow it to act
# already somewhat smart in our environment when given an action.

# If we closed this notebook, all the effort would have been for nothing.
# Let's save the state of our trainer to disk for later!
checkpoint_path = rllib_trainer.save()
print(f"Trainer (at iteration {rllib_trainer.iteration} was saved in '{checkpoint_path}'!")

# Here is what a checkpoint directory contains:
print("The checkpoint directory contains the following files:")
import os
os.listdir(os.path.dirname(checkpoint_path))

Trainer (at iteration 51 was saved in '/Users/sven/ray_results/PPO_MultiAgentArena_2021-05-05_19-28-48gg1cmrvr/checkpoint_000051/checkpoint-51'!
The checkpoint directory contains the following files:


['checkpoint-51.tune_metadata', 'checkpoint-51', '.is_checkpoint']

In [23]:
# Pretend, we wanted to pick up training from a previous run:
new_trainer = PPOTrainer(config=config)
# Evaluate the new trainer (this should yield random results).
results = new_trainer._evaluate()
print(f"Evaluating new trainer: R={results['evaluation']['episode_reward_mean']}")

# Restoring the trained state into the `new_trainer` object.
new_trainer.restore(checkpoint_path)

# Evaluate again (this should yield results we saw after having trained our saved agent).
results = new_trainer._evaluate()
print(f"Evaluating restored trainer: R={results['evaluation']['episode_reward_mean']}")

[2m[36m(pid=80220)[0m Instructions for updating:
[2m[36m(pid=80220)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80220)[0m Instructions for updating:
[2m[36m(pid=80220)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80224)[0m Instructions for updating:
[2m[36m(pid=80224)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80224)[0m Instructions for updating:
[2m[36m(pid=80224)[0m non-resource variables are not supported in the long term


AttributeError: 'PPO' object has no attribute 'evaluation_workers'

In [25]:
# 5) Configuration dicts and Ray Tune.
# Where are the default configuration dicts stored?
import pprint

# PPO algorithm:
from ray.rllib.agents.ppo import DEFAULT_CONFIG as PPO_DEFAULT_CONFIG
print(f"PPO's default config is:")
pprint.pprint(PPO_DEFAULT_CONFIG)

# DQN algorithm:
#from ray.rllib.agents.dqn import DEFAULT_CONFIG as DQN_DEFAULT_CONFIG
#print(f"DQN's default config is:")
#pprint.pprint(DQN_DEFAULT_CONFIG)

# Common (all algorithms).
#from ray.rllib.agents.trainer import COMMON_CONFIG
#print(f"RLlib Trainer's default config is:")
#pprint.pprint(COMMON_CONFIG)

PPO's default config is:
{'_fake_gpus': False,
 'batch_mode': 'truncate_episodes',
 'callbacks': <class 'ray.rllib.agents.callbacks.DefaultCallbacks'>,
 'clip_actions': True,
 'clip_param': 0.3,
 'clip_rewards': None,
 'collect_metrics_timeout': 180,
 'compress_observations': False,
 'create_env_on_driver': False,
 'custom_eval_function': None,
 'custom_resources_per_worker': {},
 'eager_tracing': False,
 'entropy_coeff': 0.0,
 'entropy_coeff_schedule': None,
 'env': None,
 'env_config': {},
 'evaluation_config': {},
 'evaluation_interval': None,
 'evaluation_num_episodes': 10,
 'evaluation_num_workers': 0,
 'evaluation_parallel_to_training': False,
 'exploration_config': {'type': 'StochasticSampling'},
 'explore': True,
 'extra_python_environs_for_driver': {},
 'extra_python_environs_for_worker': {},
 'fake_sampler': False,
 'framework': 'tf',
 'gamma': 0.99,
 'grad_clip': None,
 'horizon': None,
 'ignore_worker_failures': False,
 'in_evaluation': False,
 'input': 'sampler',
 'input_e

In [26]:
# Plugging in Ray Tune.
# Note that this is the recommended way to run any experiments with RLlib.
# Reasons:
# - Tune allows you to do hyperparameter tuning in a user-friendly way
#   and at large scale!
# - Tune automatically allocates needed resources for the different
#   hyperparam trials and experiment runs.

from ray import tune

# Now that we will run things "automatically" through tune, we have to
# define one or more stopping criteria.
stop = {
    # explain that keys here can be anything present in the above print(trainer.train())
    "training_iteration": 5,
    "episode_reward_mean": 9999.9,
}

# "PPO" is a registered name that points to RLlib's PPOTrainer.
# See `ray/rllib/agents/registry.py`
# Run our simple experiment until one of the stop criteria is met.
tune.run("PPO", config=config, stop=stop)


Trial name,status,loc
PPO_MultiAgentArena_8cbc4_00000,PENDING,


[2m[36m(pid=80219)[0m Instructions for updating:
[2m[36m(pid=80219)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80219)[0m Instructions for updating:
[2m[36m(pid=80219)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80219)[0m 2021-05-05 19:51:47,841	INFO trainer.py:648 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=80219)[0m 2021-05-05 19:51:47,841	INFO trainer.py:673 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=80219)[0m 2021-05-05 19:51:47,841	INFO trainer.py:648 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=80219)[0m 2021-05-05 19:51:47,841	INFO trainer.py:673 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=80216)[0m Instructions for updating:
[2m[

Result for PPO_MultiAgentArena_8cbc4_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-05-05_19-51-58
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 25.499999999999932
  episode_reward_mean: -7.56
  episode_reward_min: -39.00000000000006
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: 4c18df4ded4147309ab290fd322e57b7
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.368713617324829
          entropy_coeff: 0.0
          kl: 0.0179552361369133
          model: {}
          policy_loss: -0.056025221943855286
          total_loss: 32.07113265991211
          vf_explained_var: 0.10238191485404968
          vf_loss: 32.123565673828125
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iteratio

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_8cbc4_00000,RUNNING,192.168.0.100:80219,1,3.15855,4000,-7.56,25.5,-39,100


Result for PPO_MultiAgentArena_8cbc4_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-05-05_19-52-03
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 25.499999999999932
  episode_reward_mean: -4.2649999999999935
  episode_reward_min: -39.00000000000006
  episodes_this_iter: 20
  episodes_total: 60
  experiment_id: 4c18df4ded4147309ab290fd322e57b7
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 1.3033363819122314
          entropy_coeff: 0.0
          kl: 0.018584633246064186
          model: {}
          policy_loss: -0.055112652480602264
          total_loss: 12.647034645080566
          vf_explained_var: 0.3117234408855438
          vf_loss: 12.696571350097656
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_tra

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_8cbc4_00000,RUNNING,192.168.0.100:80219,3,8.89383,12000,-4.265,25.5,-39,100


Result for PPO_MultiAgentArena_8cbc4_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-05-05_19-52-10
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 25.499999999999932
  episode_reward_mean: -2.6009999999999946
  episode_reward_min: -39.00000000000006
  episodes_this_iter: 20
  episodes_total: 100
  experiment_id: 4c18df4ded4147309ab290fd322e57b7
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 4.999999873689376e-05
          entropy: 1.2405627965927124
          entropy_coeff: 0.0
          kl: 0.01944688893854618
          model: {}
          policy_loss: -0.05826034024357796
          total_loss: 21.072965621948242
          vf_explained_var: 0.38389959931373596
          vf_loss: 21.122478485107422
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 20000
    num_steps_trai

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_8cbc4_00000,TERMINATED,,5,15.0082,20000,-2.601,25.5,-39,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_8cbc4_00000,TERMINATED,,5,15.0082,20000,-2.601,25.5,-39,100


2021-05-05 19:52:10,559	INFO tune.py:549 -- Total run time: 28.48 seconds (27.98 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7fd1411b2fd0>

In [27]:
# Updating an algo's default config dict and adding hyperparameter tuning
# options to it.
# Note: Hyperparameter tuning options (e.g. grid_search) will only work,
# if we run these configs via `tune.run`.
config.update(
    {
        # Try 2 different learning rates.
        "lr": tune.grid_search([0.0001, 0.5]),
        # NN model config to tweak the default model
        # that'll be created by RLlib for the policy.
        "model": {
            # e.g. change the dense layer stack.
            "fcnet_hiddens": [256, 256, 256],
            # Alternatively, you can specify a custom model here
            # (we'll cover that later).
            # "custom_model": ...
            # Pass kwargs to your custom model.
            # "custom_model_config": {}
        },
    }
)
# Repeat our experiment using tune's grid-search feature.
results = tune.run(
    "PPO",
    config=config,
    stop=stop,
    checkpoint_at_end=True,  # create a checkpoint when done.
    checkpoint_freq=1,  # create a checkpoint on every iteration.
)
print(results)


Trial name,status,loc,lr
PPO_MultiAgentArena_1664e_00000,PENDING,,0.0001
PPO_MultiAgentArena_1664e_00001,PENDING,,0.5


[2m[36m(pid=80217)[0m Instructions for updating:
[2m[36m(pid=80217)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80217)[0m Instructions for updating:
[2m[36m(pid=80217)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80213)[0m Instructions for updating:
[2m[36m(pid=80213)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80213)[0m Instructions for updating:
[2m[36m(pid=80213)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80217)[0m 2021-05-05 19:55:38,860	INFO trainer.py:648 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=80217)[0m 2021-05-05 19:55:38,861	INFO trainer.py:673 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=80217)[0m 2021-05-05 19:55:38,860	INFO trainer.py:648 -- Tip: set framework=tfe or the --eager flag to enable

Result for PPO_MultiAgentArena_1664e_00001:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-05-05_19-55-53
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 18.59999999999993
  episode_reward_mean: -10.185000000000004
  episode_reward_min: -31.50000000000003
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: b88594800f1f4574b1da6c71be580020
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.5
          entropy: 0.12736791372299194
          entropy_coeff: 0.0
          kl: 12.461297035217285
          model: {}
          policy_loss: 0.38490647077560425
          total_loss: 33.12248992919922
          vf_explained_var: -0.002359993988648057
          vf_loss: 30.24532699584961
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iterations_s

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_1664e_00000,RUNNING,,0.0001,,,,,,,
PPO_MultiAgentArena_1664e_00001,RUNNING,192.168.0.100:80217,0.5,1.0,5.08742,4000.0,-10.185,18.6,-31.5,100.0


Result for PPO_MultiAgentArena_1664e_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-05-05_19-55-53
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 8.40000000000002
  episode_reward_mean: -9.794999999999996
  episode_reward_min: -41.10000000000006
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: 34d9273bb9e2421bb7bdb8ccaef970c2
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 1.3498280048370361
          entropy_coeff: 0.0
          kl: 0.03733210638165474
          model: {}
          policy_loss: -0.0692160502076149
          total_loss: 25.37936782836914
          vf_explained_var: 0.1313427984714508
          vf_loss: 25.44111442565918
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
 

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_1664e_00000,RUNNING,192.168.0.100:80213,0.0001,2,10.012,8000,-7.83,9.6,-41.1,100
PPO_MultiAgentArena_1664e_00001,RUNNING,192.168.0.100:80217,0.5,2,10.0161,8000,-34.3425,18.6,-58.5,100


Result for PPO_MultiAgentArena_1664e_00001:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-05-05_19-56-03
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 18.59999999999993
  episode_reward_mean: -42.39500000000006
  episode_reward_min: -58.50000000000009
  episodes_this_iter: 20
  episodes_total: 60
  experiment_id: b88594800f1f4574b1da6c71be580020
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.15000000596046448
          cur_lr: 0.5
          entropy: 0.0006342960405163467
          entropy_coeff: 0.0
          kl: 8.398860518354923e-05
          model: {}
          policy_loss: 0.0012262860545888543
          total_loss: 104.77656555175781
          vf_explained_var: 0.010487637482583523
          vf_loss: 104.77530670166016
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
  

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_1664e_00000,RUNNING,192.168.0.100:80213,0.0001,2,10.012,8000,-7.83,9.6,-41.1,100
PPO_MultiAgentArena_1664e_00001,RUNNING,192.168.0.100:80217,0.5,3,15.0243,12000,-42.395,18.6,-58.5,100


Result for PPO_MultiAgentArena_1664e_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-05-05_19-56-03
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 20.999999999999968
  episode_reward_mean: -5.309999999999993
  episode_reward_min: -41.10000000000006
  episodes_this_iter: 20
  episodes_total: 60
  experiment_id: 34d9273bb9e2421bb7bdb8ccaef970c2
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 9.999999747378752e-05
          entropy: 1.2728362083435059
          entropy_coeff: 0.0
          kl: 0.028103772550821304
          model: {}
          policy_loss: -0.06841445714235306
          total_loss: 15.778899192810059
          vf_explained_var: 0.2877601385116577
          vf_loss: 15.834668159484863
    num_agent_steps_sampled: 12000
    num_agent_steps_trained: 12000
    num_steps_sampled: 12000
    num_steps_train

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_1664e_00000,RUNNING,192.168.0.100:80213,0.0001,3,15.032,12000,-5.31,21.0,-41.1,100
PPO_MultiAgentArena_1664e_00001,RUNNING,192.168.0.100:80217,0.5,4,20.3794,16000,-46.4213,18.6,-58.5,100


Result for PPO_MultiAgentArena_1664e_00000:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-05-05_19-56-09
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 20.999999999999968
  episode_reward_mean: -3.573749999999991
  episode_reward_min: -41.10000000000006
  episodes_this_iter: 20
  episodes_total: 80
  experiment_id: 34d9273bb9e2421bb7bdb8ccaef970c2
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.675000011920929
          cur_lr: 9.999999747378752e-05
          entropy: 1.2485978603363037
          entropy_coeff: 0.0
          kl: 0.02224148064851761
          model: {}
          policy_loss: -0.06387993693351746
          total_loss: 14.692970275878906
          vf_explained_var: 0.3112991154193878
          vf_loss: 14.741838455200195
    num_agent_steps_sampled: 16000
    num_agent_steps_trained: 16000
    num_steps_sampled: 16000
    num_steps_trained:

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_1664e_00000,RUNNING,192.168.0.100:80213,0.0001,4,20.371,16000,-3.57375,21.0,-41.1,100
PPO_MultiAgentArena_1664e_00001,RUNNING,192.168.0.100:80217,0.5,5,25.7632,20000,-48.837,18.6,-58.5,100


Result for PPO_MultiAgentArena_1664e_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-05-05_19-56-14
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 20.999999999999968
  episode_reward_mean: -2.1719999999999913
  episode_reward_min: -41.10000000000006
  episodes_this_iter: 20
  episodes_total: 100
  experiment_id: 34d9273bb9e2421bb7bdb8ccaef970c2
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 1.2218016386032104
          entropy_coeff: 0.0
          kl: 0.016137147322297096
          model: {}
          policy_loss: -0.057371288537979126
          total_loss: 13.115107536315918
          vf_explained_var: 0.3720450699329376
          vf_loss: 13.156139373779297
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 20000
    num_steps_trai

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_1664e_00000,TERMINATED,,0.0001,5,25.7507,20000,-2.172,21.0,-41.1,100
PPO_MultiAgentArena_1664e_00001,TERMINATED,,0.5,5,25.7632,20000,-48.837,18.6,-58.5,100


2021-05-05 19:56:15,324	INFO tune.py:549 -- Total run time: 42.27 seconds (41.66 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis object at 0x7fd128ee4bb0>


In [28]:
# Going multi-policy:

# Our experiment is ill-configured b/c both
# agents, which should behave differently due to their different
# tasks and reward functions, learn the same policy (the "default_policy",
# which RLlib always provides if you don't configure anything else; Remember
# that RLlib does not know at Trainer setup time, how many and which agents
# the environment will "produce").
# Let's fix this and introduce the "multiagent" API.

# 6.1.) Define an agent->policy mapping function.
# Which agents (defined by the environment) use which policies
# (defined by us)? Mapping is M (agents) -> N (policies), where M >= N.
def policy_mapping_fn(agent: str):
    assert agent in ["agent1", "agent2"], f"ERROR: invalid agent {agent}!"
    return "pol1" if agent == "agent1" else "pol2"
    
# 6.2.) Define details for our two policies.
#TODO: coding Sven: Make it possible to not need obs/action spaces
#  if they are the default anyways.
observation_space = rllib_trainer.workers.local_worker().env.observation_space
action_space = rllib_trainer.workers.local_worker().env.action_space
# Btw, the above is equivalent to saying:
# >>> rllib_trainer.get_policy("default_policy").obs/action_space
policies = {
    "pol1": (None, observation_space, action_space, {"lr": 0.0003}),
    "pol2": (None, observation_space, action_space, {"lr": 0.0004}),
}

#policies_to_train = ["pol1", "pol2"]

# 6.3) Adding the above to our config.
config.update({
    "multiagent": {
        "policies": policies,
        "policy_mapping_fn": policy_mapping_fn,
        #"policies_to_train": policies_to_train,
    },
})


## Exercise No 2

<hr />

Try learning our environment using Ray tune.run and a simple hyperparameter grid_search over:
- 2 different learning rates (pick your own values).
- AND 2 different `train_batch_size` settings (use 2000 and 3000).

Also, make RLlib use a [128,128] dense layer stack as the NN model.

Also, use the config setting of `num_envs_per_worker=10` to increase the sampling throughput.

In case your local machine has less than 12 CPUs, try setting `num_workers=1` to make all tune trials run at the same time.
Background: PPO by default uses 2 workers, which makes 1 trial use 3 CPUs (2 workers + "driver" ("local-worker")),
which makes the entire experiment use 12 CPUs. Tune will run trials in sequence in case it cannot allocate enough CPUs at once
(which is also fine, but then takes longer).

Try to reach a total reward (sum of agent1 and agent2) of 15.0.

**Good luck! :)**


In [33]:
# Solution to Exercise #2
# !LIVE CODING!
# Solution to Exercise #2:

# Update our config and set it up for 2x tune grid-searches (leading to 4 parallel trials in total).
config.update({
    "lr": tune.grid_search([0.0001, 0.0005]),
    "train_batch_size": tune.grid_search([2000, 3000]),
    "num_envs_per_worker": 10,
    # Change our model to be simpler.
    "model": {
        "fcnet_hiddens": [128, 128],
    },
})

# Run the experiment.
tune.run("PPO", config=config, stop={"episode_reward_mean": 15.0, "training_iteration": 100})


Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_cb9b0_00000,PENDING,,0.0001,2000
PPO_MultiAgentArena_cb9b0_00001,PENDING,,0.0005,2000
PPO_MultiAgentArena_cb9b0_00002,PENDING,,0.0001,3000
PPO_MultiAgentArena_cb9b0_00003,PENDING,,0.0005,3000


[2m[36m(pid=80480)[0m Instructions for updating:
[2m[36m(pid=80480)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80480)[0m Instructions for updating:
[2m[36m(pid=80480)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80485)[0m Instructions for updating:
[2m[36m(pid=80485)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80485)[0m Instructions for updating:
[2m[36m(pid=80485)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80483)[0m Instructions for updating:
[2m[36m(pid=80483)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80483)[0m Instructions for updating:
[2m[36m(pid=80483)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80482)[0m Instructions for updating:
[2m[36m(pid=80482)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80482)[0m Instructions for updating:
[2

Result for PPO_MultiAgentArena_cb9b0_00003:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-05-05_20-01-01
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 7.500000000000025
  episode_reward_mean: -9.637499999999998
  episode_reward_min: -43.500000000000064
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: 1af777aec03e4ac1950c31804df8af64
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.0003000000142492354
          entropy: 1.3486239910125732
          entropy_coeff: 0.0
          kl: 0.038723256438970566
          model: {}
          policy_loss: -0.06343455612659454
          total_loss: 35.706642150878906
          vf_explained_var: 0.12014700472354889
          vf_loss: 35.762332916259766
      pol2:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,,0.0001,2000,,,,,,,
PPO_MultiAgentArena_cb9b0_00001,RUNNING,,0.0005,2000,,,,,,,
PPO_MultiAgentArena_cb9b0_00002,RUNNING,,0.0001,3000,,,,,,,
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,1.0,7.40753,4000.0,-9.6375,7.5,-43.5,100.0


Result for PPO_MultiAgentArena_cb9b0_00002:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-05-05_20-01-01
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 8.400000000000032
  episode_reward_mean: -17.07750000000001
  episode_reward_min: -37.50000000000005
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: d7bdf14eedaa445a9267af681b70299f
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.0003000000142492354
          entropy: 1.3484574556350708
          entropy_coeff: 0.0
          kl: 0.03962294012308121
          model: {}
          policy_loss: -0.07537916302680969
          total_loss: 38.138641357421875
          vf_explained_var: 0.18260008096694946
          vf_loss: 38.20609664916992
      pol2:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,1,7.48566,4000,-6.645,23.4,-27.9,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,1,7.51113,4000,-6.6375,25.2,-31.5,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,1,7.46906,4000,-17.0775,8.4,-37.5,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,2,14.7351,8000,-5.205,16.5,-43.5,100


Result for PPO_MultiAgentArena_cb9b0_00001:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-05-05_20-01-08
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 25.199999999999932
  episode_reward_mean: -5.013749999999993
  episode_reward_min: -31.50000000000007
  episodes_this_iter: 40
  episodes_total: 80
  experiment_id: 3953b7f839aa4f4fb4db97e5dbf30e49
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 0.0003000000142492354
          entropy: 1.2987873554229736
          entropy_coeff: 0.0
          kl: 0.04003700241446495
          model: {}
          policy_loss: -0.08468735963106155
          total_loss: 28.290122985839844
          vf_explained_var: 0.18354330956935883
          vf_loss: 28.36280059814453
      pol2:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,2,14.7719,8000,-4.24125,23.4,-30.0,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,2,14.7471,8000,-5.01375,25.2,-31.5,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,2,14.9206,8000,-9.20625,21.6,-37.5,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,3,21.5962,12000,-1.815,16.5,-33.3,100


Result for PPO_MultiAgentArena_cb9b0_00000:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-05-05_20-01-15
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 12.899999999999974
  episode_reward_mean: -2.378999999999989
  episode_reward_min: -30.000000000000036
  episodes_this_iter: 40
  episodes_total: 120
  experiment_id: 476320c3f885462b9e54ae95d0c60e70
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 0.0003000000142492354
          entropy: 1.2661799192428589
          entropy_coeff: 0.0
          kl: 0.02801455371081829
          model: {}
          policy_loss: -0.06668344885110855
          total_loss: 22.77505874633789
          vf_explained_var: 0.30767640471458435
          vf_loss: 22.829133987426758
      pol2:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,3,21.5581,12000,-2.379,12.9,-30.0,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,3,21.6369,12000,-3.354,19.2,-28.5,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,3,21.6351,12000,-4.65,21.6,-37.5,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,4,28.4017,16000,1.773,24.6,-22.5,100


Result for PPO_MultiAgentArena_cb9b0_00002:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-05-05_20-01-22
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 16.799999999999955
  episode_reward_mean: -0.34199999999999037
  episode_reward_min: -31.500000000000043
  episodes_this_iter: 40
  episodes_total: 160
  experiment_id: d7bdf14eedaa445a9267af681b70299f
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.675000011920929
          cur_lr: 0.0003000000142492354
          entropy: 1.2238572835922241
          entropy_coeff: 0.0
          kl: 0.025144267827272415
          model: {}
          policy_loss: -0.0658692792057991
          total_loss: 29.964096069335938
          vf_explained_var: 0.28109848499298096
          vf_loss: 30.01299285888672
      pol2:
        learner_stats:
          cur_kl_coeff: 0.675000011920929
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,4,28.3893,16000,-0.606,17.1,-19.5,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,4,28.4304,16000,-1.281,19.2,-28.5,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,4,28.414,16000,-0.342,16.8,-31.5,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,5,35.5823,20000,2.514,24.6,-18.0,100


Result for PPO_MultiAgentArena_cb9b0_00001:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-05-05_20-01-29
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 18.299999999999933
  episode_reward_mean: 1.4880000000000104
  episode_reward_min: -18.90000000000002
  episodes_this_iter: 40
  episodes_total: 200
  experiment_id: 3953b7f839aa4f4fb4db97e5dbf30e49
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.2245558500289917
          entropy_coeff: 0.0
          kl: 0.018586529418826103
          model: {}
          policy_loss: -0.06037398427724838
          total_loss: 27.91238021850586
          vf_explained_var: 0.34945371747016907
          vf_loss: 27.953933715820312
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,5,35.5312,20000,0.219,21.0,-39.0,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,5,35.4717,20000,1.488,18.3,-18.9,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,6,42.3994,24000,2.166,18.6,-21.0,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,5,35.5823,20000,2.514,24.6,-18.0,100


Result for PPO_MultiAgentArena_cb9b0_00003:
  agent_timesteps_total: 48000
  custom_metrics: {}
  date: 2021-05-05_20-01-36
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 24.599999999999945
  episode_reward_mean: 3.5850000000000124
  episode_reward_min: -17.99999999999999
  episodes_this_iter: 40
  episodes_total: 240
  experiment_id: 1af777aec03e4ac1950c31804df8af64
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.1979690790176392
          entropy_coeff: 0.0
          kl: 0.017934303730726242
          model: {}
          policy_loss: -0.059876009821891785
          total_loss: 27.669910430908203
          vf_explained_var: 0.35560882091522217
          vf_loss: 27.711627960205078
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,6,42.3557,24000,1.824,21.0,-39.0,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,6,42.3881,24000,2.694,26.1,-18.9,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,7,49.1699,28000,2.667,24.0,-21.0,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,6,42.5366,24000,3.585,24.6,-18.0,100


Result for PPO_MultiAgentArena_cb9b0_00003:
  agent_timesteps_total: 56000
  custom_metrics: {}
  date: 2021-05-05_20-01-43
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 32.99999999999991
  episode_reward_mean: 3.903000000000008
  episode_reward_min: -16.499999999999986
  episodes_this_iter: 40
  episodes_total: 280
  experiment_id: 1af777aec03e4ac1950c31804df8af64
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.196864128112793
          entropy_coeff: 0.0
          kl: 0.01645570620894432
          model: {}
          policy_loss: -0.057790931314229965
          total_loss: 37.81082534790039
          vf_explained_var: 0.38023316860198975
          vf_loss: 37.851959228515625
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,7,49.2289,28000,2.307,24.6,-19.5,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,7,49.2239,28000,1.965,26.1,-25.5,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,8,56.206,32000,3.03,29.4,-21.0,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,7,49.3821,28000,3.903,33.0,-16.5,100


Result for PPO_MultiAgentArena_cb9b0_00000:
  agent_timesteps_total: 64000
  custom_metrics: {}
  date: 2021-05-05_20-01-50
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 24.599999999999945
  episode_reward_mean: 2.9520000000000106
  episode_reward_min: -19.499999999999986
  episodes_this_iter: 40
  episodes_total: 320
  experiment_id: 476320c3f885462b9e54ae95d0c60e70
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.1686789989471436
          entropy_coeff: 0.0
          kl: 0.01762952283024788
          model: {}
          policy_loss: -0.06268365681171417
          total_loss: 22.67396354675293
          vf_explained_var: 0.44134843349456787
          vf_loss: 22.71879768371582
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,8,56.1212,32000,2.952,24.6,-19.5,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,8,56.1268,32000,2.838,20.4,-25.5,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,9,62.8826,36000,2.952,29.4,-16.8,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,8,56.3408,32000,4.854,33.0,-16.5,100


Result for PPO_MultiAgentArena_cb9b0_00003:
  agent_timesteps_total: 72000
  custom_metrics: {}
  date: 2021-05-05_20-01-57
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 32.99999999999991
  episode_reward_mean: 5.612999999999999
  episode_reward_min: -21.0
  episodes_this_iter: 40
  episodes_total: 360
  experiment_id: 1af777aec03e4ac1950c31804df8af64
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.1554176807403564
          entropy_coeff: 0.0
          kl: 0.018537860363721848
          model: {}
          policy_loss: -0.06271504610776901
          total_loss: 37.18227767944336
          vf_explained_var: 0.30555063486099243
          vf_loss: 37.22622299194336
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          entropy: 1.1625

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,9,62.8391,36000,3.078,24.6,-19.5,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,9,62.7801,36000,3.849,31.2,-18.9,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,9,62.8826,36000,2.952,29.4,-16.8,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,10,69.8523,40000,7.233,26.4,-21.0,100


Result for PPO_MultiAgentArena_cb9b0_00002:
  agent_timesteps_total: 80000
  custom_metrics: {}
  date: 2021-05-05_20-02-04
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 24.599999999999973
  episode_reward_mean: 4.1400000000000095
  episode_reward_min: -15.899999999999993
  episodes_this_iter: 40
  episodes_total: 400
  experiment_id: d7bdf14eedaa445a9267af681b70299f
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.1392654180526733
          entropy_coeff: 0.0
          kl: 0.01658860594034195
          model: {}
          policy_loss: -0.04870859533548355
          total_loss: 25.632568359375
          vf_explained_var: 0.38717877864837646
          vf_loss: 25.664478302001953
      pol2:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,10,69.6861,40000,3.456,21.3,-19.5,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,10,69.5841,40000,5.31,31.2,-18.9,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,11,78.2186,44000,3.642,23.1,-19.5,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,10,69.8523,40000,7.233,26.4,-21.0,100


Result for PPO_MultiAgentArena_cb9b0_00003:
  agent_timesteps_total: 88000
  custom_metrics: {}
  date: 2021-05-05_20-02-13
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 26.999999999999908
  episode_reward_mean: 8.669999999999993
  episode_reward_min: -20.999999999999993
  episodes_this_iter: 40
  episodes_total: 440
  experiment_id: 1af777aec03e4ac1950c31804df8af64
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.1069669723510742
          entropy_coeff: 0.0
          kl: 0.016530446708202362
          model: {}
          policy_loss: -0.05600178614258766
          total_loss: 35.22227478027344
          vf_explained_var: 0.30403196811676025
          vf_loss: 35.26153564453125
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,11,78.0819,44000,3.423,22.8,-20.4,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,11,77.9821,44000,6.996,27.3,-18.9,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,12,85.8698,48000,4.977,23.1,-19.5,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,11,78.2993,44000,8.67,27.0,-21.0,100


Result for PPO_MultiAgentArena_cb9b0_00003:
  agent_timesteps_total: 96000
  custom_metrics: {}
  date: 2021-05-05_20-02-20
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 34.49999999999992
  episode_reward_mean: 9.494999999999989
  episode_reward_min: -8.999999999999975
  episodes_this_iter: 40
  episodes_total: 480
  experiment_id: 1af777aec03e4ac1950c31804df8af64
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.0975826978683472
          entropy_coeff: 0.0
          kl: 0.018019065260887146
          model: {}
          policy_loss: -0.05512842535972595
          total_loss: 40.95720672607422
          vf_explained_var: 0.33136841654777527
          vf_loss: 40.99409484863281
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          en

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,12,85.8323,48000,4.665,22.8,-20.4,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,12,85.7676,48000,8.718,32.1,-14.7,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,12,85.8698,48000,4.977,23.1,-19.5,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,13,93.2849,52000,10.233,34.5,-9.0,100


Result for PPO_MultiAgentArena_cb9b0_00002:
  agent_timesteps_total: 104000
  custom_metrics: {}
  date: 2021-05-05_20-02-28
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 24.599999999999923
  episode_reward_mean: 6.033000000000002
  episode_reward_min: -11.999999999999979
  episodes_this_iter: 40
  episodes_total: 520
  experiment_id: d7bdf14eedaa445a9267af681b70299f
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.0590343475341797
          entropy_coeff: 0.0
          kl: 0.015824107453227043
          model: {}
          policy_loss: -0.05223042517900467
          total_loss: 32.82256317138672
          vf_explained_var: 0.35390713810920715
          vf_loss: 32.858768463134766
      pol2:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,13,93.0854,52000,5.382,30.6,-20.4,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,13,93.111,52000,9.234,35.4,-15.0,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,13,93.2845,52000,6.033,24.6,-12.0,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,14,100.239,56000,11.208,37.5,-9.0,100


Result for PPO_MultiAgentArena_cb9b0_00002:
  agent_timesteps_total: 112000
  custom_metrics: {}
  date: 2021-05-05_20-02-35
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 31.499999999999922
  episode_reward_mean: 7.997999999999994
  episode_reward_min: -10.49999999999998
  episodes_this_iter: 40
  episodes_total: 560
  experiment_id: d7bdf14eedaa445a9267af681b70299f
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.0541672706604004
          entropy_coeff: 0.0
          kl: 0.016092058271169662
          model: {}
          policy_loss: -0.04986872151494026
          total_loss: 34.061737060546875
          vf_explained_var: 0.4095231890678406
          vf_loss: 34.095314025878906
      pol2:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,15,107.185,60000,7.605,28.5,-16.5,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,14,100.071,56000,10.98,35.4,-15.6,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,14,100.223,56000,7.998,31.5,-10.5,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,14,100.239,56000,11.208,37.5,-9.0,100


Result for PPO_MultiAgentArena_cb9b0_00003:
  agent_timesteps_total: 120000
  custom_metrics: {}
  date: 2021-05-05_20-02-42
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 37.49999999999991
  episode_reward_mean: 13.067999999999966
  episode_reward_min: -10.199999999999983
  episodes_this_iter: 40
  episodes_total: 600
  experiment_id: 1af777aec03e4ac1950c31804df8af64
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.0266005992889404
          entropy_coeff: 0.0
          kl: 0.01582265831530094
          model: {}
          policy_loss: -0.051514022052288055
          total_loss: 39.734100341796875
          vf_explained_var: 0.38242244720458984
          vf_loss: 39.76959228515625
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,16,114.132,64000,8.802,29.7,-9.6,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,15,107.289,60000,12.729,32.1,-15.6,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,15,107.485,60000,9.546,31.5,-9.0,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,15,107.566,60000,13.068,37.5,-10.2,100


Result for PPO_MultiAgentArena_cb9b0_00001:
  agent_timesteps_total: 128000
  custom_metrics: {}
  date: 2021-05-05_20-02-49
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 32.09999999999994
  episode_reward_mean: 13.802999999999967
  episode_reward_min: -15.599999999999993
  episodes_this_iter: 40
  episodes_total: 640
  experiment_id: 3953b7f839aa4f4fb4db97e5dbf30e49
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9753525853157043
          entropy_coeff: 0.0
          kl: 0.016500258818268776
          model: {}
          policy_loss: -0.049142371863126755
          total_loss: 36.821136474609375
          vf_explained_var: 0.4418574571609497
          vf_loss: 36.85356903076172
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,17,121.331,68000,8.925,29.7,-10.8,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,16,114.079,64000,13.803,32.1,-15.6,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,16,114.414,64000,10.074,35.1,-8.4,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,16,114.519,64000,13.86,40.2,-10.2,100


Result for PPO_MultiAgentArena_cb9b0_00001:
  agent_timesteps_total: 136000
  custom_metrics: {}
  date: 2021-05-05_20-02-56
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 31.799999999999912
  episode_reward_mean: 15.446999999999962
  episode_reward_min: -5.999999999999998
  episodes_this_iter: 40
  episodes_total: 680
  experiment_id: 3953b7f839aa4f4fb4db97e5dbf30e49
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9616758823394775
          entropy_coeff: 0.0
          kl: 0.015210719779133797
          model: {}
          policy_loss: -0.04672204703092575
          total_loss: 39.0045051574707
          vf_explained_var: 0.4549851715564728
          vf_loss: 39.03582763671875
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,18,128.245,72000,9.99,29.7,-10.8,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,17,121.262,68000,15.447,31.8,-6.0,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,17,121.678,68000,12.207,38.7,-6.6,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,17,121.758,68000,15.099,40.2,-13.2,100


Result for PPO_MultiAgentArena_cb9b0_00001:
  agent_timesteps_total: 144000
  custom_metrics: {}
  date: 2021-05-05_20-03-03
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 33.5999999999999
  episode_reward_mean: 16.54499999999996
  episode_reward_min: -6.599999999999982
  episodes_this_iter: 40
  episodes_total: 720
  experiment_id: 3953b7f839aa4f4fb4db97e5dbf30e49
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9522113800048828
          entropy_coeff: 0.0
          kl: 0.01667182706296444
          model: {}
          policy_loss: -0.05492858961224556
          total_loss: 33.16084289550781
          vf_explained_var: 0.5019678473472595
          vf_loss: 33.198890686035156
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          ent

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,18,128.245,72000,9.99,29.7,-10.8,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,18,128.117,72000,16.545,33.6,-6.6,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,19,135.342,76000,14.871,34.5,-9.0,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,18,128.641,72000,14.709,30.0,-13.2,100


Result for PPO_MultiAgentArena_cb9b0_00003:
  agent_timesteps_total: 152000
  custom_metrics: {}
  date: 2021-05-05_20-03-10
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 38.699999999999925
  episode_reward_mean: 16.283999999999946
  episode_reward_min: -13.199999999999976
  episodes_this_iter: 40
  episodes_total: 760
  experiment_id: 1af777aec03e4ac1950c31804df8af64
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9326965808868408
          entropy_coeff: 0.0
          kl: 0.014160866849124432
          model: {}
          policy_loss: -0.04371863603591919
          total_loss: 44.559326171875
          vf_explained_var: 0.3980241119861603
          vf_loss: 44.588706970214844
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,19,135.252,76000,10.686,37.8,-10.8,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,19,135.042,76000,17.166,38.1,-6.6,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,20,142.35,80000,15.135,33.0,-9.0,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,19,135.382,76000,16.284,38.7,-13.2,100


Result for PPO_MultiAgentArena_cb9b0_00003:
  agent_timesteps_total: 160000
  custom_metrics: {}
  date: 2021-05-05_20-03-17
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 38.699999999999925
  episode_reward_mean: 16.343999999999948
  episode_reward_min: -14.999999999999982
  episodes_this_iter: 40
  episodes_total: 800
  experiment_id: 1af777aec03e4ac1950c31804df8af64
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.945174515247345
          entropy_coeff: 0.0
          kl: 0.015016630291938782
          model: {}
          policy_loss: -0.05658692121505737
          total_loss: 43.63652420043945
          vf_explained_var: 0.38146549463272095
          vf_loss: 43.67790603637695
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,20,142.272,80000,11.361,37.8,-6.0,100
PPO_MultiAgentArena_cb9b0_00001,RUNNING,192.168.0.100:80480,0.0005,2000,20,141.975,80000,18.363,39.6,-4.8,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,21,149.686,84000,16.377,33.6,-7.5,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,20,142.397,80000,16.344,38.7,-15.0,100


Result for PPO_MultiAgentArena_cb9b0_00003:
  agent_timesteps_total: 168000
  custom_metrics: {}
  date: 2021-05-05_20-03-25
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 32.39999999999993
  episode_reward_mean: 16.88999999999994
  episode_reward_min: -14.999999999999982
  episodes_this_iter: 40
  episodes_total: 840
  experiment_id: 1af777aec03e4ac1950c31804df8af64
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9267206192016602
          entropy_coeff: 0.0
          kl: 0.013508189469575882
          model: {}
          policy_loss: -0.036311075091362
          total_loss: 39.247928619384766
          vf_explained_var: 0.43722862005233765
          vf_loss: 39.27056121826172
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,21,149.682,84000,11.064,30.6,-7.5,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,22,155.986,88000,17.076,34.5,9.52016e-15,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,21,149.705,84000,16.89,32.4,-15.0,100
PPO_MultiAgentArena_cb9b0_00001,TERMINATED,,0.0005,2000,21,149.375,84000,20.856,39.6,0.6,100


Result for PPO_MultiAgentArena_cb9b0_00003:
  agent_timesteps_total: 176000
  custom_metrics: {}
  date: 2021-05-05_20-03-31
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 32.39999999999993
  episode_reward_mean: 16.772999999999943
  episode_reward_min: -19.499999999999993
  episodes_this_iter: 40
  episodes_total: 880
  experiment_id: 1af777aec03e4ac1950c31804df8af64
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9183897972106934
          entropy_coeff: 0.0
          kl: 0.014347020536661148
          model: {}
          policy_loss: -0.041683379560709
          total_loss: 43.516143798828125
          vf_explained_var: 0.3513191044330597
          vf_loss: 43.543304443359375
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,22,155.869,88000,12.18,31.2,-7.5,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,23,161.846,92000,17.94,34.5,-4.5,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,22,155.946,88000,16.773,32.4,-19.5,100
PPO_MultiAgentArena_cb9b0_00001,TERMINATED,,0.0005,2000,21,149.375,84000,20.856,39.6,0.6,100


Result for PPO_MultiAgentArena_cb9b0_00000:
  agent_timesteps_total: 184000
  custom_metrics: {}
  date: 2021-05-05_20-03-37
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 31.199999999999903
  episode_reward_mean: 14.168999999999961
  episode_reward_min: -9.299999999999978
  episodes_this_iter: 40
  episodes_total: 920
  experiment_id: 476320c3f885462b9e54ae95d0c60e70
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9342248439788818
          entropy_coeff: 0.0
          kl: 0.01682404801249504
          model: {}
          policy_loss: -0.05160912498831749
          total_loss: 39.154258728027344
          vf_explained_var: 0.4929683208465576
          vf_loss: 39.18883514404297
      pol2:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.00039999998989515007
          

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,23,161.731,92000,14.169,31.2,-9.3,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,24,167.696,96000,19.533,38.7,-4.5,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,23,161.816,92000,17.001,30.9,-19.5,100
PPO_MultiAgentArena_cb9b0_00001,TERMINATED,,0.0005,2000,21,149.375,84000,20.856,39.6,0.6,100


Result for PPO_MultiAgentArena_cb9b0_00000:
  agent_timesteps_total: 192000
  custom_metrics: {}
  date: 2021-05-05_20-03-43
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 32.99999999999994
  episode_reward_mean: 14.351999999999968
  episode_reward_min: -9.299999999999978
  episodes_this_iter: 40
  episodes_total: 960
  experiment_id: 476320c3f885462b9e54ae95d0c60e70
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9320690631866455
          entropy_coeff: 0.0
          kl: 0.015933865681290627
          model: {}
          policy_loss: -0.05446206033229828
          total_loss: 39.71845245361328
          vf_explained_var: 0.39550215005874634
          vf_loss: 39.75678253173828
      pol2:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.00039999998989515007
          

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,25,172.89,100000,14.313,34.8,-3.9,100
PPO_MultiAgentArena_cb9b0_00002,RUNNING,192.168.0.100:80485,0.0001,3000,24,167.696,96000,19.533,38.7,-4.5,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,24,167.67,96000,17.922,32.7,-19.5,100
PPO_MultiAgentArena_cb9b0_00001,TERMINATED,,0.0005,2000,21,149.375,84000,20.856,39.6,0.6,100


Result for PPO_MultiAgentArena_cb9b0_00002:
  agent_timesteps_total: 200000
  custom_metrics: {}
  date: 2021-05-05_20-03-48
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 38.69999999999991
  episode_reward_mean: 20.396999999999938
  episode_reward_min: -4.499999999999977
  episodes_this_iter: 40
  episodes_total: 1000
  experiment_id: d7bdf14eedaa445a9267af681b70299f
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.8556892275810242
          entropy_coeff: 0.0
          kl: 0.014373543672263622
          model: {}
          policy_loss: -0.04934139922261238
          total_loss: 42.87417221069336
          vf_explained_var: 0.4056709408760071
          vf_loss: 42.90896224975586
      pol2:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,27,181.948,108000,16.65,40.5,-11.7,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,26,177.532,104000,18.972,37.8,0.6,100
PPO_MultiAgentArena_cb9b0_00001,TERMINATED,,0.0005,2000,21,149.375,84000,20.856,39.6,0.6,100
PPO_MultiAgentArena_cb9b0_00002,TERMINATED,,0.0001,3000,25,173.138,100000,20.397,38.7,-4.5,100


Result for PPO_MultiAgentArena_cb9b0_00003:
  agent_timesteps_total: 216000
  custom_metrics: {}
  date: 2021-05-05_20-03-58
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 41.099999999999916
  episode_reward_mean: 19.226999999999936
  episode_reward_min: -5.699999999999984
  episodes_this_iter: 40
  episodes_total: 1080
  experiment_id: 1af777aec03e4ac1950c31804df8af64
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.8516989946365356
          entropy_coeff: 0.0
          kl: 0.013366274535655975
          model: {}
          policy_loss: -0.03211168944835663
          total_loss: 45.97438049316406
          vf_explained_var: 0.40258970856666565
          vf_loss: 45.992958068847656
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,29,191.942,116000,16.404,37.2,-11.7,100
PPO_MultiAgentArena_cb9b0_00003,RUNNING,192.168.0.100:80482,0.0005,3000,28,186.684,112000,19.755,41.1,-5.7,100
PPO_MultiAgentArena_cb9b0_00001,TERMINATED,,0.0005,2000,21,149.375,84000,20.856,39.6,0.6,100
PPO_MultiAgentArena_cb9b0_00002,TERMINATED,,0.0001,3000,25,173.138,100000,20.397,38.7,-4.5,100


Result for PPO_MultiAgentArena_cb9b0_00003:
  agent_timesteps_total: 232000
  custom_metrics: {}
  date: 2021-05-05_20-04-08
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 39.59999999999991
  episode_reward_mean: 20.477999999999934
  episode_reward_min: -5.699999999999984
  episodes_this_iter: 40
  episodes_total: 1160
  experiment_id: 1af777aec03e4ac1950c31804df8af64
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.8317999839782715
          entropy_coeff: 0.0
          kl: 0.013249723240733147
          model: {}
          policy_loss: -0.037187956273555756
          total_loss: 40.497474670410156
          vf_explained_var: 0.3932393789291382
          vf_loss: 40.52124786376953
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,RUNNING,192.168.0.100:80483,0.0001,2000,31,199.592,124000,17.604,38.4,-7.5,100
PPO_MultiAgentArena_cb9b0_00001,TERMINATED,,0.0005,2000,21,149.375,84000,20.856,39.6,0.6,100
PPO_MultiAgentArena_cb9b0_00002,TERMINATED,,0.0001,3000,25,173.138,100000,20.397,38.7,-4.5,100
PPO_MultiAgentArena_cb9b0_00003,TERMINATED,,0.0005,3000,29,192.106,116000,20.478,39.6,-5.7,100


Result for PPO_MultiAgentArena_cb9b0_00000:
  agent_timesteps_total: 264000
  custom_metrics: {}
  date: 2021-05-05_20-04-22
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 43.79999999999991
  episode_reward_mean: 21.317999999999934
  episode_reward_min: -7.79999999999999
  episodes_this_iter: 40
  episodes_total: 1320
  experiment_id: 476320c3f885462b9e54ae95d0c60e70
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.7630941271781921
          entropy_coeff: 0.0
          kl: 0.014042488299310207
          model: {}
          policy_loss: -0.04581364244222641
          total_loss: 56.02145004272461
          vf_explained_var: 0.38919752836227417
          vf_loss: 56.05304718017578
      pol2:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,TERMINATED,,0.0001,2000,33,206.231,132000,21.318,43.8,-7.8,100
PPO_MultiAgentArena_cb9b0_00001,TERMINATED,,0.0005,2000,21,149.375,84000,20.856,39.6,0.6,100
PPO_MultiAgentArena_cb9b0_00002,TERMINATED,,0.0001,3000,25,173.138,100000,20.397,38.7,-4.5,100
PPO_MultiAgentArena_cb9b0_00003,TERMINATED,,0.0005,3000,29,192.106,116000,20.478,39.6,-5.7,100


Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_cb9b0_00000,TERMINATED,,0.0001,2000,33,206.231,132000,21.318,43.8,-7.8,100
PPO_MultiAgentArena_cb9b0_00001,TERMINATED,,0.0005,2000,21,149.375,84000,20.856,39.6,0.6,100
PPO_MultiAgentArena_cb9b0_00002,TERMINATED,,0.0001,3000,25,173.138,100000,20.397,38.7,-4.5,100
PPO_MultiAgentArena_cb9b0_00003,TERMINATED,,0.0005,3000,29,192.106,116000,20.478,39.6,-5.7,100


[2m[36m(pid=80483)[0m 2021-05-05 20:04:22,826	ERROR worker.py:395 -- SystemExit was raised from the worker
[2m[36m(pid=80483)[0m Traceback (most recent call last):
[2m[36m(pid=80483)[0m   File "python/ray/_raylet.pyx", line 488, in ray._raylet.execute_task
[2m[36m(pid=80483)[0m   File "python/ray/_raylet.pyx", line 495, in ray._raylet.execute_task
[2m[36m(pid=80483)[0m   File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
[2m[36m(pid=80483)[0m   File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor
[2m[36m(pid=80483)[0m   File "/Users/sven/opt/anaconda3/envs/ray_tutorial/lib/python3.8/site-packages/ray/_private/function_manager.py", line 566, in actor_method_executor
[2m[36m(pid=80483)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(pid=80483)[0m   File "/Users/sven/opt/anaconda3/envs/ray_tutorial/lib/python3.8/site-packages/ray/actor.py", line 1001, in __ray_terminate__
[2m[36m(pid=80483)[0m    

<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7fd129c4a880>

In [None]:
# Anyscale's Infinite laptop:

# NOTE: The following cell will only work if you are already on-boarded to our Anyscale Inc. "Infinite Laptop".
# To get more information, see https://www.anyscale.com/product

# Let's quickly divert from our MultiAgentArena and move to something much heavier in terms of environment/simulator complexity.
# We will now demonstrate, how you can use Anyscale's infinite laptop to launch an RLlib experiment on a cloud 4 GPU + 32 CPU machine
# all from within this Jupyter cell here.
# Start an experiment in the cloud using Anyscale's product, RLlib, and a more complex multi-agent env.

# NOTE 
import anyscale



In [29]:
# Custom Neural Network Models.
# 

import tensorflow as tf


class MyModel(tf.keras.Model):
    def __init__(self,
                input_space,
                action_space,
                num_outputs,
                name="",
                *,
                layers = (256, 256)):
        super().__init__(name=name)

        self.dense_layers = []
        for i, layer_size in enumerate(layers):
            self.dense_layers.append(tf.keras.layers.Dense(
                layer_size, activation=tf.nn.relu, name=f"dense_{i}"))

        self.logits = tf.keras.layers.Dense(
            num_outputs,
            activation=tf.keras.activations.linear,
            name="logits")
        self.values = tf.keras.layers.Dense(
            1, activation=None, name="values")

    def call(self, inputs, training=None, mask=None):
        # Standardized input args:
        # - input_dict (RLlib `SampleBatch` object, which is basically a dict with numpy arrays
        # in it)
        out = inputs["obs"]
        for l in self.dense_layers:
            out = l(out)
        logits = self.logits(out)
        values = self.values(out)

        # Standardized output:
        # - "normal" model output tensor (e.g. action logits).
        # - list of internal state outputs (only needed for RNN-/memory enhanced models).
        # - "extra outs", such as model's side branches, e.g. value function outputs.
        return logits, [], {"vf_preds": tf.reshape(values, [-1])}

In [30]:
# Do a quick test on the custom model class.
from gym.spaces import Box
test_model = MyModel(
    input_space=Box(-1.0, 1.0, (2, )),
    action_space=None,
    num_outputs=2,
)
test_model({"obs": np.array([[0.5, 0.5]])})

(<tf.Tensor 'my_model/logits/BiasAdd:0' shape=(1, 2) dtype=float64>,
 [],
 {'vf_preds': <tf.Tensor 'my_model/Reshape:0' shape=(1,) dtype=float64>})

In [None]:
# Set up our custom model and re-run the experiment.

config.update({
    "model": {
        "custom_model": MyModel,
        "custom_model_config": {
            "layers": [128, 128],
        },
    },
    # Revert these to single trials (and use those hyperparams that performed well in our Exercise #2).
    "lr": 0.0005,
    "train_batch_size": 2000,
})

tune.run("PPO", config=config, stop=stop)

In [None]:
# "Hacking in": How do we customize our RL loop?
# RLlib offers a callbacks API that allows you to add custom behavior at
# all major events during the environment sampling and learning process.

# Our problem: So far, we can only see the total reward (sum for both agents).
# This does not give us enough insights into the question of which agent
# learns what (maybe agent2 doesn't learn anything and the reward we are observing
# is mostly due to agent1's progress in covering the map!).
# The following custom callbacks class allows us to add each agents single reward to
# the returned metrics, which will then be displayed in tensorboard.

# We will override RLlib's DefaultCallbacks class and implement the
# `on_episode_step` and `on_episode_end` methods therein.

from ray.rllib.agents.callbacks import DefaultCallbacks


class MyCallbacks(DefaultCallbacks):
    def on_episode_start(self, *, worker, base_env,
                         policies, episode,
                         env_index, **kwargs):
        episode.user_data["agent1_rewards"] = []
        episode.user_data["agent2_rewards"] = []

    def on_episode_step(self, *, worker, base_env,
                        episode, env_index, **kwargs):
        # Make sure this episode is ongoing.
        #assert episode.length > 0, \
        #    "ERROR: `on_episode_step()` callback should not be called right " \
        #    "after env reset!"
        ag1_r = episode.prev_reward_for("agent1")
        ag2_r = episode.prev_reward_for("agent2")
        #print("ag1_r={} ag2_r={}".format(ag1_r, ag2_r))
        episode.user_data["agent1_rewards"].append(ag1_r)
        episode.user_data["agent2_rewards"].append(ag2_r)

    def on_episode_end(self, *, worker, base_env,
                       policies, episode,
                       env_index, **kwargs):
        episode.custom_metrics["ag1_R"] = sum(episode.user_data["agent1_rewards"])
        episode.custom_metrics["ag2_R"] = sum(episode.user_data["agent2_rewards"])
        episode.hist_data["agent1_rewards"] = episode.user_data["agent1_rewards"]
        episode.hist_data["agent2_rewards"] = episode.user_data["agent2_rewards"]



In [None]:
# Setting up our config to point to our new custom callbacks class:
config.update({
    "env": MultiAgentArena,  # force "reload"
    "callbacks": MyCallbacks,  # by default, this would point to `rllib.agents.callbacks.DefaultCallbacks`, which does nothing.
    #TODO: remove this once native keras models are supported!
    "model": {
        "custom_model": None,
    },
})

results = tune.run("PPO", config=config, stop={"training_iteration": 10})

### Let's check tensorboard for the new custom metrics!

1. Head over to ~/ray_results/PPO/PPO_MultiAgentArena_[some key]_00000_0_[date]_[time]/
1. In that directory, you should see a `event.out....` file.
1. Run `tensorboard --logdir .` and head to https://localhost:6006

<img src="images/tensorboard.png" width=800>


## Exercise No 3

<hr />

Assume we would like to know exactly how much (new) ground agent1 
covers on average in an episode.
Write your own custom callback class (sub-class
ray.rllib.agents.callback::DefaultCallbacks) and override one or more methods
therein to collect the following data:
- The number of (unique) fields agent1 has covered in an episode. Try to get 
- The number of times agent2 has blocked agent1.

Run a simple experiment using tune.run (and your custom callbacks class)
and confirm the new metric shows up in tensorboard.

Hints:

To get the last reward for an agent, use `episode.prev_reward_for([agent-name])`.

To get the current observation for an agent, use `episode.last_raw_obs_for([agent-name])`.

**Good luck! :)**

In [None]:
# Solution Exercise #3

import ray
from ray.rllib.agents.callbacks import DefaultCallbacks
from ray import tune


class MyCallback(DefaultCallbacks):
    def on_episode_start(self, *, worker, base_env,
                         policies, episode,
                         env_index, **kwargs):
        # Set per-episode object to capture, which states (observations)
        # have been visited by agent1.
        episode.user_data["ground_covered"] = set()
        # Set per-episode agent2-blocks counter (how many times has agent2 blocked agent1?).
        episode.user_data["num_blocks"] = 0

    def on_episode_step(self, *, worker, base_env,
                        episode, env_index, **kwargs):
        # Add agent1's observation to our set of unique observations.
        ag1_obs = episode.last_raw_obs_for("agent1")
        episode.user_data["ground_covered"].add(ag1_obs)
        # If agent2's reward > 0.0, it means she has blocked agent1.
        ag2_r = episode.prev_reward_for("agent2")
        if ag2_r > 0.0:
            episode.user_data["num_blocks"] += 1

    def on_episode_end(self, *, worker, base_env,
                       policies, episode,
                       env_index, **kwargs):
        # Reset everything.
        episode.user_data["ground_covered"] = set()
        episode.user_data["num_blocks"] = 0



ray.init()

stop = {"training_iteration": 10}
# Specify env and custom callbacks in our config (leave everything else
# as-is (defaults)).
config = {
    "env": MultiAgentArena,
    "callbacks": MyCallback,
}

# Run for a few iterations.
tune.run("PPO", stop=stop, config=config)

# Check tensorboard.



### A closer look at RLlib's APIs and structure

We already took a quick look inside an RLlib Trainer object and extracted its Policy(ies) and the Policy's model (neural network). Here is a much more detailed overview of what's inside a Trainer object.

At the core is the so-called `WorkerSet` sitting under `Trainer.workers`. A WorkerSet is a group of `RolloutWorker` (`rllib.evaluation.rollout_worker.py`) objects that always consists of a "local worker" (`Trainer.workers.local_worker()`) and n "remote workers" (`Trainer.workers.remote_workers()`).



<img src="images/rllib_structure.png" width=1000>

### Scaling RLlib

Scaling RLlib works by parallelizing the "jobs" that the remote `RolloutWorkers` do. In a vanilla RL algorithm, like PPO, DQN, and many others, the `@ray.remote` labeled RolloutWorkers in the figure above are responsible for interacting with one or more environments and thereby collecting experiences. Observations are produced by the environment, actions are then computed by the Policy(ies) copy located on the remote worker and sent to the environment in order to produce yet another observation. This cycle is repeated endlessly and only sometimes interrupted to send experience batches ("train batches") of a certain size to the "local worker". There these batches are used to call `Policy.learn_on_batch()`, which performs a loss calculation, followed by a model weights update, and a subsequent weights broadcast back to all the remote workers.

