# Hands-on RL with Ray’s RLlib
## A beginner’s tutorial for working with multi-agent environments, models, and algorithms

<img src="images/pitfall.jpg" width=250> <img src="images/tesla.jpg" width=254> <img src="images/forklifts.jpg" width=169> <img src="images/robots.jpg" width=252> <img src="images/dota2.jpg" width=213>

### Overview
“Hands-on RL with Ray’s RLlib” is a beginners tutorial for working with reinforcement learning (RL) environments, models, and algorithms using Ray’s RLlib library. RLlib offers high scalability, a large list of algos to choose from (offline, model-based, model-free, etc..), support for TensorFlow and PyTorch, and a unified API for a variety of applications. This tutorial includes a brief introduction to provide an overview of concepts (e.g. why RL) before proceeding to RLlib (multi- and single-agent) environments, neural network models, hyperparameter tuning, debugging, student exercises, Q/A, and more. All code will be provided as .py files in a GitHub repo.

### Intended Audience
* Python programmers who want to get started with reinforcement learning and RLlib.

### Prerequisites
* Some Python programming experience.
* Some familiarity with machine learning.
* *Helpful, but not required:* Experience in reinforcement learning and Ray.
* *Helpful, but not required:* Experience with TensorFlow or PyTorch.

### Requirements/Dependencies

Install conda (https://www.anaconda.com/products/individual)

Then ...

#### Quick `conda` setup instructions (Linux):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install tensorflow  # <- either one works!
$ pip install torch  # <- either one works!
$ pip install jupyterlab
```

#### Quick `conda` setup instructions (Mac):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install cmake "ray[rllib]"
$ pip install tensorflow  # <- either one works!
$ pip install torch  # <- either one works!
$ pip install jupyterlab
```

#### Quick `conda` setup instructions (Win10):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install [tensorflow|torch]  # <- either one works!
$ pip install jupyterlab
$ conda install pywin32
```

Also, for Win10 Atari support, we have to install atari_py from a different source (gym does not support Atari envs on Windows).

```
$ pip install git+https://github.com/Kojoley/atari-py.git
```

### Opening these tutorial files:
```
$ git clone https://github.com/sven1977/rllib_tutorials
$ cd rllib_tutorials
$ jupyter-lab
```

### Key Takeaways
* What is reinforcement learning and why RLlib?
* Core concepts of RLlib: Environments, Trainers, Policies, and Models.
* How to configure, hyperparameter-tune, and parallelize RLlib.
* RLlib debugging best practices.

### Tutorial Outline
1. RL and RLlib in a nutshell.
1. Defining an RL-solvable problem: Our first environment.
1. **Exercise No.1**: Environment loop.

(15min break)

1. Picking an algorithm and training our first RLlib Trainer.
1. Configurations and hyperparameters - Easy tuning with Ray Tune.
1. Fixing our experiment's config - Going multi-agent.
1. The "infinite laptop": Quick intro into how to use RLlib with Anyscale's product.
1. **Exercise No.2**: Run your own Ray RLlib+Tune experiment)
1. Neural network models - Provide your custom models using tf.keras or torch.nn.

(15min break)

1. Deeper dive into RLlib's parallelization architecture.
1. Specifying different compute resources and parallelization options through our config.
1. "Hacking in": Using callbacks to customize the RL loop and generate our own metrics.
1. **Exercise No.3**: Write your own custom callback.
1. "Hacking in (part II)" - Debugging with RLlib and PyCharm.
1. Checking on the "infinite laptop" - Did RLlib learn to solve the problem?

### Other Recommended Readings
* [Attention Nets and More with RLlib's Trajectory View API](https://medium.com/distributed-computing-with-ray/attention-nets-and-more-with-rllibs-trajectory-view-api-d326339a6e65)
* [Intro to RLlib: Example Environments](https://medium.com/distributed-computing-with-ray/intro-to-rllib-example-environments-3a113f532c70)
* [Reinforcement Learning with RLlib in the Unity Game Engine](https://medium.com/distributed-computing-with-ray/reinforcement-learning-with-rllib-in-the-unity-game-engine-1a98080a7c0d)


<img src="images/rl-cycle.png" width=800>

### Coding/defining our "problem" via an RL environment.

We will use the following (adversarial) multi-agent environment
throughout this tutorial to demonstrate a large fraction of RLlib's
APIs, features, and customization options.

<img src="images/environment.png" width=800>

### A word or two on Spaces:

Spaces are used in ML to describe what possible/valid values inputs and outputs of a neural network can have.

RL environments also use them to describe what their valid observations and actions are.

Spaces are usually defined by their shape (e.g. 84x84x3 RGB images) and datatype (e.g. uint8 for RGB values between 0 and 255).
However, spaces could also be composed of other spaces (see Tuple or Dict spaces) or could be simply discrete with n fixed possible values
(represented by integers). For example, in our game, where each agent can only go up/down/left/right, the action space would be "Discrete(4)"
(no datatype, no shape needs to be defined here).

<img src="images/spaces.png" width=800>

In [1]:
# Let's code (parts of) our multi-agent environment.

import gym
from gym.spaces import Discrete, MultiDiscrete
import numpy as np
import random

from ray.rllib.env.multi_agent_env import MultiAgentEnv


class MultiAgentArena(MultiAgentEnv):
    def __init__(self, config=None):
        config = config or {}
        # Dimensions of the grid.
        self.width = config.get("width", 10)
        self.height = config.get("height", 10)

        # End an episode after this many timesteps.
        self.timestep_limit = config.get("ts", 100)

        self.observation_space = MultiDiscrete([self.width * self.height,
                                                self.width * self.height])
        # 0=up, 1=right, 2=down, 3=left.
        self.action_space = Discrete(4)

        # Reset env.
        self.reset()
        
    def reset(self):
        """Returns initial observation of next(!) episode."""
        # Row-major coords.
        self.agent1_pos = [0, 0]  # upper left corner
        self.agent2_pos = [self.height - 1, self.width - 1]  # lower bottom corner

        # Accumulated rewards in this episode.
        self.agent1_R = 0.0
        self.agent2_R = 0.0

        # Reset agent1's visited fields.
        self.agent1_visited_fields = set([tuple(self.agent1_pos)])

        # How many timesteps have we done in this episode.
        self.timesteps = 0

        # Return the initial observation in the new episode.
        return self._get_obs()

    def step(self, action: dict):
        """
        Returns (next observation, rewards, dones, infos) after having taken the given actions.
        
        e.g.
        `action={"agent1": action_for_agent1, "agent2": action_for_agent2}`
        """
        
        # increase our time steps counter by 1.
        self.timesteps += 1
        # An episode is "done" when we reach the time step limit.
        is_done = self.timesteps >= self.timestep_limit

        # Agent2 always moves first.
        # events = [collision|agent1_new_field]
        events = self._move(self.agent2_pos, action["agent2"], is_agent1=False)
        events |= self._move(self.agent1_pos, action["agent1"], is_agent1=True)

        # Useful for rendering.
        self.collision = "collision" in events
            
        # Get observations (based on new agent positions).
        obs = self._get_obs()

        # Determine rewards based on the collected events:
        r1 = -1.0 if "collision" in events else 1.0 if "agent1_new_field" in events else -0.5
        r2 = 1.0 if "collision" in events else -0.1

        self.agent1_R += r1
        self.agent2_R += r2
        
        rewards = {
            "agent1": r1,
            "agent2": r2,
        }

        # Generate a `done` dict (per-agent and total).
        dones = {
            "agent1": is_done,
            "agent2": is_done,
            # special `__all__` key indicates that the episode is done for all agents.
            "__all__": is_done,
        }

        return obs, rewards, dones, {}  # <- info dict (not needed here).

    def _get_obs(self):
        """
        Returns obs dict (agent name to discrete-pos tuple) using each
        agent's current x/y-positions.
        """
        ag1_discrete_pos = self.agent1_pos[0] * self.width + \
            (self.agent1_pos[1] % self.width)
        ag2_discrete_pos = self.agent2_pos[0] * self.width + \
            (self.agent2_pos[1] % self.width)
        return {
            "agent1": np.array([ag1_discrete_pos, ag2_discrete_pos]),
            "agent2": np.array([ag2_discrete_pos, ag1_discrete_pos]),
        }

    def _move(self, coords, action, is_agent1):
        """
        Moves an agent (agent1 iff is_agent1=True, else agent2) from `coords` (x/y) using the
        given action (0=up, 1=right, etc..) and returns a resulting events dict:
        Agent1: "new" when entering a new field. "bumped" when having been bumped into by agent2.
        Agent2: "bumped" when bumping into agent1 (agent1 then gets -1.0).
        """
        orig_coords = coords[:]
        # Change the row: 0=up (-1), 2=down (+1)
        coords[0] += -1 if action == 0 else 1 if action == 2 else 0
        # Change the column: 1=right (+1), 3=left (-1)
        coords[1] += 1 if action == 1 else -1 if action == 3 else 0

        # Solve collisions.
        # Make sure, we don't end up on the other agent's position.
        # If yes, don't move (we are blocked).
        if (is_agent1 and coords == self.agent2_pos) or (not is_agent1 and coords == self.agent1_pos):
            coords[0], coords[1] = orig_coords
            # Agent2 blocked agent1 (agent1 tried to run into agent2)
            # OR Agent2 bumped into agent1 (agent2 tried to run into agent1)
            return {"collision"}

        # No agent blocking -> check walls.
        if coords[0] < 0:
            coords[0] = 0
        elif coords[0] >= self.height:
            coords[0] = self.height - 1
        if coords[1] < 0:
            coords[1] = 0
        elif coords[1] >= self.width:
            coords[1] = self.width - 1

        # If agent1 -> "new" if new tile covered.
        if is_agent1 and not tuple(coords) in self.agent1_visited_fields:
            self.agent1_visited_fields.add(tuple(coords))
            return {"agent1_new_field"}
        # No new tile for agent1.
        return set()

    def render(self, mode=None):
        print("_" * (self.width + 2))
        for r in range(self.height):
            print("|", end="")
            for c in range(self.width):
                field = r * self.width + c % self.width
                if self.agent1_pos == [r, c]:
                    print("1", end="")
                elif self.agent2_pos == [r, c]:
                    print("2", end="")
                elif (r, c) in self.agent1_visited_fields:
                    print(".", end="")
                else:
                    print(" ", end="")
            print("|")
        print("‾" * (self.width + 2))
        print(f"{'!!Collision!!' if self.collision else ''}")
        print("R1={: .1f}".format(self.agent1_R))
        print("R2={: .1f}".format(self.agent2_R))
        print()


env = MultiAgentArena()

obs = env.reset()

# Agent1 will move down, Agent2 moves up.
obs, rewards, dones, infos = env.step(action={"agent1": 2, "agent2": 0})

env.render()

print("Agent1's x/y position={}".format(env.agent1_pos))
print("Agent2's x/y position={}".format(env.agent2_pos))
print("Env timesteps={}".format(env.timesteps))




____________
|.         |
|1         |
|          |
|          |
|          |
|          |
|          |
|          |
|         2|
|          |
‾‾‾‾‾‾‾‾‾‾‾‾

R1= 1.0
R2=-0.1

Agent1's x/y position=[1, 0]
Agent2's x/y position=[8, 9]
Env timesteps=1


## Exercise No 1

<hr />

<img src="images/exercise1.png" width=400>

In the cell above, we performed a `reset()` and a single `step()` call. To walk through an entire episode, one would normally call `step()` repeatedly (with different actions) until the returned `done` dict has the "agent1" or "agent2" (or "__all__") key set to True. Your task is to write an "environment loop" that runs for exactly one episode using our `MultiAgentArena` class.

Follow these instructions here to get this done.

1. Create an env object.
1. `reset` your environment to get the first (initial) observation.
1. Compute the actions for "agent1" and "agent2" calling `DummyTrainer.compute_action([obs])` twice and putting the results into an action dict to be passed into `step()`, just like it's done in the above cell (where we do a single `step()`).
1. Repeat this, `step`ing through an entire episode.
1. When an episode is done, `step()` will return a done dict with key `__all__` set to True.
1. If you feel, this is way too easy for you ;) , try to extract each agent's reward, sum it up over the episode and - at the end of the episode - print out each agent's accumulated reward (also called the "return" of an episode).

**Good luck! :)**


In [22]:
class DummyTrainer:
    """Dummy Trainer class used in Exercise #1.

    Use its `compute_action` method to get a new action for one of the agents,
    given the agent's observation (a single discrete value encoding the field
    the agent is currently in).
    """

    def compute_action(self, single_agent_obs=None):
        # Returns a random action for a single agent.
        return np.random.randint(4)  # Discrete(4) -> return rand int between 0 and 3 (incl. 3).

dummy_trainer = DummyTrainer()
# Check, whether it's working.
for _ in range(3):
    # Get action for agent1 (providing agent1's and agent2's positions).
    print("action_agent1={}".format(dummy_trainer.compute_action(np.array([0, 99]))))

    # Get action for agent2 (providing agent2's and agent1's positions).
    print("action_agent2={}".format(dummy_trainer.compute_action(np.array([99, 0]))))

    print()

action_agent1=3
action_agent2=1

action_agent1=0
action_agent2=1

action_agent1=0
action_agent2=2



Write your solution code into this cell here:

In [23]:
# !LIVE CODING!

# Leave the following as-is. It'll help us with rendering the env in this very cell's output.
import time
from ipywidgets import Output
from IPython import display
import time

out = Output()
display.display(out)

with out:

    # Solution to Exercise #1:
    # Start coding here inside this `with`-block:
    # 1) Reset the env.
    obs = env.reset()  # start new episode

    # 2) Enter an infinite while loop (to step through the episode).
    while env.timesteps < 100:
        # 3) Calculate both agents' actions individually, using dummy_trainer.compute_action([individual agent's obs])
        a1 = dummy_trainer.compute_action(obs["agent1"])
        a2 = dummy_trainer.compute_action(obs["agent2"])

        # 4) Compile the actions dict from both individual actions.
        actions = {
            "agent1": a1, "agent2": a2,
        }

        # 5) Send the actions dict to the env's `step()` method to receive: obs, rewards, dones, info dicts
        obs, rewards, dones, _ = env.step(actions)

        # 6) We'll do this together: Render the env.
        # Don't write any code here (skip directly to 7).
        out.clear_output(wait=True)
        time.sleep(0.05)
        env.render()

        # 7) Check, whether the episde is done, if yes, break out of the while loop.
        if dones["agent1"] is True:
            break

# 8) Run it! :)

Output()

------------------
## 15 min break :)
------------------

### And now for something completely different:
#### Plugging in RLlib!

In [24]:
import numpy as np
import pprint
import ray

# Start a new instance of Ray (when running this tutorial locally) or
# connect to an already running one (when running this tutorial through Anyscale).

ray.init()  # Hear the engine humming? ;)

# In case you encounter the following error during our tutorial: `RuntimeError: Maybe you called ray.init twice by accident?`
# Try: `ray.shutdown() + ray.init()` or `ray.init(ignore_reinit_error=True)`

2021-06-24 10:42:20,283	INFO services.py:1272 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.0.179',
 'raylet_ip_address': '192.168.0.179',
 'redis_address': '192.168.0.179:6379',
 'object_store_address': '/tmp/ray/session_2021-06-24_10-42-18_600631_2196/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-06-24_10-42-18_600631_2196/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2021-06-24_10-42-18_600631_2196',
 'metrics_export_port': 63988,
 'node_id': 'fb379f8d24eba01da8c8027447994ca7c9754701cb243183c55e28d7'}

### Picking an RLlib algorithm - We'll use PPO throughout this tutorial (one-size-fits-all-kind-of-algo)

<img src="images/rllib_algos.png" width=800>

https://docs.ray.io/en/master/rllib-algorithms.html#available-algorithms-overview

In [25]:
# Import a Trainable (one of RLlib's built-in algorithms):
# We use the PPO algorithm here b/c its very flexible wrt its supported
# action spaces and model types and b/c it learns well almost any problem.
from ray.rllib.agents.ppo import PPOTrainer

# Specify a very simple config, defining our environment and some environment
# options (see environment.py).
config = {
    "env": MultiAgentArena,  # "my_env" <- if we previously have registered the env with `tune.register_env("[name]", lambda config: [returns env object])`.
    "env_config": {
        "config": {
            "width": 10,
            "height": 10,
            "ts": 100,
        },
    },

    # !PyTorch users!
    #"framework": "torch",  # If users have chosen to install torch instead of tf.

    "create_env_on_driver": True,
}
# Instantiate the Trainer object using above config.
rllib_trainer = PPOTrainer(config=config)
rllib_trainer

2021-06-24 10:42:23,897	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2021-06-24 10:42:23,898	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


PPO

### Ready to train with RLlib's PPO algorithm

That's it, we are ready to train.
Calling `Trainer.train()` will execute a single "training iteration".

One iteration for most algos involves:

1) sampling from the environment(s)
2) using the sampled data (observations, actions taken, rewards) to update the policy model (neural network), such that it would pick better actions in the future, leading to higher rewards.

Let's try it out:

In [26]:
results = rllib_trainer.train()

# Delete the config from the results for clarity.
# Only the stats will remain, then.
del results["config"]
# Pretty print the stats.
pprint.pprint(results)

{'agent_timesteps_total': 4000,
 'custom_metrics': {},
 'date': '2021-06-24_10-42-36',
 'done': False,
 'episode_len_mean': 100.0,
 'episode_media': {},
 'episode_reward_max': 14.999999999999998,
 'episode_reward_mean': -9.104999999999999,
 'episode_reward_min': -34.50000000000005,
 'episodes_this_iter': 20,
 'episodes_total': 20,
 'experiment_id': '74285469b30e44cdbdfb8f3ba86dfb6c',
 'hist_stats': {'episode_lengths': [100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100,
                                    100

### Going from single policy (RLlib's default) to multi-policy:

So far, our experiment has been ill-configured, because both
agents, which should behave differently due to their different
tasks and reward functions, learn the same policy: the "default_policy",
which RLlib always provides if you don't configure anything else.
Remember that RLlib does not know at Trainer setup time, how many and which agents
the environment will "produce". Agent control (adding agents, removing them, terminating
episodes for agents) is entirely in the Env's hands.
Let's fix our single policy problem and introduce the "multiagent" API.

<img src="images/from_single_agent_to_multi_agent.png" width=800>

In order to turn on RLlib's multi-agent functionality, we need two things:

1. A policy mapping function, mapping agent IDs (e.g. a string like "agent1", produced by the environment in the returned observation/rewards/dones-dicts) to a policy ID (another string, e.g. "policy1", which is under our control).
1. A policies definition dict, mapping policy IDs (e.g. "policy1") to 4-tuples consisting of 1) policy class (None for using the default class), 2) observation space, 3) action space, and 4) config overrides (empty dict for no overrides and using the Trainer's main config dict).

Let's take a closer look:

In [27]:
# Define the policies definition dict:
# Each policy in there is defined by its ID (key) mapping to a 4-tuple (value):
# - Policy class (None for using the "default" class, e.g. PPOTFPolicy for PPO+tf or PPOTorchPolicy for PPO+torch).
# - obs-space (we get this directly from our already created env object).
# - act-space (we get this directly from our already created env object).
# - config-overrides dict (leave empty for using the Trainer's config as-is)
policies = {
    "policy1": (None, env.observation_space, env.action_space, {}),
    "policy2": (None, env.observation_space, env.action_space, {"lr": 0.0002}),
}
# Note that now we won't have a "default_policy" anymore, just "policy1" and "policy2".

# Define an agent->policy mapping function.
# Which agents (defined by the environment) use which policies (defined by us)?
# The mapping here is M (agents) -> N (policies), where M >= N.
def policy_mapping_fn(agent_id: str):
    # Make sure agent ID is valid.
    assert agent_id in ["agent1", "agent2"], f"ERROR: invalid agent ID {agent_id}!"
    # Map agent1 to policy1, and agent2 to policy2.
    return "policy1" if agent_id == "agent1" else "policy2"

# We could - if we wanted - specify, which policies should be learnt (by default, RLlib learns all).
# Non-learnt policies will be frozen and not updated:
# policies_to_train = ["policy1", "policy2"]

# Adding the above to our config.
config.update({
    "multiagent": {
        "policies": policies,
        "policy_mapping_fn": policy_mapping_fn,
        # We'll leave this empty: Means, we train both policy1 and policy2.
        # "policies_to_train": policies_to_train,
    },
})

pprint.pprint(config)
print()
print(f"agent1 is now mapped to {policy_mapping_fn('agent1')}")
print(f"agent2 is now mapped to {policy_mapping_fn('agent2')}")

{'create_env_on_driver': True,
 'env': <class '__main__.MultiAgentArena'>,
 'env_config': {'config': {'height': 10, 'ts': 100, 'width': 10}},
 'multiagent': {'policies': {'policy1': (None,
                                         MultiDiscrete([100 100]),
                                         Discrete(4),
                                         {}),
                             'policy2': (None,
                                         MultiDiscrete([100 100]),
                                         Discrete(4),
                                         {'lr': 0.0002})},
                'policy_mapping_fn': <function policy_mapping_fn at 0x7fbc3a8b2af0>}}

agent1 is now mapped to policy1
agent2 is now mapped to policy2


In [28]:
# Recreate our Trainer (we cannot just change the config on-the-fly).
rllib_trainer.stop()

# Using our updated (now multiagent!) config dict.
rllib_trainer = PPOTrainer(config=config)
rllib_trainer

2021-06-24 10:42:49,739	INFO trainable.py:101 -- Trainable.setup took 12.972 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


PPO

Now that we are setup correctly with two policies as per our "multiagent" config, let's call `train()` on the new Trainer several times (what about 10 times?).

In [29]:
# Run `train()` n times. Repeatedly call `train()` now to see rewards increase.
# Move on once you see (agent1 + agent2) episode rewards of 10.0 or more.
for _ in range(10):
    results = rllib_trainer.train()
    print(f"Iteration={rllib_trainer.iteration}: R(\"return\")={results['episode_reward_mean']}")

Iteration=1: R("return")=-8.587500000000002
Iteration=2: R("return")=-5.167499999999999
Iteration=3: R("return")=-4.460999999999993
Iteration=4: R("return")=-2.3789999999999885
Iteration=5: R("return")=-1.7069999999999876
Iteration=6: R("return")=-0.26699999999998636
Iteration=7: R("return")=1.4850000000000128
Iteration=8: R("return")=2.2110000000000114
Iteration=9: R("return")=3.3690000000000078
Iteration=10: R("return")=3.6810000000000063


In [30]:
# Do another loop, but this time, we will print out each policies' individual rewards.
for _ in range(10):
    results = rllib_trainer.train()
    r1 = results['policy_reward_mean']['policy1']
    r2 = results['policy_reward_mean']['policy2']
    r = r1 + r2
    print(f"Iteration={rllib_trainer.iteration}: R(\"return\")={r} R1={r1} R2={r2}")

Iteration=11: R1=11.675 R2=-6.655999999999988
Iteration=12: R1=13.175 R2=-6.325999999999986
Iteration=13: R1=12.185 R2=-5.929999999999989
Iteration=14: R1=12.515 R2=-5.995999999999988
Iteration=15: R1=12.13 R2=-5.610999999999991
Iteration=16: R1=13.305 R2=-5.3249999999999895
Iteration=17: R1=14.36 R2=-5.2699999999999925
Iteration=18: R1=15.555 R2=-5.78699999999999
Iteration=19: R1=16.79 R2=-5.698999999999991
Iteration=20: R1=17.795 R2=-5.66599999999999


#### !OPTIONAL HACK! (<-- we will not do these during the tutorial, but feel free to try these cells by yourself)

Use the above solution of Exercise #1 and replace our `dummy_trainer` in that solution
with the now trained `rllib_trainer`. You should see a better performance of the two agents.

However, keep in mind that we are mostly training agent1 as we only trian a single policy and agent1
is the "easier" one to collect high rewards with.

#### !OPTIONAL HACK!

Feel free to play around with the following code in order to learn how RLlib - under the hood - calculates actions from the environment's observations using Policies and their model(s) inside our Trainer object):

In [31]:
# Let's actually "look inside" our Trainer to see what's in there.
from ray.rllib.utils.numpy import softmax

# To get to one of the policies inside the Trainer, use `Trainer.get_policy([policy ID])`:
policy = rllib_trainer.get_policy("policy1")
print(f"Our (only!) Policy right now is: {policy}")

# To get to the model inside any policy, do:
model = policy.model
#print(f"Our Policy's model is: {model}")

# Print out the policy's action and observation spaces.
print(f"Our Policy's observation space is: {policy.observation_space}")
print(f"Our Policy's action space is: {policy.action_space}")

# Produce a random obervation (B=1; batch of size 1).
obs = np.array([policy.observation_space.sample()])
# Alternatively for PyTorch:
#import torch
#obs = torch.from_numpy(obs)

# Get the action logits (as tf tensor).
# If you are using torch, you would get a torch tensor here.
logits, _ = model({"obs": obs})
logits

# Numpyize the tensor by running `logits` through the Policy's own tf.Session.
logits_np = policy.get_session().run(logits)
# For torch, you can simply do: `logits_np = logits.detach().cpu().numpy()`.

# Convert logits into action probabilities and remove the B=1.
action_probs = np.squeeze(softmax(logits_np))

# Sample an action, using the probabilities.
action = np.random.choice([0, 1, 2, 3], p=action_probs)

# Print out the action.
print(f"sampled action={action}")

Our (only!) Policy right now is: <ray.rllib.policy.tf_policy_template.PPOTFPolicy object at 0x7fbc409bd160>
Our Policy's observation space is: Box(-1.0, 1.0, (200,), float32)
Our Policy's action space is: Discrete(4)
sampled action=3


### Saving and restoring a trained Trainer.
Currently, `rllib_trainer` is in an already trained state.
It holds optimized weights in its Policy's model that allow it to act
already somewhat smart in our environment when given an observation.

However, if we closed this notebook right now, all the effort would have been for nothing.
Let's therefore save the state of our trainer to disk for later!

In [32]:
# We use the `Trainer.save()` method to create a checkpoint.
checkpoint_file = rllib_trainer.save()
print(f"Trainer (at iteration {rllib_trainer.iteration} was saved in '{checkpoint_file}'!")

# Here is what a checkpoint directory contains:
print("The checkpoint directory contains the following files:")
import os
os.listdir(os.path.dirname(checkpoint_file))

Trainer (at iteration 20 was saved in '/Users/sven/ray_results/PPO_MultiAgentArena_2021-06-24_10-42-36tjwkrob9/checkpoint_000020/checkpoint-20'!
The checkpoint directory contains the following files:


['checkpoint-20', 'checkpoint-20.tune_metadata', '.is_checkpoint']

### Restoring and evaluating a Trainer
In the following cell, we'll learn how to restore a saved Trainer from a checkpoint file.

We'll also evaluate a completely new Trainer (should act more or less randomly) vs an already trained one (the one we just restored from the created checkpoint file).

In [33]:
# Pretend, we wanted to pick up training from a previous run:
new_trainer = PPOTrainer(config=config)
# Evaluate the new trainer (this should yield random results).
results = new_trainer.evaluate()
print(f"Evaluating new trainer: R={results['evaluation']['episode_reward_mean']}")

# Restoring the trained state into the `new_trainer` object.
print(f"Before restoring: Trainer is at iteration={new_trainer.iteration}")
new_trainer.restore(checkpoint_file)
print(f"After restoring: Trainer is at iteration={new_trainer.iteration}")

# Evaluate again (this should yield results we saw after having trained our saved agent).
results = new_trainer.evaluate()
print(f"Evaluating restored trainer: R={results['evaluation']['episode_reward_mean']}")

2021-06-24 10:45:16,711	INFO trainable.py:101 -- Trainable.setup took 13.248 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Evaluating new trainer: R=-7.379999999999993
Before restoring: Trainer is at iteration=0


2021-06-24 10:45:19,816	INFO trainable.py:377 -- Restored on 192.168.0.179 from checkpoint: /Users/sven/ray_results/PPO_MultiAgentArena_2021-06-24_10-42-36tjwkrob9/checkpoint_000020/checkpoint-20
2021-06-24 10:45:19,817	INFO trainable.py:385 -- Current state after restoring: {'_iteration': 20, '_timesteps_total': None, '_time_total': 132.94485211372375, '_episodes_total': 800}


After restoring: Trainer is at iteration=20
Evaluating restored trainer: R=15.284999999999972


In order to release all resources from a Trainer, you can use a Trainer's `stop()` method.
You should definitley run this cell as it frees resources that we'll need later in this tutorial, when we'll do parallel hyperparameter sweeps.

In [34]:
rllib_trainer.stop()
new_trainer.stop()

### Moving stuff to the professional level: RLlib in connection w/ Ray Tune

Running any experiments through Ray Tune is the recommended way of doing things with RLlib. If you look at our
<a href="https://github.com/ray-project/ray/tree/master/rllib/examples">examples scripts folder</a>, you will see that almost all of the scripts use Ray Tune to run the particular RLlib workload demonstrated in each script.

<img src="images/rllib_and_tune.png" width=400>

When setting up hyperparameter sweeps for Tune, we'll do this in our already familiar config dict.

So let's take a quick look at our PPO algo's default config to understand, which hyperparameters we may want to play around with:

In [35]:
# Configuration dicts and Ray Tune.
# Where are the default configuration dicts stored?

# PPO algorithm:
from ray.rllib.agents.ppo import DEFAULT_CONFIG as PPO_DEFAULT_CONFIG
print(f"PPO's default config is:")
pprint.pprint(PPO_DEFAULT_CONFIG)

# DQN algorithm:
#from ray.rllib.agents.dqn import DEFAULT_CONFIG as DQN_DEFAULT_CONFIG
#print(f"DQN's default config is:")
#pprint.pprint(DQN_DEFAULT_CONFIG)

# Common (all algorithms).
#from ray.rllib.agents.trainer import COMMON_CONFIG
#print(f"RLlib Trainer's default config is:")
#pprint.pprint(COMMON_CONFIG)

PPO's default config is:
{'_fake_gpus': False,
 'batch_mode': 'truncate_episodes',
 'callbacks': <class 'ray.rllib.agents.callbacks.DefaultCallbacks'>,
 'clip_actions': True,
 'clip_param': 0.3,
 'clip_rewards': None,
 'collect_metrics_timeout': 180,
 'compress_observations': False,
 'create_env_on_driver': False,
 'custom_eval_function': None,
 'custom_resources_per_worker': {},
 'eager_tracing': False,
 'entropy_coeff': 0.0,
 'entropy_coeff_schedule': None,
 'env': None,
 'env_config': {},
 'env_task_fn': None,
 'evaluation_config': {},
 'evaluation_interval': None,
 'evaluation_num_episodes': 10,
 'evaluation_num_workers': 0,
 'evaluation_parallel_to_training': False,
 'exploration_config': {'type': 'StochasticSampling'},
 'explore': True,
 'extra_python_environs_for_driver': {},
 'extra_python_environs_for_worker': {},
 'fake_sampler': False,
 'framework': 'tf',
 'gamma': 0.99,
 'grad_clip': None,
 'horizon': None,
 'ignore_worker_failures': False,
 'in_evaluation': False,
 'input'

### Let's do a very simple grid-search over two learning rates with tune.run().

In particular, we will try the learning rates 0.00005 and 0.5 using `tune.grid_search([...])`
inside our config dict:

In [36]:
# Plugging in Ray Tune.
# Note that this is the recommended way to run any experiments with RLlib.
# Reasons:
# - Tune allows you to do hyperparameter tuning in a user-friendly way
#   and at large scale!
# - Tune automatically allocates needed resources for the different
#   hyperparam trials and experiment runs on a cluster.

from ray import tune

# Running stuff with tune, we can re-use the exact
# same config that we used when working with RLlib directly!
tune_config = config.copy()

# Let's add our first hyperparameter search via our config.
# How about we try two different learning rates? Let's say 0.00005 and 0.5 (ouch!).
tune_config["lr"] = tune.grid_search([0.00005, 0.5])  # <- 0.5? again: ouch!
tune_config["train_batch_size"] = tune.grid_search([3000, 4000])

# Now that we will run things "automatically" through tune, we have to
# define one or more stopping criteria.
# Tune will stop the run, once any single one of the criteria is matched (not all of them!).
stop = {
    # Note that the keys used here can be anything present in the above `rllib_trainer.train()` output dict.
    "training_iteration": 5,
    "episode_reward_mean": 20.0,
}

# "PPO" is a registered name that points to RLlib's PPOTrainer.
# See `ray/rllib/agents/registry.py`

# Run a simple experiment until one of the stopping criteria is met.
tune.run(
    "PPO",
    config=tune_config,
    stop=stop,

    # Note that no trainers will be returned from this call here.
    # Tune will create n Trainers internally, run them in parallel and destroy them at the end.
    # However, you can ...
    checkpoint_at_end=True,  # ... create a checkpoint when done.
    checkpoint_freq=10,  # ... create a checkpoint every 10 training iterations.
)

Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_83919_00000,PENDING,,5e-05,3000
PPO_MultiAgentArena_83919_00001,PENDING,,0.5,3000
PPO_MultiAgentArena_83919_00002,PENDING,,5e-05,4000
PPO_MultiAgentArena_83919_00003,PENDING,,0.5,4000


[2m[36m(pid=2675)[0m 2021-06-24 10:45:29,348	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=2675)[0m 2021-06-24 10:45:29,348	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=2677)[0m 2021-06-24 10:45:29,348	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=2677)[0m 2021-06-24 10:45:29,348	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=2670)[0m 2021-06-24 10:45:29,348	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=2670)[0m 2021-06-24 10:45:29,348	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags

Result for PPO_MultiAgentArena_83919_00000:
  agent_timesteps_total: 6000
  custom_metrics: {}
  date: 2021-06-24_10-45-54
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 15.000000000000014
  episode_reward_mean: -8.909999999999993
  episode_reward_min: -34.500000000000036
  episodes_this_iter: 30
  episodes_total: 30
  experiment_id: 43c5f1bf54a446d6ac106c951f38d022
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.3664278984069824
          entropy_coeff: 0.0
          kl: 0.02037949115037918
          model: {}
          policy_loss: -0.05178745463490486
          total_loss: 53.435001373291016
          vf_explained_var: 0.13389408588409424
          vf_loss: 53.48271560668945
      policy2:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_83919_00000,RUNNING,192.168.0.179:2675,5e-05,3000,1.0,10.0594,3000.0,-8.91,15.0,-34.5,100.0
PPO_MultiAgentArena_83919_00001,RUNNING,,0.5,3000,,,,,,,
PPO_MultiAgentArena_83919_00002,RUNNING,,5e-05,4000,,,,,,,
PPO_MultiAgentArena_83919_00003,RUNNING,,0.5,4000,,,,,,,


Result for PPO_MultiAgentArena_83919_00001:
  agent_timesteps_total: 6000
  custom_metrics: {}
  date: 2021-06-24_10-45-54
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 15.000000000000016
  episode_reward_mean: -5.909999999999995
  episode_reward_min: -34.50000000000003
  episodes_this_iter: 30
  episodes_total: 30
  experiment_id: 38bdf32c865b44ec9dfdfa051cfd3bca
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.5
          entropy: 0.078101746737957
          entropy_coeff: 0.0
          kl: 18.170473098754883
          model: {}
          policy_loss: 0.47826075553894043
          total_loss: 56.070186614990234
          vf_explained_var: 0.009776771068572998
          vf_loss: 51.957828521728516
      policy2:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.00019999999494757503
          entropy: 1.34

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_83919_00000,RUNNING,192.168.0.179:2675,5e-05,3000,1,10.0594,3000,-8.91,15.0,-34.5,100
PPO_MultiAgentArena_83919_00001,RUNNING,192.168.0.179:2670,0.5,3000,2,19.8436,6000,-25.16,15.0,-48.0,100
PPO_MultiAgentArena_83919_00002,RUNNING,192.168.0.179:2674,5e-05,4000,1,13.1298,4000,-11.865,5.7,-33.0,100
PPO_MultiAgentArena_83919_00003,RUNNING,192.168.0.179:2677,0.5,4000,1,13.1306,4000,-9.8175,12.0,-34.5,100


Result for PPO_MultiAgentArena_83919_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-06-24_10-46-04
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 15.000000000000014
  episode_reward_mean: -5.68499999999999
  episode_reward_min: -34.500000000000036
  episodes_this_iter: 30
  episodes_total: 60
  experiment_id: 43c5f1bf54a446d6ac106c951f38d022
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 1.3404359817504883
          entropy_coeff: 0.0
          kl: 0.019646519795060158
          model: {}
          policy_loss: -0.05948462709784508
          total_loss: 29.73061752319336
          vf_explained_var: 0.20065376162528992
          vf_loss: 29.78420639038086
      policy2:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_83919_00000,RUNNING,192.168.0.179:2675,5e-05,3000,2,20.0011,6000,-5.685,15,-34.5,100
PPO_MultiAgentArena_83919_00001,RUNNING,192.168.0.179:2670,0.5,3000,2,19.8436,6000,-25.16,15,-48.0,100
PPO_MultiAgentArena_83919_00002,RUNNING,192.168.0.179:2674,5e-05,4000,2,26.6623,8000,-5.53875,21,-33.0,100
PPO_MultiAgentArena_83919_00003,RUNNING,192.168.0.179:2677,0.5,4000,1,13.1306,4000,-9.8175,12,-34.5,100


Result for PPO_MultiAgentArena_83919_00003:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-06-24_10-46-11
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 12.000000000000021
  episode_reward_mean: -27.71625000000003
  episode_reward_min: -46.500000000000064
  episodes_this_iter: 40
  episodes_total: 80
  experiment_id: 777bea0709a34c368f94f4b08cbe8c9b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 0.5
          entropy: 0.011390337720513344
          entropy_coeff: 0.0
          kl: 1.7135323286056519
          model: {}
          policy_loss: 0.028235359117388725
          total_loss: 87.91030883789062
          vf_explained_var: 0.04217272251844406
          vf_loss: 87.36800384521484
      policy2:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 0.00019999999494757503
          entropy: 1

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_83919_00000,RUNNING,192.168.0.179:2675,5e-05,3000,3,31.0786,9000,-3.73333,23.1,-34.5,100
PPO_MultiAgentArena_83919_00001,RUNNING,192.168.0.179:2670,0.5,3000,3,31.0258,9000,-31.26,15.0,-48.0,100
PPO_MultiAgentArena_83919_00002,RUNNING,192.168.0.179:2674,5e-05,4000,3,43.4001,12000,-2.088,26.7,-33.0,100
PPO_MultiAgentArena_83919_00003,RUNNING,192.168.0.179:2677,0.5,4000,2,26.8928,8000,-27.7163,12.0,-46.5,100


Result for PPO_MultiAgentArena_83919_00003:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-06-24_10-46-28
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 11.699999999999969
  episode_reward_mean: -37.944000000000045
  episode_reward_min: -46.500000000000064
  episodes_this_iter: 40
  episodes_total: 120
  experiment_id: 777bea0709a34c368f94f4b08cbe8c9b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 0.5
          entropy: 0.0
          entropy_coeff: 0.0
          kl: 3.331962483699158e-09
          model: {}
          policy_loss: -0.0007407463272102177
          total_loss: 84.66165924072266
          vf_explained_var: 0.10813942551612854
          vf_loss: 84.66240692138672
      policy2:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 0.00019999999494757503
          entropy: 1.247179150

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_83919_00000,RUNNING,192.168.0.179:2675,5e-05,3000,5,54.9213,15000,1.089,23.1,-24.0,100
PPO_MultiAgentArena_83919_00001,RUNNING,192.168.0.179:2670,0.5,3000,4,43.4179,12000,-40.512,6.6,-48.0,100
PPO_MultiAgentArena_83919_00002,RUNNING,192.168.0.179:2674,5e-05,4000,3,43.4001,12000,-2.088,26.7,-33.0,100
PPO_MultiAgentArena_83919_00003,RUNNING,192.168.0.179:2677,0.5,4000,3,43.3918,12000,-37.944,11.7,-46.5,100


Result for PPO_MultiAgentArena_83919_00001:
  agent_timesteps_total: 30000
  custom_metrics: {}
  date: 2021-06-24_10-46-40
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -15.299999999999994
  episode_reward_mean: -43.872000000000064
  episode_reward_min: -48.00000000000008
  episodes_this_iter: 30
  episodes_total: 150
  experiment_id: 38bdf32c865b44ec9dfdfa051cfd3bca
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.5
          entropy: 0.004095461219549179
          entropy_coeff: 0.0
          kl: 0.16092397272586823
          model: {}
          policy_loss: 0.051739536225795746
          total_loss: 95.20844268798828
          vf_explained_var: 0.10867179930210114
          vf_loss: 94.99378204345703
      policy2:
        learner_stats:
          cur_kl_coeff: 0.675000011920929
          cur_lr: 0.00019999999494757503
          entropy: 1.

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_83919_00002,RUNNING,192.168.0.179:2674,5e-05,4000,5,68.7927,20000,1.743,27.6,-16.5,100
PPO_MultiAgentArena_83919_00003,RUNNING,192.168.0.179:2677,0.5,4000,4,58.8085,16000,-41.406,-25.5,-46.5,100
PPO_MultiAgentArena_83919_00000,TERMINATED,,5e-05,3000,5,54.9213,15000,1.089,23.1,-24.0,100
PPO_MultiAgentArena_83919_00001,TERMINATED,,0.5,3000,5,55.0482,15000,-43.872,-15.3,-48.0,100


Result for PPO_MultiAgentArena_83919_00003:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-06-24_10-46-53
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -21.900000000000006
  episode_reward_mean: -36.06600000000004
  episode_reward_min: -46.500000000000064
  episodes_this_iter: 40
  episodes_total: 200
  experiment_id: 777bea0709a34c368f94f4b08cbe8c9b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.11249999701976776
          cur_lr: 0.5
          entropy: 0.0
          entropy_coeff: 0.0
          kl: 0.0
          model: {}
          policy_loss: -0.0009185558883473277
          total_loss: 146.60562133789062
          vf_explained_var: 0.07484202086925507
          vf_loss: 146.60653686523438
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
          entropy: 1.092926025390625
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_83919_00000,TERMINATED,,5e-05,3000,5,54.9213,15000,1.089,23.1,-24.0,100
PPO_MultiAgentArena_83919_00001,TERMINATED,,0.5,3000,5,55.0482,15000,-43.872,-15.3,-48.0,100
PPO_MultiAgentArena_83919_00002,TERMINATED,,5e-05,4000,5,68.7927,20000,1.743,27.6,-16.5,100
PPO_MultiAgentArena_83919_00003,TERMINATED,,0.5,4000,5,68.7797,20000,-36.066,-21.9,-46.5,100


2021-06-24 10:46:54,581	INFO tune.py:549 -- Total run time: 91.50 seconds (90.98 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7fbc14c9b970>

### Why did we use 6 CPUs in the tune run above (3 CPUs per trial)?

PPO - by default - uses 2 "rollout" workers (`num_workers=2`). These are Ray Actors that have their own environment copy(ies) and step through those in parallel. On top of these two "rollout" workers, every Trainer in RLlib always also has a "local" worker, which - in case of PPO - handles the learning updates. This gives us 3 workers (2 rollout + 1 local learner), which require 3 CPUs.

## Exercise No 2

<hr />

Using the `tune_config` that we have built so far, let's run another `tune.run()`, but apply the following changes to our setup this time:
- Setup only 1 learning rate under the "lr" config key. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Setup only 1 train batch size under the "train_batch_size" config key. Chose the (seemingly) best value from the run in the previous cell (the one that yielded the highest avg. reward).
- Set `num_workers` to 5, which will allow us to run more environment "rollouts" in parallel and to collect training batches more quickly.
- Set the `num_envs_per_worker` config parameter to 5. This will clone our env on each rollout worker, and thus parallelize action computing forward passes through our neural networks.

Other than that, use the exact same args as in our `tune.run()` call in the previous cell.

**Good luck! :)**


In [None]:
# !LIVE CODING!

# Solution to Exercise #2

# Run for longer this time (100 iterations) and try to reach 40.0 reward (sum of both agents).
stop = {
    "training_iteration": 200,  # we have the 15min break now to run this many iterations
    "episode_reward_mean": 60.0,  # sum of both agents' rewards. Probably won't reach it, but we should try nevertheless :)
}

# tune_config.update({
# ???
# })

# analysis = tune.run(...)

Trial name,status,loc
PPO_MultiAgentArena_62cb1_00000,PENDING,


[2m[36m(pid=2801)[0m 2021-06-24 10:51:43,603	INFO trainer.py:671 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=2801)[0m 2021-06-24 10:51:43,604	INFO trainer.py:696 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=2801)[0m 2021-06-24 10:51:56,770	INFO trainable.py:101 -- Trainable.setup took 13.167 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-06-24_10-52-02
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 17.999999999999936
  episode_reward_mean: -8.483999999999995
  episode_reward_min: -34.500000000000036
  episodes_this_iter: 25
  episodes_total: 25
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 1.3588711023330688
          entropy_coeff: 0.0
          kl: 0.027996812015771866
          model: {}
          policy_loss: -0.05669461190700531
          total_loss: 44.654815673828125
          vf_explained_var: 0.09232431650161743
          vf_loss: 44.70591354370117
      policy2:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.00019999999494757503
  

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,1,5.33662,4000,-8.484,18,-34.5,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-06-24_10-52-11
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 22.499999999999943
  episode_reward_mean: -5.075999999999994
  episode_reward_min: -37.50000000000004
  episodes_this_iter: 25
  episodes_total: 100
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 9.999999747378752e-05
          entropy: 1.2776226997375488
          entropy_coeff: 0.0
          kl: 0.026900572702288628
          model: {}
          policy_loss: -0.07335038483142853
          total_loss: 29.200061798095703
          vf_explained_var: 0.20351383090019226
          vf_loss: 29.261306762695312
      policy2:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 0.00019999999494757503


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,3,14.4419,12000,-5.076,22.5,-37.5,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-06-24_10-52-20
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 21.599999999999973
  episode_reward_mean: -1.2689999999999875
  episode_reward_min: -27.00000000000003
  episodes_this_iter: 50
  episodes_total: 200
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 1.2463394403457642
          entropy_coeff: 0.0
          kl: 0.015143339522182941
          model: {}
          policy_loss: -0.05386582016944885
          total_loss: 24.41973876953125
          vf_explained_var: 0.2661151587963104
          vf_loss: 24.458274841308594
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,5,23.8853,20000,-1.269,21.6,-27,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 56000
  custom_metrics: {}
  date: 2021-06-24_10-52-30
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 23.099999999999948
  episode_reward_mean: 1.3830000000000056
  episode_reward_min: -19.499999999999993
  episodes_this_iter: 50
  episodes_total: 275
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 1.2091331481933594
          entropy_coeff: 0.0
          kl: 0.016251813620328903
          model: {}
          policy_loss: -0.058222174644470215
          total_loss: 26.836854934692383
          vf_explained_var: 0.4003700613975525
          vf_loss: 26.878620147705078
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,7,33.1465,28000,1.383,23.1,-19.5,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 72000
  custom_metrics: {}
  date: 2021-06-24_10-52-39
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 26.99999999999995
  episode_reward_mean: 3.495000000000006
  episode_reward_min: -12.599999999999982
  episodes_this_iter: 50
  episodes_total: 350
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 1.1770472526550293
          entropy_coeff: 0.0
          kl: 0.015331901609897614
          model: {}
          policy_loss: -0.05267131328582764
          total_loss: 34.16590118408203
          vf_explained_var: 0.384956032037735
          vf_loss: 34.20304870605469
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
       

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,9,42.7056,36000,3.495,27,-12.6,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 88000
  custom_metrics: {}
  date: 2021-06-24_10-52-49
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 28.799999999999955
  episode_reward_mean: 6.1229999999999976
  episode_reward_min: -21.00000000000001
  episodes_this_iter: 25
  episodes_total: 425
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 1.1281288862228394
          entropy_coeff: 0.0
          kl: 0.017284387722611427
          model: {}
          policy_loss: -0.06095132231712341
          total_loss: 31.74726104736328
          vf_explained_var: 0.42851197719573975
          vf_loss: 31.790712356567383
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,11,51.9055,44000,6.123,28.8,-21,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 104000
  custom_metrics: {}
  date: 2021-06-24_10-52-58
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 28.799999999999955
  episode_reward_mean: 8.879999999999994
  episode_reward_min: -11.099999999999996
  episodes_this_iter: 25
  episodes_total: 500
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 1.0732851028442383
          entropy_coeff: 0.0
          kl: 0.01739729940891266
          model: {}
          policy_loss: -0.05874611437320709
          total_loss: 33.32643127441406
          vf_explained_var: 0.38476911187171936
          vf_loss: 33.36756134033203
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,13,61.2109,52000,8.88,28.8,-11.1,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 120000
  custom_metrics: {}
  date: 2021-06-24_10-53-07
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 34.499999999999915
  episode_reward_mean: 12.90299999999998
  episode_reward_min: -15.899999999999984
  episodes_this_iter: 50
  episodes_total: 600
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 1.041711688041687
          entropy_coeff: 0.0
          kl: 0.01560800801962614
          model: {}
          policy_loss: -0.05298454314470291
          total_loss: 45.75339126586914
          vf_explained_var: 0.2644711434841156
          vf_loss: 45.79058074951172
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
      

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,15,70.4017,60000,12.903,34.5,-15.9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 136000
  custom_metrics: {}
  date: 2021-06-24_10-53-16
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 37.49999999999991
  episode_reward_mean: 16.301999999999964
  episode_reward_min: -11.999999999999993
  episodes_this_iter: 50
  episodes_total: 675
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.9933565258979797
          entropy_coeff: 0.0
          kl: 0.015695845708251
          model: {}
          policy_loss: -0.05434282869100571
          total_loss: 37.19157791137695
          vf_explained_var: 0.33458060026168823
          vf_loss: 37.230037689208984
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,17,79.5874,68000,16.302,37.5,-12,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 152000
  custom_metrics: {}
  date: 2021-06-24_10-53-26
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 35.099999999999916
  episode_reward_mean: 18.67799999999995
  episode_reward_min: -6.899999999999993
  episodes_this_iter: 50
  episodes_total: 750
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.9335061311721802
          entropy_coeff: 0.0
          kl: 0.014588603749871254
          model: {}
          policy_loss: -0.05031830444931984
          total_loss: 39.424007415771484
          vf_explained_var: 0.3828079104423523
          vf_loss: 39.459564208984375
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,19,88.6267,76000,18.678,35.1,-6.9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 168000
  custom_metrics: {}
  date: 2021-06-24_10-53-34
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 39.89999999999991
  episode_reward_mean: 21.638999999999932
  episode_reward_min: -6.899999999999993
  episodes_this_iter: 25
  episodes_total: 825
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.8775365948677063
          entropy_coeff: 0.0
          kl: 0.01357054989784956
          model: {}
          policy_loss: -0.04833472520112991
          total_loss: 49.6264533996582
          vf_explained_var: 0.36908942461013794
          vf_loss: 49.66104507446289
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
      

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,21,97.1875,84000,21.639,39.9,-6.9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 184000
  custom_metrics: {}
  date: 2021-06-24_10-53-43
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 39.89999999999991
  episode_reward_mean: 24.497999999999923
  episode_reward_min: -8.399999999999977
  episodes_this_iter: 25
  episodes_total: 900
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.851176381111145
          entropy_coeff: 0.0
          kl: 0.012521426193416119
          model: {}
          policy_loss: -0.041859906166791916
          total_loss: 60.12490463256836
          vf_explained_var: 0.2846677005290985
          vf_loss: 60.15409469604492
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,23,105.738,92000,24.498,39.9,-8.4,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 200000
  custom_metrics: {}
  date: 2021-06-24_10-53-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 44.999999999999915
  episode_reward_mean: 25.328999999999922
  episode_reward_min: 4.199999999999992
  episodes_this_iter: 50
  episodes_total: 1000
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.8086280822753906
          entropy_coeff: 0.0
          kl: 0.013364194892346859
          model: {}
          policy_loss: -0.04560176655650139
          total_loss: 53.232032775878906
          vf_explained_var: 0.3239186406135559
          vf_loss: 53.26410675048828
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,25,114.389,100000,25.329,45,4.2,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 216000
  custom_metrics: {}
  date: 2021-06-24_10-54-00
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.799999999999926
  episode_reward_mean: 27.206999999999915
  episode_reward_min: 3.2999999999999816
  episodes_this_iter: 50
  episodes_total: 1075
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.7687411308288574
          entropy_coeff: 0.0
          kl: 0.011752902530133724
          model: {}
          policy_loss: -0.03996681049466133
          total_loss: 58.88492202758789
          vf_explained_var: 0.3884660005569458
          vf_loss: 58.912986755371094
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
  

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,27,123.24,108000,27.207,46.8,3.3,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 232000
  custom_metrics: {}
  date: 2021-06-24_10-54-10
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.799999999999926
  episode_reward_mean: 28.493999999999915
  episode_reward_min: 3.2999999999999816
  episodes_this_iter: 50
  episodes_total: 1150
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.7267895340919495
          entropy_coeff: 0.0
          kl: 0.012627107091248035
          model: {}
          policy_loss: -0.042782749980688095
          total_loss: 47.68418884277344
          vf_explained_var: 0.45486703515052795
          vf_loss: 47.71418762207031
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,29,132.337,116000,28.494,46.8,3.3,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 248000
  custom_metrics: {}
  date: 2021-06-24_10-54-18
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.2999999999999
  episode_reward_mean: 29.750999999999912
  episode_reward_min: 12.899999999999961
  episodes_this_iter: 25
  episodes_total: 1225
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.6840970516204834
          entropy_coeff: 0.0
          kl: 0.01279220636934042
          model: {}
          policy_loss: -0.039032094180583954
          total_loss: 65.87602233886719
          vf_explained_var: 0.3669774830341339
          vf_loss: 65.90210723876953
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,31,141.034,124000,29.751,48.3,12.9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 264000
  custom_metrics: {}
  date: 2021-06-24_10-54-27
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.2999999999999
  episode_reward_mean: 29.228999999999917
  episode_reward_min: 10.800000000000022
  episodes_this_iter: 25
  episodes_total: 1300
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.677323579788208
          entropy_coeff: 0.0
          kl: 0.011933168396353722
          model: {}
          policy_loss: -0.03782849758863449
          total_loss: 71.7448501586914
          vf_explained_var: 0.32996243238449097
          vf_loss: 71.77059173583984
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
      

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,33,149.908,132000,29.229,48.3,10.8,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 280000
  custom_metrics: {}
  date: 2021-06-24_10-54-37
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.49999999999989
  episode_reward_mean: 29.945999999999913
  episode_reward_min: 10.499999999999964
  episodes_this_iter: 50
  episodes_total: 1400
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.6566652059555054
          entropy_coeff: 0.0
          kl: 0.012123221531510353
          model: {}
          policy_loss: -0.03603876009583473
          total_loss: 62.44966506958008
          vf_explained_var: 0.3267621695995331
          vf_loss: 62.473426818847656
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,35,159.129,140000,29.946,46.5,10.5,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 296000
  custom_metrics: {}
  date: 2021-06-24_10-54-46
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.49999999999989
  episode_reward_mean: 30.701999999999916
  episode_reward_min: 12.60000000000002
  episodes_this_iter: 50
  episodes_total: 1475
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.6108099818229675
          entropy_coeff: 0.0
          kl: 0.010560435242950916
          model: {}
          policy_loss: -0.03370382636785507
          total_loss: 62.26347351074219
          vf_explained_var: 0.4298911988735199
          vf_loss: 62.2864875793457
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
      

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,37,168.375,148000,30.702,46.5,12.6,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 312000
  custom_metrics: {}
  date: 2021-06-24_10-54-55
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.29999999999991
  episode_reward_mean: 31.331999999999912
  episode_reward_min: 14.999999999999996
  episodes_this_iter: 50
  episodes_total: 1550
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.5867441296577454
          entropy_coeff: 0.0
          kl: 0.011585080996155739
          model: {}
          policy_loss: -0.03752788156270981
          total_loss: 87.87799835205078
          vf_explained_var: 0.4259708523750305
          vf_loss: 87.90380859375
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
       

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,39,177.752,156000,31.332,48.3,15,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 328000
  custom_metrics: {}
  date: 2021-06-24_10-55-05
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 53.39999999999992
  episode_reward_mean: 30.854999999999922
  episode_reward_min: -1.1999999999999975
  episodes_this_iter: 25
  episodes_total: 1625
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.574537456035614
          entropy_coeff: 0.0
          kl: 0.009072775021195412
          model: {}
          policy_loss: -0.031153691932559013
          total_loss: 108.16924285888672
          vf_explained_var: 0.3942277729511261
          vf_loss: 108.19120025634766
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,41,186.895,164000,30.855,53.4,-1.2,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 344000
  custom_metrics: {}
  date: 2021-06-24_10-55-14
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.29999999999989
  episode_reward_mean: 29.840999999999926
  episode_reward_min: 13.49999999999992
  episodes_this_iter: 25
  episodes_total: 1700
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.5857378244400024
          entropy_coeff: 0.0
          kl: 0.010605060495436192
          model: {}
          policy_loss: -0.03383101895451546
          total_loss: 84.67359924316406
          vf_explained_var: 0.3525088131427765
          vf_loss: 84.69670104980469
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,43,196.254,172000,29.841,45.3,13.5,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 360000
  custom_metrics: {}
  date: 2021-06-24_10-55-23
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.39999999999989
  episode_reward_mean: 30.533999999999914
  episode_reward_min: 6.900000000000023
  episodes_this_iter: 50
  episodes_total: 1800
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.5727860331535339
          entropy_coeff: 0.0
          kl: 0.01140167098492384
          model: {}
          policy_loss: -0.033886298537254333
          total_loss: 76.70872497558594
          vf_explained_var: 0.3266090452671051
          vf_loss: 76.73106384277344
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,45,205.494,180000,30.534,47.4,6.9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 376000
  custom_metrics: {}
  date: 2021-06-24_10-55-32
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.7999999999999
  episode_reward_mean: 29.687999999999917
  episode_reward_min: 14.099999999999909
  episodes_this_iter: 50
  episodes_total: 1875
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.5390828251838684
          entropy_coeff: 0.0
          kl: 0.009557900950312614
          model: {}
          policy_loss: -0.03181135281920433
          total_loss: 55.03185272216797
          vf_explained_var: 0.41093891859054565
          vf_loss: 55.05398178100586
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,47,214.497,188000,29.688,49.8,14.1,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 392000
  custom_metrics: {}
  date: 2021-06-24_10-55-42
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.7999999999999
  episode_reward_mean: 30.25499999999991
  episode_reward_min: 9.899999999999965
  episodes_this_iter: 50
  episodes_total: 1950
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.5224382281303406
          entropy_coeff: 0.0
          kl: 0.00951691810041666
          model: {}
          policy_loss: -0.031399473547935486
          total_loss: 48.411170959472656
          vf_explained_var: 0.5059530138969421
          vf_loss: 48.43292999267578
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
      

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,49,223.802,196000,30.255,49.8,9.9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 408000
  custom_metrics: {}
  date: 2021-06-24_10-55-51
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 43.4999999999999
  episode_reward_mean: 30.74399999999991
  episode_reward_min: 7.499999999999952
  episodes_this_iter: 25
  episodes_total: 2025
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.4954160153865814
          entropy_coeff: 0.0
          kl: 0.009320042096078396
          model: {}
          policy_loss: -0.029491031542420387
          total_loss: 54.5516471862793
          vf_explained_var: 0.4740511476993561
          vf_loss: 54.57169723510742
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
       

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,51,233.284,204000,30.744,43.5,7.5,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 424000
  custom_metrics: {}
  date: 2021-06-24_10-56-01
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 45.899999999999906
  episode_reward_mean: 32.897999999999904
  episode_reward_min: 19.199999999999896
  episodes_this_iter: 25
  episodes_total: 2100
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.49919164180755615
          entropy_coeff: 0.0
          kl: 0.011284520849585533
          model: {}
          policy_loss: -0.03782995045185089
          total_loss: 71.1130599975586
          vf_explained_var: 0.34547683596611023
          vf_loss: 71.13946533203125
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
  

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,53,242.821,212000,32.898,45.9,19.2,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 440000
  custom_metrics: {}
  date: 2021-06-24_10-56-11
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.699999999999896
  episode_reward_mean: 31.58399999999991
  episode_reward_min: 12.299999999999967
  episodes_this_iter: 50
  episodes_total: 2200
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.46220383048057556
          entropy_coeff: 0.0
          kl: 0.008790775202214718
          model: {}
          policy_loss: -0.02926849201321602
          total_loss: 59.04888153076172
          vf_explained_var: 0.3530977964401245
          vf_loss: 59.06924057006836
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,55,252.517,220000,31.584,47.7,12.3,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 456000
  custom_metrics: {}
  date: 2021-06-24_10-56-20
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.699999999999896
  episode_reward_mean: 31.42799999999992
  episode_reward_min: 11.699999999999946
  episodes_this_iter: 50
  episodes_total: 2275
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.4620126485824585
          entropy_coeff: 0.0
          kl: 0.012693880125880241
          model: {}
          policy_loss: -0.03664546459913254
          total_loss: 53.39973068237305
          vf_explained_var: 0.49103453755378723
          vf_loss: 53.42353057861328
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,57,261.982,228000,31.428,47.7,11.7,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 472000
  custom_metrics: {}
  date: 2021-06-24_10-56-30
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.39999999999991
  episode_reward_mean: 31.550999999999913
  episode_reward_min: 15.29999999999998
  episodes_this_iter: 50
  episodes_total: 2350
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.42946454882621765
          entropy_coeff: 0.0
          kl: 0.00793248601257801
          model: {}
          policy_loss: -0.024854931980371475
          total_loss: 84.31121063232422
          vf_explained_var: 0.5293190479278564
          vf_loss: 84.32803344726562
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,59,271.252,236000,31.551,47.4,15.3,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 488000
  custom_metrics: {}
  date: 2021-06-24_10-56-38
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.39999999999991
  episode_reward_mean: 30.452999999999914
  episode_reward_min: 11.999999999999932
  episodes_this_iter: 25
  episodes_total: 2425
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.44497305154800415
          entropy_coeff: 0.0
          kl: 0.009061883203685284
          model: {}
          policy_loss: -0.027709050104022026
          total_loss: 59.62226104736328
          vf_explained_var: 0.46265682578086853
          vf_loss: 59.64079284667969
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,61,280.066,244000,30.453,47.4,12,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 504000
  custom_metrics: {}
  date: 2021-06-24_10-56-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.19999999999992
  episode_reward_mean: 31.199999999999914
  episode_reward_min: 15.599999999999918
  episodes_this_iter: 25
  episodes_total: 2500
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.44672054052352905
          entropy_coeff: 0.0
          kl: 0.009240994229912758
          model: {}
          policy_loss: -0.030118603259325027
          total_loss: 67.60299682617188
          vf_explained_var: 0.3419913053512573
          vf_loss: 67.62374877929688
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
  

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,63,288.884,252000,31.2,49.2,15.6,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 520000
  custom_metrics: {}
  date: 2021-06-24_10-56-56
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.3999999999999
  episode_reward_mean: 31.244999999999912
  episode_reward_min: 3.9000000000000035
  episodes_this_iter: 50
  episodes_total: 2600
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.44735029339790344
          entropy_coeff: 0.0
          kl: 0.009497132152318954
          model: {}
          policy_loss: -0.029139278456568718
          total_loss: 48.774818420410156
          vf_explained_var: 0.4081743061542511
          vf_loss: 48.79433822631836
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
  

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,65,297.842,260000,31.245,50.4,3.9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 536000
  custom_metrics: {}
  date: 2021-06-24_10-57-05
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.9999999999999
  episode_reward_mean: 32.17799999999991
  episode_reward_min: 6.8999999999999275
  episodes_this_iter: 50
  episodes_total: 2675
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.4300624132156372
          entropy_coeff: 0.0
          kl: 0.009933280758559704
          model: {}
          policy_loss: -0.032282616943120956
          total_loss: 53.065765380859375
          vf_explained_var: 0.48740923404693604
          vf_loss: 53.08799362182617
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,67,306.851,268000,32.178,48,6.9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 552000
  custom_metrics: {}
  date: 2021-06-24_10-57-15
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.3999999999999
  episode_reward_mean: 32.69999999999992
  episode_reward_min: 17.09999999999996
  episodes_this_iter: 50
  episodes_total: 2750
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.402817964553833
          entropy_coeff: 0.0
          kl: 0.008575302548706532
          model: {}
          policy_loss: -0.03071298636496067
          total_loss: 56.4021110534668
          vf_explained_var: 0.48138228058815
          vf_loss: 56.42414093017578
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
          e

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,69,316.252,276000,32.7,47.4,17.1,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 568000
  custom_metrics: {}
  date: 2021-06-24_10-57-24
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.1999999999999
  episode_reward_mean: 33.91499999999991
  episode_reward_min: 18.59999999999991
  episodes_this_iter: 25
  episodes_total: 2825
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.41297051310539246
          entropy_coeff: 0.0
          kl: 0.009383268654346466
          model: {}
          policy_loss: -0.027733489871025085
          total_loss: 60.802494049072266
          vf_explained_var: 0.5030959844589233
          vf_loss: 60.82072448730469
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,71,325.557,284000,33.915,49.2,18.6,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 584000
  custom_metrics: {}
  date: 2021-06-24_10-57-33
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.1999999999999
  episode_reward_mean: 34.14899999999991
  episode_reward_min: 14.999999999999938
  episodes_this_iter: 25
  episodes_total: 2900
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.4195472002029419
          entropy_coeff: 0.0
          kl: 0.009106173180043697
          model: {}
          policy_loss: -0.03179044649004936
          total_loss: 57.41215133666992
          vf_explained_var: 0.36610138416290283
          vf_loss: 57.434722900390625
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,73,334.431,292000,34.149,49.2,15,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 600000
  custom_metrics: {}
  date: 2021-06-24_10-57-42
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.49999999999993
  episode_reward_mean: 31.904999999999912
  episode_reward_min: 9.899999999999986
  episodes_this_iter: 50
  episodes_total: 3000
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.4204123318195343
          entropy_coeff: 0.0
          kl: 0.008523418568074703
          model: {}
          policy_loss: -0.0244015883654356
          total_loss: 54.88983917236328
          vf_explained_var: 0.3839063048362732
          vf_loss: 54.90561294555664
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
      

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,75,343.435,300000,31.905,46.5,9.9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 616000
  custom_metrics: {}
  date: 2021-06-24_10-57-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 46.1999999999999
  episode_reward_mean: 32.76899999999992
  episode_reward_min: 6.899999999999967
  episodes_this_iter: 50
  episodes_total: 3075
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.3990224599838257
          entropy_coeff: 0.0
          kl: 0.010732964612543583
          model: {}
          policy_loss: -0.03383614122867584
          total_loss: 50.62503433227539
          vf_explained_var: 0.5071938633918762
          vf_loss: 50.64799880981445
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
       

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,77,352.749,308000,32.769,46.2,6.9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 632000
  custom_metrics: {}
  date: 2021-06-24_10-58-01
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.2999999999999
  episode_reward_mean: 31.00799999999992
  episode_reward_min: -6.300000000000022
  episodes_this_iter: 50
  episodes_total: 3150
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.3911486268043518
          entropy_coeff: 0.0
          kl: 0.013790293596684933
          model: {}
          policy_loss: -0.03714088723063469
          total_loss: 62.425235748291016
          vf_explained_var: 0.5381317734718323
          vf_loss: 62.44841003417969
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,79,362.2,316000,31.008,51.3,-6.3,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 648000
  custom_metrics: {}
  date: 2021-06-24_10-58-11
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.89999999999991
  episode_reward_mean: 32.60399999999992
  episode_reward_min: 10.499999999999938
  episodes_this_iter: 25
  episodes_total: 3225
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.398970365524292
          entropy_coeff: 0.0
          kl: 0.007967749610543251
          model: {}
          policy_loss: -0.0246298648416996
          total_loss: 43.50117111206055
          vf_explained_var: 0.5558068156242371
          vf_loss: 43.51772689819336
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
       

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,81,371.583,324000,32.604,48.9,10.5,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 664000
  custom_metrics: {}
  date: 2021-06-24_10-58-20
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.799999999999905
  episode_reward_mean: 32.879999999999924
  episode_reward_min: 14.399999999999912
  episodes_this_iter: 25
  episodes_total: 3300
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.4005577266216278
          entropy_coeff: 0.0
          kl: 0.007322691846638918
          model: {}
          policy_loss: -0.02241870015859604
          total_loss: 64.23497772216797
          vf_explained_var: 0.40708014369010925
          vf_loss: 64.2499771118164
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,83,380.695,332000,32.88,49.8,14.4,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 680000
  custom_metrics: {}
  date: 2021-06-24_10-58-29
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.599999999999916
  episode_reward_mean: 31.073999999999913
  episode_reward_min: 7.800000000000004
  episodes_this_iter: 50
  episodes_total: 3400
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.4029429256916046
          entropy_coeff: 0.0
          kl: 0.009470459073781967
          model: {}
          policy_loss: -0.030153706669807434
          total_loss: 45.164791107177734
          vf_explained_var: 0.4209146797657013
          vf_loss: 45.18535614013672
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
  

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,85,389.851,340000,31.074,51.6,7.8,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 696000
  custom_metrics: {}
  date: 2021-06-24_10-58-38
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.899999999999906
  episode_reward_mean: 31.28699999999991
  episode_reward_min: 7.800000000000004
  episodes_this_iter: 50
  episodes_total: 3475
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.3852783143520355
          entropy_coeff: 0.0
          kl: 0.007708707824349403
          model: {}
          policy_loss: -0.024134930223226547
          total_loss: 40.1035041809082
          vf_explained_var: 0.5831844806671143
          vf_loss: 40.11983871459961
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,87,398.882,348000,31.287,48.9,7.8,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 712000
  custom_metrics: {}
  date: 2021-06-24_10-58-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.199999999999896
  episode_reward_mean: 32.27399999999991
  episode_reward_min: 4.799999999999987
  episodes_this_iter: 50
  episodes_total: 3550
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.3577474355697632
          entropy_coeff: 0.0
          kl: 0.008355529978871346
          model: {}
          policy_loss: -0.024146242067217827
          total_loss: 52.31120681762695
          vf_explained_var: 0.5799316167831421
          vf_loss: 52.32689666748047
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,89,408.136,356000,32.274,49.2,4.8,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 728000
  custom_metrics: {}
  date: 2021-06-24_10-58-57
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.199999999999896
  episode_reward_mean: 31.472999999999914
  episode_reward_min: 2.0999999999999424
  episodes_this_iter: 25
  episodes_total: 3625
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.33933261036872864
          entropy_coeff: 0.0
          kl: 0.007521684747189283
          model: {}
          policy_loss: -0.02475820481777191
          total_loss: 61.76087188720703
          vf_explained_var: 0.4999489486217499
          vf_loss: 61.77800750732422
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
  

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,91,417.614,364000,31.473,49.2,2.1,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 744000
  custom_metrics: {}
  date: 2021-06-24_10-59-07
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.09999999999991
  episode_reward_mean: 34.27499999999992
  episode_reward_min: 7.7999999999999226
  episodes_this_iter: 25
  episodes_total: 3700
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.3324660658836365
          entropy_coeff: 0.0
          kl: 0.0072219776920974255
          model: {}
          policy_loss: -0.023308219388127327
          total_loss: 54.98113250732422
          vf_explained_var: 0.5101220011711121
          vf_loss: 54.99712371826172
      policy2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,93,427.214,372000,34.275,50.1,7.8,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 760000
  custom_metrics: {}
  date: 2021-06-24_10-59-16
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 47.09999999999991
  episode_reward_mean: 33.47999999999991
  episode_reward_min: 14.39999999999999
  episodes_this_iter: 50
  episodes_total: 3800
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.3454279601573944
          entropy_coeff: 0.0
          kl: 0.0136724216863513
          model: {}
          policy_loss: -0.030320780351758003
          total_loss: 43.55768966674805
          vf_explained_var: 0.5542909502983093
          vf_loss: 43.57417678833008
      policy2:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 0.00019999999494757503
       

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,95,436.704,380000,33.48,47.1,14.4,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 776000
  custom_metrics: {}
  date: 2021-06-24_10-59-26
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.5999999999999
  episode_reward_mean: 34.073999999999906
  episode_reward_min: 1.4999999999999774
  episodes_this_iter: 50
  episodes_total: 3875
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.3241569995880127
          entropy_coeff: 0.0
          kl: 0.0070323823019862175
          model: {}
          policy_loss: -0.02379133738577366
          total_loss: 41.69140625
          vf_explained_var: 0.6324723362922668
          vf_loss: 41.70808792114258
      policy2:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 0.00019999999494757503
          

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,97,446.346,388000,34.074,48.6,1.5,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 792000
  custom_metrics: {}
  date: 2021-06-24_10-59-35
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.59999999999991
  episode_reward_mean: 35.078999999999915
  episode_reward_min: 1.4999999999999774
  episodes_this_iter: 50
  episodes_total: 3950
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.29948750138282776
          entropy_coeff: 0.0
          kl: 0.006608506198972464
          model: {}
          policy_loss: -0.01910887286067009
          total_loss: 37.096317291259766
          vf_explained_var: 0.6903815865516663
          vf_loss: 37.108734130859375
      policy2:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 0.00019999999494757503
 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,99,455.577,396000,35.079,48.6,1.5,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 808000
  custom_metrics: {}
  date: 2021-06-24_10-59-44
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 52.79999999999991
  episode_reward_mean: 34.18499999999992
  episode_reward_min: 15.899999999999894
  episodes_this_iter: 25
  episodes_total: 4025
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.3147125542163849
          entropy_coeff: 0.0
          kl: 0.008687067776918411
          model: {}
          policy_loss: -0.027713540941476822
          total_loss: 43.66475296020508
          vf_explained_var: 0.644365668296814
          vf_loss: 43.683677673339844
      policy2:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,101,464.716,404000,34.185,52.8,15.9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 824000
  custom_metrics: {}
  date: 2021-06-24_10-59-54
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 52.79999999999991
  episode_reward_mean: 34.57199999999991
  episode_reward_min: 20.699999999999907
  episodes_this_iter: 25
  episodes_total: 4100
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.31303897500038147
          entropy_coeff: 0.0
          kl: 0.006495901849120855
          model: {}
          policy_loss: -0.0222522784024477
          total_loss: 43.90480422973633
          vf_explained_var: 0.6080071926116943
          vf_loss: 43.92047882080078
      policy2:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,103,473.809,412000,34.572,52.8,20.7,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 840000
  custom_metrics: {}
  date: 2021-06-24_11-00-03
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.29999999999991
  episode_reward_mean: 34.49099999999991
  episode_reward_min: -33.000000000000036
  episodes_this_iter: 50
  episodes_total: 4200
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.30664438009262085
          entropy_coeff: 0.0
          kl: 0.008208895102143288
          model: {}
          policy_loss: -0.024369115009903908
          total_loss: 42.06644058227539
          vf_explained_var: 0.5696028470993042
          vf_loss: 42.08250045776367
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
  

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,105,483.166,420000,34.491,51.3,-33,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 856000
  custom_metrics: {}
  date: 2021-06-24_11-00-12
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.89999999999991
  episode_reward_mean: 35.67899999999991
  episode_reward_min: 4.200000000000008
  episodes_this_iter: 50
  episodes_total: 4275
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.2875642478466034
          entropy_coeff: 0.0
          kl: 0.008471362292766571
          model: {}
          policy_loss: -0.025084299966692924
          total_loss: 41.3225212097168
          vf_explained_var: 0.6639360189437866
          vf_loss: 41.33902359008789
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
      

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,107,492.429,428000,35.679,48.9,4.2,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 872000
  custom_metrics: {}
  date: 2021-06-24_11-00-21
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 48.299999999999905
  episode_reward_mean: 34.34699999999991
  episode_reward_min: 4.200000000000008
  episodes_this_iter: 50
  episodes_total: 4350
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.27596229314804077
          entropy_coeff: 0.0
          kl: 0.007240485865622759
          model: {}
          policy_loss: -0.02397489733994007
          total_loss: 38.14629364013672
          vf_explained_var: 0.7113078832626343
          vf_loss: 38.162940979003906
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,109,501.481,436000,34.347,48.3,4.2,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 888000
  custom_metrics: {}
  date: 2021-06-24_11-00-30
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.1999999999999
  episode_reward_mean: 35.05499999999991
  episode_reward_min: 12.29999999999995
  episodes_this_iter: 25
  episodes_total: 4425
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.2788536250591278
          entropy_coeff: 0.0
          kl: 0.007655322086066008
          model: {}
          policy_loss: -0.023204611614346504
          total_loss: 33.8001708984375
          vf_explained_var: 0.7406017184257507
          vf_loss: 33.81562423706055
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
       

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,111,510.418,444000,35.055,49.2,12.3,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 904000
  custom_metrics: {}
  date: 2021-06-24_11-00-39
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.1999999999999
  episode_reward_mean: 35.94599999999991
  episode_reward_min: 11.999999999999922
  episodes_this_iter: 25
  episodes_total: 4500
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.2914312779903412
          entropy_coeff: 0.0
          kl: 0.006899161729961634
          model: {}
          policy_loss: -0.022130368277430534
          total_loss: 52.92919158935547
          vf_explained_var: 0.5927540063858032
          vf_loss: 52.9443359375
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
         

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,113,519.259,452000,35.946,49.2,12,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 920000
  custom_metrics: {}
  date: 2021-06-24_11-00-49
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 53.99999999999991
  episode_reward_mean: 37.07999999999991
  episode_reward_min: 8.69999999999994
  episodes_this_iter: 50
  episodes_total: 4600
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.275748074054718
          entropy_coeff: 0.0
          kl: 0.006656138692051172
          model: {}
          policy_loss: -0.023216459900140762
          total_loss: 37.434017181396484
          vf_explained_var: 0.6401604413986206
          vf_loss: 37.450496673583984
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,115,528.431,460000,37.08,54,8.7,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 936000
  custom_metrics: {}
  date: 2021-06-24_11-00-58
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.399999999999906
  episode_reward_mean: 35.49299999999991
  episode_reward_min: 5.999999999999941
  episodes_this_iter: 50
  episodes_total: 4675
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.2570823132991791
          entropy_coeff: 0.0
          kl: 0.006995275616645813
          model: {}
          policy_loss: -0.021396106109023094
          total_loss: 33.54477310180664
          vf_explained_var: 0.73509681224823
          vf_loss: 33.559085845947266
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,117,537.483,468000,35.493,50.4,6,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 952000
  custom_metrics: {}
  date: 2021-06-24_11-01-07
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 54.59999999999991
  episode_reward_mean: 37.48199999999991
  episode_reward_min: 12.599999999999913
  episodes_this_iter: 50
  episodes_total: 4750
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.24991481006145477
          entropy_coeff: 0.0
          kl: 0.00670434208586812
          model: {}
          policy_loss: -0.019589873030781746
          total_loss: 43.31212615966797
          vf_explained_var: 0.6904425024986267
          vf_loss: 43.32492446899414
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,119,546.484,476000,37.482,54.6,12.6,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 968000
  custom_metrics: {}
  date: 2021-06-24_11-01-16
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 55.19999999999991
  episode_reward_mean: 34.37099999999992
  episode_reward_min: 12.29999999999992
  episodes_this_iter: 25
  episodes_total: 4825
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.24578088521957397
          entropy_coeff: 0.0
          kl: 0.005851077847182751
          model: {}
          policy_loss: -0.016021618619561195
          total_loss: 43.45042037963867
          vf_explained_var: 0.6269910931587219
          vf_loss: 43.46051788330078
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,121,555.499,484000,34.371,55.2,12.3,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 984000
  custom_metrics: {}
  date: 2021-06-24_11-01-25
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.6999999999999
  episode_reward_mean: 33.48599999999991
  episode_reward_min: 11.99999999999994
  episodes_this_iter: 25
  episodes_total: 4900
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.2574528753757477
          entropy_coeff: 0.0
          kl: 0.007298425305634737
          model: {}
          policy_loss: -0.021373813971877098
          total_loss: 40.427467346191406
          vf_explained_var: 0.5343409776687622
          vf_loss: 40.44144821166992
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,123,564.467,492000,33.486,50.7,12,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1000000
  custom_metrics: {}
  date: 2021-06-24_11-01-34
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 57.5999999999999
  episode_reward_mean: 34.322999999999915
  episode_reward_min: 4.199999999999958
  episodes_this_iter: 50
  episodes_total: 5000
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.27957624197006226
          entropy_coeff: 0.0
          kl: 0.006263553164899349
          model: {}
          policy_loss: -0.01936435140669346
          total_loss: 66.02657318115234
          vf_explained_var: 0.4892638027667999
          vf_loss: 66.03959655761719
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,125,573.437,500000,34.323,57.6,4.2,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1016000
  custom_metrics: {}
  date: 2021-06-24_11-01-43
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 57.5999999999999
  episode_reward_mean: 35.429999999999914
  episode_reward_min: 8.099999999999937
  episodes_this_iter: 50
  episodes_total: 5075
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.2499667853116989
          entropy_coeff: 0.0
          kl: 0.006388661451637745
          model: {}
          policy_loss: -0.022287975996732712
          total_loss: 52.9800910949707
          vf_explained_var: 0.6502175331115723
          vf_loss: 52.995914459228516
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,127,582.255,508000,35.43,57.6,8.1,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1032000
  custom_metrics: {}
  date: 2021-06-24_11-01-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.5999999999999
  episode_reward_mean: 35.06099999999991
  episode_reward_min: 8.099999999999937
  episodes_this_iter: 50
  episodes_total: 5150
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.24276036024093628
          entropy_coeff: 0.0
          kl: 0.005688406992703676
          model: {}
          policy_loss: -0.015478793531656265
          total_loss: 42.58594512939453
          vf_explained_var: 0.6779042482376099
          vf_loss: 42.59566116333008
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,129,591.18,516000,35.061,51.6,8.1,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1048000
  custom_metrics: {}
  date: 2021-06-24_11-02-01
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.5999999999999
  episode_reward_mean: 35.74499999999991
  episode_reward_min: 3.9000000000000195
  episodes_this_iter: 25
  episodes_total: 5225
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.23224569857120514
          entropy_coeff: 0.0
          kl: 0.00638602813705802
          model: {}
          policy_loss: -0.017932292073965073
          total_loss: 48.474735260009766
          vf_explained_var: 0.6207736730575562
          vf_loss: 48.4862060546875
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,131,599.921,524000,35.745,51.6,3.9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1064000
  custom_metrics: {}
  date: 2021-06-24_11-02-09
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.5999999999999
  episode_reward_mean: 35.042999999999914
  episode_reward_min: 3.9000000000000195
  episodes_this_iter: 25
  episodes_total: 5300
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.23823557794094086
          entropy_coeff: 0.0
          kl: 0.005204926244914532
          model: {}
          policy_loss: -0.016981763765215874
          total_loss: 48.62496566772461
          vf_explained_var: 0.586117148399353
          vf_loss: 48.636680603027344
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
  

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,133,608.723,532000,35.043,51.6,3.9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1080000
  custom_metrics: {}
  date: 2021-06-24_11-02-19
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 49.49999999999991
  episode_reward_mean: 37.77899999999991
  episode_reward_min: 23.69999999999991
  episodes_this_iter: 50
  episodes_total: 5400
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 9.999999747378752e-05
          entropy: 0.25082454085350037
          entropy_coeff: 0.0
          kl: 0.005223363172262907
          model: {}
          policy_loss: -0.014521021395921707
          total_loss: 39.5062141418457
          vf_explained_var: 0.628329873085022
          vf_loss: 39.51543426513672
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,135,617.747,540000,37.779,49.5,23.7,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1096000
  custom_metrics: {}
  date: 2021-06-24_11-02-28
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.29999999999991
  episode_reward_mean: 37.56599999999991
  episode_reward_min: 22.499999999999908
  episodes_this_iter: 50
  episodes_total: 5475
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 9.999999747378752e-05
          entropy: 0.223600372672081
          entropy_coeff: 0.0
          kl: 0.011608666740357876
          model: {}
          policy_loss: -0.022319426760077477
          total_loss: 37.95277404785156
          vf_explained_var: 0.759234607219696
          vf_loss: 37.96921920776367
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,137,626.961,548000,37.566,51.3,22.5,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1104000
  custom_metrics: {}
  date: 2021-06-24_11-02-34
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.29999999999991
  episode_reward_mean: 37.658999999999914
  episode_reward_min: 22.499999999999908
  episodes_this_iter: 25
  episodes_total: 5500
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 9.999999747378752e-05
          entropy: 0.23901215195655823
          entropy_coeff: 0.0
          kl: 0.009055706672370434
          model: {}
          policy_loss: -0.020144682377576828
          total_loss: 51.1915397644043
          vf_explained_var: 0.5503359436988831
          vf_loss: 51.207096099853516
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,138,633.012,552000,37.659,51.3,22.5,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1120000
  custom_metrics: {}
  date: 2021-06-24_11-02-44
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.99999999999991
  episode_reward_mean: 38.29799999999991
  episode_reward_min: 20.09999999999993
  episodes_this_iter: 50
  episodes_total: 5600
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 9.999999747378752e-05
          entropy: 0.24074704945087433
          entropy_coeff: 0.0
          kl: 0.010952427051961422
          model: {}
          policy_loss: -0.022378819063305855
          total_loss: 40.17567443847656
          vf_explained_var: 0.6254825592041016
          vf_loss: 40.1925163269043
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
    

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,140,643.354,560000,38.298,51,20.1,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1128000
  custom_metrics: {}
  date: 2021-06-24_11-02-50
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.99999999999991
  episode_reward_mean: 38.28899999999991
  episode_reward_min: 16.499999999999932
  episodes_this_iter: 25
  episodes_total: 5625
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 9.999999747378752e-05
          entropy: 0.2217584103345871
          entropy_coeff: 0.0
          kl: 0.009751363657414913
          model: {}
          policy_loss: -0.01812124252319336
          total_loss: 32.837345123291016
          vf_explained_var: 0.7351718544960022
          vf_loss: 32.85053253173828
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,141,648.498,564000,38.289,51,16.5,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1136000
  custom_metrics: {}
  date: 2021-06-24_11-02-56
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 50.699999999999896
  episode_reward_mean: 38.963999999999906
  episode_reward_min: 8.999999999999956
  episodes_this_iter: 50
  episodes_total: 5675
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 9.999999747378752e-05
          entropy: 0.21710826456546783
          entropy_coeff: 0.0
          kl: 0.012960165739059448
          model: {}
          policy_loss: -0.026138702407479286
          total_loss: 35.3424186706543
          vf_explained_var: 0.7727813720703125
          vf_loss: 35.362003326416016
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,142,654.532,568000,38.964,50.7,9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1144000
  custom_metrics: {}
  date: 2021-06-24_11-03-01
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.59999999999991
  episode_reward_mean: 38.780999999999906
  episode_reward_min: 8.999999999999956
  episodes_this_iter: 25
  episodes_total: 5700
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 9.999999747378752e-05
          entropy: 0.22592586278915405
          entropy_coeff: 0.0
          kl: 0.011361761949956417
          model: {}
          policy_loss: -0.023921573534607887
          total_loss: 44.94271469116211
          vf_explained_var: 0.5940679311752319
          vf_loss: 44.96088790893555
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
  

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,143,659.929,572000,38.781,51.6,9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1152000
  custom_metrics: {}
  date: 2021-06-24_11-03-06
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 52.79999999999992
  episode_reward_mean: 36.83399999999992
  episode_reward_min: 8.999999999999956
  episodes_this_iter: 50
  episodes_total: 5750
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 9.999999747378752e-05
          entropy: 0.19761675596237183
          entropy_coeff: 0.0
          kl: 0.010123346000909805
          model: {}
          policy_loss: -0.020393038168549538
          total_loss: 56.9569091796875
          vf_explained_var: 0.6359074115753174
          vf_loss: 56.9721794128418
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
     

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,144,664.966,576000,36.834,52.8,9,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1168000
  custom_metrics: {}
  date: 2021-06-24_11-03-16
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 51.2999999999999
  episode_reward_mean: 37.94099999999992
  episode_reward_min: 11.700000000000003
  episodes_this_iter: 25
  episodes_total: 5825
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 9.999999747378752e-05
          entropy: 0.20331810414791107
          entropy_coeff: 0.0
          kl: 0.013396869413554668
          model: {}
          policy_loss: -0.024723723530769348
          total_loss: 52.11165237426758
          vf_explained_var: 0.6604058742523193
          vf_loss: 52.129600524902344
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
  

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,146,674.553,584000,37.941,51.3,11.7,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1184000
  custom_metrics: {}
  date: 2021-06-24_11-03-25
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 52.19999999999991
  episode_reward_mean: 38.666999999999916
  episode_reward_min: 21.299999999999933
  episodes_this_iter: 25
  episodes_total: 5900
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 9.999999747378752e-05
          entropy: 0.21368350088596344
          entropy_coeff: 0.0
          kl: 0.011372498236596584
          model: {}
          policy_loss: -0.019039282575249672
          total_loss: 60.702919006347656
          vf_explained_var: 0.5923455357551575
          vf_loss: 60.7161979675293
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,148,683.482,592000,38.667,52.2,21.3,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1200000
  custom_metrics: {}
  date: 2021-06-24_11-03-34
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 53.99999999999991
  episode_reward_mean: 38.91299999999992
  episode_reward_min: 15.599999999999943
  episodes_this_iter: 50
  episodes_total: 6000
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 9.999999747378752e-05
          entropy: 0.2088169902563095
          entropy_coeff: 0.0
          kl: 0.013482782989740372
          model: {}
          policy_loss: -0.02733447402715683
          total_loss: 46.111915588378906
          vf_explained_var: 0.656093180179596
          vf_loss: 46.132423400878906
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,150,692.304,600000,38.913,54,15.6,100


Result for PPO_MultiAgentArena_62cb1_00000:
  agent_timesteps_total: 1216000
  custom_metrics: {}
  date: 2021-06-24_11-03-43
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 54.299999999999905
  episode_reward_mean: 39.91199999999991
  episode_reward_min: 19.499999999999908
  episodes_this_iter: 50
  episodes_total: 6075
  experiment_id: 3d6fd9cc179a437092430c6672b67fdc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      policy1:
        learner_stats:
          cur_kl_coeff: 0.5062500238418579
          cur_lr: 9.999999747378752e-05
          entropy: 0.2036026567220688
          entropy_coeff: 0.0
          kl: 0.011194639839231968
          model: {}
          policy_loss: -0.021081706508994102
          total_loss: 42.769195556640625
          vf_explained_var: 0.7206318378448486
          vf_loss: 42.78461837768555
      policy2:
        learner_stats:
          cur_kl_coeff: 0.7593749761581421
          cur_lr: 0.00019999999494757503
 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_62cb1_00000,RUNNING,192.168.0.179:2801,152,701.189,608000,39.912,54.3,19.5,100


------------------
## 15 min break :)
------------------


(while the above experiment is running (and hopefully learning))


## How do we extract any checkpoint from a trial of a tune.run?

In [None]:
# The previous tune.run (the one we did before the exercise) returned an Analysis object, from which we can access any checkpoint
# (given we set checkpoint_freq or checkpoint_at_end to reasonable values) like so:
print(analysis)
# Get all trials.
trials = analysis.trials
# Assuming, the first trial was the best, we'd like to extract this trial's best checkpoint "":
best_checkpoint = analysis.get_best_checkpoint(trial=trials[0], mode="max")
print(f"Found best checkpoint for trial #2: {best_checkpoint}")

# Undo the grid-search config, which RLlib doesn't understand.
rllib_config = tune_config.copy()
rllib_config["lr"] = 0.00005
rllib_config["train_batch_size"] = 4000

# Restore a RLlib Trainer from the checkpoint.
new_trainer = PPOTrainer(config=rllib_config)
new_trainer.restore(best_checkpoint)

In [None]:
out = Output()
display.display(out)

with out:
    obs = env.reset()
    while True:
        a1 = new_trainer.compute_action(obs["agent1"], policy_id="policy1")
        a2 = new_trainer.compute_action(obs["agent2"], policy_id="policy2")
        actions = {"agent1": a1, "agent2": a2}
        obs, rewards, dones, _ = env.step(actions)

        out.clear_output(wait=True)
        env.render()
        time.sleep(0.07)

        if dones["agent1"] is True:
            break


## Let's talk about customization options

### Deep Dive: How do we customize RLlib's RL loop?

RLlib offers a callbacks API that allows you to add custom behavior to
all major events during the environment sampling- and learning process.

**Our problem:** So far, we can only see standard stats, such as rewards, episode lengths, etc..
This does not give us enough insights sometimes into important questions, such as: How many times
have both agents collided? or How many times has agent1 discovered a new field?

In the following cell, we will create custom callback "hooks" that will allow us to
add these stats to the returned metrics dict, and which will therefore be displayed in tensorboard!

For that we will override RLlib's DefaultCallbacks class and implement the
`on_episode_start`, `on_episode_step`, and `on_episode_end` methods therein:


In [None]:
# Override the DefaultCallbacks with your own and implement any methods (hooks)
# that you need.
from ray.rllib.agents.callbacks import DefaultCallbacks
from ray.rllib.evaluation.episode import MultiAgentEpisode


class MyCallbacks(DefaultCallbacks):
    def on_episode_start(self,
                         *,
                         worker,
                         base_env,
                         policies,
                         episode: MultiAgentEpisode,
                         env_index,
                         **kwargs):
        # We will use the `MultiAgentEpisode` object being passed into
        # all episode-related callbacks. It comes with a user_data property (dict),
        # which we can write arbitrary data into.

        # At the end of an episode, we'll transfer that data into the `hist_data`, and `custom_metrics`
        # properties to make sure our custom data is displayed in TensorBoard.

        # The episode is starting:
        # Set per-episode object to capture, which states (observations)
        # have been visited by agent1.
        episode.user_data["new_fields_discovered"] = 0
        # Set per-episode agent2-blocks counter (how many times has agent2 blocked agent1?).
        episode.user_data["num_collisions"] = 0

    def on_episode_step(self,
                        *,
                        worker,
                        base_env,
                        episode: MultiAgentEpisode,
                        env_index,
                        **kwargs):
        # Get both rewards.
        ag1_r = episode.prev_reward_for("agent1")
        ag2_r = episode.prev_reward_for("agent2")

        # Agent1 discovered a new field.
        if ag1_r == 1.0:
            episode.user_data["new_fields_discovered"] += 1
        # Collision.
        elif ag2_r == 1.0:
            episode.user_data["num_collisions"] += 1

    def on_episode_end(self,
                       *,
                       worker,
                       base_env,
                       policies,
                       episode: MultiAgentEpisode,
                       env_index,
                       **kwargs):
        # Episode is done:
        # Write scalar values (sum over rewards) to `custom_metrics` and
        # time-series data (rewards per time step) to `hist_data`.
        # Both will be visible then in TensorBoard.
        episode.custom_metrics["new_fields_discovered"] = episode.user_data["new_fields_discovered"]
        episode.custom_metrics["num_collisions"] = episode.user_data["num_collisions"]


In [None]:
# Setting up our config to point to our new custom callbacks class:
config = {
    "env": MultiAgentArena,
    "callbacks": MyCallbacks,  # by default, this would point to `rllib.agents.callbacks.DefaultCallbacks`, which does nothing.
}

tune.run(
    "PPO",
    config=config,
    stop={"training_iteration": 20},
    checkpoint_at_end=True,
    # If you'd like to restore the tune run from an existing checkpoint file, you can do the following:
    #restore="/Users/sven/ray_results/PPO/PPO_MultiAgentArena_fd451_00000_0_2021-05-25_15-13-26/checkpoint_000010/checkpoint-10",
)

### Let's check tensorboard for the new custom metrics!

1. Head over to the Anyscale project view and click on the "TensorBoard" butten:

<img src="images/tensorboard_button.png" width=1000>

Alternatively - if you ran this locally on your own machine:

1. Head over to ~/ray_results/PPO/PPO_MultiAgentArena_[some key]_00000_0_[date]_[time]/
1. In that directory, you should see a `event.out....` file.
1. Run `tensorboard --logdir .` and head to https://localhost:6006

<img src="images/tensorboard.png" width=800>


### Deep Dive: Providing your custom Models in tf or torch.

In [11]:
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.utils.framework import try_import_tf, try_import_torch

tf1, tf, tf_version = try_import_tf()
torch, nn = try_import_torch()


# Custom Neural Network Models.
class MyKerasModel(TFModelV2):
    """Custom model for policy gradient algorithms."""

    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name):
        """Build a simple [16, 16]-MLP (+ value branch)."""
        super(MyKerasModel, self).__init__(obs_space, action_space,
                                           num_outputs, model_config, name)
        
        # Keras Input layer.
        self.inputs = tf.keras.layers.Input(
            shape=obs_space.shape, name="observations")

        # Hidden layer (shared by action logits outputs and value output).
        layer_1 = tf.keras.layers.Dense(
            16,
            name="layer1",
            activation=tf.nn.relu)(self.inputs)
        
        # Action logits output.
        logits = tf.keras.layers.Dense(
            num_outputs,
            name="out",
            activation=None)(layer_1)

        # "Value"-branch (single node output).
        # Used by several RLlib algorithms (e.g. PPO) to calculate an observation's value.
        value_out = tf.keras.layers.Dense(
            1,
            name="value",
            activation=None)(layer_1)

        # The actual Keras model:
        self.base_model = tf.keras.Model(self.inputs,
                                         [logits, value_out])

    def forward(self, input_dict, state, seq_lens):
        """Custom-define your forard pass logic here."""
        # Pass inputs through our 2 layers and calculate the "value"
        # of the observation and store it for when `value_function` is called.
        logits, self.cur_value = self.base_model(input_dict["obs"])
        return logits, state

    def value_function(self):
        """Implement the value branch forward pass logic here:
        
        We will just return the already calculated `self.cur_value`.
        """
        assert self.cur_value is not None, "Must call `forward()` first!"
        return tf.reshape(self.cur_value, [-1])


class MyTorchModel(TorchModelV2, nn.Module):
    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name):
        """Build a simple [16, 16]-MLP (+ value branch)."""
        TorchModelV2.__init__(self, obs_space, action_space, num_outputs,
                              model_config, name)
        nn.Module.__init__(self)

        self.device = torch.device("cuda"
                                   if torch.cuda.is_available() else "cpu")

        # Hidden layer (shared by action logits outputs and value output).
        self.layer_1 = nn.Linear(obs_space.shape[0], 16).to(self.device)

        # Action logits output.
        self.layer_out = nn.Linear(16, num_outputs).to(self.device)

        # "Value"-branch (single node output).
        # Used by several RLlib algorithms (e.g. PPO) to calculate an observation's value.
        self.value_branch = nn.Linear(16, 1).to(self.device)
        self.cur_value = None

    def forward(self, input_dict, state, seq_lens):
        """Custom-define your forard pass logic here."""
        # Pass inputs through our 2 layers.
        layer_1_out = self.layer_1(input_dict["obs"])
        logits = self.layer_out(layer_1_out)

        # Calculate the "value" of the observation and store it for
        # when `value_function` is called.
        self.cur_value = self.value_branch(layer_1_out).squeeze(1)

        return logits, state

    def value_function(self):
        """Implement the value branch forward pass logic here:
        
        We will just return the already calculated `self.cur_value`.
        """
        assert self.cur_value is not None, "Must call `forward()` first!"
        return self.cur_value


In [20]:
# Do a quick test on the custom model classes.
test_model_tf = MyKerasModel(
    obs_space=gym.spaces.Box(-1.0, 1.0, (2, )),
    action_space=None,
    num_outputs=2,
    model_config={},
    name="MyModel",
)

print("TF-output={}".format(test_model_tf({"obs": np.array([[0.5, 0.5]])})))

# For PyTorch, you can do:
#test_model_torch = MyTorchModel(
#    obs_space=gym.spaces.Box(-1.0, 1.0, (2, )),
#    action_space=None,
#    num_outputs=2,
#    model_config={},
#    name="MyModel",
#)
#print("Torch-output={}".format(test_model_torch({"obs": torch.from_numpy(np.array([[0.5, 0.5]], dtype=np.float32))})))


TF-output=(<tf.Tensor 'model_6/out/BiasAdd:0' shape=(1, 2) dtype=float32>, [])


In [21]:
# Set up our custom model and re-run the experiment.
config.update({
    "model": {
        "custom_model": MyKerasModel,
        "custom_model_config": {
            #"layers": [128, 128],
        },
    },
    # Revert these to single trials (and use those hyperparams that performed well in our Exercise #2).
    "lr": 0.0005,
    "train_batch_size": 2000,
})

tune.run("PPO", config=config, stop=stop)


NameError: name 'config' is not defined

### Deep Dive: A closer look at RLlib's APIs and components
#### (Depending on time left and amount of questions having been accumulated :)

We already took a quick look inside an RLlib Trainer object and extracted its Policy(ies) and the Policy's model (neural network). Here is a much more detailed overview of what's inside a Trainer object.

At the core is the so-called `WorkerSet` sitting under `Trainer.workers`. A WorkerSet is a group of `RolloutWorker` (`rllib.evaluation.rollout_worker.py`) objects that always consists of a "local worker" (`Trainer.workers.local_worker()`) and n "remote workers" (`Trainer.workers.remote_workers()`).



<img src="images/rllib_structure.png" width=1000>

### Scaling RLlib

Scaling RLlib works by parallelizing the "jobs" that the remote `RolloutWorkers` do. In a vanilla RL algorithm, like PPO, DQN, and many others, the `@ray.remote` labeled RolloutWorkers in the figure above are responsible for interacting with one or more environments and thereby collecting experiences. Observations are produced by the environment, actions are then computed by the Policy(ies) copy located on the remote worker and sent to the environment in order to produce yet another observation. This cycle is repeated endlessly and only sometimes interrupted to send experience batches ("train batches") of a certain size to the "local worker". There these batches are used to call `Policy.learn_on_batch()`, which performs a loss calculation, followed by a model weights update, and a subsequent weights broadcast back to all the remote workers.



## Time for Q&A

...

## Thank you for listening and participating!

### Here are a couple of links that you may find useful.

- The <a href="https://github.com/sven1977/rllib_tutorials.git">github repo of this tutorial</a>.
- <a href="https://docs.ray.io/en/master/rllib.html">RLlib's documentation main page</a>.
- <a href="http://discuss.ray.io">Our discourse forum</a> to ask questions on RLlib.
- Our <a href="https://forms.gle/9TSdDYUgxYs8SA9e8">Slack channel</a> for interacting with other Ray RLlib users.
- The <a href="https://github.com/ray-project/ray/blob/master/rllib/examples/">RLlib examples scripts folder</a> with tons of examples on how to do different stuff with RLlib.
- A <a href="https://medium.com/distributed-computing-with-ray/reinforcement-learning-with-rllib-in-the-unity-game-engine-1a98080a7c0d">blog post on training with RLlib inside a Unity3D environment</a>.
