# Hands-on RL with Ray’s RLlib
## A beginner’s tutorial for working with multi-agent environments, models, and algorithms

### Overview
“Hands-on RL with Ray’s RLlib” is a beginners tutorial for working with reinforcement learning (RL) environments, models, and algorithms using Ray’s RLlib library. RLlib offers high scalability, a large list of algos to choose from (offline, model-based, model-free, etc..), support for TensorFlow and PyTorch, and a unified API for a variety of applications. This tutorial includes a brief introduction to provide an overview of concepts (e.g. why RL) before proceeding to RLlib (multi- and single-agent) environments, neural network models, hyperparameter tuning, debugging, student exercises, Q/A, and more. All code will be provided as .py files in a GitHub repo.

### Intended Audience
* Python programmers who want to get started with reinforcement learning and RLlib.

### Prerequisites
* Some Python programming experience.
* Some familiarity with machine learning.
* *Helpful, but not required:* Experience in reinforcement learning and Ray.
* *Helpful, but not required:* Experience with TensorFlow or PyTorch.

### Requirements/Dependencies

Install conda (https://www.anaconda.com/products/individual)

Then ...

#### Quick `conda` setup instructions (Mac and Linux):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install [tensorflow|torch]  # <- either one works!
$ pip install jupyter-labs
```

#### Quick `conda` setup instructions (Win10):
```
$ conda create -n rllib python=3.8
$ conda activate rllib
$ pip install ray[rllib]
$ pip install [tensorflow|torch]  # <- either one works!
$ pip install jupyter-labs
$ conda install pywin32
```

Also, for Win10 Atari support, we have to install atari_py from a different source (gym does not support Atari envs on Windows).

```
$ pip install git+https://github.com/Kojoley/atari-py.git
```

### Opening these tutorial files:
```
$ git clone https://github.com/sven1977/rllib_tutorials
$ cd rllib_tutorials
$ jupyter-lab
```

### Key Takeaways
* What is reinforcement learning and why RLlib?
* Core concepts of RLlib: Environments, Trainers, Policies, and Models.
* How to configure, hyperparameter-tune, and parallelize RLlib.
* RLlib debugging best practices.

### Tutorial Outline
1. RL and RLlib in a nutshell.
1. Defining an RL-solvable problem: Our first environment.
1. Exercise No.1 (env loop)
1. Picking an algorithm and training our first RLlib Trainer.
1. Configurations and hyperparameters - Easy tuning with Ray Tune.
1. Fixing our experiment's config - Going multi-agent.
1. The "infinite laptop": Quick intro into how to use RLlib with Anyscale's product.
1. Exercise No.2 (run your own Ray RLlib+Tune experiment)
1. Neural network models - Provide your custom models using tf.keras or torch.nn.
1. Deeper dive into RLlib's parallelization architecture.
1. Specifying different compute resources and parallelization options through our config.
1. "Hacking in": Using callbacks to customize the RL loop and generate our own metrics.
1. Exercise No.3 (write your own custom callback)
1. "Hacking in (part II)" - Debugging with RLlib and PyCharm.
1. Checking on the "infinite laptop" - Did RLlib learn to solve the problem?

### Other Recommended Readings
* [Attention Nets and More with RLlib's Trajectory View API](https://medium.com/distributed-computing-with-ray/attention-nets-and-more-with-rllibs-trajectory-view-api-d326339a6e65)
* [Intro to RLlib: Example Environments](https://medium.com/distributed-computing-with-ray/intro-to-rllib-example-environments-3a113f532c70)
* [Reinforcement Learning with RLlib in the Unity Game Engine](https://medium.com/distributed-computing-with-ray/reinforcement-learning-with-rllib-in-the-unity-game-engine-1a98080a7c0d)


In [5]:
import numpy as np

import ray

# Start a new instance of Ray or connect to an already running one.
ray.init()
# In case you encounter this error during our tutorial:
# RuntimeError: Maybe you called ray.init twice by accident?
# Try: ray.shutdown() or ray.init(ignore_reinit_error=True)

2021-04-29 15:30:53,640	INFO services.py:1267 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.0.100',
 'raylet_ip_address': '192.168.0.100',
 'redis_address': '192.168.0.100:6379',
 'object_store_address': '/tmp/ray/session_2021-04-29_15-30-52_145625_27624/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-04-29_15-30-52_145625_27624/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2021-04-29_15-30-52_145625_27624',
 'metrics_export_port': 63009,
 'node_id': 'f9c56d30f3030d5011db17b9f583f5477ac374d682c8820c0477a48e'}

<img src="images/rl-cycle.png" width=1200>

In [6]:
# 2) Coding/defining our "problem" via an RL environment.
# We will use the following (adversarial) multi-agent environment
# throughout this tutorial to demonstrate a large fraction of RLlib's
# APIs, features, and customization options.



<img src="images/environment.png" width=800>

### A word or two on Spaces:

Spaces are used in ML to describe what possible/valid values inputs and outputs of a neural network can have.

RL environments also use them to describe what their valid observations and actions are.

Spaces are usually defined by their shape (e.g. 84x84x3 RGB images) and datatype (e.g. uint8 for RGB values between 0 and 255).
However, spaces could also be composed of other spaces (see Tuple or Dict spaces) or could be simply discrete with n fixed possible values
(represented by integers). For example, in our game, where each agent can only go up/down/left/right, the action space would be "Discrete(4)"
(no datatype, no shape needs to be defined here).

<img src="images/spaces.png" width=800>

In [7]:
import gym
from gym.spaces import Discrete, MultiDiscrete
import random

from ray.rllib.env.multi_agent_env import MultiAgentEnv

class MultiAgentArena(MultiAgentEnv):
    def __init__(self, config=None):
        # !LIVE CODING!
        config = config or {}
        self.width = config.get("width", 10)
        self.height = config.get("height", 10)

        # 0=up, 1=right, 2=down, 3=left.
        self.action_space = Discrete(4)
        self.observation_space = MultiDiscrete([self.width * self.height,
                                                self.width * self.height])
        # End an episode after this many timesteps.
        self.timestep_limit = config.get("ts", 100)
        # Reset env.
        self.reset()

    def reset(self):
        # !LIVE CODING!
        # Row-major coords.
        self.agent1_pos = [0, 0]
        self.agent2_pos = [self.height - 1, self.width - 1]
        # Reset agent1's visited states.
        self.agent1_visited_states = set()
        # How many timesteps have we done in this episode.
        self.timesteps = 0

        return self.get_obs()

    def step(self, action: dict):
        # !LIVE CODING!
        self.timesteps += 1
        # Determine, who is allowed to move first.
        agent1_first = random.random() > 0.5
        # Move first agent (could be agent 1 or 2).
        if agent1_first:
            r1, r2 = self.move(self.agent1_pos, action["agent1"], is_agent1=True)
            add = self.move(self.agent2_pos, action["agent2"], is_agent1=False)
        else:
            r1, r2 = self.move(self.agent2_pos, action["agent2"], is_agent1=False)
            add = self.move(self.agent1_pos, action["agent1"], is_agent1=True)
        r1 += add[0]
        r2 += add[1]

        obs = self.get_obs()

        reward = {"agent1": r1, "agent2": r2}

        done = self.timesteps >= self.timestep_limit
        done = {"agent1": done, "agent2": done, "__all__": done}

        return obs, reward, done, {}

    def get_obs(self):
        ag1_discrete_pos = self.agent1_pos[0] * self.width + \
            (self.agent1_pos[1] % self.width)
        ag2_discrete_pos = self.agent2_pos[0] * self.width + \
            (self.agent2_pos[1] % self.width)
        return {
            "agent1": np.array([ag1_discrete_pos, ag2_discrete_pos]),
            "agent2": np.array([ag2_discrete_pos, ag1_discrete_pos]),
        }

    def move(self, coords, action, is_agent1):
        orig_coords = coords[:]
        # Change the row: 0=up (-1), 2=down (+1)
        coords[0] += -1 if action == 0 else 1 if action == 2 else 0
        # Change the column: 1=right (+1), 3=left (-1)
        coords[1] += 1 if action == 1 else -1 if action == 3 else 0

        # Solve collisions.
        # Make sure, we don't end up on the other agent's position.
        # If yes, don't move (we are blocked).
        if (is_agent1 and coords == self.agent2_pos) or (not is_agent1 and coords == self.agent1_pos):
            coords[0], coords[1] = orig_coords
            # Agent2 blocked agent1 (agent1 tried to run into agent2)
            # OR Agent2 bumped into agent1 (agent2 tried to run into agent1)
            # -> +1 for agent2; -1 for agent1
            return -1.0, 1.0

        # No agent blocking -> check walls.
        if coords[0] < 0:
            coords[0] = 0
        elif coords[0] >= self.height:
            coords[0] = self.height - 1
        if coords[1] < 0:
            coords[1] = 0
        elif coords[1] >= self.width:
            coords[1] = self.width - 1

        # If agent1 -> +1.0 if new tile covered.
        if is_agent1 and not tuple(coords) in self.agent1_visited_states:
            self.agent1_visited_states.add(tuple(coords))
            return 1.0, -0.1
        # No new tile for agent1 -> Negative reward.
        return -0.5, -0.1

    # Optionally: Add `render` method returning some img.
    def render(self, mode=None):
        return np.random.randint(0, 256, (20, 20, 3), dtype=np.uint8)

Instructions for updating:
non-resource variables are not supported in the long term


## Exercise No 1

Write an "environment loop" using our `MultiAgentArena` class.

1. Create an env object.
1. `reset` your environment to get the first (initial) observation.
1. `step` through the environment using a provided
   "DummyTrainer.compute_action([obs])" method to compute action dicts (see cell below, in which you can create a DummyTrainer object and query it for random actions).
1. When an episode is done, remember to `reset()` your environment before the next call to `step()`.
1. If you feel, this is way too easy for you ;) , try to extract each agent's reward, sum it up over one episode and - at the end of an episode (when done=True) - print out each agent's accumulated reward (also called "return").

**Good luck! :)**


In [8]:
class DummyTrainer:
    """Dummy Trainer class used in Exercise #1.

    Use its `compute_action` method to get a new action, given some environment
    observation.
    """

    def compute_action(self, obs):
        # Returns a random action.
        return {
            "agent1": np.random.randint(4),
            "agent2": np.random.randint(4)
        }

dummy_trainer = DummyTrainer()
# Check, whether it's working.
for _ in range(3):
    print(dummy_trainer.compute_action({"agent1": np.array([0, 10]), "agent2": np.array([10, 0])}))

{'agent1': 3, 'agent2': 1}
{'agent1': 1, 'agent2': 2}
{'agent1': 1, 'agent2': 0}


In [9]:
# Solution to Exercise #1:
#from gym.envs.classic_control.rendering import SimpleImageViewer
#simple_image_viewer = SimpleImageViewer()

# Solution:
env = MultiAgentArena(config={"width": 10, "height": 10})
obs = env.reset()
# Play through a single episode.
done = {"__all__": False}
return_ag1 = return_ag2 = 0.0
num_episodes = 0
while num_episodes < 10:
    action = dummy_trainer.compute_action(obs)
    obs, rewards, done, _ = env.step(action)
    return_ag1 += rewards["agent1"]
    return_ag2 += rewards["agent2"]    
    if done["__all__"]:
        print(f"Episode done. R1={return_ag1} R2={return_ag2}")
        num_episodes += 1
        return_ag1 = return_ag2 = 0.0
        obs = env.reset()
    # Optional:
    #img = env.render()
    #simple_image_viewer.imshow(img)


Episode done. R1=-60.0 R2=-18.899999999999963
Episode done. R1=-48.5 R2=-17.79999999999996
Episode done. R1=-50.5 R2=-16.699999999999964
Episode done. R1=-68.5 R2=-19.99999999999996
Episode done. R1=-80.0 R2=-11.19999999999997
Episode done. R1=-45.5 R2=-14.499999999999972
Episode done. R1=-53.5 R2=-19.99999999999996
Episode done. R1=-59.5 R2=-19.99999999999996
Episode done. R1=-51.0 R2=-15.599999999999962
Episode done. R1=-46.0 R2=-19.99999999999996


In [10]:
# 4) Plugging in RLlib.

# Import a Trainable (one of RLlib's built-in algorithms):
# We use the PPO algorithm here b/c its very flexible wrt its supported
# action spaces and model types and b/c it learns well almost any problem.
from ray.rllib.agents.ppo import PPOTrainer

# Specify a very simple config, defining our environment and some environment
# options (see environment.py).
config = {
    "env": MultiAgentArena,
    "env_config": {
        "config": {
            "width": 10,
            "height": 10,
        },
    },
    # "framework": "torch",
    "create_env_on_driver": True,
}
# Instantiate the Trainer object using above config.
rllib_trainer = PPOTrainer(config=config)


2021-04-29 15:31:12,852	INFO trainer.py:669 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
2021-04-29 15:31:12,853	INFO trainer.py:694 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=27764)[0m Instructions for updating:
[2m[36m(pid=27764)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27764)[0m Instructions for updating:
[2m[36m(pid=27764)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27765)[0m Instructions for updating:
[2m[36m(pid=27765)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27765)[0m Instructions for updating:
[2m[36m(pid=27765)[0m non-resource variables are not supported in the long term


In [11]:
# That's it, we are ready to train.
# Calling `train` once runs a single "training iteration". One iteration
# for most algos contains a) sampling from the environment(s) + b) using the
# sampled data (observations, actions taken, rewards) to update the policy
# model (neural network), such that it would pick better actions in the future,
# leading to higher rewards.
print(rllib_trainer.train())



{'episode_reward_max': -35.999999999999986, 'episode_reward_min': -88.50000000000016, 'episode_reward_mean': -65.5800000000001, 'episode_len_mean': 100.0, 'episode_media': {}, 'episodes_this_iter': 20, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [-52.50000000000007, -60.00000000000004, -76.80000000000015, -78.60000000000014, -70.50000000000016, -52.50000000000004, -84.00000000000011, -58.50000000000006, -76.50000000000014, -35.999999999999986, -63.00000000000006, -72.00000000000011, -64.50000000000014, -70.50000000000011, -70.20000000000007, -78.00000000000011, -47.70000000000003, -88.50000000000016, -55.8000000000001, -55.50000000000007], 'episode_lengths': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 0.1742773122720785, 'mean_inference_ms': 0.6968317689238253, 'mean_action_processing_ms': 0.0682

In [16]:
# Run `train()` n times. Try to repeatedly call this to see rewards increase.
# Move on once you see episode rewards > -55.0.
for i in range(10):
    results = rllib_trainer.train()
    print(f"iteration {i}: R={results['episode_reward_mean']}")

iteration 0: R=-56.406000000000056
iteration 1: R=-56.043000000000056
iteration 2: R=-54.92700000000004
iteration 3: R=-53.10300000000006
iteration 4: R=-52.950000000000045
iteration 5: R=-51.99300000000006
iteration 6: R=-51.91500000000005
iteration 7: R=-52.30500000000006
iteration 8: R=-53.940000000000055
iteration 9: R=-53.89200000000005


In [17]:
# !LIVE CODING!
# Let's actually "look inside" our Trainer to see what's in there.
pol = rllib_trainer.get_policy()
print(f"Policy: {pol}; Observation-space: {pol.observation_space}; Action-space: {pol.action_space}")

print(f"Model: {pol.model}")

# Create a fake numpy B=1 (single) observation consisting of both agents positions ("one-hot'd" and "concat'd").
from ray.rllib.utils.numpy import one_hot
single_obs = np.concatenate([one_hot(0, depth=100), one_hot(99, depth=100)])
single_obs = np.array([single_obs])
#single_obs.shape

# Generate the Model's output.
out, state_out = pol.model({"obs": single_obs})

# tf1.x (static graph) -> Need to run this through a tf session.
numpy_out = pol._sess.run(out)

# RLlib then passes the model's output to the policy's "action distribution" to sample an action.
action_dist = pol.dist_class(out)
action = action_dist.sample()

# Show us the actual action.
pol._sess.run(action)

Policy: <ray.rllib.policy.tf_policy_template.PPOTFPolicy object at 0x7ffc05b10670>; Observation-space: Box(-1.0, 1.0, (200,), float32); Action-space: Discrete(4)
Model: <ray.rllib.models.tf.fcnet.FullyConnectedNetwork object at 0x7ffc05b106d0>


array([1])

In [18]:
# Save our trainer.
checkpoint_path = rllib_trainer.save()
print(f"Trainer was saved in '{checkpoint_path}'!")

import os
os.listdir(os.path.dirname(checkpoint_path))

Trainer was saved in '/Users/sven/ray_results/PPO_MultiAgentArena_2021-04-29_15-31-12gg_j3w3g/checkpoint_000041/checkpoint-41'!


['checkpoint-41.tune_metadata', 'checkpoint-41', '.is_checkpoint']

In [19]:
# Pretend, we wanted to pick up training from a previous run:
new_trainer = PPOTrainer(config=config)
# Evaluate the new trainer (this should yield random results).
results = new_trainer._evaluate()
print(f"Evaluating new trainer: R={results['evaluation']['episode_reward_mean']}")

# Restoring the trained state into the `new_trainer` object.
new_trainer.restore(checkpoint_path)

# Evaluate again (this should yield results we saw after having trained our saved agent).
results = new_trainer._evaluate()
print(f"Evaluating restored trainer: R={results['evaluation']['episode_reward_mean']}")

[2m[36m(pid=27766)[0m Instructions for updating:
[2m[36m(pid=27766)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27766)[0m Instructions for updating:
[2m[36m(pid=27766)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27760)[0m Instructions for updating:
[2m[36m(pid=27760)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27760)[0m Instructions for updating:
[2m[36m(pid=27760)[0m non-resource variables are not supported in the long term


AttributeError: 'PPO' object has no attribute 'evaluation_workers'

In [20]:
# 5) Configuration dicts and Ray Tune.
# Where are the default configuration dicts stored?
import pprint
from ray.rllib.agents.ppo import DEFAULT_CONFIG as PPO_DEFAULT_CONFIG
print(f"PPO's default config is:")
pprint.pprint(PPO_DEFAULT_CONFIG)

#from ray.rllib.agents.dqn import DEFAULT_CONFIG as DQN_DEFAULT_CONFIG
#print(f"DQN's default config is:")
#pprint.pprint(DQN_DEFAULT_CONFIG)

#from ray.rllib.agents.trainer import COMMON_CONFIG
#print(f"RLlib Trainer's default config is:")
#pprint.pprint(COMMON_CONFIG)

PPO's default config is:
{'_fake_gpus': False,
 'batch_mode': 'truncate_episodes',
 'callbacks': <class 'ray.rllib.agents.callbacks.DefaultCallbacks'>,
 'clip_actions': True,
 'clip_param': 0.3,
 'clip_rewards': None,
 'collect_metrics_timeout': 180,
 'compress_observations': False,
 'create_env_on_driver': False,
 'custom_eval_function': None,
 'custom_resources_per_worker': {},
 'eager_tracing': False,
 'entropy_coeff': 0.0,
 'entropy_coeff_schedule': None,
 'env': None,
 'env_config': {},
 'evaluation_config': {},
 'evaluation_interval': None,
 'evaluation_num_episodes': 10,
 'evaluation_num_workers': 0,
 'exploration_config': {'type': 'StochasticSampling'},
 'explore': True,
 'extra_python_environs_for_driver': {},
 'extra_python_environs_for_worker': {},
 'fake_sampler': False,
 'framework': 'tf',
 'gamma': 0.99,
 'grad_clip': None,
 'horizon': None,
 'ignore_worker_failures': False,
 'in_evaluation': False,
 'input': 'sampler',
 'input_evaluation': ['is', 'wis'],
 'kl_coeff': 0.2

In [21]:
# Plugging in Ray Tune.
# Note that this is the recommended way to run any experiments with RLlib.
# Reasons:
# - Tune allows you to do hyperparameter tuning in a user-friendly way
#   and at large scale!
# - Tune automatically allocates needed resources for the different
#   hyperparam trials and experiment runs.

from ray import tune

# Now that we will run things "automatically" through tune, we have to
# define one or more stopping criteria.
stop = {
    # explain that keys here can be anything present in the above print(trainer.train())
    "training_iteration": 5,
    "episode_reward_mean": 9999.9,
}

# "PPO" is a registered name that points to RLlib's PPOTrainer.
# See `ray/rllib/agents/registry.py`
# Run our simple experiment until one of the stop criteria is met.
tune.run("PPO", config=config, stop=stop)


Trial name,status,loc
PPO_MultiAgentArena_4525b_00000,PENDING,


[2m[36m(pid=27758)[0m Instructions for updating:
[2m[36m(pid=27758)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27758)[0m Instructions for updating:
[2m[36m(pid=27758)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27758)[0m 2021-04-29 15:39:12,833	INFO trainer.py:669 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=27758)[0m 2021-04-29 15:39:12,833	INFO trainer.py:694 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=27758)[0m 2021-04-29 15:39:12,833	INFO trainer.py:669 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=27758)[0m 2021-04-29 15:39:12,833	INFO trainer.py:694 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=27759)[0m Instructions for updating:
[2m[

Result for PPO_MultiAgentArena_4525b_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-04-29_15-39-22
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -43.800000000000026
  episode_reward_mean: -70.08000000000011
  episode_reward_min: -86.40000000000015
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: 8553c35b91814519ad2ca12c7fa11668
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.362125039100647
          entropy_coeff: 0.0
          kl: 0.02420918457210064
          model: {}
          policy_loss: -0.059532247483730316
          total_loss: 133.79750061035156
          vf_explained_var: 0.0809241235256195
          vf_loss: 133.8521728515625
    num_agent_steps_sampled: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iterations_since_restore: 1


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_4525b_00000,RUNNING,192.168.0.100:27758,1,3.91841,4000,-70.08,-43.8,-86.4,100


Result for PPO_MultiAgentArena_4525b_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-04-29_15-39-27
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -35.99999999999999
  episode_reward_mean: -62.04000000000009
  episode_reward_min: -88.50000000000014
  episodes_this_iter: 20
  episodes_total: 60
  experiment_id: 8553c35b91814519ad2ca12c7fa11668
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 1.3045539855957031
          entropy_coeff: 0.0
          kl: 0.01824115589261055
          model: {}
          policy_loss: -0.05359369516372681
          total_loss: 66.01841735839844
          vf_explained_var: 0.2051105946302414
          vf_loss: 66.06654357910156
    num_agent_steps_sampled: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
  iterations_since_restore: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_4525b_00000,RUNNING,192.168.0.100:27758,3,9.27795,12000,-62.04,-36,-88.5,100


Result for PPO_MultiAgentArena_4525b_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-04-29_15-39-32
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -35.99999999999999
  episode_reward_mean: -61.71000000000009
  episode_reward_min: -88.50000000000016
  episodes_this_iter: 20
  episodes_total: 100
  experiment_id: 8553c35b91814519ad2ca12c7fa11668
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 1.2382826805114746
          entropy_coeff: 0.0
          kl: 0.01850685104727745
          model: {}
          policy_loss: -0.04642891883850098
          total_loss: 71.22037506103516
          vf_explained_var: 0.23586240410804749
          vf_loss: 71.2612533569336
    num_agent_steps_sampled: 20000
    num_steps_sampled: 20000
    num_steps_trained: 20000
  iterations_since_restore: 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_4525b_00000,TERMINATED,,5,14.2496,20000,-61.71,-36,-88.5,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_4525b_00000,TERMINATED,,5,14.2496,20000,-61.71,-36,-88.5,100


2021-04-29 15:39:33,025	INFO tune.py:549 -- Total run time: 25.97 seconds (25.42 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7ffc08b79370>

In [22]:
# Updating an algo's default config dict and adding hyperparameter tuning
# options to it.
# Note: Hyperparameter tuning options (e.g. grid_search) will only work,
# if we run these configs via `tune.run`.
config.update(
    {
        # Try 2 different learning rates.
        "lr": tune.grid_search([0.0001, 0.5]),
        # NN model config to tweak the default model
        # that'll be created by RLlib for the policy.
        "model": {
            # e.g. change the dense layer stack.
            "fcnet_hiddens": [256, 256, 256],
            # Alternatively, you can specify a custom model here
            # (we'll cover that later).
            # "custom_model": ...
            # Pass kwargs to your custom model.
            # "custom_model_config": {}
        },
    }
)
# Repeat our experiment using tune's grid-search feature.
results = tune.run(
    "PPO",
    config=config,
    stop=stop,
    checkpoint_at_end=True,  # create a checkpoint when done.
    checkpoint_freq=1,  # create a checkpoint on every iteration.
)
print(results)


Trial name,status,loc,lr
PPO_MultiAgentArena_54a34_00000,PENDING,,0.0001
PPO_MultiAgentArena_54a34_00001,PENDING,,0.5


[2m[36m(pid=27753)[0m Instructions for updating:
[2m[36m(pid=27753)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27753)[0m Instructions for updating:
[2m[36m(pid=27753)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27755)[0m Instructions for updating:
[2m[36m(pid=27755)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27755)[0m Instructions for updating:
[2m[36m(pid=27755)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27753)[0m 2021-04-29 15:39:37,463	INFO trainer.py:669 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=27753)[0m 2021-04-29 15:39:37,463	INFO trainer.py:694 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=27753)[0m 2021-04-29 15:39:37,463	INFO trainer.py:669 -- Tip: set framework=tfe or the --eager flag to enable

Result for PPO_MultiAgentArena_54a34_00001:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-04-29_15-39-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -48.30000000000009
  episode_reward_mean: -66.69000000000011
  episode_reward_min: -93.00000000000013
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: 3898debe393b45bba3ba5505420443c5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.5
          entropy: 0.03141152858734131
          entropy_coeff: 0.0
          kl: 36.357173919677734
          model: {}
          policy_loss: 0.4721342921257019
          total_loss: 158.54925537109375
          vf_explained_var: -0.0006072241812944412
          vf_loss: 150.80569458007812
    num_agent_steps_sampled: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iterations_since_restore: 1
  node_ip: 192.1

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_54a34_00000,RUNNING,,0.0001,,,,,,,
PPO_MultiAgentArena_54a34_00001,RUNNING,192.168.0.100:27755,0.5,1.0,4.33747,4000.0,-66.69,-48.3,-93.0,100.0


Result for PPO_MultiAgentArena_54a34_00000:
  agent_timesteps_total: 4000
  custom_metrics: {}
  date: 2021-04-29_15-39-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -56.4000000000001
  episode_reward_mean: -71.61000000000011
  episode_reward_min: -88.50000000000017
  episodes_this_iter: 20
  episodes_total: 20
  experiment_id: e48d71458a3945f4bcc1000b5dec669e
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 9.999999747378752e-05
          entropy: 1.345606803894043
          entropy_coeff: 0.0
          kl: 0.041982509195804596
          model: {}
          policy_loss: -0.07474327087402344
          total_loss: 138.66407775878906
          vf_explained_var: 0.1261296421289444
          vf_loss: 138.73043823242188
    num_agent_steps_sampled: 4000
    num_steps_sampled: 4000
    num_steps_trained: 4000
  iterations_since_restore: 1
 

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_54a34_00000,RUNNING,192.168.0.100:27753,0.0001,2,8.36267,8000,-70.2975,-47.7,-90.9,100
PPO_MultiAgentArena_54a34_00001,RUNNING,192.168.0.100:27755,0.5,3,12.5541,12000,-93.23,-48.3,-106.5,100


Result for PPO_MultiAgentArena_54a34_00000:
  agent_timesteps_total: 12000
  custom_metrics: {}
  date: 2021-04-29_15-39-56
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -47.70000000000005
  episode_reward_mean: -68.0600000000001
  episode_reward_min: -90.90000000000015
  episodes_this_iter: 20
  episodes_total: 60
  experiment_id: e48d71458a3945f4bcc1000b5dec669e
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 9.999999747378752e-05
          entropy: 1.2796211242675781
          entropy_coeff: 0.0
          kl: 0.02720813825726509
          model: {}
          policy_loss: -0.06847807765007019
          total_loss: 64.20436096191406
          vf_explained_var: 0.3745890259742737
          vf_loss: 64.26058959960938
    num_agent_steps_sampled: 12000
    num_steps_sampled: 12000
    num_steps_trained: 12000
  iterations_since_restore: 3

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_54a34_00000,RUNNING,192.168.0.100:27753,0.0001,5,20.5724,20000,-66.06,-44.7,-90.9,100
PPO_MultiAgentArena_54a34_00001,RUNNING,192.168.0.100:27755,0.5,4,16.5365,16000,-96.5475,-48.3,-106.5,100


Result for PPO_MultiAgentArena_54a34_00001:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-04-29_15-40-04
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -48.30000000000009
  episode_reward_mean: -98.53800000000018
  episode_reward_min: -106.50000000000017
  episodes_this_iter: 20
  episodes_total: 100
  experiment_id: 3898debe393b45bba3ba5505420443c5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      default_policy:
        learner_stats:
          cur_kl_coeff: 0.03750000149011612
          cur_lr: 0.5
          entropy: 0.0
          entropy_coeff: 0.0
          kl: 0.0
          model: {}
          policy_loss: -0.004698052071034908
          total_loss: 424.4887390136719
          vf_explained_var: -1.6763806343078613e-08
          vf_loss: 424.4934387207031
    num_agent_steps_sampled: 20000
    num_steps_sampled: 20000
    num_steps_trained: 20000
  iterations_since_restore: 5
  node_ip: 192.168.0.100
  num_healthy_w

Trial name,status,loc,lr,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_54a34_00000,TERMINATED,,0.0001,5,20.5724,20000,-66.06,-44.7,-90.9,100
PPO_MultiAgentArena_54a34_00001,TERMINATED,,0.5,5,20.6172,20000,-98.538,-48.3,-106.5,100


[2m[36m(pid=27754)[0m 2021-04-29 15:40:04,397	ERROR worker.py:382 -- SystemExit was raised from the worker
[2m[36m(pid=27754)[0m Traceback (most recent call last):
[2m[36m(pid=27754)[0m   File "python/ray/_raylet.pyx", line 488, in ray._raylet.execute_task
[2m[36m(pid=27754)[0m   File "python/ray/_raylet.pyx", line 495, in ray._raylet.execute_task
[2m[36m(pid=27754)[0m   File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
[2m[36m(pid=27754)[0m   File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor
[2m[36m(pid=27754)[0m   File "/Users/sven/opt/anaconda3/envs/ray_tutorial/lib/python3.8/site-packages/ray/_private/function_manager.py", line 556, in actor_method_executor
[2m[36m(pid=27754)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(pid=27754)[0m   File "/Users/sven/opt/anaconda3/envs/ray_tutorial/lib/python3.8/site-packages/ray/actor.py", line 1001, in __ray_terminate__
[2m[36m(pid=27754)[0m    

<ray.tune.analysis.experiment_analysis.ExperimentAnalysis object at 0x7ffc09734130>


In [23]:
# 6) Going multi-policy: Our experiment is ill-configured b/c both
# agents, which should behave differently due to their different
# tasks and reward functions, learn the same policy (the "default_policy",
# which RLlib always provides if you don't configure anything else; Remember
# that RLlib does not know at Trainer setup time, how many and which agents
# the environment will "produce").
# Let's fix this and introduce the "multiagent" API.

# 6.1.) Define an agent->policy mapping function.
# Which agents (defined by the environment) use which policies
# (defined by us)? Mapping is M (agents) -> N (policies), where M >= N.
def policy_mapping_fn(agent: str):
    assert agent in ["agent1", "agent2"], f"ERROR: invalid agent {agent}!"
    return "pol1" if agent == "agent1" else "pol2"
    
# 6.2.) Define details for our two policies.
#TODO: coding Sven: Make it possible to not need obs/action spaces
#  if they are the default anyways.
observation_space = rllib_trainer.workers.local_worker().env.observation_space
action_space = rllib_trainer.workers.local_worker().env.action_space
# Btw, the above is equivalent to saying:
# >>> rllib_trainer.get_policy("default_policy").obs/action_space
policies = {
    "pol1": (None, observation_space, action_space, {"lr": 0.0003}),
    "pol2": (None, observation_space, action_space, {"lr": 0.0004}),
}

#policies_to_train = ["pol1", "pol2"]

# 6.3) Adding the above to our config.
config.update({
    "multiagent": {
        "policies": policies,
        "policy_mapping_fn": policy_mapping_fn,
        #"policies_to_train": policies_to_train,
    },
})


## Exercise No 2

Try learning our environment using Ray tune.run and a simple hyperparameter grid_search over:
- 2 different learning rates (pick your own values).
- AND 2 different `train_batch_size` settings (use 2000 and 3000).

Also, make RLlib use a [128, 128] dense layer stack as the NN model.

Also, use the config setting of `num_envs_per_worker=10` to increase the sampling throughput.

In case your local machine has less than 12 CPUs, try setting `num_workers=1` to make all tune trials run at the same time.
Background: PPO by default uses 2 workers, which makes 1 trial use 3 CPUs (2 workers + "driver" ("local-worker")),
which makes the entire experiment use 12 CPUs. Tune will run trials in sequence in case it cannot allocate enough CPUs at once
(which is also fine, but then takes longer).

Try to reach a total reward (sum of agent1 and agent2) of -25.0.

**Good luck! :)**


In [24]:
# Solution to Exercise #2:

# Update our config and set it up for 2x tune grid-searches (leading to 4 parallel trials in total).
config.update({
    "lr": tune.grid_search([0.0001, 0.0005]),
    "train_batch_size": tune.grid_search([2000, 3000]),
    "num_envs_per_worker": 10,
    # Change our model to be simpler.
    "model": {
        "fcnet_hiddens": [128, 128],
    },
})

# Run the experiment.
tune.run("PPO", config=config, stop={"episode_reward_mean": -25.0, "training_iteration": 100})

Trial name,status,loc,lr,train_batch_size
PPO_MultiAgentArena_6768d_00000,PENDING,,0.0001,2000
PPO_MultiAgentArena_6768d_00001,PENDING,,0.0005,2000
PPO_MultiAgentArena_6768d_00002,PENDING,,0.0001,3000
PPO_MultiAgentArena_6768d_00003,PENDING,,0.0005,3000


[2m[36m(pid=27751)[0m Instructions for updating:
[2m[36m(pid=27751)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27751)[0m Instructions for updating:
[2m[36m(pid=27751)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27752)[0m Instructions for updating:
[2m[36m(pid=27752)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27752)[0m Instructions for updating:
[2m[36m(pid=27752)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27761)[0m Instructions for updating:
[2m[36m(pid=27761)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27761)[0m Instructions for updating:
[2m[36m(pid=27761)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27831)[0m Instructions for updating:
[2m[36m(pid=27831)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27831)[0m Instructions for updating:
[2

[2m[36m(pid=27830)[0m Instructions for updating:
[2m[36m(pid=27830)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27830)[0m Instructions for updating:
[2m[36m(pid=27830)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27836)[0m Instructions for updating:
[2m[36m(pid=27836)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27836)[0m Instructions for updating:
[2m[36m(pid=27836)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27837)[0m Instructions for updating:
[2m[36m(pid=27837)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27837)[0m Instructions for updating:
[2m[36m(pid=27837)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27839)[0m Instructions for updating:
[2m[36m(pid=27839)[0m non-resource variables are not supported in the long term
[2m[36m(pid=27839)[0m Instructions for updating:
[2



Result for PPO_MultiAgentArena_6768d_00000:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-04-29_15-40-24
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -45.90000000000006
  episode_reward_mean: -69.2925000000001
  episode_reward_min: -91.20000000000014
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: ca7d5fe006ec4c509dd419cd353bfbf5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.0003000000142492354
          entropy: 1.3394379615783691
          entropy_coeff: 0.0
          kl: 0.04784049838781357
          model: {}
          policy_loss: -0.07369829714298248
          total_loss: 226.8167724609375
          vf_explained_var: 0.05389287695288658
          vf_loss: 226.88088989257812
      pol2:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,1.0,5.61759,4000.0,-69.2925,-45.9,-91.2,100.0
PPO_MultiAgentArena_6768d_00001,RUNNING,,0.0005,2000,,,,,,,
PPO_MultiAgentArena_6768d_00002,RUNNING,,0.0001,3000,,,,,,,
PPO_MultiAgentArena_6768d_00003,RUNNING,,0.0005,3000,,,,,,,


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 8000
  custom_metrics: {}
  date: 2021-04-29_15-40-25
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -46.5
  episode_reward_mean: -67.2675000000001
  episode_reward_min: -90.00000000000017
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.0003000000142492354
          entropy: 1.3452237844467163
          entropy_coeff: 0.0
          kl: 0.0427456833422184
          model: {}
          policy_loss: -0.0751623585820198
          total_loss: 249.20498657226562
          vf_explained_var: 0.015425518155097961
          vf_loss: 249.2716064453125
      pol2:
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 0.00039999998989515007
          entropy: 1.3468

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,2,11.4469,8000,-65.9775,-32.4,-91.2,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,1,6.1069,4000,-69.645,-43.2,-99.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,1,6.08135,4000,-66.975,-46.2,-97.8,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,1,6.10309,4000,-67.2675,-46.5,-90.0,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 16000
  custom_metrics: {}
  date: 2021-04-29_15-40-31
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -37.5
  episode_reward_mean: -63.42750000000008
  episode_reward_min: -97.80000000000015
  episodes_this_iter: 40
  episodes_total: 80
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 0.0003000000142492354
          entropy: 1.3139889240264893
          entropy_coeff: 0.0
          kl: 0.03743421286344528
          model: {}
          policy_loss: -0.07891766726970673
          total_loss: 73.26046752929688
          vf_explained_var: 0.2710918188095093
          vf_loss: 73.3281478881836
      pol2:
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 0.00039999998989515007
          entropy: 1.3077

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,3,18.5226,12000,-61.206,-32.4,-87.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,2,11.6804,8000,-66.1875,-43.2,-99.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,2,11.6882,8000,-63.4275,-37.5,-97.8,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,2,11.7603,8000,-67.2938,-46.5,-93.0,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 24000
  custom_metrics: {}
  date: 2021-04-29_15-40-38
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -37.5
  episode_reward_mean: -62.36700000000009
  episode_reward_min: -97.80000000000015
  episodes_this_iter: 40
  episodes_total: 120
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 0.0003000000142492354
          entropy: 1.2709401845932007
          entropy_coeff: 0.0
          kl: 0.030720677226781845
          model: {}
          policy_loss: -0.07197196781635284
          total_loss: 67.42460632324219
          vf_explained_var: 0.2721448838710785
          vf_loss: 67.48274993896484
      pol2:
        learner_stats:
          cur_kl_coeff: 0.44999998807907104
          cur_lr: 0.00039999998989515007
          entropy: 1.2

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,4,26.77,16000,-59.508,-33.3,-78.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,3,18.795,12000,-63.633,-40.8,-99.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,3,18.772,12000,-62.367,-37.5,-97.8,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,3,19.2357,12000,-64.938,-46.5,-93.0,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 32000
  custom_metrics: {}
  date: 2021-04-29_15-40-46
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -35.99999999999998
  episode_reward_mean: -59.92200000000008
  episode_reward_min: -81.60000000000015
  episodes_this_iter: 40
  episodes_total: 160
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 0.675000011920929
          cur_lr: 0.0003000000142492354
          entropy: 1.2504839897155762
          entropy_coeff: 0.0
          kl: 0.02280413918197155
          model: {}
          policy_loss: -0.06321410834789276
          total_loss: 67.6986083984375
          vf_explained_var: 0.22486630082130432
          vf_loss: 67.74642944335938
      pol2:
        learner_stats:
          cur_kl_coeff: 0.675000011920929
          cur_lr: 0.00039999998989515007
          entr

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,5,32.6686,20000,-57.885,-33.3,-78.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,4,26.6756,16000,-59.754,-35.4,-80.4,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,4,26.7494,16000,-59.922,-36.0,-81.6,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,4,26.9667,16000,-61.224,-37.5,-90.9,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-04-29_15-40-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -35.99999999999998
  episode_reward_mean: -58.89900000000008
  episode_reward_min: -81.60000000000015
  episodes_this_iter: 40
  episodes_total: 200
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.2248653173446655
          entropy_coeff: 0.0
          kl: 0.01879333332180977
          model: {}
          policy_loss: -0.06278271228075027
          total_loss: 83.66142272949219
          vf_explained_var: 0.16791298985481262
          vf_loss: 83.7051773071289
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          en

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,6,40.6666,24000,-56.874,-40.2,-81.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,5,32.7106,20000,-57.708,-35.4,-84.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,5,32.7878,20000,-58.899,-36.0,-81.6,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,5,33.1643,20000,-58.167,-37.5,-90.9,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 48000
  custom_metrics: {}
  date: 2021-04-29_15-41-00
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -33.29999999999998
  episode_reward_mean: -58.45200000000007
  episode_reward_min: -81.00000000000011
  episodes_this_iter: 40
  episodes_total: 240
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.205172061920166
          entropy_coeff: 0.0
          kl: 0.017608027905225754
          model: {}
          policy_loss: -0.061123598366975784
          total_loss: 58.42267608642578
          vf_explained_var: 0.23633694648742676
          vf_loss: 58.465972900390625
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,7,48.2252,28000,-55.671,-33.3,-81.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,6,40.8582,24000,-56.43,-35.4,-84.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,6,40.892,24000,-58.452,-33.3,-81.0,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,6,41.1709,24000,-58.032,-36.3,-90.9,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 56000
  custom_metrics: {}
  date: 2021-04-29_15-41-08
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -33.29999999999998
  episode_reward_mean: -58.53900000000006
  episode_reward_min: -81.00000000000011
  episodes_this_iter: 40
  episodes_total: 280
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.186113953590393
          entropy_coeff: 0.0
          kl: 0.018820296972990036
          model: {}
          policy_loss: -0.054979607462882996
          total_loss: 66.61347198486328
          vf_explained_var: 0.1913606822490692
          vf_loss: 66.64939880371094
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,8,54.5993,32000,-56.436,-33.3,-81.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,7,48.3417,28000,-56.358,-37.5,-87.9,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,7,48.4155,28000,-58.539,-33.3,-81.0,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,7,48.4941,28000,-56.703,-35.7,-81.0,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 64000
  custom_metrics: {}
  date: 2021-04-29_15-41-14
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -33.29999999999998
  episode_reward_mean: -57.990000000000066
  episode_reward_min: -78.90000000000012
  episodes_this_iter: 40
  episodes_total: 320
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.1739661693572998
          entropy_coeff: 0.0
          kl: 0.017135027796030045
          model: {}
          policy_loss: -0.059963442385196686
          total_loss: 63.47212600708008
          vf_explained_var: 0.1945434808731079
          vf_loss: 63.514739990234375
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,9,60.8852,36000,-56.289,-34.8,-81.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,8,54.6612,32000,-56.637,-38.4,-87.9,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,8,54.7616,32000,-57.99,-33.3,-78.9,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,8,54.8433,32000,-55.575,-35.7,-76.5,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 72000
  custom_metrics: {}
  date: 2021-04-29_15-41-20
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -37.500000000000014
  episode_reward_mean: -56.679000000000066
  episode_reward_min: -76.5000000000001
  episodes_this_iter: 40
  episodes_total: 360
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.1497597694396973
          entropy_coeff: 0.0
          kl: 0.017849018797278404
          model: {}
          policy_loss: -0.06419435888528824
          total_loss: 75.64063262939453
          vf_explained_var: 0.17273123562335968
          vf_loss: 75.68675231933594
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,10,66.9093,40000,-56.502,-34.8,-81.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,9,60.7697,36000,-56.049,-36.0,-76.8,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,9,60.8861,36000,-57.855,-36.0,-90.0,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,9,60.8802,36000,-56.679,-37.5,-76.5,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 80000
  custom_metrics: {}
  date: 2021-04-29_15-41-26
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -37.500000000000014
  episode_reward_mean: -56.679000000000066
  episode_reward_min: -76.5000000000001
  episodes_this_iter: 40
  episodes_total: 400
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.137955665588379
          entropy_coeff: 0.0
          kl: 0.017595399171113968
          model: {}
          policy_loss: -0.05801450461149216
          total_loss: 81.1055908203125
          vf_explained_var: 0.11618998646736145
          vf_loss: 81.14579010009766
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,11,72.8217,44000,-56.112,-37.5,-75.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,10,66.709,40000,-55.662,-34.2,-77.1,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,10,66.869,40000,-56.16,-36.0,-90.0,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,10,66.8705,40000,-56.679,-37.5,-76.5,100


Result for PPO_MultiAgentArena_6768d_00001:
  agent_timesteps_total: 88000
  custom_metrics: {}
  date: 2021-04-29_15-41-32
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -34.199999999999974
  episode_reward_mean: -53.79900000000005
  episode_reward_min: -77.1000000000001
  episodes_this_iter: 40
  episodes_total: 440
  experiment_id: 88ed0d232176421b9f10b011ac47f80b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.1288137435913086
          entropy_coeff: 0.0
          kl: 0.017578192055225372
          model: {}
          policy_loss: -0.05381487309932709
          total_loss: 55.07358932495117
          vf_explained_var: 0.17492404580116272
          vf_loss: 55.10960388183594
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,11,72.8217,44000,-56.112,-37.5,-75.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,12,78.0068,48000,-52.161,-38.4,-72.3,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,11,72.6714,44000,-56.79,-40.8,-90.0,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,11,72.6825,44000,-56.283,-38.7,-75.3,100


Result for PPO_MultiAgentArena_6768d_00000:
  agent_timesteps_total: 96000
  custom_metrics: {}
  date: 2021-04-29_15-41-38
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -34.199999999999974
  episode_reward_mean: -55.39500000000005
  episode_reward_min: -81.0000000000001
  episodes_this_iter: 40
  episodes_total: 480
  experiment_id: ca7d5fe006ec4c509dd419cd353bfbf5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.0768318176269531
          entropy_coeff: 0.0
          kl: 0.01763078197836876
          model: {}
          policy_loss: -0.05409485101699829
          total_loss: 68.98703002929688
          vf_explained_var: 0.18389922380447388
          vf_loss: 69.02326965332031
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,12,78.6584,48000,-55.395,-34.2,-81.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,13,85.1388,52000,-50.907,-27.3,-72.3,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,12,78.426,48000,-55.143,-30.0,-79.5,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,12,78.4573,48000,-56.43,-35.4,-78.0,100


Result for PPO_MultiAgentArena_6768d_00000:
  agent_timesteps_total: 104000
  custom_metrics: {}
  date: 2021-04-29_15-41-45
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -27.899999999999977
  episode_reward_mean: -54.71400000000005
  episode_reward_min: -81.0000000000001
  episodes_this_iter: 40
  episodes_total: 520
  experiment_id: ca7d5fe006ec4c509dd419cd353bfbf5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.0471458435058594
          entropy_coeff: 0.0
          kl: 0.01586918532848358
          model: {}
          policy_loss: -0.0489836186170578
          total_loss: 81.99269104003906
          vf_explained_var: 0.16763535141944885
          vf_loss: 82.02560424804688
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,13,86.0087,52000,-54.714,-27.9,-81.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,14,91.4699,56000,-49.023,-27.3,-72.3,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,13,85.6649,52000,-54.594,-30.0,-82.5,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,13,85.734,52000,-55.155,-33.9,-88.8,100


Result for PPO_MultiAgentArena_6768d_00000:
  agent_timesteps_total: 112000
  custom_metrics: {}
  date: 2021-04-29_15-41-51
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -27.899999999999977
  episode_reward_mean: -55.10400000000005
  episode_reward_min: -82.50000000000011
  episodes_this_iter: 40
  episodes_total: 560
  experiment_id: ca7d5fe006ec4c509dd419cd353bfbf5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.0497357845306396
          entropy_coeff: 0.0
          kl: 0.016114089637994766
          model: {}
          policy_loss: -0.06135372072458267
          total_loss: 71.85824584960938
          vf_explained_var: 0.1468425989151001
          vf_loss: 71.90328216552734
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,15,98.1505,60000,-54.51,-24.6,-90.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,14,91.4699,56000,-49.023,-27.3,-72.3,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,14,91.8187,56000,-54.936,-34.8,-82.5,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,14,91.9038,56000,-54.426,-33.9,-88.8,100


Result for PPO_MultiAgentArena_6768d_00001:
  agent_timesteps_total: 120000
  custom_metrics: {}
  date: 2021-04-29_15-41-57
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -28.19999999999997
  episode_reward_mean: -47.07000000000003
  episode_reward_min: -71.10000000000012
  episodes_this_iter: 40
  episodes_total: 600
  experiment_id: 88ed0d232176421b9f10b011ac47f80b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.0645031929016113
          entropy_coeff: 0.0
          kl: 0.015343844890594482
          model: {}
          policy_loss: -0.056475937366485596
          total_loss: 62.323326110839844
          vf_explained_var: 0.08131785690784454
          vf_loss: 62.36427307128906
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,16,104.608,64000,-54.318,-24.6,-90.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,15,97.6404,60000,-47.07,-28.2,-71.1,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,15,97.8717,60000,-53.541,-33.6,-82.5,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,15,97.9148,60000,-54.039,-33.6,-88.8,100


Result for PPO_MultiAgentArena_6768d_00001:
  agent_timesteps_total: 128000
  custom_metrics: {}
  date: 2021-04-29_15-42-04
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -25.199999999999978
  episode_reward_mean: -46.87800000000003
  episode_reward_min: -68.70000000000007
  episodes_this_iter: 40
  episodes_total: 640
  experiment_id: 88ed0d232176421b9f10b011ac47f80b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.0216400623321533
          entropy_coeff: 0.0
          kl: 0.0159014780074358
          model: {}
          policy_loss: -0.052468206733465195
          total_loss: 64.10145568847656
          vf_explained_var: 0.08231677114963531
          vf_loss: 64.13783264160156
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,16,104.608,64000,-54.318,-24.6,-90.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,17,110.045,68000,-45.189,-23.1,-65.4,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,16,104.385,64000,-52.842,-33.6,-75.0,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,16,104.711,64000,-53.058,-33.6,-73.5,100


Result for PPO_MultiAgentArena_6768d_00000:
  agent_timesteps_total: 136000
  custom_metrics: {}
  date: 2021-04-29_15-42-10
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -24.599999999999984
  episode_reward_mean: -52.91700000000005
  episode_reward_min: -87.00000000000013
  episodes_this_iter: 40
  episodes_total: 680
  experiment_id: ca7d5fe006ec4c509dd419cd353bfbf5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 1.0037667751312256
          entropy_coeff: 0.0
          kl: 0.015906378626823425
          model: {}
          policy_loss: -0.053139783442020416
          total_loss: 54.807289123535156
          vf_explained_var: 0.12878882884979248
          vf_loss: 54.844322204589844
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
     

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,18,116.521,72000,-51.366,-33.3,-74.4,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,17,110.045,68000,-45.189,-23.1,-65.4,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,17,110.448,68000,-52.206,-33.6,-75.0,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,17,110.669,68000,-52.923,-33.6,-73.5,100


Result for PPO_MultiAgentArena_6768d_00001:
  agent_timesteps_total: 144000
  custom_metrics: {}
  date: 2021-04-29_15-42-16
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -23.09999999999996
  episode_reward_mean: -43.86000000000003
  episode_reward_min: -64.50000000000007
  episodes_this_iter: 40
  episodes_total: 720
  experiment_id: 88ed0d232176421b9f10b011ac47f80b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9882775545120239
          entropy_coeff: 0.0
          kl: 0.014036556705832481
          model: {}
          policy_loss: -0.04217071086168289
          total_loss: 68.61019897460938
          vf_explained_var: 0.06600187718868256
          vf_loss: 68.63815307617188
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,19,122.232,76000,-49.5,-33.3,-72.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,18,116.049,72000,-43.86,-23.1,-64.5,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,18,116.438,72000,-50.445,-31.2,-75.0,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,18,116.486,72000,-52.239,-32.1,-73.2,100


Result for PPO_MultiAgentArena_6768d_00001:
  agent_timesteps_total: 152000
  custom_metrics: {}
  date: 2021-04-29_15-42-21
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -18.899999999999974
  episode_reward_mean: -41.89500000000002
  episode_reward_min: -64.50000000000007
  episodes_this_iter: 40
  episodes_total: 760
  experiment_id: 88ed0d232176421b9f10b011ac47f80b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9670689702033997
          entropy_coeff: 0.0
          kl: 0.014392219483852386
          model: {}
          policy_loss: -0.04987628012895584
          total_loss: 48.62340545654297
          vf_explained_var: 0.08940410614013672
          vf_loss: 48.65870666503906
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,20,128.181,80000,-49.119,-30.0,-72.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,19,121.645,76000,-41.895,-18.9,-64.5,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,19,121.999,76000,-50.598,-30.0,-73.5,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,19,122.16,76000,-51.819,-32.1,-73.2,100


Result for PPO_MultiAgentArena_6768d_00001:
  agent_timesteps_total: 160000
  custom_metrics: {}
  date: 2021-04-29_15-42-27
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -18.899999999999974
  episode_reward_mean: -41.16600000000001
  episode_reward_min: -62.10000000000008
  episodes_this_iter: 40
  episodes_total: 800
  experiment_id: 88ed0d232176421b9f10b011ac47f80b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9520997405052185
          entropy_coeff: 0.0
          kl: 0.015064072795212269
          model: {}
          policy_loss: -0.05516812205314636
          total_loss: 50.42054748535156
          vf_explained_var: 0.1345233917236328
          vf_loss: 50.46046447753906
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,21,134.036,84000,-48.12,-30.0,-72.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,20,127.563,80000,-41.166,-18.9,-62.1,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,20,127.902,80000,-48.981,-19.2,-73.5,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,20,128.026,80000,-51.387,-32.1,-66.9,100


Result for PPO_MultiAgentArena_6768d_00001:
  agent_timesteps_total: 168000
  custom_metrics: {}
  date: 2021-04-29_15-42-33
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -18.899999999999974
  episode_reward_mean: -39.62700000000001
  episode_reward_min: -60.90000000000008
  episodes_this_iter: 40
  episodes_total: 840
  experiment_id: 88ed0d232176421b9f10b011ac47f80b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9443992972373962
          entropy_coeff: 0.0
          kl: 0.014932885766029358
          model: {}
          policy_loss: -0.05163469910621643
          total_loss: 33.023468017578125
          vf_explained_var: 0.13632924854755402
          vf_loss: 33.05998229980469
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,21,134.036,84000,-48.12,-30.0,-72.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,22,138.737,88000,-38.598,-22.5,-57.9,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,21,133.743,84000,-48.246,-19.2,-66.9,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,21,133.752,84000,-50.241,-32.1,-69.0,100


Result for PPO_MultiAgentArena_6768d_00000:
  agent_timesteps_total: 176000
  custom_metrics: {}
  date: 2021-04-29_15-42-39
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -26.69999999999997
  episode_reward_mean: -47.15700000000003
  episode_reward_min: -71.70000000000009
  episodes_this_iter: 40
  episodes_total: 880
  experiment_id: ca7d5fe006ec4c509dd419cd353bfbf5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9587459564208984
          entropy_coeff: 0.0
          kl: 0.01614401675760746
          model: {}
          policy_loss: -0.0503421388566494
          total_loss: 65.3270492553711
          vf_explained_var: 0.10377812385559082
          vf_loss: 65.36104583740234
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          en

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,23,144.828,92000,-48.108,-26.7,-81.6,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,22,138.737,88000,-38.598,-22.5,-57.9,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,22,139.285,88000,-48.435,-23.7,-72.9,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,22,139.394,88000,-49.632,-32.1,-69.0,100


Result for PPO_MultiAgentArena_6768d_00001:
  agent_timesteps_total: 184000
  custom_metrics: {}
  date: 2021-04-29_15-42-44
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -21.599999999999977
  episode_reward_mean: -38.175000000000004
  episode_reward_min: -57.90000000000005
  episodes_this_iter: 40
  episodes_total: 920
  experiment_id: 88ed0d232176421b9f10b011ac47f80b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9347695708274841
          entropy_coeff: 0.0
          kl: 0.013316778466105461
          model: {}
          policy_loss: -0.04786767065525055
          total_loss: 43.17835998535156
          vf_explained_var: 0.11864691972732544
          vf_loss: 43.212745666503906
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
      

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,24,152.456,96000,-47.526,-33.3,-81.6,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,23,144.149,92000,-38.175,-21.6,-57.9,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,23,144.667,92000,-47.772,-24.3,-72.9,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,23,144.989,92000,-47.88,-29.4,-67.5,100


Result for PPO_MultiAgentArena_6768d_00001:
  agent_timesteps_total: 192000
  custom_metrics: {}
  date: 2021-04-29_15-42-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -21.599999999999977
  episode_reward_mean: -37.641000000000005
  episode_reward_min: -57.600000000000115
  episodes_this_iter: 40
  episodes_total: 960
  experiment_id: 88ed0d232176421b9f10b011ac47f80b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9331375360488892
          entropy_coeff: 0.0
          kl: 0.014219926670193672
          model: {}
          policy_loss: -0.04750514775514603
          total_loss: 55.11891555786133
          vf_explained_var: 0.143527090549469
          vf_loss: 55.15201950073242
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,25,158.634,100000,-45.657,-28.5,-70.5,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,24,151.83,96000,-37.641,-21.6,-57.6,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,24,152.448,96000,-46.932,-22.5,-68.4,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,24,152.908,96000,-46.644,-27.0,-70.2,100


Result for PPO_MultiAgentArena_6768d_00001:
  agent_timesteps_total: 200000
  custom_metrics: {}
  date: 2021-04-29_15-42-58
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -20.999999999999964
  episode_reward_mean: -36.309
  episode_reward_min: -58.20000000000009
  episodes_this_iter: 40
  episodes_total: 1000
  experiment_id: 88ed0d232176421b9f10b011ac47f80b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.9093656539916992
          entropy_coeff: 0.0
          kl: 0.015630370005965233
          model: {}
          policy_loss: -0.04836378991603851
          total_loss: 38.12422180175781
          vf_explained_var: 0.11646772176027298
          vf_loss: 38.15675354003906
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          entropy:

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,26,164.394,104000,-45.309,-26.4,-66.3,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,25,157.907,100000,-36.309,-21.0,-58.2,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,25,158.485,100000,-44.769,-22.5,-67.5,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,25,158.638,100000,-46.719,-27.0,-70.2,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 208000
  custom_metrics: {}
  date: 2021-04-29_15-43-04
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -23.999999999999993
  episode_reward_mean: -42.870000000000026
  episode_reward_min: -67.5000000000001
  episodes_this_iter: 40
  episodes_total: 1040
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.8917433023452759
          entropy_coeff: 0.0
          kl: 0.01605263352394104
          model: {}
          policy_loss: -0.04632389545440674
          total_loss: 43.24937438964844
          vf_explained_var: 0.17000192403793335
          vf_loss: 43.279441833496094
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,27,170.673,108000,-46.038,-26.4,-70.8,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,26,163.766,104000,-34.983,-21.0,-58.2,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,26,164.109,104000,-42.87,-24.0,-67.5,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,26,164.218,104000,-47.253,-27.0,-69.0,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 216000
  custom_metrics: {}
  date: 2021-04-29_15-43-11
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -22.799999999999994
  episode_reward_mean: -46.914000000000044
  episode_reward_min: -67.5000000000001
  episodes_this_iter: 40
  episodes_total: 1080
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.8817446231842041
          entropy_coeff: 0.0
          kl: 0.0162050724029541
          model: {}
          policy_loss: -0.05816159024834633
          total_loss: 72.2748794555664
          vf_explained_var: 0.15604007244110107
          vf_loss: 72.31663513183594
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,28,176.62,112000,-46.65,-26.1,-70.8,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,27,170.134,108000,-34.272,-18.0,-58.2,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,27,170.541,108000,-41.655,-20.4,-68.4,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,27,170.509,108000,-46.914,-22.8,-67.5,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 224000
  custom_metrics: {}
  date: 2021-04-29_15-43-17
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -22.799999999999994
  episode_reward_mean: -45.43800000000002
  episode_reward_min: -67.50000000000011
  episodes_this_iter: 40
  episodes_total: 1120
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.8651791214942932
          entropy_coeff: 0.0
          kl: 0.014631852507591248
          model: {}
          policy_loss: -0.04263238608837128
          total_loss: 70.45223999023438
          vf_explained_var: 0.10022768378257751
          vf_loss: 70.48005676269531
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,29,182.307,116000,-43.98,-17.4,-69.3,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,28,176.106,112000,-34.062,-16.5,-55.8,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,28,176.463,112000,-41.799,-20.1,-68.4,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,28,176.493,112000,-45.438,-22.8,-67.5,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 232000
  custom_metrics: {}
  date: 2021-04-29_15-43-22
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -23.099999999999973
  episode_reward_mean: -44.385000000000026
  episode_reward_min: -67.50000000000011
  episodes_this_iter: 40
  episodes_total: 1160
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.8432906866073608
          entropy_coeff: 0.0
          kl: 0.011795462109148502
          model: {}
          policy_loss: -0.03927554935216904
          total_loss: 79.24305725097656
          vf_explained_var: 0.0963183119893074
          vf_loss: 79.2703857421875
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,30,188.091,120000,-43.383,-17.4,-67.5,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,29,182.03,116000,-33.72,-16.5,-55.8,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,29,182.369,116000,-40.413,-20.1,-68.4,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,29,182.379,116000,-44.385,-23.1,-67.5,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 240000
  custom_metrics: {}
  date: 2021-04-29_15-43-28
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -26.399999999999995
  episode_reward_mean: -44.505000000000024
  episode_reward_min: -64.20000000000007
  episodes_this_iter: 40
  episodes_total: 1200
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.8319759368896484
          entropy_coeff: 0.0
          kl: 0.012361745350062847
          model: {}
          policy_loss: -0.036920759826898575
          total_loss: 54.914268493652344
          vf_explained_var: 0.1454734206199646
          vf_loss: 54.93867111206055
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
     

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,31,193.903,124000,-43.575,-20.4,-67.5,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,30,187.983,120000,-33.87,-16.5,-57.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,30,188.265,120000,-38.745,-9.0,-58.8,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,30,188.258,120000,-44.505,-26.4,-64.2,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 248000
  custom_metrics: {}
  date: 2021-04-29_15-43-34
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -26.69999999999999
  episode_reward_mean: -43.56900000000002
  episode_reward_min: -64.20000000000007
  episodes_this_iter: 40
  episodes_total: 1240
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.8291501998901367
          entropy_coeff: 0.0
          kl: 0.012866836041212082
          model: {}
          policy_loss: -0.04232071712613106
          total_loss: 65.07513427734375
          vf_explained_var: 0.13710403442382812
          vf_loss: 65.10442352294922
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,32,199.322,128000,-42.087,-12.3,-67.5,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,31,193.575,124000,-33.663,-18.6,-57.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,31,193.853,124000,-38.226,-9.0,-63.0,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,31,193.832,124000,-43.569,-26.7,-64.2,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 256000
  custom_metrics: {}
  date: 2021-04-29_15-43-40
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -22.199999999999964
  episode_reward_mean: -42.774000000000015
  episode_reward_min: -61.50000000000008
  episodes_this_iter: 40
  episodes_total: 1280
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.8071923851966858
          entropy_coeff: 0.0
          kl: 0.016047891229391098
          model: {}
          policy_loss: -0.05564795434474945
          total_loss: 47.105125427246094
          vf_explained_var: 0.1270139366388321
          vf_loss: 47.14452362060547
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
      

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,33,204.986,132000,-40.59,-12.3,-64.2,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,32,199.026,128000,-33.033,-17.7,-57.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,32,199.316,128000,-38.457,-13.8,-63.0,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,32,199.292,128000,-42.774,-22.2,-61.5,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 264000
  custom_metrics: {}
  date: 2021-04-29_15-43-45
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -19.799999999999972
  episode_reward_mean: -42.47400000000002
  episode_reward_min: -64.50000000000009
  episodes_this_iter: 40
  episodes_total: 1320
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.7916695475578308
          entropy_coeff: 0.0
          kl: 0.014242918230593204
          model: {}
          policy_loss: -0.041739337146282196
          total_loss: 71.66497802734375
          vf_explained_var: 0.14963850378990173
          vf_loss: 71.69229888916016
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
      

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,34,210.385,136000,-40.338,-17.1,-58.5,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,33,204.719,132000,-32.511,-16.8,-57.3,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,33,204.868,132000,-38.577,-15.9,-63.0,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,33,204.751,132000,-42.474,-19.8,-64.5,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 272000
  custom_metrics: {}
  date: 2021-04-29_15-43-51
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -18.299999999999994
  episode_reward_mean: -41.19600000000001
  episode_reward_min: -64.50000000000009
  episodes_this_iter: 40
  episodes_total: 1360
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.7697420716285706
          entropy_coeff: 0.0
          kl: 0.01435716450214386
          model: {}
          policy_loss: -0.04923272505402565
          total_loss: 46.393985748291016
          vf_explained_var: 0.11645837128162384
          vf_loss: 46.428680419921875
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
      

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,35,216.153,140000,-39.807,-17.1,-60.3,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,34,210.095,136000,-32.973,-12.9,-57.3,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,34,210.33,136000,-36.894,-18.6,-58.5,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,34,210.191,136000,-41.196,-18.3,-64.5,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 280000
  custom_metrics: {}
  date: 2021-04-29_15-43-56
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -18.299999999999994
  episode_reward_mean: -39.51900000000001
  episode_reward_min: -61.20000000000006
  episodes_this_iter: 40
  episodes_total: 1400
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.7619023323059082
          entropy_coeff: 0.0
          kl: 0.014338860288262367
          model: {}
          policy_loss: -0.044018156826496124
          total_loss: 62.76490783691406
          vf_explained_var: 0.2097339928150177
          vf_loss: 62.79440689086914
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,36,221.89,144000,-39.189,-20.1,-60.3,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,35,215.818,140000,-31.941,-9.0,-57.3,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,35,215.979,140000,-36.162,-17.7,-58.5,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,35,215.947,140000,-39.519,-18.3,-61.2,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 288000
  custom_metrics: {}
  date: 2021-04-29_15-44-02
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -17.699999999999992
  episode_reward_mean: -35.44499999999999
  episode_reward_min: -57.600000000000044
  episodes_this_iter: 40
  episodes_total: 1440
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.7972707748413086
          entropy_coeff: 0.0
          kl: 0.013321878388524055
          model: {}
          policy_loss: -0.045407045632600784
          total_loss: 44.729469299316406
          vf_explained_var: 0.13115015625953674
          vf_loss: 44.76138687133789
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
    

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,37,227.625,148000,-38.061,-20.1,-61.5,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,36,221.374,144000,-30.894,-9.0,-54.6,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,36,221.557,144000,-35.445,-17.7,-57.6,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,36,221.631,144000,-39.453,-21.0,-61.2,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 296000
  custom_metrics: {}
  date: 2021-04-29_15-44-08
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -20.99999999999997
  episode_reward_mean: -38.661000000000016
  episode_reward_min: -77.40000000000009
  episodes_this_iter: 40
  episodes_total: 1480
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.745100200176239
          entropy_coeff: 0.0
          kl: 0.01402284111827612
          model: {}
          policy_loss: -0.04022061824798584
          total_loss: 71.06468963623047
          vf_explained_var: 0.14604327082633972
          vf_loss: 71.09071350097656
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,37,227.625,148000,-38.061,-20.1,-61.5,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,38,232.223,152000,-30.516,-15.9,-46.2,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,37,227.394,148000,-35.784,-18.3,-60.3,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,37,227.318,148000,-38.661,-21.0,-77.4,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 304000
  custom_metrics: {}
  date: 2021-04-29_15-44-13
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -20.99999999999997
  episode_reward_mean: -39.09000000000001
  episode_reward_min: -77.40000000000009
  episodes_this_iter: 40
  episodes_total: 1520
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.7454019784927368
          entropy_coeff: 0.0
          kl: 0.014348085969686508
          model: {}
          policy_loss: -0.03885621950030327
          total_loss: 70.56571197509766
          vf_explained_var: 0.14796730875968933
          vf_loss: 70.59004211425781
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,38,233.149,152000,-37.884,-20.1,-61.5,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,39,237.397,156000,-31.275,-15.9,-46.2,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,38,232.737,152000,-34.224,-10.8,-60.3,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,38,232.691,152000,-39.09,-21.0,-77.4,100


Result for PPO_MultiAgentArena_6768d_00000:
  agent_timesteps_total: 312000
  custom_metrics: {}
  date: 2021-04-29_15-44-19
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -16.79999999999998
  episode_reward_mean: -34.848
  episode_reward_min: -57.00000000000007
  episodes_this_iter: 40
  episodes_total: 1560
  experiment_id: ca7d5fe006ec4c509dd419cd353bfbf5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.7127864360809326
          entropy_coeff: 0.0
          kl: 0.011409412138164043
          model: {}
          policy_loss: -0.03785654157400131
          total_loss: 51.31982421875
          vf_explained_var: 0.14388220012187958
          vf_loss: 51.346126556396484
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          entropy: 0.

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,39,238.335,156000,-34.848,-16.8,-57.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,40,242.728,160000,-29.952,-15.9,-57.3,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,39,237.969,156000,-33.564,-8.1,-60.3,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,39,237.994,156000,-39.144,-21.3,-58.5,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 320000
  custom_metrics: {}
  date: 2021-04-29_15-44-24
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -8.100000000000005
  episode_reward_mean: -30.992999999999988
  episode_reward_min: -48.30000000000004
  episodes_this_iter: 40
  episodes_total: 1600
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.7552467584609985
          entropy_coeff: 0.0
          kl: 0.014542650431394577
          model: {}
          policy_loss: -0.05497484654188156
          total_loss: 33.39323425292969
          vf_explained_var: 0.1488848328590393
          vf_loss: 33.43348693847656
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,40,243.795,160000,-33.168,-16.8,-57.0,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,41,248.288,164000,-30.153,-16.8,-58.2,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,40,243.244,160000,-30.993,-8.1,-48.3,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,40,243.462,160000,-38.331,-21.9,-64.5,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 328000
  custom_metrics: {}
  date: 2021-04-29_15-44-29
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -8.100000000000005
  episode_reward_mean: -31.046999999999993
  episode_reward_min: -57.00000000000008
  episodes_this_iter: 40
  episodes_total: 1640
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.7417598962783813
          entropy_coeff: 0.0
          kl: 0.012348907068371773
          model: {}
          policy_loss: -0.04774465784430504
          total_loss: 52.86546325683594
          vf_explained_var: 0.15233348309993744
          vf_loss: 52.90070343017578
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,41,249.393,164000,-33.786,-2.1,-56.1,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,42,253.796,168000,-30.255,-14.4,-58.2,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,41,248.654,164000,-31.047,-8.1,-57.0,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,41,249.069,164000,-39.006,-9.3,-76.2,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 336000
  custom_metrics: {}
  date: 2021-04-29_15-44-35
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -8.700000000000003
  episode_reward_mean: -31.139999999999997
  episode_reward_min: -57.00000000000008
  episodes_this_iter: 40
  episodes_total: 1680
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.7316937446594238
          entropy_coeff: 0.0
          kl: 0.012332964688539505
          model: {}
          policy_loss: -0.03773343563079834
          total_loss: 63.31257247924805
          vf_explained_var: 0.15398120880126953
          vf_loss: 63.33781814575195
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,42,254.757,168000,-33.405,-2.1,-73.5,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,43,259.342,172000,-29.154,-14.4,-58.2,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,42,254.204,168000,-31.14,-8.7,-57.0,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,42,254.41,168000,-38.628,-9.3,-76.2,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 344000
  custom_metrics: {}
  date: 2021-04-29_15-44-41
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -8.69999999999999
  episode_reward_mean: -32.04899999999999
  episode_reward_min: -54.30000000000005
  episodes_this_iter: 40
  episodes_total: 1720
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.7055241465568542
          entropy_coeff: 0.0
          kl: 0.013003507629036903
          model: {}
          policy_loss: -0.04498252645134926
          total_loss: 71.33078002929688
          vf_explained_var: 0.20857921242713928
          vf_loss: 71.36259460449219
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,43,260.418,172000,-33.864,-2.1,-73.5,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,44,264.708,176000,-27.336,-11.1,-42.9,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,43,259.823,172000,-32.049,-8.7,-54.3,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,43,260.082,172000,-39.195,-9.3,-67.8,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 352000
  custom_metrics: {}
  date: 2021-04-29_15-44-46
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -8.69999999999999
  episode_reward_mean: -30.948
  episode_reward_min: -54.30000000000005
  episodes_this_iter: 40
  episodes_total: 1760
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.6968677043914795
          entropy_coeff: 0.0
          kl: 0.013342528603971004
          model: {}
          policy_loss: -0.04849805310368538
          total_loss: 43.174903869628906
          vf_explained_var: 0.2104724496603012
          vf_loss: 43.20989227294922
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          entropy: 0

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,44,265.712,176000,-33.495,-14.7,-54.6,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,45,270.928,180000,-26.625,-11.1,-44.7,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,44,265.122,176000,-30.948,-8.7,-54.3,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,44,265.373,176000,-37.926,-14.7,-67.8,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 360000
  custom_metrics: {}
  date: 2021-04-29_15-44-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -8.69999999999999
  episode_reward_mean: -28.934999999999995
  episode_reward_min: -48.90000000000005
  episodes_this_iter: 40
  episodes_total: 1800
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.6995198726654053
          entropy_coeff: 0.0
          kl: 0.011494915001094341
          model: {}
          policy_loss: -0.037927594035863876
          total_loss: 36.368751525878906
          vf_explained_var: 0.16457778215408325
          vf_loss: 36.39503479003906
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
      

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,45,271.818,180000,-32.73,-14.1,-54.6,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,46,276.337,184000,-26.889,-13.8,-48.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,45,271.282,180000,-28.935,-8.7,-48.9,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,45,271.45,180000,-37.95,-14.7,-63.0,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 368000
  custom_metrics: {}
  date: 2021-04-29_15-44-58
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -13.799999999999999
  episode_reward_mean: -28.241999999999997
  episode_reward_min: -55.20000000000004
  episodes_this_iter: 40
  episodes_total: 1840
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.672947883605957
          entropy_coeff: 0.0
          kl: 0.01259644515812397
          model: {}
          policy_loss: -0.04462946206331253
          total_loss: 39.78437423706055
          vf_explained_var: 0.18982207775115967
          vf_loss: 39.81624984741211
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,46,277.275,184000,-32.052,-10.8,-73.2,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,47,282.03,188000,-27.93,-13.8,-48.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,46,276.71,184000,-28.242,-13.8,-55.2,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,46,276.901,184000,-37.236,-16.2,-63.0,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 376000
  custom_metrics: {}
  date: 2021-04-29_15-45-03
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -13.799999999999999
  episode_reward_mean: -29.306999999999995
  episode_reward_min: -55.20000000000004
  episodes_this_iter: 40
  episodes_total: 1880
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.6598195433616638
          entropy_coeff: 0.0
          kl: 0.011603888124227524
          model: {}
          policy_loss: -0.037345319986343384
          total_loss: 37.23706817626953
          vf_explained_var: 0.2108994871377945
          vf_loss: 37.262664794921875
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
     

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,47,282.898,188000,-31.617,-10.8,-73.2,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,48,287.448,192000,-27.768,-9.3,-48.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,47,282.328,188000,-29.307,-13.8,-55.2,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,47,282.502,188000,-36.786,-16.2,-61.2,100


Result for PPO_MultiAgentArena_6768d_00002:
  agent_timesteps_total: 384000
  custom_metrics: {}
  date: 2021-04-29_15-45-09
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -12.90000000000001
  episode_reward_mean: -28.856999999999992
  episode_reward_min: -51.30000000000005
  episodes_this_iter: 40
  episodes_total: 1920
  experiment_id: e1a3447d40ef4fcb931f851e938075bc
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.6441656351089478
          entropy_coeff: 0.0
          kl: 0.013493036851286888
          model: {}
          policy_loss: -0.043827593326568604
          total_loss: 38.107139587402344
          vf_explained_var: 0.23160181939601898
          vf_loss: 38.137306213378906
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
    

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,48,288.357,192000,-30.654,-14.4,-51.9,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,48,287.448,192000,-27.768,-9.3,-48.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,49,293.547,196000,-27.903,-9.0,-51.9,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,48,288.039,192000,-37.305,-16.2,-68.1,100


Result for PPO_MultiAgentArena_6768d_00000:
  agent_timesteps_total: 392000
  custom_metrics: {}
  date: 2021-04-29_15-45-15
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -13.49999999999999
  episode_reward_mean: -29.543999999999983
  episode_reward_min: -51.90000000000011
  episodes_this_iter: 40
  episodes_total: 1960
  experiment_id: ca7d5fe006ec4c509dd419cd353bfbf5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.6254383325576782
          entropy_coeff: 0.0
          kl: 0.01105943787842989
          model: {}
          policy_loss: -0.039970993995666504
          total_loss: 47.576210021972656
          vf_explained_var: 0.09534189105033875
          vf_loss: 47.604984283447266
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
     

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,50,300.465,200000,-30.054,-10.8,-55.8,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,49,293.319,196000,-27.468,-9.3,-60.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,49,293.547,196000,-27.903,-9.0,-51.9,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,49,293.744,196000,-37.434,-12.6,-68.1,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 400000
  custom_metrics: {}
  date: 2021-04-29_15-45-21
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -12.600000000000001
  episode_reward_mean: -36.62100000000001
  episode_reward_min: -66.00000000000007
  episodes_this_iter: 40
  episodes_total: 2000
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.6448127031326294
          entropy_coeff: 0.0
          kl: 0.009580668993294239
          model: {}
          policy_loss: -0.019873110577464104
          total_loss: 74.49268341064453
          vf_explained_var: 0.1424368917942047
          vf_loss: 74.50285339355469
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,50,300.465,200000,-30.054,-10.8,-55.8,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,50,299.686,200000,-27.465,-10.8,-60.0,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,50,299.996,200000,-27.393,-8.7,-51.9,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,51,305.556,204000,-37.05,-12.6,-66.0,100


Result for PPO_MultiAgentArena_6768d_00000:
  agent_timesteps_total: 408000
  custom_metrics: {}
  date: 2021-04-29_15-45-27
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -10.799999999999994
  episode_reward_mean: -29.74499999999999
  episode_reward_min: -55.80000000000005
  episodes_this_iter: 40
  episodes_total: 2040
  experiment_id: ca7d5fe006ec4c509dd419cd353bfbf5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.6240818500518799
          entropy_coeff: 0.0
          kl: 0.011015006341040134
          model: {}
          policy_loss: -0.036762431263923645
          total_loss: 70.20338439941406
          vf_explained_var: 0.11177567392587662
          vf_loss: 70.22899627685547
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
      

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,51,305.992,204000,-29.745,-10.8,-55.8,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,51,305.148,204000,-27.216,-10.8,-54.6,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,51,305.509,204000,-26.811,-8.7,-51.9,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,52,312.915,208000,-36.462,-16.2,-60.0,100


Result for PPO_MultiAgentArena_6768d_00000:
  agent_timesteps_total: 416000
  custom_metrics: {}
  date: 2021-04-29_15-45-34
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -12.000000000000004
  episode_reward_mean: -29.318999999999996
  episode_reward_min: -58.20000000000006
  episodes_this_iter: 40
  episodes_total: 2080
  experiment_id: ca7d5fe006ec4c509dd419cd353bfbf5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.6388430595397949
          entropy_coeff: 0.0
          kl: 0.009823620319366455
          model: {}
          policy_loss: -0.025797009468078613
          total_loss: 102.05728149414062
          vf_explained_var: 0.1473698914051056
          vf_loss: 102.07312774658203
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
    

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,53,319.887,212000,-29.259,-12.9,-58.2,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,52,312.549,208000,-28.383,-8.1,-54.6,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,52,312.874,208000,-26.889,-1.2,-50.7,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,52,312.915,208000,-36.462,-16.2,-60.0,100


Result for PPO_MultiAgentArena_6768d_00001:
  agent_timesteps_total: 424000
  custom_metrics: {}
  date: 2021-04-29_15-45-41
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -7.50000000000001
  episode_reward_mean: -27.317999999999987
  episode_reward_min: -49.50000000000003
  episodes_this_iter: 40
  episodes_total: 2120
  experiment_id: 88ed0d232176421b9f10b011ac47f80b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.6354122757911682
          entropy_coeff: 0.0
          kl: 0.010645512491464615
          model: {}
          policy_loss: -0.039844147861003876
          total_loss: 46.61997985839844
          vf_explained_var: 0.1894303858280182
          vf_loss: 46.6490478515625
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
         

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,54,325.491,216000,-28.773,-6.9,-58.2,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,53,319.12,212000,-27.318,-7.5,-49.5,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,53,319.497,212000,-25.953,-1.2,-47.4,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,53,319.661,212000,-35.247,-12.6,-57.0,100


Result for PPO_MultiAgentArena_6768d_00001:
  agent_timesteps_total: 432000
  custom_metrics: {}
  date: 2021-04-29_15-45-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -7.50000000000001
  episode_reward_mean: -26.412
  episode_reward_min: -56.10000000000006
  episodes_this_iter: 40
  episodes_total: 2160
  experiment_id: 88ed0d232176421b9f10b011ac47f80b
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.6164888739585876
          entropy_coeff: 0.0
          kl: 0.011558332480490208
          model: {}
          policy_loss: -0.040928248316049576
          total_loss: 59.5357551574707
          vf_explained_var: 0.1788903772830963
          vf_loss: 59.564979553222656
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          entropy: 0

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,55,331.243,220000,-26.949,-6.9,-48.6,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,54,324.75,216000,-26.412,-7.5,-56.1,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,54,325.076,216000,-26.289,-1.2,-59.7,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,54,325.326,216000,-35.343,-12.6,-57.3,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 440000
  custom_metrics: {}
  date: 2021-04-29_15-45-52
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -12.600000000000005
  episode_reward_mean: -34.164
  episode_reward_min: -65.40000000000008
  episodes_this_iter: 40
  episodes_total: 2200
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.598090648651123
          entropy_coeff: 0.0
          kl: 0.012055113911628723
          model: {}
          policy_loss: -0.04696172848343849
          total_loss: 73.78448486328125
          vf_explained_var: 0.17110657691955566
          vf_loss: 73.81924438476562
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          entropy: 

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,56,337.003,224000,-26.418,-7.8,-51.6,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,55,330.674,220000,-25.41,-7.5,-56.1,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,55,330.944,220000,-26.178,-8.1,-59.7,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,55,331.071,220000,-34.164,-12.6,-65.4,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 448000
  custom_metrics: {}
  date: 2021-04-29_15-45-58
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -17.099999999999966
  episode_reward_mean: -34.782000000000004
  episode_reward_min: -65.40000000000008
  episodes_this_iter: 40
  episodes_total: 2240
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.5774433612823486
          entropy_coeff: 0.0
          kl: 0.01109391637146473
          model: {}
          policy_loss: -0.025817465037107468
          total_loss: 60.31378173828125
          vf_explained_var: 0.10948525369167328
          vf_loss: 60.32836151123047
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
      

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,57,342.549,228000,-27.33,-6.0,-51.6,100
PPO_MultiAgentArena_6768d_00001,RUNNING,192.168.0.100:27761,0.0005,2000,56,336.411,224000,-25.152,-6.9,-43.5,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,56,336.678,224000,-26.298,-8.1,-59.7,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,56,336.772,224000,-34.782,-17.1,-65.4,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 456000
  custom_metrics: {}
  date: 2021-04-29_15-46-04
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -15.299999999999983
  episode_reward_mean: -34.01699999999999
  episode_reward_min: -65.40000000000008
  episodes_this_iter: 40
  episodes_total: 2280
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.5810614228248596
          entropy_coeff: 0.0
          kl: 0.009506484493613243
          model: {}
          policy_loss: -0.02447396144270897
          total_loss: 74.84034729003906
          vf_explained_var: 0.2173941731452942
          vf_loss: 74.85519409179688
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,58,347.532,232000,-27.507,-6.0,-48.9,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,57,342.016,228000,-25.698,-8.1,-50.4,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,57,342.124,228000,-34.017,-15.3,-65.4,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 464000
  custom_metrics: {}
  date: 2021-04-29_15-46-09
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 5.699999999999996
  episode_reward_mean: -33.74699999999999
  episode_reward_min: -49.50000000000007
  episodes_this_iter: 40
  episodes_total: 2320
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.5718386173248291
          entropy_coeff: 0.0
          kl: 0.00854343269020319
          model: {}
          policy_loss: -0.02570328116416931
          total_loss: 82.67941284179688
          vf_explained_var: 0.2271953821182251
          vf_loss: 82.69647216796875
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          e

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,59,352.756,236000,-27.492,-9.6,-48.9,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,58,347.076,232000,-26.481,-8.1,-50.4,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,58,347.131,232000,-33.747,5.7,-49.5,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 472000
  custom_metrics: {}
  date: 2021-04-29_15-46-14
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 5.699999999999996
  episode_reward_mean: -33.974999999999994
  episode_reward_min: -63.000000000000135
  episodes_this_iter: 40
  episodes_total: 2360
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.5892821550369263
          entropy_coeff: 0.0
          kl: 0.010701831430196762
          model: {}
          policy_loss: -0.03261489421129227
          total_loss: 60.32833480834961
          vf_explained_var: 0.15444166958332062
          vf_loss: 60.35011672973633
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,60,358.138,240000,-27.12,-4.2,-48.9,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,59,352.315,236000,-26.205,-8.1,-42.9,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,59,352.434,236000,-33.975,5.7,-63.0,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 480000
  custom_metrics: {}
  date: 2021-04-29_15-46-20
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 5.699999999999996
  episode_reward_mean: -33.195
  episode_reward_min: -63.000000000000135
  episodes_this_iter: 40
  episodes_total: 2400
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.5813271999359131
          entropy_coeff: 0.0
          kl: 0.010782662779092789
          model: {}
          policy_loss: -0.03663226589560509
          total_loss: 56.77137756347656
          vf_explained_var: 0.15052932500839233
          vf_loss: 56.797088623046875
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          entropy:

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,62,367.505,248000,-27.666,-11.7,-51.6,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,61,362.439,244000,-26.562,-10.2,-45.3,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,61,362.536,244000,-33.246,-3.3,-63.0,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 496000
  custom_metrics: {}
  date: 2021-04-29_15-46-29
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -3.3
  episode_reward_mean: -33.00299999999999
  episode_reward_min: -57.00000000000006
  episodes_this_iter: 40
  episodes_total: 2480
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.5644306540489197
          entropy_coeff: 0.0
          kl: 0.010993688367307186
          model: {}
          policy_loss: -0.036069903522729874
          total_loss: 51.7763786315918
          vf_explained_var: 0.15618523955345154
          vf_loss: 51.80132293701172
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          entropy: 0.5

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,64,376.983,256000,-26.085,-9.9,-61.2,100
PPO_MultiAgentArena_6768d_00002,RUNNING,192.168.0.100:27752,0.0001,3000,63,371.652,252000,-25.809,-10.8,-57.9,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,63,371.803,252000,-31.482,1.5,-57.0,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 512000
  custom_metrics: {}
  date: 2021-04-29_15-46-38
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 1.499999999999996
  episode_reward_mean: -30.881999999999998
  episode_reward_min: -60.00000000000007
  episodes_this_iter: 40
  episodes_total: 2560
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.5531268119812012
          entropy_coeff: 0.0
          kl: 0.009923124685883522
          model: {}
          policy_loss: -0.028609829023480415
          total_loss: 54.256229400634766
          vf_explained_var: 0.13008983433246613
          vf_loss: 54.2747917175293
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,66,385.646,264000,-27.402,-8.7,-61.2,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,65,381.288,260000,-31.209,1.5,-60.0,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 528000
  custom_metrics: {}
  date: 2021-04-29_15-46-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -6.000000000000007
  episode_reward_mean: -30.94199999999999
  episode_reward_min: -60.00000000000007
  episodes_this_iter: 40
  episodes_total: 2640
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.5442802309989929
          entropy_coeff: 0.0
          kl: 0.009576226584613323
          model: {}
          policy_loss: -0.04068347066640854
          total_loss: 72.03518676757812
          vf_explained_var: 0.18012464046478271
          vf_loss: 72.06617736816406
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,68,393.709,272000,-26.082,-9.6,-56.1,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,67,388.96,268000,-31.242,-9.3,-64.5,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 544000
  custom_metrics: {}
  date: 2021-04-29_15-46-55
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -6.300000000000007
  episode_reward_mean: -32.523
  episode_reward_min: -64.50000000000007
  episodes_this_iter: 40
  episodes_total: 2720
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.5339787602424622
          entropy_coeff: 0.0
          kl: 0.0100470669567585
          model: {}
          policy_loss: -0.02859780751168728
          total_loss: 74.75786590576172
          vf_explained_var: 0.14006027579307556
          vf_loss: 74.77629089355469
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          entropy: 0.

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,69,398.646,276000,-25.575,-9.6,-56.1,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,69,398.253,276000,-31.248,-6.3,-64.5,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100


Result for PPO_MultiAgentArena_6768d_00000:
  agent_timesteps_total: 560000
  custom_metrics: {}
  date: 2021-04-29_15-47-04
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -10.199999999999989
  episode_reward_mean: -26.249999999999986
  episode_reward_min: -45.00000000000004
  episodes_this_iter: 40
  episodes_total: 2800
  experiment_id: ca7d5fe006ec4c509dd419cd353bfbf5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.5122224688529968
          entropy_coeff: 0.0
          kl: 0.00915178470313549
          model: {}
          policy_loss: -0.02914438769221306
          total_loss: 49.6755485534668
          vf_explained_var: 0.19187286496162415
          vf_loss: 49.69542694091797
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,71,406.252,284000,-27.183,-10.2,-51.6,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,70,402.014,280000,-31.068,-6.3,-59.7,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100


Result for PPO_MultiAgentArena_6768d_00000:
  agent_timesteps_total: 576000
  custom_metrics: {}
  date: 2021-04-29_15-47-12
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -10.500000000000004
  episode_reward_mean: -26.132999999999985
  episode_reward_min: -51.60000000000004
  episodes_this_iter: 40
  episodes_total: 2880
  experiment_id: ca7d5fe006ec4c509dd419cd353bfbf5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.5092606544494629
          entropy_coeff: 0.0
          kl: 0.010658515617251396
          model: {}
          policy_loss: -0.04323691874742508
          total_loss: 49.281822204589844
          vf_explained_var: 0.1927822232246399
          vf_loss: 49.31426239013672
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
      

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,RUNNING,192.168.0.100:27751,0.0001,2000,73,414.094,292000,-25.521,-10.2,-51.6,100
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,72,409.673,288000,-30.117,-10.8,-65.4,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100


Result for PPO_MultiAgentArena_6768d_00000:
  agent_timesteps_total: 592000
  custom_metrics: {}
  date: 2021-04-29_15-47-20
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -8.400000000000002
  episode_reward_mean: -24.722999999999992
  episode_reward_min: -47.400000000000006
  episodes_this_iter: 40
  episodes_total: 2960
  experiment_id: ca7d5fe006ec4c509dd419cd353bfbf5
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.0003000000142492354
          entropy: 0.4993453323841095
          entropy_coeff: 0.0
          kl: 0.010554897598922253
          model: {}
          policy_loss: -0.03376566618680954
          total_loss: 45.99901580810547
          vf_explained_var: 0.15946485102176666
          vf_loss: 46.0220947265625
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
        

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,75,421.671,300000,-27.96,4.5,-48.0,100
PPO_MultiAgentArena_6768d_00000,TERMINATED,,0.0001,2000,74,418.431,296000,-24.723,-8.4,-47.4,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 608000
  custom_metrics: {}
  date: 2021-04-29_15-47-27
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -6.600000000000004
  episode_reward_mean: -27.716999999999988
  episode_reward_min: -48.00000000000014
  episodes_this_iter: 40
  episodes_total: 3040
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.0003000000142492354
          entropy: 0.4986693561077118
          entropy_coeff: 0.0
          kl: 0.0081246979534626
          model: {}
          policy_loss: -0.04725240170955658
          total_loss: 45.27167510986328
          vf_explained_var: 0.2570512890815735
          vf_loss: 45.30658721923828
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
          

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,77,428.551,308000,-27.498,-6.6,-49.8,100
PPO_MultiAgentArena_6768d_00000,TERMINATED,,0.0001,2000,74,418.431,296000,-24.723,-8.4,-47.4,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 624000
  custom_metrics: {}
  date: 2021-04-29_15-47-34
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -10.799999999999994
  episode_reward_mean: -27.311999999999994
  episode_reward_min: -49.800000000000075
  episodes_this_iter: 40
  episodes_total: 3120
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.0003000000142492354
          entropy: 0.4930303692817688
          entropy_coeff: 0.0
          kl: 0.007112347986549139
          model: {}
          policy_loss: -0.03969614952802658
          total_loss: 48.887725830078125
          vf_explained_var: 0.20681846141815186
          vf_loss: 48.91661834716797
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
    

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,79,434.823,316000,-26.469,-8.7,-49.8,100
PPO_MultiAgentArena_6768d_00000,TERMINATED,,0.0001,2000,74,418.431,296000,-24.723,-8.4,-47.4,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 640000
  custom_metrics: {}
  date: 2021-04-29_15-47-40
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -8.700000000000005
  episode_reward_mean: -25.40999999999999
  episode_reward_min: -45.60000000000002
  episodes_this_iter: 40
  episodes_total: 3200
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.0003000000142492354
          entropy: 0.4889751076698303
          entropy_coeff: 0.0
          kl: 0.007557917386293411
          model: {}
          policy_loss: -0.04218336194753647
          total_loss: 49.497039794921875
          vf_explained_var: 0.2600170373916626
          vf_loss: 49.527748107910156
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
       

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,81,441.235,324000,-25.314,-8.4,-49.5,100
PPO_MultiAgentArena_6768d_00000,TERMINATED,,0.0001,2000,74,418.431,296000,-24.723,-8.4,-47.4,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 656000
  custom_metrics: {}
  date: 2021-04-29_15-47-47
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -8.399999999999997
  episode_reward_mean: -25.712999999999997
  episode_reward_min: -49.500000000000085
  episodes_this_iter: 40
  episodes_total: 3280
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.0003000000142492354
          entropy: 0.4842988848686218
          entropy_coeff: 0.0
          kl: 0.0069571854546666145
          model: {}
          policy_loss: -0.033499155193567276
          total_loss: 45.17038345336914
          vf_explained_var: 0.20376478135585785
          vf_loss: 45.19331359863281
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
    

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,83,447.581,332000,-26.382,-8.4,-49.5,100
PPO_MultiAgentArena_6768d_00000,TERMINATED,,0.0001,2000,74,418.431,296000,-24.723,-8.4,-47.4,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 672000
  custom_metrics: {}
  date: 2021-04-29_15-47-53
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -9.300000000000004
  episode_reward_mean: -26.906999999999993
  episode_reward_min: -64.50000000000009
  episodes_this_iter: 40
  episodes_total: 3360
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.0003000000142492354
          entropy: 0.47307562828063965
          entropy_coeff: 0.0
          kl: 0.006755154579877853
          model: {}
          policy_loss: -0.03495004028081894
          total_loss: 61.36048126220703
          vf_explained_var: 0.20576593279838562
          vf_loss: 61.38517379760742
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
      

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,85,453.952,340000,-25.599,-3.3,-64.5,100
PPO_MultiAgentArena_6768d_00000,TERMINATED,,0.0001,2000,74,418.431,296000,-24.723,-8.4,-47.4,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 688000
  custom_metrics: {}
  date: 2021-04-29_15-48-00
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -3.300000000000008
  episode_reward_mean: -25.967999999999996
  episode_reward_min: -64.50000000000009
  episodes_this_iter: 40
  episodes_total: 3440
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.0003000000142492354
          entropy: 0.48155075311660767
          entropy_coeff: 0.0
          kl: 0.007136037107557058
          model: {}
          policy_loss: -0.03913995623588562
          total_loss: 64.23258209228516
          vf_explained_var: 0.20871126651763916
          vf_loss: 64.26087951660156
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
      

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00003,RUNNING,192.168.0.100:27831,0.0005,3000,87,460.319,348000,-26.067,-0.3,-59.4,100
PPO_MultiAgentArena_6768d_00000,TERMINATED,,0.0001,2000,74,418.431,296000,-24.723,-8.4,-47.4,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100


Result for PPO_MultiAgentArena_6768d_00003:
  agent_timesteps_total: 704000
  custom_metrics: {}
  date: 2021-04-29_15-48-06
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -0.29999999999999544
  episode_reward_mean: -25.682999999999993
  episode_reward_min: -59.10000000000005
  episodes_this_iter: 40
  episodes_total: 3520
  experiment_id: 57a3943dc3314ef292b59a6778e27a63
  hostname: Svens-MacBook-Pro.local
  info:
    learner:
      pol1:
        learner_stats:
          cur_kl_coeff: 1.5187499523162842
          cur_lr: 0.0003000000142492354
          entropy: 0.4750180244445801
          entropy_coeff: 0.0
          kl: 0.007025967352092266
          model: {}
          policy_loss: -0.026259679347276688
          total_loss: 41.6136589050293
          vf_explained_var: 0.23172453045845032
          vf_loss: 41.629249572753906
      pol2:
        learner_stats:
          cur_kl_coeff: 1.0125000476837158
          cur_lr: 0.00039999998989515007
    

Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,TERMINATED,,0.0001,2000,74,418.431,296000,-24.723,-8.4,-47.4,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100
PPO_MultiAgentArena_6768d_00003,TERMINATED,,0.0005,3000,89,466.944,356000,-24.78,-0.3,-55.2,100


Trial name,status,loc,lr,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_MultiAgentArena_6768d_00000,TERMINATED,,0.0001,2000,74,418.431,296000,-24.723,-8.4,-47.4,100
PPO_MultiAgentArena_6768d_00001,TERMINATED,,0.0005,2000,57,341.737,228000,-24.663,-6.9,-47.1,100
PPO_MultiAgentArena_6768d_00002,TERMINATED,,0.0001,3000,65,381.123,260000,-23.769,-4.5,-42.0,100
PPO_MultiAgentArena_6768d_00003,TERMINATED,,0.0005,3000,89,466.944,356000,-24.78,-0.3,-55.2,100


[2m[36m(pid=27831)[0m Exception ignored in: <function Connection.__del__ at 0x7fe49c83e940>
[2m[36m(pid=27831)[0m Traceback (most recent call last):
[2m[36m(pid=27831)[0m   File "/Users/sven/opt/anaconda3/envs/ray_tutorial/lib/python3.8/site-packages/redis/connection.py", line 543, in __del__
[2m[36m(pid=27831)[0m     try:
[2m[36m(pid=27831)[0m   File "/Users/sven/opt/anaconda3/envs/ray_tutorial/lib/python3.8/site-packages/ray/worker.py", line 379, in sigterm_handler
[2m[36m(pid=27831)[0m     sys.exit(1)
[2m[36m(pid=27831)[0m SystemExit: 1
[2m[36m(pid=27831)[0m Traceback (most recent call last):
[2m[36m(pid=27831)[0m   File "python/ray/_raylet.pyx", line 488, in ray._raylet.execute_task
[2m[36m(pid=27831)[0m   File "python/ray/_raylet.pyx", line 495, in ray._raylet.execute_task
[2m[36m(pid=27831)[0m   File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
[2m[36m(pid=27831)[0m   File "python/ray/_raylet.pyx", line 449, in ray._raylet.ex

<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7ffc08b49280>

In [None]:
# 8) Infinite laptop:

# NOTE: The following cell will only work if you are already on-boarded to our Anyscale Inc. "Infinite Laptop".
# To get more information, see https://www.anyscale.com/product

# Let's quickly divert from our MultiAgentArena and move to something much heavier in terms of environment/simulator complexity.
# We will now demonstrate, how you can use Anyscale's infinite laptop to launch an RLlib experiment on a cloud 4 GPU + 32 CPU machine
# all from within this Jupyter cell here.
# Start an experiment in the cloud using Anyscale's product, RLlib, and a more complex multi-agent env.

# NOTE 
import anyscale



In [None]:
# 9) Custom Neural Network Models.

import tensorflow as tf


class MyModel(tf.keras.Model):
    def __init__(self,
                input_space,
                action_space,
                num_outputs,
                name="",
                *,
                layers = (256, 256)):
        super().__init__(name=name)

        self.dense_layers = []
        for i, layer_size in enumerate(layers):
            self.dense_layers.append(tf.keras.layers.Dense(
                layer_size, activation=tf.nn.relu, name=f"dense_{i}"))

        self.logits = tf.keras.layers.Dense(
            num_outputs,
            activation=tf.keras.activations.linear,
            name="logits")
        self.values = tf.keras.layers.Dense(
            1, activation=None, name="values")

    def call(self, inputs, training=None, mask=None):
        # Standardized input args:
        # - input_dict (RLlib `SampleBatch` object, which is basically a dict with numpy arrays
        # in it)
        out = inputs["obs"]
        for l in self.dense_layers:
            out = l(out)
        logits = self.logits(out)
        values = self.values(out)

        # Standardized output:
        # - "normal" model output tensor (e.g. action logits).
        # - list of internal state outputs (only needed for RNN-/memory enhanced models).
        # - "extra outs", such as model's side branches, e.g. value function outputs.
        return logits, [], {"vf_preds": tf.reshape(values, [-1])}

# Do a quick test.
from gym.spaces import Box
test_model = MyModel(
    input_space=Box(-1.0, 1.0, (2, )),
    action_space=None,
    num_outputs=2,
)
test_model({"obs": np.array([[0.5, 0.5]])})

In [None]:
# "Hacking in": How do we customize our RL loop?
# RLlib offers a callbacks API that allows you to add custom behavior at
# all major events during the environment sampling and learning process.

# Our problem: So far, we can only see the total reward (sum for both agents).
# This does not give us enough insights into the question of which agent
# learns what (maybe agent2 doesn't learn anything and the reward we are observing
# is mostly due to agent1's progress in covering the map!).
# The following custom callbacks class allows us to add each agents single reward to
# the returned metrics, which will then be displayed in tensorboard.

# We will override RLlib's DefaultCallbacks class and implement the
# `on_episode_step` and `on_episode_end` methods therein.

from ray.rllib.agents.callbacks import DefaultCallbacks


class MyCallbacks(DefaultCallbacks):
    def on_episode_start(self, *, worker, base_env,
                         policies, episode,
                         env_index, **kwargs):
        episode.user_data["agent1_rewards"] = []
        episode.user_data["agent2_rewards"] = []

    def on_episode_step(self, *, worker, base_env,
                        episode, env_index, **kwargs):
        # Make sure this episode is ongoing.
        assert episode.length > 0, \
            "ERROR: `on_episode_step()` callback should not be called right " \
            "after env reset!"
        ag1_r = episode.prev_reward_for("agent1")
        ag2_r = episode.prev_reward_for("agent2")
        #print("ag1_r={} ag2_r={}".format(ag1_r, ag2_r))
        episode.user_data["agent1_rewards"].append(ag1_r)
        episode.user_data["agent2_rewards"].append(ag2_r)

    def on_episode_end(self, *, worker, base_env,
                       policies, episode,
                       env_index, **kwargs):
        episode.custom_metrics["ag1_R"] = sum(episode.user_data["agent1_rewards"])
        episode.custom_metrics["ag2_R"] = sum(episode.user_data["agent2_rewards"])
        episode.hist_data["agent1_rewards"] = episode.user_data["agent1_rewards"]
        episode.hist_data["agent2_rewards"] = episode.user_data["agent2_rewards"]



In [None]:
# Setting up our config to point to our new custom callbacks class:
config.update({
    "env": MultiAgentArena,  # force "reload"
    "callbacks": MyCallbacks,  # by default, this would point to `rllib.agents.callbacks.DefaultCallbacks`, which does nothing.
    # Revert these to single trials.
    "lr": 0.0001,
    "train_batch_size": 4000,
})

tune.run("PPO", config=config, stop={"training_iteration": 10})

In [None]:
# Exercise #3:
# ============
# The episode mean rewards reported to us thus far were always the sum
# of both agents, which doesn't seem to make too much sense given that
# the agents are adversarial.
# Instead, we would like to know, what the individual agents' rewards are in
# our environment.
# Write your own custom callback class (sub-class
# ray.rllib.agents.callback::DefaultCallbacks) and override one or more methods
# therein to manipulate and collect the following data:

#TODO

# a) Extract each agent's individual rewards from ...
# b) Store each agents reward under the new "reward_agent1" and
#    "reward_agent2" keys in the custom metrics.
# c) Run a simple experiment and confirm that you are seeing these two new stats
#    in the tensorboard output.
# Good luck! :)

