# Tensorflow Implementation of MADDPG in Multiagent-Particle Environments

---

### 1. Install Gym


In [3]:
# Install gym
!apt-get update
!apt-get install xvfb ffmpeg python-opengl > /dev/null
!pip3 install gym
!pip3 install pyvirtualdisplay

Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64  InRelease
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [83.2 kB]
Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64  Release
Hit:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64  Release
Hit:7 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:8 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:12 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [264 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [734 kB]
Get:14 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Package

### 2. Import the Necessary Packages

In [0]:
import gym
import numpy as np
import random
import tensorflow as tf
import time
import pickle
import matplotlib.pyplot as plt
%matplotlib inline

# Activate virtual display
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

In [0]:
import matplotlib.pyplot as plt
import matplotlib.animation
from IPython import display as ipythondisplay
from IPython.display import HTML

def display_frames_as_gif(frames):
    """
    Displays a list of frames as a gif, with controls
    """
    patch = plt.imshow(frames[0])
    plt.axis('off')
    animate = lambda i: patch.set_data(frames[i])
    ani = matplotlib.animation.FuncAnimation(plt.gcf(), animate, frames=len(frames), interval = 50)
    HTML(ani.to_jshtml())

### 3. Install the Multiagent-Particle Environment (Optional)

In [6]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Change working directory
import os
os.chdir('/content/drive/My Drive/Colab Notebooks/move37-final/')
!ls -l

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive
total 118
drwx------ 2 root root  4096 Nov 24 10:05 bin
-rw------- 1 root root 11853 Nov 24 15:31 distributions.py
-rw------- 1 root root  1063 Nov 24 10:05 LICENSE.txt
-rw------- 1 root root 61602 Nov 25 08:34 maddpg_multiagent-particle.ipynb
-rw------- 1 root root  1916 Nov 24 10:05 make_env.py
drwx------ 2 root root  4096 Nov 25 06:22 models
drwx------ 2 root root  4096 Nov 24 10:05 multiagent
drwx------ 2 root root  4096 Nov 24 10:05 multiagent.egg-info
dr

In [0]:
# Git clone into the current directory
!git init .
!git remote add -t \* -f origin https://github.com/openai/multiagent-particle-envs.git
!git checkout master

Initialized empty Git repository in /content/drive/My Drive/Colab Notebooks/move37-final/.git/
Updating origin
remote: Enumerating objects: 227, done.[K
remote: Total 227 (delta 0), reused 0 (delta 0), pack-reused 227[K
Receiving objects: 100% (227/227), 99.67 KiB | 289.00 KiB/s, done.
Resolving deltas: 100% (123/123), done.
From https://github.com/openai/multiagent-particle-envs
 * [new branch]      master     -> origin/master
Branch 'master' set up to track remote branch 'master' from 'origin'.
Already on 'master'


In [0]:
!pip install -e .

### 4. Instantiate the Environment and Agent

In [7]:
# Allow relative imports to directories above the current directory
import os, sys
sys.path.append('.')

#import modules
from importlib import reload
from replay_buffer import ReplayBuffer
from distributions import make_pdtype
import utils as U; reload(U)

<module 'utils' from '/content/drive/My Drive/Colab Notebooks/move37-final/utils.py'>

In [0]:
# Training parameters
MAX_EPISODE_LEN = int(25)  # maximum episode length
NUM_EPISODES = int(25000)  # number of episodes
NUM_ADVERSARIES = int(0)  # number of adversaries
LR = 1e-2  # learning rate for Adam optimizer
GAMMA = 0.95  # discount factor
BATCH_SIZE = int(1024)  # number of episodes to optimize at the same time
NUM_UNITS = int(64)  # number of units in the mlp
GOOD_POLICY = 'maddpg'  # policy of good agents
ADV_POLICY = 'maddpg'  # policy of adversaries
# Checkpointing
SAVE_DIR = './models/'  # directory in which training state and model should be saved
SAVE_RATE = int(1000)  # save model once every time this many episodes are completed
LOAD_DIR = ''  # directory in which training state and model are loaded
# Evaluation
RESTORE = False
SAVE_GIFS = False
BENCHMARK = False
BENCHMARK_ITERS = int(100000)  # number of iterations run for benchmarking
BENCHMARK_DIR = './benchmark_files/'  # directory where benchmark data is saved
GIFS_DIR = './gifs/'  # directory where image data is saved

def discount_with_dones(rewards, dones, gamma):
    discounted = []
    r = 0
    for reward, done in zip(rewards[::-1], dones[::-1]):
        r = reward + gamma*r
        r = r*(1.-done)
        discounted.append(r)
    return discounted[::-1]

def make_update_exp(vals, target_vals):
    polyak = 1.0 - 1e-2
    expression = []
    for var, var_target in zip(sorted(vals, key=lambda v: v.name), sorted(target_vals, key=lambda v: v.name)):
        expression.append(var_target.assign(polyak * var_target + (1.0-polyak) * var))
    expression = tf.group(*expression)
    return U.function([], [], updates=[expression])

def p_train(make_obs_ph_n, act_space_n, p_index, p_func, q_func, optimizer, grad_norm_clipping=None, local_q_func=False, num_units=64, scope="trainer", reuse=None):
    with tf.variable_scope(scope, reuse=reuse):
        # create distribtuions
        act_pdtype_n = [make_pdtype(act_space) for act_space in act_space_n]

        # set up placeholders
        obs_ph_n = make_obs_ph_n
        act_ph_n = [act_pdtype_n[i].sample_placeholder([None], name="action"+str(i)) for i in range(len(act_space_n))]

        p_input = obs_ph_n[p_index]

        p = p_func(p_input, int(act_pdtype_n[p_index].param_shape()[0]), scope="p_func", num_units=num_units)
        p_func_vars = U.scope_vars(U.absolute_scope_name("p_func"))

        # wrap parameters in distribution
        act_pd = act_pdtype_n[p_index].pdfromflat(p)

        act_sample = act_pd.sample()
        p_reg = tf.reduce_mean(tf.square(act_pd.flatparam()))

        act_input_n = act_ph_n + []
        act_input_n[p_index] = act_pd.sample()
        q_input = tf.concat(obs_ph_n + act_input_n, 1)
        if local_q_func:
            q_input = tf.concat([obs_ph_n[p_index], act_input_n[p_index]], 1)
        q = q_func(q_input, 1, scope="q_func", reuse=True, num_units=num_units)[:,0]
        pg_loss = -tf.reduce_mean(q)

        loss = pg_loss + p_reg * 1e-3

        optimize_expr = U.minimize_and_clip(optimizer, loss, p_func_vars, grad_norm_clipping)

        # Create callable functions
        train = U.function(inputs=obs_ph_n + act_ph_n, outputs=loss, updates=[optimize_expr])
        act = U.function(inputs=[obs_ph_n[p_index]], outputs=act_sample)
        p_values = U.function([obs_ph_n[p_index]], p)

        # target network
        target_p = p_func(p_input, int(act_pdtype_n[p_index].param_shape()[0]), scope="target_p_func", num_units=num_units)
        target_p_func_vars = U.scope_vars(U.absolute_scope_name("target_p_func"))
        update_target_p = make_update_exp(p_func_vars, target_p_func_vars)

        target_act_sample = act_pdtype_n[p_index].pdfromflat(target_p).sample()
        target_act = U.function(inputs=[obs_ph_n[p_index]], outputs=target_act_sample)

        return act, train, update_target_p, {'p_values': p_values, 'target_act': target_act}

def q_train(make_obs_ph_n, act_space_n, q_index, q_func, optimizer, grad_norm_clipping=None, local_q_func=False, scope="trainer", reuse=None, num_units=64):
    with tf.variable_scope(scope, reuse=reuse):
        # create distribtuions
        act_pdtype_n = [make_pdtype(act_space) for act_space in act_space_n]

        # set up placeholders
        obs_ph_n = make_obs_ph_n
        act_ph_n = [act_pdtype_n[i].sample_placeholder([None], name="action"+str(i)) for i in range(len(act_space_n))]
        target_ph = tf.placeholder(tf.float32, [None], name="target")

        q_input = tf.concat(obs_ph_n + act_ph_n, 1)
        if local_q_func:
            q_input = tf.concat([obs_ph_n[q_index], act_ph_n[q_index]], 1)
        q = q_func(q_input, 1, scope="q_func", num_units=num_units)[:,0]
        q_func_vars = U.scope_vars(U.absolute_scope_name("q_func"))

        q_loss = tf.reduce_mean(tf.square(q - target_ph))

        # viscosity solution to Bellman differential equation in place of an initial condition
        q_reg = tf.reduce_mean(tf.square(q))
        loss = q_loss #+ 1e-3 * q_reg

        optimize_expr = U.minimize_and_clip(optimizer, loss, q_func_vars, grad_norm_clipping)

        # Create callable functions
        train = U.function(inputs=obs_ph_n + act_ph_n + [target_ph], outputs=loss, updates=[optimize_expr])
        q_values = U.function(obs_ph_n + act_ph_n, q)

        # target network
        target_q = q_func(q_input, 1, scope="target_q_func", num_units=num_units)[:,0]
        target_q_func_vars = U.scope_vars(U.absolute_scope_name("target_q_func"))
        update_target_q = make_update_exp(q_func_vars, target_q_func_vars)

        target_q_values = U.function(obs_ph_n + act_ph_n, target_q)

        return train, update_target_q, {'q_values': q_values, 'target_q_values': target_q_values}

class AgentTrainer(object):
    def __init__(self, name, model, obs_shape, act_space):
        raise NotImplemented()

    def action(self, obs):
        raise NotImplemented()

    def process_experience(self, obs, act, rew, new_obs, done, terminal):
        raise NotImplemented()

    def preupdate(self):
        raise NotImplemented()

    def update(self, agents):
        raise NotImplemented()
      
class MADDPGAgentTrainer(AgentTrainer):
    def __init__(self, name, model, obs_shape_n, act_space_n, agent_index, local_q_func=False):
        self.name = name
        self.n = len(obs_shape_n)
        self.agent_index = agent_index
        obs_ph_n = []
        for i in range(self.n):
            obs_ph_n.append(U.BatchInput(obs_shape_n[i], name="observation"+str(i)).get())

        # Create all the functions necessary to train the model
        self.q_train, self.q_update, self.q_debug = q_train(
            scope=self.name,
            make_obs_ph_n=obs_ph_n,
            act_space_n=act_space_n,
            q_index=agent_index,
            q_func=model,
            optimizer=tf.train.AdamOptimizer(learning_rate=LR),
            grad_norm_clipping=0.5,
            local_q_func=local_q_func,
            num_units=NUM_UNITS
        )
        self.act, self.p_train, self.p_update, self.p_debug = p_train(
            scope=self.name,
            make_obs_ph_n=obs_ph_n,
            act_space_n=act_space_n,
            p_index=agent_index,
            p_func=model,
            q_func=model,
            optimizer=tf.train.AdamOptimizer(learning_rate=LR),
            grad_norm_clipping=0.5,
            local_q_func=local_q_func,
            num_units=NUM_UNITS
        )
        # Create experience buffer
        self.replay_buffer = ReplayBuffer(1e6)
        self.max_replay_buffer_len = BATCH_SIZE * MAX_EPISODE_LEN
        self.replay_sample_index = None

    def action(self, obs):
        return self.act(obs[None])[0]

    def experience(self, obs, act, rew, new_obs, done, terminal):
        # Store transition in the replay buffer.
        self.replay_buffer.add(obs, act, rew, new_obs, float(done))

    def preupdate(self):
        self.replay_sample_index = None

    def update(self, agents, t):
        if len(self.replay_buffer) < self.max_replay_buffer_len: # replay buffer is not large enough
            return
        if not t % 100 == 0:  # only update every 100 steps
            return

        self.replay_sample_index = self.replay_buffer.make_index(BATCH_SIZE)
        # collect replay sample from all agents
        obs_n = []
        obs_next_n = []
        act_n = []
        index = self.replay_sample_index
        for i in range(self.n):
            obs, act, rew, obs_next, done = agents[i].replay_buffer.sample_index(index)
            obs_n.append(obs)
            obs_next_n.append(obs_next)
            act_n.append(act)
        obs, act, rew, obs_next, done = self.replay_buffer.sample_index(index)

        # train q network
        num_sample = 1
        target_q = 0.0
        for i in range(num_sample):
            target_act_next_n = [agents[i].p_debug['target_act'](obs_next_n[i]) for i in range(self.n)]
            target_q_next = self.q_debug['target_q_values'](*(obs_next_n + target_act_next_n))
            target_q += rew + GAMMA * (1.0 - done) * target_q_next
        target_q /= num_sample
        q_loss = self.q_train(*(obs_n + act_n + [target_q]))

        # train p network
        p_loss = self.p_train(*(obs_n + act_n))

        self.p_update()
        self.q_update()

        return [q_loss, p_loss, np.mean(target_q), np.mean(rew), np.mean(target_q_next), np.std(target_q)]

In [0]:
import tensorflow.contrib.layers as layers

def mlp_model(input, num_outputs, scope, reuse=False, num_units=64, rnn_cell=None):
    # This model takes as input an observation and returns values of all actions
    with tf.variable_scope(scope, reuse=reuse):
        out = input
        out = layers.fully_connected(out, num_outputs=num_units, activation_fn=tf.nn.relu)
        out = layers.fully_connected(out, num_outputs=num_units, activation_fn=tf.nn.relu)
        out = layers.fully_connected(out, num_outputs=num_outputs, activation_fn=None)
        return out

def get_trainers(env, num_adversaries, obs_shape_n):
    trainers = []
    model = mlp_model
    trainer = MADDPGAgentTrainer
    for i in range(num_adversaries):
        trainers.append(trainer("agent_%d" % i, model, obs_shape_n, env.action_space, i, local_q_func=(ADV_POLICY=='ddpg')))
    for i in range(num_adversaries, env.n):
        trainers.append(trainer("agent_%d" % i, model, obs_shape_n, env.action_space, i, local_q_func=(GOOD_POLICY=='ddpg')))
    return trainers

def train(scenario='simple', name='training'):
    from make_env import make_env
 
    tf.reset_default_graph()

    with U.single_threaded_session():
        # Create environment
        env = make_env(scenario, BENCHMARK)
        # Create agent trainers
        obs_shape_n = [env.observation_space[i].shape for i in range(env.n)]
        num_adversaries = min(env.n, NUM_ADVERSARIES)
        trainers = get_trainers(env, num_adversaries, obs_shape_n)
        print('Using good policy {} and adv policy {}'.format(GOOD_POLICY, ADV_POLICY))

        # Initialize
        U.initialize()

        # Load previous results, if necessary
        load_dir = LOAD_DIR
        if load_dir == "":
            load_dir = SAVE_DIR
#         if DISPLAY or RESTORE or BENCHMARK:
        if RESTORE:
            print('Loading previous state...')
            U.load_state(load_dir + name)

        episode_rewards = [0.0]  # sum of rewards for all agents
        agent_rewards = [[0.0] for _ in range(env.n)]  # individual agent reward
        final_ep_rewards = []  # sum of rewards for training curve
        final_ep_ag_rewards = []  # agent rewards for training curve
        agent_info = [[[]]]  # placeholder for benchmarking info
        saver = tf.train.Saver()
        obs_n = env.reset()
        episode_step = 0
        train_step = 0
        t_start = time.time()
        
        if SAVE_GIFS:
            frames = []
            frames.append(env.render(mode = 'rgb_array')[0])
             
        print('Starting iterations...')
        while True:
            # get action
            action_n = [agent.action(obs) for agent, obs in zip(trainers,obs_n)]
            # environment step
            new_obs_n, rew_n, done_n, info_n = env.step(action_n)
            episode_step += 1
            done = all(done_n)
            terminal = (episode_step >= MAX_EPISODE_LEN)
            # collect experience
            for i, agent in enumerate(trainers):
                agent.experience(obs_n[i], action_n[i], rew_n[i], new_obs_n[i], done_n[i], terminal)
            obs_n = new_obs_n

            for i, rew in enumerate(rew_n):
                episode_rewards[-1] += rew
                agent_rewards[i][-1] += rew

            if done or terminal:
                obs_n = env.reset()
                episode_step = 0
                episode_rewards.append(0)
                for a in agent_rewards:
                    a.append(0)
                agent_info.append([[]])

            # increment global step counter
            train_step += 1

            # for benchmarking learned policies
            if BENCHMARK:
                for i, info in enumerate(info_n):
                    agent_info[-1][i].append(info_n['n'])
                if train_step > BENCHMARK_ITES and (done or terminal):
                    file_name = BENCHMARK_DIR + name + '.pkl'
                    print('Finished benchmarking, now saving...')
                    with open(file_name, 'wb') as fp:
                        pickle.dump(agent_info[:-1], fp)
                    break
                continue
            
            # save frames for evaluation
            if SAVE_GIFS:
                time.sleep(0.1)
                frames.append(env.render(mode = 'rgb_array')[0])
                continue
                
            # update all trainers, if not in benchmark mode
            loss = None
            for agent in trainers:
                agent.preupdate()
            for agent in trainers:
                loss = agent.update(trainers, train_step)

            # save model, display training output
            if terminal and (len(episode_rewards) % SAVE_RATE == 0):
                U.save_state(SAVE_DIR + name, saver=saver)
                # print statement depends on whether or not there are adversaries
                if num_adversaries == 0:
                    print("steps: {}, episodes: {}, mean episode reward: {}, time: {}".format(
                        train_step, len(episode_rewards), np.mean(episode_rewards[-SAVE_RATE:]), round(time.time()-t_start, 3)))
                else:
                    print("steps: {}, episodes: {}, mean episode reward: {}, agent episode reward: {}, time: {}".format(
                        train_step, len(episode_rewards), np.mean(episode_rewards[-SAVE_RATE:]),
                        [np.mean(rew[-SAVE_RATE:]) for rew in agent_rewards], round(time.time()-t_start, 3)))
                t_start = time.time()
                # Keep track of final episode reward
                final_ep_rewards.append(np.mean(episode_rewards[-SAVE_RATE:]))
                for rew in agent_rewards:
                    final_ep_ag_rewards.append(np.mean(rew[-SAVE_RATE:]))
            
            # finish the training process
            if len(episode_rewards) > NUM_EPISODES:
                print('...Finished total of {} episodes.'.format(len(episode_rewards)))
                break

        env.close()
#         display_frames_as_gif(frames)

### 5. Cooperative Communication
One agent is the ‘speaker’ (gray) that does not move (observes goal of other agent), and other agent is the listener (cannot speak, but must navigate to correct landmark).

In [11]:
train('simple_speaker_listener', 'Cooperative Communication')

Using good policy maddpg and adv policy maddpg
Starting iterations...
steps: 24975, episodes: 1000, mean episode reward: -142.88330606485846, time: 55.184
steps: 49975, episodes: 2000, mean episode reward: -119.55574798871612, time: 66.547
steps: 74975, episodes: 3000, mean episode reward: -58.635686984176616, time: 67.272
steps: 99975, episodes: 4000, mean episode reward: -57.471882724832874, time: 66.496
steps: 124975, episodes: 5000, mean episode reward: -60.433609603389606, time: 66.012
steps: 149975, episodes: 6000, mean episode reward: -59.22826531946425, time: 67.83
steps: 174975, episodes: 7000, mean episode reward: -58.554753915049986, time: 66.907
steps: 199975, episodes: 8000, mean episode reward: -59.84911301995979, time: 66.648
steps: 224975, episodes: 9000, mean episode reward: -56.50665237094993, time: 66.48
steps: 249975, episodes: 10000, mean episode reward: -58.40336454298203, time: 67.581
steps: 274975, episodes: 11000, mean episode reward: -60.89297352407632, time: 

### 6. Predator-Prey
Good agents (green) are faster and want to avoid being hit by adversaries (red). Adversaries are slower and want to hit good agents. Obstacles (large black circles) block the way.



In [12]:
train('simple_tag', 'Predator-Prey')

Using good policy maddpg and adv policy maddpg
Starting iterations...
steps: 24975, episodes: 1000, mean episode reward: -0.7477632799887421, time: 115.819
steps: 49975, episodes: 2000, mean episode reward: 0.8306453678608287, time: 152.266
steps: 74975, episodes: 3000, mean episode reward: 8.071925852293045, time: 154.644
steps: 99975, episodes: 4000, mean episode reward: 11.215985832614242, time: 154.179
steps: 124975, episodes: 5000, mean episode reward: 15.816438693348195, time: 155.074
steps: 149975, episodes: 6000, mean episode reward: 31.043099904831497, time: 154.976
steps: 174975, episodes: 7000, mean episode reward: 45.941660431644095, time: 155.23
steps: 199975, episodes: 8000, mean episode reward: 15.255968870078606, time: 153.372
steps: 224975, episodes: 9000, mean episode reward: 9.737429120350772, time: 156.003
steps: 249975, episodes: 10000, mean episode reward: 11.372254000965379, time: 154.017
steps: 274975, episodes: 11000, mean episode reward: 11.863740383728349, ti

### 7. Cooperative Navigation
Agents are rewarded based on how far any agent is from each landmark. Agents are penalized if they collide with other agents. So, agents have to learn to cover all the landmarks while avoiding collisions.

In [13]:
train('simple_spread', 'Cooperative Navigation')

Using good policy maddpg and adv policy maddpg
Starting iterations...
steps: 24975, episodes: 1000, mean episode reward: -665.2543330896325, time: 90.716
steps: 49975, episodes: 2000, mean episode reward: -826.7683860285059, time: 113.599
steps: 74975, episodes: 3000, mean episode reward: -604.6051438575689, time: 114.696
steps: 99975, episodes: 4000, mean episode reward: -575.0436901392027, time: 114.855
steps: 124975, episodes: 5000, mean episode reward: -553.4622219828843, time: 113.471
steps: 149975, episodes: 6000, mean episode reward: -539.4998262682473, time: 115.385
steps: 174975, episodes: 7000, mean episode reward: -533.3811456742297, time: 112.966
steps: 199975, episodes: 8000, mean episode reward: -524.2825599889181, time: 112.772
steps: 224975, episodes: 9000, mean episode reward: -521.1189904402252, time: 113.504
steps: 249975, episodes: 10000, mean episode reward: -515.2612397304559, time: 113.224
steps: 274975, episodes: 11000, mean episode reward: -508.6429990212346, t

### 8. Physical Deception
All agents observe position of landmarks and other agents. One landmark is the ‘target landmark’ (colored green). Good agents rewarded based on how close one of them is to the target landmark, but negatively rewarded if the adversary is close to target landmark. Adversary is rewarded based on how close it is to the target, but it doesn’t know which landmark is the target landmark. So good agents have to learn to ‘split up’ and cover all landmarks to deceive the adversary.

In [14]:
train('simple_adversary', 'Physical Deception')

Using good policy maddpg and adv policy maddpg
Starting iterations...
steps: 24975, episodes: 1000, mean episode reward: -23.62287264995385, time: 80.808
steps: 49975, episodes: 2000, mean episode reward: -24.570445068769427, time: 104.695
steps: 74975, episodes: 3000, mean episode reward: 4.230458852692893, time: 103.531
steps: 99975, episodes: 4000, mean episode reward: 3.057586729468592, time: 104.73
steps: 124975, episodes: 5000, mean episode reward: 2.491181899654724, time: 105.074
steps: 149975, episodes: 6000, mean episode reward: 2.2235228714947364, time: 105.138
steps: 174975, episodes: 7000, mean episode reward: 1.2179255089826488, time: 104.638
steps: 199975, episodes: 8000, mean episode reward: -0.22520570266500975, time: 105.392
steps: 224975, episodes: 9000, mean episode reward: -0.6101349851035407, time: 103.833
steps: 249975, episodes: 10000, mean episode reward: -0.12715706710949184, time: 105.014
steps: 274975, episodes: 11000, mean episode reward: 0.10080468771879497