# mobile-env: An Open Environment for Autonomous Coordination in Mobile Networks

This Google Colaboratory notebook gives an introduction on how to use *mobile-env*,
which is an open environment for autonomous coordination in mobile networks.
This notebook demonstrates how to train and evaluate different coordination approaches
and how to use different reinforcement learning frameworks with mobile-env.

First, we train a multi-agent policy with **RLlib**.

Second, we train a central policy with **stable-baselines3**.

In [15]:
# first install and import relevant dependencies
!pip install mobile-env
!pip install tensorboard

import gym
import mobile_env

# Load the TensorBoard notebook extension
%load_ext tensorboard


The tensorboard extension is already loaded. To reload it, use:



## Multi-Agent Control with RLlib

In [16]:
# install RLlib and tensorflow
!pip install ray[rllib]
!pip install tensorflow



In [17]:
import ray
from ray.rllib.agents import ppo

# train for 2000 episodes
stop = {
    # "episodes_total": 2000,
    "timesteps_total": 50000,
}

config = {
        # environment configuration:
        "env": "mobile-small-ma-v0",

        # agent configuration:
        "multiagent": {
            "policies": {"shared_policy"},
            "policy_mapping_fn": (
                lambda agent_id, **kwargs: "shared_policy"),
        },
}
save_dir = "."

# set available CPUs (and GPUs) and init ray
ray.init(
  num_cpus=3,
  include_dashboard=False,
  ignore_reinit_error=True,
  log_to_driver=False,
)

2021-11-24 12:32:13,571	INFO worker.py:832 -- Calling ray.init() again after it has already been called.


To use multi-agent policies of RLlib,
we must first register mobile-env as custom OpenAI Gym environment.
The RLlibMAWrapper class can be used to wrap the default multi-agent simulation so that it conforms with RLlib's MultiAgentEnv. Now, the environment defines an action and observation space for each user equipment (UE), attributes rewards per UE (per agent) and returns partial observations (no global knowledge for agents).

In [18]:
from ray.tune.registry import register_env


def register(config):
    import mobile_env
    from mobile_env.wrappers.multi_agent import RLlibMAWrapper
    env = gym.make("mobile-small-ma-v0")
    return RLlibMAWrapper(env)

register_env("mobile-small-ma-v0", register)

Run RLlib (this can take a while):

In [19]:
analysis = ray.tune.run(ppo.PPOTrainer, config=config, local_dir=save_dir, stop=stop, checkpoint_at_end=True)

Trial name,status,loc
PPO_mobile-small-ma-v0_2b82d_00000,PENDING,


Trial name,status,loc
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956


Trial name,status,loc
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956


Trial name,status,loc
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956


Trial name,status,loc
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956


Trial name,status,loc
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956


Trial name,status,loc
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956


Trial name,status,loc
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956


Trial name,status,loc
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956


Trial name,status,loc
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956


Result for PPO_mobile-small-ma-v0_2b82d_00000:
  agent_timesteps_total: 20000
  custom_metrics: {}
  date: 2021-11-24_12-33-20
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: -82.58529022334926
  episode_reward_mean: -197.16972293866527
  episode_reward_min: -324.75028304970056
  episodes_this_iter: 40
  episodes_total: 40
  experiment_id: 4e540bfa436c40ce885eae70b7b6ac0e
  hostname: nb-stschn
  info:
    learner:
      shared_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.3763474225997925
          entropy_coeff: 0.0
          kl: 0.01018717885017395
          model: {}
          policy_loss: -0.0061570750549435616
          total_loss: 166.27658081054688
          vf_explained_var: 0.015568939968943596
          vf_loss: 166.2806854248047
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_steps_sampled: 4000
 

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,1,36.9278,4000,-197.17,-82.5853,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,1,36.9278,4000,-197.17,-82.5853,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,1,36.9278,4000,-197.17,-82.5853,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,1,36.9278,4000,-197.17,-82.5853,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,1,36.9278,4000,-197.17,-82.5853,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,1,36.9278,4000,-197.17,-82.5853,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,1,36.9278,4000,-197.17,-82.5853,-324.75,100


Result for PPO_mobile-small-ma-v0_2b82d_00000:
  agent_timesteps_total: 40000
  custom_metrics: {}
  date: 2021-11-24_12-33-57
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 26.011023298586956
  episode_reward_mean: -156.0066308067777
  episode_reward_min: -324.75028304970056
  episodes_this_iter: 40
  episodes_total: 80
  experiment_id: 4e540bfa436c40ce885eae70b7b6ac0e
  hostname: nb-stschn
  info:
    learner:
      shared_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.351000428199768
          entropy_coeff: 0.0
          kl: 0.010089628398418427
          model: {}
          policy_loss: -0.005375070031732321
          total_loss: 83.70623779296875
          vf_explained_var: -0.1662183254957199
          vf_loss: 83.70960235595703
    num_agent_steps_sampled: 40000
    num_agent_steps_trained: 40000
    num_steps_sampled: 8000
    n

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,2,74.0645,8000,-156.007,26.011,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,2,74.0645,8000,-156.007,26.011,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,2,74.0645,8000,-156.007,26.011,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,2,74.0645,8000,-156.007,26.011,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,2,74.0645,8000,-156.007,26.011,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,2,74.0645,8000,-156.007,26.011,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,2,74.0645,8000,-156.007,26.011,-324.75,100


Result for PPO_mobile-small-ma-v0_2b82d_00000:
  agent_timesteps_total: 60000
  custom_metrics: {}
  date: 2021-11-24_12-34-37
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 26.011023298586956
  episode_reward_mean: -123.1233896979559
  episode_reward_min: -324.75028304970056
  episodes_this_iter: 40
  episodes_total: 120
  experiment_id: 4e540bfa436c40ce885eae70b7b6ac0e
  hostname: nb-stschn
  info:
    learner:
      shared_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.3054527044296265
          entropy_coeff: 0.0
          kl: 0.011907270178198814
          model: {}
          policy_loss: -0.005937955342233181
          total_loss: 80.03804016113281
          vf_explained_var: -0.3122124671936035
          vf_loss: 80.0416030883789
    num_agent_steps_sampled: 60000
    num_agent_steps_trained: 60000
    num_steps_sampled: 12000
   

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,3,113.562,12000,-123.123,26.011,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,3,113.562,12000,-123.123,26.011,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,3,113.562,12000,-123.123,26.011,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,3,113.562,12000,-123.123,26.011,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,3,113.562,12000,-123.123,26.011,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,3,113.562,12000,-123.123,26.011,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,3,113.562,12000,-123.123,26.011,-324.75,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,3,113.562,12000,-123.123,26.011,-324.75,100


Result for PPO_mobile-small-ma-v0_2b82d_00000:
  agent_timesteps_total: 80000
  custom_metrics: {}
  date: 2021-11-24_12-35-14
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 26.011023298586956
  episode_reward_mean: -87.9172788083747
  episode_reward_min: -308.37925235206995
  episodes_this_iter: 40
  episodes_total: 160
  experiment_id: 4e540bfa436c40ce885eae70b7b6ac0e
  hostname: nb-stschn
  info:
    learner:
      shared_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.272882342338562
          entropy_coeff: 0.0
          kl: 0.009863310493528843
          model: {}
          policy_loss: -0.006680862046778202
          total_loss: 65.5313949584961
          vf_explained_var: -0.3546501100063324
          vf_loss: 65.5361099243164
    num_agent_steps_sampled: 80000
    num_agent_steps_trained: 80000
    num_steps_sampled: 16000
    nu

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,4,150.941,16000,-87.9173,26.011,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,4,150.941,16000,-87.9173,26.011,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,4,150.941,16000,-87.9173,26.011,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,4,150.941,16000,-87.9173,26.011,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,4,150.941,16000,-87.9173,26.011,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,4,150.941,16000,-87.9173,26.011,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,4,150.941,16000,-87.9173,26.011,-308.379,100


Result for PPO_mobile-small-ma-v0_2b82d_00000:
  agent_timesteps_total: 100000
  custom_metrics: {}
  date: 2021-11-24_12-35-54
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 27.92399798089297
  episode_reward_mean: -74.02029784688384
  episode_reward_min: -308.37925235206995
  episodes_this_iter: 40
  episodes_total: 200
  experiment_id: 4e540bfa436c40ce885eae70b7b6ac0e
  hostname: nb-stschn
  info:
    learner:
      shared_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.2363429069519043
          entropy_coeff: 0.0
          kl: 0.011199514381587505
          model: {}
          policy_loss: -0.007194239646196365
          total_loss: 75.12138366699219
          vf_explained_var: -0.16546496748924255
          vf_loss: 75.12632751464844
    num_agent_steps_sampled: 100000
    num_agent_steps_trained: 100000
    num_steps_sampled: 20000

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,5,190.251,20000,-74.0203,27.924,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,5,190.251,20000,-74.0203,27.924,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,5,190.251,20000,-74.0203,27.924,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,5,190.251,20000,-74.0203,27.924,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,5,190.251,20000,-74.0203,27.924,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,5,190.251,20000,-74.0203,27.924,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,5,190.251,20000,-74.0203,27.924,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,5,190.251,20000,-74.0203,27.924,-308.379,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,5,190.251,20000,-74.0203,27.924,-308.379,100


Result for PPO_mobile-small-ma-v0_2b82d_00000:
  agent_timesteps_total: 120000
  custom_metrics: {}
  date: 2021-11-24_12-36-38
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 32.00511647140962
  episode_reward_mean: -63.05783606921642
  episode_reward_min: -166.7868669415777
  episodes_this_iter: 40
  episodes_total: 240
  experiment_id: 4e540bfa436c40ce885eae70b7b6ac0e
  hostname: nb-stschn
  info:
    learner:
      shared_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.2105662822723389
          entropy_coeff: 0.0
          kl: 0.008601279929280281
          model: {}
          policy_loss: -0.005161717534065247
          total_loss: 66.13565063476562
          vf_explained_var: -0.18167948722839355
          vf_loss: 66.13909149169922
    num_agent_steps_sampled: 120000
    num_agent_steps_trained: 120000
    num_steps_sampled: 24000


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,6,234.509,24000,-63.0578,32.0051,-166.787,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,6,234.509,24000,-63.0578,32.0051,-166.787,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,6,234.509,24000,-63.0578,32.0051,-166.787,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,6,234.509,24000,-63.0578,32.0051,-166.787,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,6,234.509,24000,-63.0578,32.0051,-166.787,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,6,234.509,24000,-63.0578,32.0051,-166.787,100


Result for PPO_mobile-small-ma-v0_2b82d_00000:
  agent_timesteps_total: 140000
  custom_metrics: {}
  date: 2021-11-24_12-37-12
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 37.517350407532646
  episode_reward_mean: -56.8993966162799
  episode_reward_min: -159.39671116929244
  episodes_this_iter: 40
  episodes_total: 280
  experiment_id: 4e540bfa436c40ce885eae70b7b6ac0e
  hostname: nb-stschn
  info:
    learner:
      shared_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.1945053339004517
          entropy_coeff: 0.0
          kl: 0.008428691886365414
          model: {}
          policy_loss: -0.005386861972510815
          total_loss: 67.3160629272461
          vf_explained_var: -0.18060088157653809
          vf_loss: 67.31977081298828
    num_agent_steps_sampled: 140000
    num_agent_steps_trained: 140000
    num_steps_sampled: 28000


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,7,268.332,28000,-56.8994,37.5174,-159.397,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,7,268.332,28000,-56.8994,37.5174,-159.397,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,7,268.332,28000,-56.8994,37.5174,-159.397,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,7,268.332,28000,-56.8994,37.5174,-159.397,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,7,268.332,28000,-56.8994,37.5174,-159.397,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,7,268.332,28000,-56.8994,37.5174,-159.397,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,7,268.332,28000,-56.8994,37.5174,-159.397,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,7,268.332,28000,-56.8994,37.5174,-159.397,100


Result for PPO_mobile-small-ma-v0_2b82d_00000:
  agent_timesteps_total: 160000
  custom_metrics: {}
  date: 2021-11-24_12-37-51
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 37.517350407532646
  episode_reward_mean: -44.28550081510158
  episode_reward_min: -307.40419640021577
  episodes_this_iter: 40
  episodes_total: 320
  experiment_id: 4e540bfa436c40ce885eae70b7b6ac0e
  hostname: nb-stschn
  info:
    learner:
      shared_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.1653368473052979
          entropy_coeff: 0.0
          kl: 0.006505147088319063
          model: {}
          policy_loss: -0.004781945608556271
          total_loss: 54.027320861816406
          vf_explained_var: -0.2776322066783905
          vf_loss: 54.03080368041992
    num_agent_steps_sampled: 160000
    num_agent_steps_trained: 160000
    num_steps_sampled: 3200

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,8,306.968,32000,-44.2855,37.5174,-307.404,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,8,306.968,32000,-44.2855,37.5174,-307.404,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,8,306.968,32000,-44.2855,37.5174,-307.404,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,8,306.968,32000,-44.2855,37.5174,-307.404,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,8,306.968,32000,-44.2855,37.5174,-307.404,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,8,306.968,32000,-44.2855,37.5174,-307.404,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,8,306.968,32000,-44.2855,37.5174,-307.404,100


Result for PPO_mobile-small-ma-v0_2b82d_00000:
  agent_timesteps_total: 180000
  custom_metrics: {}
  date: 2021-11-24_12-38-30
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 28.014085671833037
  episode_reward_mean: -36.26332603764277
  episode_reward_min: -307.40419640021577
  episodes_this_iter: 40
  episodes_total: 360
  experiment_id: 4e540bfa436c40ce885eae70b7b6ac0e
  hostname: nb-stschn
  info:
    learner:
      shared_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.1505690813064575
          entropy_coeff: 0.0
          kl: 0.00772960064932704
          model: {}
          policy_loss: -0.005320595111697912
          total_loss: 52.63880157470703
          vf_explained_var: -0.18861554563045502
          vf_loss: 52.642581939697266
    num_agent_steps_sampled: 180000
    num_agent_steps_trained: 180000
    num_steps_sampled: 3600

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,9,345.897,36000,-36.2633,28.0141,-307.404,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,9,345.897,36000,-36.2633,28.0141,-307.404,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,9,345.897,36000,-36.2633,28.0141,-307.404,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,9,345.897,36000,-36.2633,28.0141,-307.404,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,9,345.897,36000,-36.2633,28.0141,-307.404,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,9,345.897,36000,-36.2633,28.0141,-307.404,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,9,345.897,36000,-36.2633,28.0141,-307.404,100


Result for PPO_mobile-small-ma-v0_2b82d_00000:
  agent_timesteps_total: 200000
  custom_metrics: {}
  date: 2021-11-24_12-39-05
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 33.72066006899857
  episode_reward_mean: -32.099061836934574
  episode_reward_min: -301.0544155509032
  episodes_this_iter: 40
  episodes_total: 400
  experiment_id: 4e540bfa436c40ce885eae70b7b6ac0e
  hostname: nb-stschn
  info:
    learner:
      shared_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.1299378871917725
          entropy_coeff: 0.0
          kl: 0.006103357300162315
          model: {}
          policy_loss: -0.003299871226772666
          total_loss: 57.985862731933594
          vf_explained_var: -0.1961873322725296
          vf_loss: 57.98794174194336
    num_agent_steps_sampled: 200000
    num_agent_steps_trained: 200000
    num_steps_sampled: 40000

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,10,381.275,40000,-32.0991,33.7207,-301.054,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,10,381.275,40000,-32.0991,33.7207,-301.054,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,10,381.275,40000,-32.0991,33.7207,-301.054,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,10,381.275,40000,-32.0991,33.7207,-301.054,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,10,381.275,40000,-32.0991,33.7207,-301.054,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,10,381.275,40000,-32.0991,33.7207,-301.054,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,10,381.275,40000,-32.0991,33.7207,-301.054,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,10,381.275,40000,-32.0991,33.7207,-301.054,100


Result for PPO_mobile-small-ma-v0_2b82d_00000:
  agent_timesteps_total: 220000
  custom_metrics: {}
  date: 2021-11-24_12-39-42
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 76.60983866128029
  episode_reward_mean: -25.00437293296074
  episode_reward_min: -156.68041022451237
  episodes_this_iter: 40
  episodes_total: 440
  experiment_id: 4e540bfa436c40ce885eae70b7b6ac0e
  hostname: nb-stschn
  info:
    learner:
      shared_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.111107349395752
          entropy_coeff: 0.0
          kl: 0.008636608719825745
          model: {}
          policy_loss: -0.005253782961517572
          total_loss: 63.161590576171875
          vf_explained_var: -0.20226168632507324
          vf_loss: 63.165122985839844
    num_agent_steps_sampled: 220000
    num_agent_steps_trained: 220000
    num_steps_sampled: 4400

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,11,418.423,44000,-25.0044,76.6098,-156.68,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,11,418.423,44000,-25.0044,76.6098,-156.68,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,11,418.423,44000,-25.0044,76.6098,-156.68,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,11,418.423,44000,-25.0044,76.6098,-156.68,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,11,418.423,44000,-25.0044,76.6098,-156.68,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,11,418.423,44000,-25.0044,76.6098,-156.68,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,11,418.423,44000,-25.0044,76.6098,-156.68,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,11,418.423,44000,-25.0044,76.6098,-156.68,100


Result for PPO_mobile-small-ma-v0_2b82d_00000:
  agent_timesteps_total: 240000
  custom_metrics: {}
  date: 2021-11-24_12-40-26
  done: false
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 76.60983866128029
  episode_reward_mean: -25.139173555377642
  episode_reward_min: -158.71299775589793
  episodes_this_iter: 40
  episodes_total: 480
  experiment_id: 4e540bfa436c40ce885eae70b7b6ac0e
  hostname: nb-stschn
  info:
    learner:
      shared_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.100955843925476
          entropy_coeff: 0.0
          kl: 0.005644019227474928
          model: {}
          policy_loss: -0.0038178933318704367
          total_loss: 64.17167663574219
          vf_explained_var: -0.1336476355791092
          vf_loss: 64.17436218261719
    num_agent_steps_sampled: 240000
    num_agent_steps_trained: 240000
    num_steps_sampled: 48000

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,12,461.525,48000,-25.1392,76.6098,-158.713,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,12,461.525,48000,-25.1392,76.6098,-158.713,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,12,461.525,48000,-25.1392,76.6098,-158.713,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,12,461.525,48000,-25.1392,76.6098,-158.713,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,12,461.525,48000,-25.1392,76.6098,-158.713,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,12,461.525,48000,-25.1392,76.6098,-158.713,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,12,461.525,48000,-25.1392,76.6098,-158.713,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,12,461.525,48000,-25.1392,76.6098,-158.713,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,12,461.525,48000,-25.1392,76.6098,-158.713,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,12,461.525,48000,-25.1392,76.6098,-158.713,100


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,RUNNING,127.0.0.1:7956,12,461.525,48000,-25.1392,76.6098,-158.713,100


Result for PPO_mobile-small-ma-v0_2b82d_00000:
  agent_timesteps_total: 260000
  custom_metrics: {}
  date: 2021-11-24_12-41-23
  done: true
  episode_len_mean: 100.0
  episode_media: {}
  episode_reward_max: 76.60983866128029
  episode_reward_mean: -35.461493594127866
  episode_reward_min: -158.71299775589793
  episodes_this_iter: 40
  episodes_total: 520
  experiment_id: 4e540bfa436c40ce885eae70b7b6ac0e
  hostname: nb-stschn
  info:
    learner:
      shared_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 1.0943677425384521
          entropy_coeff: 0.0
          kl: 0.006255448330193758
          model: {}
          policy_loss: -0.0035820447374135256
          total_loss: 86.11984252929688
          vf_explained_var: -0.22441518306732178
          vf_loss: 86.12218475341797
    num_agent_steps_sampled: 260000
    num_agent_steps_trained: 260000
    num_steps_sampled: 5200

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_2b82d_00000,TERMINATED,127.0.0.1:7956,13,518.475,52000,-35.4615,76.6098,-158.713,100


2021-11-24 12:41:23,617	INFO tune.py:630 -- Total run time: 550.02 seconds (549.56 seconds for the tuning loop).


Visualize the training with Tensorboard:

In [20]:
%tensorboard --logdir ray_results

Launching TensorBoard...

To visualize the final multi-agent policy, load the latest model checkpoint:

In [21]:
checkpoint = analysis.get_last_checkpoint()
model = ppo.PPOTrainer(config=config, env='mobile-small-ma-v0')
model.restore(checkpoint)

2021-11-24 12:41:23,766	INFO ppo.py:166 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
2021-11-24 12:41:23,766	INFO trainer.py:770 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


RayActorError: The actor died because of an error raised in its creation task, [36mray::RolloutWorker.__init__()[39m (pid=18304, ip=127.0.0.1)
  File "python\ray\_raylet.pyx", line 565, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 569, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 519, in ray._raylet.execute_task.function_executor
  File "c:\users\stefan\git-repos\work\mobile-env\venv\lib\site-packages\ray\_private\function_manager.py", line 576, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "c:\users\stefan\git-repos\work\mobile-env\venv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "c:\users\stefan\git-repos\work\mobile-env\venv\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 584, in __init__
    self._build_policy_map(
  File "c:\users\stefan\git-repos\work\mobile-env\venv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 451, in _resume_span
    return method(self, *_args, **_kwargs)
  File "c:\users\stefan\git-repos\work\mobile-env\venv\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1384, in _build_policy_map
    self.policy_map.create_policy(name, orig_cls, obs_space, act_space,
  File "c:\users\stefan\git-repos\work\mobile-env\venv\lib\site-packages\ray\rllib\policy\policy_map.py", line 123, in create_policy
    sess = self.session_creator()
  File "c:\users\stefan\git-repos\work\mobile-env\venv\lib\site-packages\ray\rllib\evaluation\worker_set.py", line 323, in session_creator
    return tf1.Session(
AttributeError: 'NoneType' object has no attribute 'Session'

Mobile-Env provides a render() function to visualize the simulation.
In Google Colaboratory the better-looking 'human' mode is unavailable (only available locally). Still, we can visualize the final policy as RGB images:

In [None]:
import time
import pylab as pl
import matplotlib.pyplot as plt
from IPython import display
%matplotlib inline

env = register('mobile-small-ma-v0')
done = {'__all__': False}
obs = env.reset()

while not done['__all__']:
    # gather action from each actor (for each UE)
    action = {}
    for agent_id, agent_obs in obs.items():
        policy_id = config['multiagent']['policy_mapping_fn'](agent_id)
        action[agent_id] = model.compute_action(agent_obs, policy_id=policy_id)
    
    # perform step on simulation environment 
    obs, reward, done, info = env.step(action)

    # display environment as RGB
    plt.imshow(env.env.render(mode='rgb_array'))
    display.display(plt.gcf())
    display.clear_output(wait=True)
    time.sleep(0.025)

# Central Environment with Stable-Baselines3

In [None]:
!pip install stable-baselines3

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy

The decision making policy can also be trained on scenarios that simulate centralized control over user equipments, i.e., one single agent decides what connections should be established for each UE and is given global information. The centralized setting wraps the observations, rewards and actions of the multi-agent setting. Now, observations are single (concatenated) vectors that jointly represent up-to-date information on all UEs. The reward is the average utility of active UEs. Similarly, actions are vectors of discrete decisions.

Use *stable-baselines3*'s PPO agent on the (small) centralized control environment and train it for 500,000 steps:

In [None]:
# create the small central simulation
env = gym.make("mobile-small-central-v0")

# train PPO agent on environment
model = PPO(MlpPolicy, env, verbose=1, tensorboard_log='ppo_central_tensorboard')
model.learn(total_timesteps=50000)

Visualize the training results with Tensorboard:

In [None]:
%tensorboard --logdir ppo_central_tensorboard

Visualize what the central agent has learned. Note that the RGB visualization on Google Colaboratory does not render the environment as clearly as the 'human' mode, which is unavailable in virtual environments.

In [None]:
import time
import pylab as pl
import matplotlib.pyplot as plt
from IPython import display
%matplotlib inline

env = gym.make('mobile-small-central-v0')
done = False
obs = env.reset()

while not done:
    action, _ = model.predict(obs)

    # perform step on simulation environment 
    obs, reward, done, info = env.step(action)

    # display environment as RGB
    plt.imshow(env.render(mode='rgb_array'))
    display.display(plt.gcf())
    display.clear_output(wait=True)
    time.sleep(0.025)