In [None]:
!pip install mobile-env

# Load the TensorBoard notebook extension
%load_ext tensorboard

In [None]:
import gym
import mobile_env

This Google Colaboratory notebook gives an introduction on how to use *mobile-env* for training & evaluating multi-agent and central decision making policies for cell selection in mobile communication settings. First, we train a multi-agent policy with **RLlib**. Second, we train a central policy with **stable-baselines3**.

# Multi-Agent Control Setting with RLlib:

In [None]:
!pip install -U ray[rllib]

In [None]:
import ray
from ray.rllib.agents import ppo

stop = {
    "episodes_total": 2000
}

config = {
        # enviroment configuration:
        "env": "mobile-small-ma-v0",

        # agent configuration:
        "multiagent": {
            "policies": {"shared_policy"},
            "policy_mapping_fn": (
                lambda agent_id, **kwargs: "shared_policy"),
        },
}
save_dir = "."

ray.shutdown()
ray.init(
  num_cpus=3,
  include_dashboard=False,
  ignore_reinit_error=True,
  log_to_driver=False,
)

To use multi-agent policies of RLlib, we must first register our custom OpenAI Gym environment. The RLlibMAWrapper class can be used to wrap the default multi-agent simulation so that it conforms with RLlib's MultiAgentEnv. Now, the environment defines an action and observation space for each user equipment (UE), attributes rewards per UE (per agent) and returns partial observations (no global knowledge for agents).

In [None]:
from ray.tune.registry import register_env


def register(config):
    import mobile_env
    from mobile_env.wrappers.multi_agent import RLlibMAWrapper
    env = gym.make("mobile-small-ma-v0")
    return RLlibMAWrapper(env)

register_env("mobile-small-ma-v0", register)

Run RLlib (this can take a while):

In [None]:
analysis = ray.tune.run(ppo.PPOTrainer, config=config, local_dir=save_dir, stop=stop, checkpoint_at_end=True)

Visualize the training with Tensorboard:

In [None]:
%tensorboard --logdir ray_results

To visualize the final multi-agent policy, load the latest model checkpoint:

In [None]:
checkpoint = analysis.get_last_checkpoint()
model = ppo.PPOTrainer(config=config, env='mobile-small-ma-v0')
model.restore(checkpoint)

Mobile-Env provides a render() function to visualize the simulation. In Google Colaboratory the better-looking 'human' mode is unavailable (only available locally). Still, we can visualize the final policy as RGB images:

In [None]:
import time
import pylab as pl
import matplotlib.pyplot as plt
from IPython import display
%matplotlib inline

env = register('mobile-small-ma-v0')
done = {'__all__': False}
obs = env.reset()

while not done['__all__']:
    # gather action from each actor (for each UE)
    action = {}
    for agent_id, agent_obs in obs.items():
        policy_id = config['multiagent']['policy_mapping_fn'](agent_id)
        action[agent_id] = model.compute_action(agent_obs, policy_id=policy_id)
    
    # perform step on simulation environment 
    obs, reward, done, info = env.step(action)

    # display environment as RGB
    plt.imshow(env.env.render(mode='rgb_array'))
    display.display(plt.gcf())
    display.clear_output(wait=True)
    time.sleep(0.025)

# Central Environment with Stable-Baselines3

In [None]:
!pip install stable-baselines3

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy

The decision making policy can also be trained on scenarios that simulate centralized control over user equipments, i.e., one single agent decides what connections should be established for each UE and is given global information. The centralized setting wraps the observations, rewards and actions of the multi-agent setting. Now, observations are single (concatenated) vectors that jointly represent up-to-date information on all UEs. The reward is the average utility of active UEs. Similarly, actions are vectors of discrete decisions.

Use *stable-baselines3*'s PPO agent on the (small) centralized control environment and train it for 500,000 steps:

In [None]:
# create the small central simulation
env = gym.make("mobile-small-central-v0")

# train PPO agent on environment
model = PPO(MlpPolicy, env, verbose=1, tensorboard_log='ppo_central_tensorboard')
model.learn(total_timesteps=500000)

Visualize the training results with Tensorboard:

In [None]:
%tensorboard --logdir ppo_central_tensorboard

Visualize what the central agent has learned. Note that the RGB visualization on Google Colaboratory does not render the environment as clearly as the 'human' mode, which is unavailable in virtual environments.

In [None]:
import time
import pylab as pl
import matplotlib.pyplot as plt
from IPython import display
%matplotlib inline

env = gym.make('mobile-small-central-v0')
done = False
obs = env.reset()

while not done:
    action, _ = model.predict(obs)

    # perform step on simulation environment 
    obs, reward, done, info = env.step(action)

    # display environment as RGB
    plt.imshow(env.render(mode='rgb_array'))
    display.display(plt.gcf())
    display.clear_output(wait=True)
    time.sleep(0.025)