# Demonstrating `mobile-env`

`mobile-env` is a simple and open environment for training, testing, and evaluating autonomous coordination
approaches for wireless mobile networks.

* `mobile-env` is written in pure Python and can be installed easily via [PyPI](https://pypi.org/project/mobile-env/)
* It allows simulating various scenarios with moving users in a cellular network with multiple base stations
* `mobile-env` implements the standard [Gymnasium](https://gymnasium.farama.org/) (previously [OpenAI Gym](https://gym.openai.com/)) interface such that it can be used with all common frameworks for reinforcement learning
* `mobile-env` is not restricted to reinforcement learning approaches but can also be used with conventional control approaches or dummy benchmark algorithms
* It supports both centralized, single-agent control and distributed, multi-agent control
* It can be configured easily (e.g., adjusting number and movement of users, properties of cells, etc.)
* It is also easy to extend `mobile-env`, e.g., implementing different observations, actions, or reward

As such `mobile-env` is a simple platform to evaluate and compare different coordination approaches in a meaningful way.



**Demonstration Steps:**

This demonstration consists of the following steps (split accross separate notebooks):

1. [Intro notebook](examples\demo.ipynb): Installation and usage of `mobile-env` with dummy actions
2. [Intro notebook](examples\demo.ipynb): Configuration of `mobile-env` and adjustment of the observation space (optional)
3. [SB3 notebook](examples\sb3.ipynb): Training a single-agent reinforcement learning approach with [`stable-baselines3`](https://github.com/DLR-RM/stable-baselines3)
4. **This notebook:** Training a multi-agent reinforcement learning approach with [Ray RLlib](https://docs.ray.io/en/latest/rllib.html)

## Step 4: Multi-Agent RL with Ray RLlib

As alternative to controlling cell selection centrally for all users from a single RL agent, we can also use multi-agent RL, i.e., delegating control to multiple agents that act in parallel.
As an example, we could have each RL agent responsible for the cell selection of a single user. Then we would need as many agents as we have users.
That's what happens in the predefined multi-agent scenarios, e.g., `mobile-small-ma-v0`.

Let's use RLlib to train a multi-agent policy on the `mobile-small-ma-v0` scenario, which has three base stations and five users.

### Set up Ray RLlib

To train a multi-agent approach, we can use Ray RLlib, which supports multi-agent RL out of the box. To register the predefined multi-agent scenario with RLlib, `mobile-env` provides a wrapper `RLlibMAWrapper`. But first we need to install `mobile-env` and `ray` with RLlib:

In [1]:
# install mobile-env
!pip install -U mobile-env
# install ray RLlib
!pip install ray[rllib]==2.5.1 tensorboard



In [1]:
import gymnasium
from ray.tune.registry import register_env

# use the mobile-env RLlib wrapper for RLlib
def register(config):
    # importing mobile_env registers the included environments
    import mobile_env
    from mobile_env.wrappers.multi_agent import RLlibMAWrapper

    env = gymnasium.make("mobile-small-ma-v0")
    return RLlibMAWrapper(env)

# register the predefined scenario with RLlib
register_env("mobile-small-ma-v0", register)

### Train a PPO Multi-Agent Policy

Now, that the predefined scenario is registered with RLlib, we can configure and train a multi-agent PPO approach on the scenario with RLlib.

In [2]:
import ray


# init ray with available CPUs (and GPUs) and init ray
ray.init(
  num_cpus=2,   # change to your available number of CPUs
  include_dashboard=False,
  ignore_reinit_error=True,
  log_to_driver=False,
)

2023-07-20 19:24:07,872	INFO worker.py:1636 -- Started a local Ray instance.


0,1
Python version:,3.8.16
Ray version:,2.5.1


In [3]:
import ray.air
from ray.rllib.algorithms.ppo import PPOConfig

from ray.rllib.policy.policy import PolicySpec
from ray.tune.stopper import MaximumIterationStopper

# Create an RLlib config using multi-agent PPO on mobile-env's small scenario.
config = (
    PPOConfig()
    .environment(env="mobile-small-ma-v0")
    # Here, we configure all agents to share the same policy.
    .multi_agent(
        policies={"shared_policy": PolicySpec()},
        policy_mapping_fn=lambda agent_id, episode, worker, **kwargs: "shared_policy",
    )
    # RLlib needs +1 CPU than configured below (for the driver/traininer?)
    .resources(num_cpus_per_worker=1)
    .rollouts(num_rollout_workers=1)
)

# Create the Trainer/Tuner and define how long to train
tuner = ray.tune.Tuner(
    "PPO",
    run_config=ray.air.RunConfig(
        # Save the training progress and checkpoints locally under the specified subfolder.
        storage_path="./results_rllib",
        # Control training length by setting the number of iterations. 1 iter = 4000 time steps by default.
        stop=MaximumIterationStopper(max_iter=10),
        checkpoint_config=ray.air.CheckpointConfig(checkpoint_at_end=True),
    ),
    param_space=config,
)

# Run training and save the result
result_grid = tuner.fit()

  if (distutils.version.LooseVersion(tf.__version__) <


0,1
Current time:,2023-07-20 19:27:03
Running for:,00:01:56.23
Memory:,7.5/7.9 GiB

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_mobile-small-ma-v0_5f1f2_00000,TERMINATED,127.0.0.1:8360,1,79.252,4000,-186.81,-19.9275,-322.086,100


Trial name,agent_timesteps_total,connector_metrics,counters,custom_metrics,date,done,episode_len_mean,episode_media,episode_reward_max,episode_reward_mean,episode_reward_min,episodes_this_iter,episodes_total,hostname,info,iterations_since_restore,node_ip,num_agent_steps_sampled,num_agent_steps_trained,num_env_steps_sampled,num_env_steps_sampled_this_iter,num_env_steps_sampled_throughput_per_sec,num_env_steps_trained,num_env_steps_trained_this_iter,num_env_steps_trained_throughput_per_sec,num_faulty_episodes,num_healthy_workers,num_in_flight_async_reqs,num_remote_worker_restarts,num_steps_trained_this_iter,perf,pid,policy_reward_max,policy_reward_mean,policy_reward_min,sampler_perf,sampler_results,time_since_restore,time_this_iter_s,time_total_s,timers,timestamp,timesteps_total,training_iteration,trial_id
PPO_mobile-small-ma-v0_5f1f2_00000,20000,"{'ObsPreprocessorConnector_ms': 0.009922981262207031, 'StateBufferConnector_ms': 0.017557144165039062, 'ViewRequirementAgentConnector_ms': 0.5606132745742798}","{'num_env_steps_sampled': 4000, 'num_env_steps_trained': 4000, 'num_agent_steps_sampled': 20000, 'num_agent_steps_trained': 20000}",{},2023-07-20_19-27-03,True,100,{},-19.9275,-186.81,-322.086,40,40,Lenovo-Yoga,"{'learner': {'shared_policy': {'learner_stats': {'allreduce_latency': 0.0, 'grad_gnorm': 1.7262574494960707, 'cur_kl_coeff': 0.2, 'cur_lr': 5.0000000000000016e-05, 'total_loss': 8.068133396302565, 'policy_loss': -0.006108309560398039, 'vf_loss': 8.072629420266283, 'vf_explained_var': 0.03875742717913002, 'kl': 0.008061309833623109, 'entropy': 1.3782111626268698, 'entropy_coeff': 0.0}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': 127.38853503184713, 'num_grad_updates_lifetime': 2355.5, 'diff_num_grad_updates_vs_sampler_policy': 2354.5}}, 'num_env_steps_sampled': 4000, 'num_env_steps_trained': 4000, 'num_agent_steps_sampled': 20000, 'num_agent_steps_trained': 20000}",1,127.0.0.1,20000,20000,4000,4000,50.4775,4000,4000,50.4775,0,1,0,0,4000,"{'cpu_util_percent': 12.044144144144147, 'ram_util_percent': 95.96396396396395}",8360,{'shared_policy': 4.675273777830706},{'shared_policy': -37.361940196467856},{'shared_policy': -88.18277678648225},"{'mean_raw_obs_processing_ms': 1.5595255420077476, 'mean_inference_ms': 1.6059204507726217, 'mean_action_processing_ms': 0.5598039038328492, 'mean_env_wait_ms': 4.779515639450276, 'mean_env_render_ms': 0.0}","{'episode_reward_max': -19.92746997992103, 'episode_reward_min': -322.08627121618844, 'episode_reward_mean': -186.80970098233928, 'episode_len_mean': 100.0, 'episode_media': {}, 'episodes_this_iter': 40, 'policy_reward_min': {'shared_policy': -88.18277678648225}, 'policy_reward_max': {'shared_policy': 4.675273777830706}, 'policy_reward_mean': {'shared_policy': -37.361940196467856}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [-322.08627121618844, -278.88043273685696, -244.71452134426957, -157.1398199851625, -226.00008248822994, -127.82212125721367, -247.46025302073042, -257.3566577006891, -133.04203355210132, -149.52059947456613, -182.438737363085, -241.57087379202846, -170.4377571697119, -121.6072213233723, -182.68594830964847, -146.65869909088593, -238.49180030595232, -255.82750090110454, -233.46191958283333, -179.84368864482855, -105.85947426307229, -19.92746997992103, -192.80547659971486, -223.2493490739629, -273.18700857570707, -263.77550289439387, -168.87990934096985, -136.695985086413, -146.67970252124113, -135.82812894511898, -160.92218828762213, -297.73169168607154, -138.04634050338998, -172.98576266494652, -118.69265622866148, -217.94240718709767, -137.81712412032542, -156.48512154158567, -157.6607284950749, -150.16907203882303], 'episode_lengths': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100], 'policy_shared_policy_reward': [-51.03341555916889, -57.961168201249535, -86.27784871536005, -86.53046436582014, -40.28337437458947, -87.53273695951005, -52.2755688861878, -36.02495286799379, -40.76561525830208, -62.281558764863185, -82.44604476729212, -34.02371393362752, -36.399994293918915, -42.66165418643358, -49.183114162997676, -34.034125265869314, -46.009601709386736, -24.859928935665824, -30.208717797039135, -22.027446277201392, -60.14757318147146, -56.331820379281886, -18.166686981568873, -75.85201666228218, -15.50198528362532, -12.455061576705978, -14.373400166844698, -22.213580141926005, -20.16427189751852, -58.61580747421852, -68.2088450596716, -33.056791745049814, -29.637188375052308, -35.424567329912776, -81.132860511044, -51.058680682076364, -56.58453598079767, -80.15906395012463, -31.772688219735898, -37.78168886795459, -45.27446513983769, -32.30007349540072, -20.29210634747012, -15.130006274654102, -20.045382294738598, -24.163608709756048, -12.460561264070085, -72.57612503656794, -26.984221673562477, -13.336082790609444, -36.9330626965723, -25.092718373651014, -77.4418783505646, -17.931666039823234, -25.039411902473628, -50.816740110514864, -36.576772643198545, -71.1066833952199, -57.1363211926676, -25.934356450427806, -37.971759666559315, -37.245302660083325, -38.80022523590413, -29.10069466114797, -27.31977494601717, -25.728506022501787, -29.9740757257651, -13.457002634443738, -22.42210645072346, -30.02553048993827, -38.80174235164709, -44.046277373610316, -33.73957762558036, -48.951797258118916, -17.146553700692017, -27.317654298552483, -28.94077680953367, -47.7845058544837, -23.579967043085187, -19.035795085230863, -58.39275077343285, -53.33509819310671, -50.494009313187085, -23.864957848002263, -52.40498417822365, -80.70090586558031, -45.383854489720726, -34.9172578939934, -37.68242263602946, -57.14306001578048, -88.18277678648225, -22.876761032773896, -44.859941845811065, -38.58898396749046, -38.95345595027577, -43.59238297343855, -24.97134999998932, -38.98054125502674, -46.78411049111446, -25.51530392525937, -34.136957260460406, -44.59221878475209, -25.64617233716315, -0.35380283793888495, -1.1303230427577367, -7.160823786436596, -1.5843416241190953, -10.11357388748882, 4.675273777830706, -5.744004459707221, -22.71933535229186, -37.3720261869451, -50.237301215606394, -38.40361234352491, -44.07320150134653, -33.402427020416525, -22.760754192079247, -43.9937320647077, -73.2187353464703, -49.873700450289164, -76.8252085026487, -37.409042643384815, -74.95429481269328, -41.20761704440594, -42.790845572574355, -37.54137347761333, -58.086859076842174, -33.541527565204234, -60.94995866692314, -73.65578410781116, -38.616031202062416, -49.47301009468382, -20.162834960833145, -36.16367660990922, -24.46435647348106, -43.13530714973844, -17.99899348663194, -23.699378509025472, -38.08455127054142, -13.777754670475838, -18.980538050814424, -27.827378510896153, -24.005600654926372, -40.169563812790635, -35.69662149181359, -38.473849235679104, -41.745383634431946, -12.767212127818915, -26.443391536468457, -16.398292410720526, -33.117156986057296, -22.861624981924084, -79.12030876658399, -19.569112370424193, -6.253985182632566, -55.79808659143144, -61.38525122592956, -66.671589144198, -57.28257219220108, -56.59419253231149, -32.704019993210714, -61.00148856730766, -15.049110236276697, -13.224430677461868, -16.0672910291331, -26.889657979185074, -56.21120486872933, -57.48167758466561, -15.891080664778913, -16.512141567587545, -13.878121038126732, -19.79864980917352, -25.52023197897797, -36.16750987593207, -23.32814352645112, -70.7339254202731, -31.331027436007506, -39.98630731081323, -33.174696522266885, -42.716450497737085, -34.98299711830303, -21.983077604514445, -27.29889449005229, -23.503828362820443, -30.048326544635124, -48.99666102493343, -40.2410176628139, -24.831721725962062, -13.8038356176105, -28.61188551026558, -32.96058666102843, -30.401647361356673, -55.384367765768225, -21.071139920560817, -17.84298678636094, -21.238318467487318, -51.78797889605226, -31.934300571798197, -32.048347756155096, -13.1601263473302]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 1.5595255420077476, 'mean_inference_ms': 1.6059204507726217, 'mean_action_processing_ms': 0.5598039038328492, 'mean_env_wait_ms': 4.779515639450276, 'mean_env_render_ms': 0.0}, 'num_faulty_episodes': 0, 'connector_metrics': {'ObsPreprocessorConnector_ms': 0.009922981262207031, 'StateBufferConnector_ms': 0.017557144165039062, 'ViewRequirementAgentConnector_ms': 0.5606132745742798}}",79.252,79.252,79.252,"{'training_iteration_time_ms': 79243.155, 'sample_time_ms': 34188.102, 'learn_time_ms': 45049.352, 'learn_throughput': 88.792, 'synch_weights_time_ms': 2.001}",1689874023,4000,1,5f1f2_00000


2023-07-20 19:27:03,902	INFO tune.py:1111 -- Total run time: 117.18 seconds (115.89 seconds for the tuning loop).


We can check the learning curve on TensorBoard. The corresponding files are in the configured `save_dir=results_rllib`.

The "episode_reward_mean" should increase with increasing training, indicating that the agent is learning. RLlib also logs many other metrics by default, which can be useful for debugging. 

In [10]:
%load_ext tensorboard
%tensorboard --logdir results_rllib

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 17908), started 0:00:22 ago. (Use '!kill 17908' to kill it.)

### Load and Test the Trained Multi-Agent Policy

Let's load the trained multi-agent model and visualize the learned multi-agent policy:

In [6]:
from ray.rllib.algorithms.algorithm import Algorithm

# load the trained agent from the stored checkpoint
best_result = result_grid.get_best_result(metric="episode_reward_mean", mode="max")
ppo = Algorithm.from_checkpoint(best_result.checkpoint)

2023-07-20 19:29:26,871	INFO trainable.py:173 -- Trainable.setup took 14.657 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


In [11]:
import mobile_env
import matplotlib.pyplot as plt
from IPython import display

# create the env for testing
# pass rgb_array as render mode so the env can be rendered inside the notebook
env = gymnasium.make("mobile-small-ma-v0", render_mode="rgb_array")
obs, info = env.reset()
done = False

# run one episode with the trained model
while not done:
    # gather action from each actor (for each UE)
    action = {}
    for agent_id, agent_obs in obs.items():
        # compute the action for the given agent using the shared policy
        action[agent_id] = ppo.compute_single_action(agent_obs, policy_id="shared_policy")

    # apply actions and perform step on simulation environment
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated

    # render environment as RGB
    plt.imshow(env.render())
    display.display(plt.gcf())
    display.clear_output(wait=True)

{0: -0.08314402193659076, 1: -0.06218562828460161, 2: -1.0, 3: -1.0, 4: -0.15216507928783843}


While the learned policy is not yet perfect (the reward is still increasing, i.e., agent is still learning), the visualization shows that most users are connected to suitable cells.

Feel free to experiment more with `mobile-env`, e.g., by training longer or customizing the environment.
The [documentation](https://mobile-env.readthedocs.io/en/latest/) provides further information about the API.
If you still have open questions or run into issues, you can [open an issue on GitHub](https://github.com/stefanbschneider/mobile-env/issues).

We hope `mobile-env` is useful for you. If you use `mobile-env`, please cite it and let us know - then we can list your work on `mobile-env`'s Readme.
We also very much appreciate contributions in the form of pull requests.

```
@inproceedings{schneider2022mobileenv,
  author = {Schneider, Stefan and Werner, Stefan and Khalili, Ramin and Hecker, Artur and Karl, Holger},
  title = {mobile-env: An Open Platform for Reinforcement Learning in Wireless Mobile Networks},
  booktitle={Network Operations and Management Symposium (NOMS)},
  year = {2022},
  publisher = {IEEE/IFIP},
}
```

Happy training/learning!