- This notebook is an example of benchmarking some RL algorithms from stable_baselines3. 

- While running all cells, the learning curve of each algorithm can be visualized via tensorboard.

- To view the learning curves, navigate to current directory in the command prompt and run below command.
  - tensorboard --logdir ./tensorboard/EHREnv-v0/

In [1]:
import gymnasium as gym
from EHR_env import EHREnv

import stable_baselines3 as sb3

In [2]:
gym.register(
    id='EHREnv-v0',
    entry_point='EHR_env:EHREnv',
    max_episode_steps=100,
    kwargs={'model': 'lstm',
            'sofa': 5}
)

In [3]:
env = gym.make('EHREnv-v0', model='lstm', sofa=5)

- PPO

In [4]:
ppo = sb3.PPO("MlpPolicy", 
                env, 
                learning_rate=0.005,
                n_steps=100,
                batch_size=50,
                verbose=1, 
                tensorboard_log="./tensorboard/EHREnv-v0/",
                device='cpu')

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [5]:
ppo.learn(total_timesteps=10000, progress_bar=False, log_interval=1, tb_log_name="PPO")

Logging to ./tensorboard/EHREnv-v0/PPO_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -4       |
| time/              |          |
|    fps             | 16       |
|    iterations      | 1        |
|    time_elapsed    | 5        |
|    total_timesteps | 100      |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 100         |
|    ep_rew_mean          | -3.5        |
| time/                   |             |
|    fps                  | 18          |
|    iterations           | 2           |
|    time_elapsed         | 10          |
|    total_timesteps      | 200         |
| train/                  |             |
|    approx_kl            | 0.043740924 |
|    clip_fraction        | 0.42        |
|    clip_range           | 0.2         |
|    entropy_loss         | -5.63       |
|    explained_variance   | 0.1

<stable_baselines3.ppo.ppo.PPO at 0x20cb5137f10>

- DDPG

In [6]:
del env
env = gym.make('EHREnv-v0', model='lstm', sofa=5)

In [7]:
ddpg = sb3.DDPG("MlpPolicy", 
                env, 
                learning_rate=0.01,
                buffer_size=100,
                learning_starts=100,
                batch_size=64,
                train_freq=1,
                verbose=1, 
                tensorboard_log="./tensorboard/EHREnv-v0/",
                device='cpu')

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [8]:
ddpg.learn(total_timesteps=10000, progress_bar=False, log_interval=1, tb_log_name="DDPG")

Logging to ./tensorboard/EHREnv-v0/DDPG_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -7       |
| time/              |          |
|    episodes        | 1        |
|    fps             | 22       |
|    time_elapsed    | 4        |
|    total_timesteps | 100      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -5       |
| time/              |          |
|    episodes        | 2        |
|    fps             | 19       |
|    time_elapsed    | 10       |
|    total_timesteps | 200      |
| train/             |          |
|    actor_loss      | 1.19e+03 |
|    critic_loss     | 2.03e+03 |
|    learning_rate   | 0.01     |
|    n_updates       | 99       |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    e

<stable_baselines3.ddpg.ddpg.DDPG at 0x20e845621f0>

- SAC

In [9]:
del env
env = gym.make('EHREnv-v0', model='lstm', sofa=5)

In [10]:
sac = sb3.SAC("MlpPolicy", 
                env, 
                learning_rate=0.005,
                learning_starts=100,
                buffer_size=100,
                batch_size=64,
                train_freq=1,
                verbose=1, 
                tensorboard_log="./tensorboard/EHREnv-v0/",
                device='cpu')

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [11]:
sac.learn(total_timesteps=10000, progress_bar=False, log_interval=1, tb_log_name="SAC")

Logging to ./tensorboard/EHREnv-v0/SAC_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -5       |
| time/              |          |
|    episodes        | 1        |
|    fps             | 23       |
|    time_elapsed    | 4        |
|    total_timesteps | 100      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -1.5     |
| time/              |          |
|    episodes        | 2        |
|    fps             | 18       |
|    time_elapsed    | 10       |
|    total_timesteps | 200      |
| train/             |          |
|    actor_loss      | 429      |
|    critic_loss     | 13.7     |
|    ent_coef        | 1.6      |
|    ent_coef_loss   | -17.7    |
|    learning_rate   | 0.005    |
|    n_updates       | 99       |
---------------------------------
---------------------------------
| rollo

<stable_baselines3.sac.sac.SAC at 0x20e9b90a3d0>

- TD3

In [12]:
del env
env = gym.make('EHREnv-v0', model='lstm', sofa=5)

In [13]:
td3 = sb3.TD3("MlpPolicy", 
                env, 
                learning_rate=0.01,
                learning_starts=100,
                buffer_size=100,
                batch_size=64,
                train_freq=1,
                verbose=1, 
                tensorboard_log="./tensorboard/EHREnv-v0/",
                device='cpu')

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [14]:
td3.learn(total_timesteps=10000, progress_bar=False, log_interval=1, tb_log_name="TD3")

Logging to ./tensorboard/EHREnv-v0/TD3_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | 1        |
| time/              |          |
|    episodes        | 1        |
|    fps             | 21       |
|    time_elapsed    | 4        |
|    total_timesteps | 100      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | 0.5      |
| time/              |          |
|    episodes        | 2        |
|    fps             | 19       |
|    time_elapsed    | 10       |
|    total_timesteps | 200      |
| train/             |          |
|    actor_loss      | 416      |
|    critic_loss     | 5.89e+04 |
|    learning_rate   | 0.01     |
|    n_updates       | 99       |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep

<stable_baselines3.td3.td3.TD3 at 0x20ed44b6430>

- A2C

In [15]:
del env
env = gym.make('EHREnv-v0', model='lstm', sofa=5)

In [16]:
a2c = sb3.A2C("MlpPolicy", 
                env, 
                learning_rate=0.005,
                n_steps=100,
                verbose=1, 
                tensorboard_log="./tensorboard/EHREnv-v0/",
                device='cpu')

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [17]:
a2c.learn(total_timesteps=10000, progress_bar=False, log_interval=1, tb_log_name="A2C")

Logging to ./tensorboard/EHREnv-v0/A2C_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -6       |
| time/              |          |
|    fps             | 18       |
|    iterations      | 1        |
|    time_elapsed    | 5        |
|    total_timesteps | 100      |
---------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 100      |
|    ep_rew_mean        | -7       |
| time/                 |          |
|    fps                | 18       |
|    iterations         | 2        |
|    time_elapsed       | 10       |
|    total_timesteps    | 200      |
| train/                |          |
|    entropy_loss       | -5.68    |
|    explained_variance | -0.45    |
|    learning_rate      | 0.005    |
|    n_updates          | 1        |
|    policy_loss        | -11.1    |
|    std                | 0.979    |
|    value_loss         | 5

<stable_baselines3.a2c.a2c.A2C at 0x20edd329400>