# Vaccine Allotment Code Demonstration

Reinforcement learning (RL) is a natural model for problems involving real-time sequential decision making. In these models, a principal interacts with a system having stochastic transitions and rewards and aims to control the system online (by exploring available actions using real-time feedback) or offline (by exploiting known properties of the system).

This project revolves around providing a unified landscape on scaling reinforcement learning algorithms to operations research domains.

In this notebook we walk through generating plots, and applying the problem to the `vaccine allotment` problem with a population of size $P$ split into four risk classes, a discrete state space $\mathcal{S} = \{0, 1, 2, \ldots, P\}^{11}$, and a discrete action space consisting of "priority orders" corresponding to how we allot vaccines to the four risk classes. In this case, a valid priority order is one of two options: 
1. an empty list -- interpreted as no priority order, meaning we vaccinate the population randomly
2. a permutation of the numbers $\{1,2,3,4\}$ -- interpreted as the order in which we vaccinate the risk classes

### Step 1: Import Required Packages

The main package for ORSuite is contained in `or_suite`.  However, some additional packages may be required for specific environments / algorithms.  Here, we include `stable baselines`, a package containing implementation for state of the art deep RL algorithms, and `matploblib` for the plotting.

In [3]:
import or_suite
import gym
import matplotlib.pyplot as plt
from stable_baselines3.common.monitor import Monitor
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
import numpy as np

### Step 2: Pick problem parameters for the environment

Here we use the ambulance metric environment as outlined in `or_suite/envs/ambulance/ambulance_metric.py`.  The package has default specifications for all of the environments in the file `or_suite/envs/env_configs.py`, and so we use one the default for the ambulance problem in a metric space.

In addition, we need to specify the number of episodes for learning, and the number of iterations (in order to plot average results with confidence intervals).

In [4]:
DEFAULT_CONFIG = or_suite.envs.env_configs.vaccine_4groups_default_config
epLen = DEFAULT_CONFIG['epLen']
nEps = 200
numIters = 5

AttributeError: module 'or_suite' has no attribute 'envs'

### Step 3: Pick simulation parameters

Next we need to specify parameters for the simulation.  This includes setting a seed, the frequency to record the metrics, directory path for saving the data files, a deBug mode which prints the trajectory, etc.

In [3]:
DEFAULT_SETTINGS = {'seed': 1, 
                    'recFreq': 1, 
                    'dirPath': '../data/ambulance/', 
                    'deBug': False, 
                    'nEps': nEps, 
                    'numIters': numIters, 
                    'saveTrajectory': True, 
                    'epLen' : 5}

ambulance_env = gym.make('Ambulance-v0', config=DEFAULT_CONFIG)
mon_env = Monitor(ambulance_env)

### Step 4: Pick list of algorithms

We have several heuristics implemented for each of the environments defined, in addition to a `random` policy, and some `RL discretization based` algorithms.  Here we pick a couple of the heuristics, and a PPO algorithm implemented from `stable baselines` just to test.

In [4]:
agents = {'SB PPO': PPO(MlpPolicy, mon_env, gamma=1, verbose=0, n_steps=epLen),
          'Random': or_suite.agents.rl.random.randomAgent(),
          'Stable': or_suite.agents.ambulance.stable.stableAgent(DEFAULT_CONFIG['epLen']),
          'Median': or_suite.agents.ambulance.median.medianAgent(DEFAULT_CONFIG['epLen'])
          }

We recommend using a `batch_size` that is a multiple of `n_steps * n_envs`.
Info: (n_steps=5 and n_envs=1)


### Step 5: Run simulations

In [5]:
for agent in agents:
    print(agent)
    DEFAULT_SETTINGS['dirPath'] = '../data/ambulance_metric_test_'+str(agent)+'/'
    if agent == 'SB PPO':
        or_suite.utils.run_single_sb_algo(mon_env, agents[agent], DEFAULT_SETTINGS)
    else:
        or_suite.utils.run_single_algo(ambulance_env, agents[agent], DEFAULT_SETTINGS)

SB PPO
**************************************************
Running experiment
**************************************************
**************************************************
Experiment complete
**************************************************
**************************************************
Saving data
**************************************************
     episode  iteration  epReward      time    memory
0        0.0        0.0 -3.992157 -1.935468  528939.0
1        1.0        0.0 -3.293304 -2.687384  528939.0
2        2.0        0.0 -2.481742 -3.522500  528939.0
3        3.0        0.0 -2.659202 -3.456917  528939.0
4        4.0        0.0 -3.075968 -3.365957  528939.0
..       ...        ...       ...       ...       ...
995    195.0        4.0 -1.339041 -3.425642   53762.0
996    196.0        4.0 -1.726129 -3.505733   53762.0
997    197.0        4.0 -0.891938 -3.456910   53762.0
998    198.0        4.0 -0.825988 -3.489179   53762.0
999    199.0        4.0 -1.918325 -3.48915

  self.data[index, 4] = np.log(((end_time) - (start_time)))


**************************************************
Experiment complete
**************************************************
**************************************************
Saving data
**************************************************
[[ 0.00000000e+00  0.00000000e+00 -8.51879284e-01  3.58400000e+03
  -7.60097494e+00]
 [ 1.00000000e+00  0.00000000e+00 -1.27596690e+00  1.55099000e+05
  -7.59954535e+00]
 [ 2.00000000e+00  0.00000000e+00 -8.48246947e-01  2.57500000e+03
             -inf]
 ...
 [ 1.97000000e+02  4.00000000e+00 -7.72291988e-01  3.76400000e+03
             -inf]
 [ 1.98000000e+02  4.00000000e+00 -9.89294961e-01  1.27240000e+04
  -7.60097494e+00]
 [ 1.99000000e+02  4.00000000e+00 -7.74680257e-01  3.76400000e+03
             -inf]]
Writing to file data.csv
**************************************************
Data save complete
**************************************************
Median
**************************************************
Running experiment
*************************

### Step 6: Generate figures

In [6]:
path_list_line = []
path_list_radar = []
algo_list_line = []
algo_list_radar = []

for agent in agents:
    print(str(agent))
    path_list_line.append('../data/ambulance_metric_test_'+str(agent)+'/data.csv')
    algo_list_line.append(str(agent))
    if agent != 'SB PPO':    
        path_list_radar.append('../data/ambulance_metric_test_'+str(agent)+'/')
        algo_list_radar.append(str(agent))

    

fig_path = '../figures/'
fig_name = 'test_ambulance_metric.pdf'

or_suite.plots.plot_line_plots(path_list_line, algo_list_line, fig_path, fig_name, int(nEps / 40) + 1)

additional_metric = {'MRT': lambda traj : or_suite.utils.mean_response_time(traj, lambda x, y : np.abs(x-y))}


or_suite.plots.plot_radar_plots(path_list_radar, algo_list_radar, fig_path, fig_name, additional_metric)

SB PPO
Random
Stable
Median
  Algorithm    Reward      Time   Space       MRT
0    Random -1.291178 -7.380172  4578.8 -0.320382
1    Stable -0.814626      -inf  3764.0 -0.183517
2    Median -0.825112 -6.826338  4517.6 -0.132534
