In [1]:
import or_suite
import numpy as np
import copy
import os
import gym

Key references:

[Sutton and Barto](http://incompleteideas.net/book/the-book-2nd.html)

[Windy Grid World](https://github.com/ibrahim-elshar/gym-windy-gridworlds)

[Q Learning](https://arxiv.org/abs/1807.03765)

[UCBVI](https://proceedings.mlr.press/v70/azar17a.html)


##  Introduction: Online Tabular Algorithms

In the second code demo we will work on implementing ```Q-Learning``` and ```UCBVI``` for tabular OpenAI Gym environments and incorporate them into the ```ORSuite``` package.  We will then use the same package to run simulations on the ```Windy Grid World``` and ```Ambulance routing``` environment, comparing the performance between the two algorithms to randomized performance.

## Windy Grid World

Windy Grid World is a modification of the standard Grid World task.  Here, there is a fixed grid of possible states, and dedicated start and goal states.  The algorithm starts in the start state, and has four actions (up, right, down, left) and needs to learn where the goal state is through interacting with the environment.  There is one kick: there is a crosswind running upward through the middle of the grid, so in the middle region the resultant next states are shifted upward by ```wind``` whose strength varies either stochastically or deterministically in the different regions.  Here we will be using either:
- ```StochWindyGridWorld-v0```
as our environments to test out the learning algorithms.

Note that the code for these environments is taken and modified from [here](https://github.com/ibrahim-elshar/gym-windy-gridworlds).  I also am not "registering" it as part of the package for convenience just to keep the environment code out separate.

In [2]:
import stoch_windy_gridworld_env

In [3]:
env = stoch_windy_gridworld_env.StochWindyGridWorldEnv(GRID_HEIGHT=3,  \
                                            GRID_WIDTH=3, \
                                            PROB = [.05, .15, .02, .1], \
                                            START_CELL = (2,0), \
                                            GOAL_CELL = (0,2), \
                                            REWARD=1,
                                            EPLEN = 10)

In [4]:
env.observation_space.contains((0,0))

True

In [5]:
env.reset()

(2, 0)

The starting state is $(3,0)$ and referred to as the "top left" square, so taking the action $a = 2$, for down, should slowly shift agent down accruing a reward of zero until reaching the bottom limit.

In [6]:
env.step(2)

((1, 0), 0, False, {})

In [7]:
env.step(2)

((0, 0), 0, False, {})

In [8]:
env.step(1)

((0, 1), 0, False, {})

In [9]:
env.step(1)

((0, 2), 1, False, {})

## Ambulance Routing

See ```or_suite.envs.ambulance.ambulance_routing_readme.ipynb``` for a description.

One potential application of reinforcement learning involves positioning a server or servers (in this case an ambulance) in an optimal way geographically to respond to incoming calls while minimizing the distance traveled by the servers. This is closely related to the [k-server problem](https://en.wikipedia.org/wiki/K-server_problem), where there are $k$ servers stationed in a space that must respond to requests arriving in that space in such a way as to minimize the total distance traveled. 

The ambulance routing problem addresses the problem by modeling an environment where there are ambulances stationed at locations, and calls come in that one of the ambulances must be sent to respond to. The goal of the agent is to minimize both the distance traveled by the ambulances between calls and the distance traveled to respond to a call by optimally choosing the locations to station the ambulances. The ambulance environment has been implemented in two different ways; as a 1-dimensional number line $[0,1]$ along which ambulances will be stationed and calls will arrive, and a graph with nodes where ambulances can be stationed and calls can arrive, and edges between the nodes that ambulances travel along.

`ambulance_graph.py` is structured as a graph of nodes $V$ with edges between the nodes $E$. Each node represents a location where an ambulance could be stationed or a call could come in. The edges between nodes are undirected and have a weight representing the distance between those two nodes.

The nearest ambulance to a call is determined by computing the shortest path from each ambulance to the call, and choosing the ambulance with the minimum length path. The calls arrive using a pre-specified probability distribution. The default is for the probability of call arrivals to be evenly distributed over all the nodes; however, the user can also choose different probabilities for each of the nodes that a call will arrive at that node.

After each call comes in, the agent will choose where to move each ambulance in the graph. Every ambulance except the ambulance that moved to respond to the call will be at the same location where the agent moved it to on the previous iteration, and the ambulance that moved to respond to the call will be at the node where the call came in. 


![final_graph_ithaca.png](attachment:final_graph_ithaca.png)

The graph environment is currently implemented using the [networkx package](https://networkx.org/documentation/stable/index.html).


## What makes an algorithm?

As discussed during the backgrounds on MDPs earlier today, an algorithm is characterized by a couple main components:
- Policy (i.e. how it decides to take actions from a given state)
- Update Step (i.e. how it updates its policy based on observed values)

In ```orsuite.or_suite.agents.agent.py``` we have an API framework of how an algorithm should work, characterized by these methods:

In [10]:
'''
All agents should inherit from the Agent class.


class Agent(object):

    def __init__(self):
        pass

    def reset(self):
        pass

    def update_config(self, env, config):
         Update agent information based on the config__file
        self.config = config
        return
        
    def update_parameters(self, param):
        return

    def update_obs(self, obs, action, reward, newObs, timestep, info):
        Add observation to records

    def update_policy(self, h):
        Update internal policy based upon records

    def pick_action(self, obs, h):
        Select an action based upon the observation


'''

'\nAll agents should inherit from the Agent class.\n\n\nclass Agent(object):\n\n    def __init__(self):\n        pass\n\n    def reset(self):\n        pass\n\n    def update_config(self, env, config):\n         Update agent information based on the config__file\n        self.config = config\n        return\n        \n    def update_parameters(self, param):\n        return\n\n    def update_obs(self, obs, action, reward, newObs, timestep, info):\n        Add observation to records\n\n    def update_policy(self, h):\n        Update internal policy based upon records\n\n    def pick_action(self, obs, h):\n        Select an action based upon the observation\n\n\n'

In this framework you will notice a couple key components.

- reset (resets the agent to "forget" what it has learned between experiments)
- update config (potentially updates internals of the algorithm based on the config of the environment)
- update parameters (potentially updates parameters, e.g. bonus confidence term) for hyperparameter tuning
- update obs (updates internal estimates based on one step reward)
- update policy (updates the internal policy based on records)
- pick action (actually picks the action based on the current state)

## Randomized Algorithm

For example, if we wanted to implement a random algorithm (included in ```or_suite.agents.rl.random.py``` we would simply do the following:

In [11]:
class randomAgent():
    """Randomized RL Algorithm

    Implements the randomized RL algorithm - selection an action uniformly at random from the action space.  In particular,
    the algorithm stores an internal copy of the environment's action space and samples uniformly at random from it.

    """

    def __init__(self):
        pass


    def reset(self):
        pass

    def update_config(self, env, config = None):
        """Updates configuration file for the agent

        Updates the stored environment to sample uniformly from.

        Args:
            env: an openAI gym environment
            config: an (optional) dictionary containing parameters for the environment
        """

        self.environment = env
        pass

    def update_obs(self, obs, action, reward, newObs, timestep, info):
        pass

    def update_policy(self, h):
        pass

    def pick_action(self, obs, h):
        """Selects an action for the algorithm.

        Args:
            obs: a state for the environment
            h: timestep

        Returns:
            An action sampled uniformly at random from the environment's action space.
        """
        return self.environment.action_space.sample()


Note that the only real component is ```pick_action``` which simply picks an action from the environment's defined action space.

## Your Turn

Your goal for this code demo is to implement [Q Learning](https://arxiv.org/abs/1807.03765)
and 
[UCBVI](https://proceedings.mlr.press/v70/azar17a.html)
 and run an experiment.  A starter for the code is located in ```or_suite.agents.rl.discrete_mb``` and ```or_suite.agents.rl.discrete_mf```.
(Note that solutions are also located in a sub-folder just for checking work).  The high level architecture of the code, and an outline on what to implement, is included in the python files.

## Running an experiment

Now that we have the algorithms up and running, our next step is to actually run an experiment to compare the algorithm's performance over time!  We will be using the ```ORSuite``` package (see [here](https://orsuite.readthedocs.io/en/latest/experiment_file.html) for documentation on running experiments) and to generate plots as well.

### Package Installation

First we import necessary packages

In [12]:
import or_suite
import numpy as np
import itertools as it

import copy

import os
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.monitor import Monitor
from stable_baselines3 import PPO
from stable_baselines3 import DQN
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.dqn import MlpPolicy as MlpPolicy_dqn
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
import pandas as pd


import gym

### Configure Environment and Pick Problem Parameters

In [13]:
epLen = 10 # Length or horizon of model
nEps = 5000 # Number of "episodes" K to run against
numIters = 20 # Number of iterations to average learning performance over

### Simulation Parameters

Next we need to specify parameters for the simulation. This includes setting a seed, the frequency to record the metrics, directory path for saving the data files, a deBug mode which prints the trajectory, etc.

In [14]:
DEFAULT_SETTINGS = {'seed': 1, # randomized seed for experiments
                    'recFreq': 1, # recording frequency of data
                    'dirPath': '../data/rideshare/',  # path to save info
                    'deBug': False, # debug mode, i.e. print out stuff over time
                    'nEps': nEps,  # number of episodes
                    'numIters': numIters,  # number of iterations
                    'saveTrajectory': True, # indicator to save trajectory level information
                    'epLen' : epLen, # episode length
                    'render': False, # render indicator
                    'pickle': False # indicator to pickle final model files
                    }


### List of Algorithms

Next we will pick a list of algorithms to test.

In [15]:
env = stoch_windy_gridworld_env.StochWindyGridWorldEnv(GRID_HEIGHT=3,  \
                                            GRID_WIDTH=3, \
                                            PROB = [.05, .15, .02, .1], \
                                            START_CELL = (2,0), \
                                            GOAL_CELL = (0,2), \
                                            REWARD=1,
                                            EPLEN = epLen)

scaling_list = [.01, .1, 1, 10, 100] # list of scaling parameters to evaluate

mon_env = Monitor(env) # the stable baselines deep RL algorithms require a "wrapper" around the environment
# to run experiments, which we set up here


action_space = env.action_space # records action space and state space to pass our tabular algortihms
state_space = env.observation_space

agents = { # creates the list of algorithms to test against
    'Random': or_suite.agents.rl.random.randomAgent(),
    'UCBVI': or_suite.agents.rl.discrete_mb.DiscreteMB(action_space, state_space, epLen, 10, 0, False),
    'QLearning': or_suite.agents.rl.discrete_ql.DiscreteQl(action_space, state_space, epLen, 10)
}

### Running an Experiment

In [16]:
# List of paths to the data files for each of the algorithms we run
path_list_line = []
algo_list_line = []
path_list_radar = []
algo_list_radar= []


for agent in agents: # loops over each algorithm
    print(agent)
    
    
    DEFAULT_SETTINGS['dirPath'] = '../data/grid_world_'+str(agent)+'/' # updates directory to include agent name
    if agent == 'SB PPO': # separate experiment file to run the deep RL algorithms as it is just a wrapper
        # of the stable baselines package
        or_suite.utils.run_single_sb_algo(mon_env, agents[agent], DEFAULT_SETTINGS)
#     elif agent == 'UCBVI' or agent == 'QLearning':
#         # separate experiment file which runs hyperparameter tuning, essentially looping over list
#         # and picking best performance
#         # note - I tuned these separately before hand so you are just going to run on the tuned value
#         or_suite.utils.run_single_algo_tune(env,agents[agent], scaling_list, DEFAULT_SETTINGS)
    else:
        # runs a single algorithm, no hyperparameter tuning or anything fancy
        or_suite.utils.run_single_algo(env, agents[agent], DEFAULT_SETTINGS)
    # appends the directory to the lists
    path_list_line.append('../data/grid_world_'+str(agent))
    algo_list_line.append(str(agent))


fig_path = '../figures/' # path for the figure
fig_name = 'grid_world'+'_line_plot'+'.pdf' # name of the figure
or_suite.plots.plot_line_plots(path_list_line, algo_list_line, fig_path, fig_name, int(nEps / 40)+1)
    # creates the line plot figure, including list of algo, the path, name of figure,
    # and a plot frequency (here nEps / 40)

Random
Writing to file data.csv
UCBVI
Writing to file data.csv
QLearning
Writing to file data.csv


In [17]:
from IPython.display import IFrame
IFrame("../figures/grid_world_line_plot.pdf", width=600, height=500)

### What gives?

Well - we have spotted one of the largest pitfalls of online algorithms, exploration in goal based settings.  Typically "goal based" MDPs are thought to be the worst case problem instances as the rewards are sparse and take a while to propagate through the network.  This creates additional issues when looking at the performance of the $Q$-learning based algorithms, as information is propagated across steps much slower than full value based algorithms.  These issues created a long line of research on understanding algorithm development for these problem set-ups named "goal-based" RL (see [here](https://arxiv.org/abs/2002.12361) for a summary).

### Ambulance Environment

Next we will run the experiments on the ambulance graph environment.  In this domain the rewards are not sparse and so we expect to witness learning in fewer iterations.

In [18]:
# Getting out configuration parameter for the environment
CONFIG =  or_suite.envs.env_configs.ambulance_graph_default_config


# Specifying training iteration, epLen, number of episodes, and number of iterations
epLen = CONFIG['epLen']
nEps = 5000
numIters = 20


scaling_list = [0.01, .1, 1, 10]


# Configuration parameters for running the experiment
DEFAULT_SETTINGS = {'seed': 1, 
                    'recFreq': 1, 
                    'dirPath': '../data/ambulance/', 
                    'deBug': False, 
                    'nEps': nEps, 
                    'numIters': numIters, 
                    'saveTrajectory': True, # save trajectory for calculating additional metrics
                    'epLen' : 5,
                    'render': False,
                    'pickle': False # indicator for pickling final information
                    }


alpha = CONFIG['alpha']
num_ambulance = CONFIG['num_ambulance']

ambulance_env = gym.make('Ambulance-v1', config=CONFIG)
mon_env = Monitor(ambulance_env)

state_space = ambulance_env.observation_space
action_space = ambulance_env.action_space


agents = {
'Random': or_suite.agents.rl.random.randomAgent(),
'Stable': or_suite.agents.ambulance.stable.stableAgent(CONFIG['epLen']),
'UCBVI': or_suite.agents.rl.discrete_mb.DiscreteMB(action_space, state_space, epLen, 1, 0, False),
'QLearning' : or_suite.agents.rl.discrete_ql.DiscreteQl(action_space, state_space, epLen, .01)}

In [19]:
path_list_line = []
algo_list_line = []

for agent in agents:
    print(agent)
    DEFAULT_SETTINGS['dirPath'] = '../data/ambulance_graph_'+str(agent)+'_'+str(num_ambulance)+'_'+str(alpha)+'/'
    if agent == 'SB PPO':
        or_suite.utils.run_single_sb_algo(mon_env, agents[agent], DEFAULT_SETTINGS)
#     elif agent == 'UCBVI' or agent == 'QLearning':
#         or_suite.utils.run_single_algo_tune(ambulance_env,agents[agent], scaling_list, DEFAULT_SETTINGS)
    else:
        or_suite.utils.run_single_algo(ambulance_env, agents[agent], DEFAULT_SETTINGS)

    path_list_line.append('../data/ambulance_graph_'+str(agent)+'_'+str(num_ambulance)+'_'+str(alpha))
    algo_list_line.append(str(agent))



    
fig_path = '../figures/'
fig_name = 'ambulance_graph'+'_'+str(num_ambulance)+'_'+str(alpha)+'_line_plot'+'.pdf'
or_suite.plots.plot_line_plots(path_list_line, algo_list_line, fig_path, fig_name, int(nEps / 40)+1)

additional_metric = {}
fig_name = 'ambulance_graph'+'_'+str(num_ambulance)+'_'+str(alpha)+'_radar_plot'+'.pdf'
or_suite.plots.plot_radar_plots(path_list_line, algo_list_line,
fig_path, fig_name,
additional_metric
)


Random
Writing to file data.csv
Stable
Writing to file data.csv
UCBVI
Writing to file data.csv
QLearning
Writing to file data.csv
   Algorithm  Reward      Time   Space
0     Random -7.7125  7.253235 -4275.0
1     Stable -5.0625  7.384510 -3659.0
2      UCBVI -4.8750  4.520873 -4953.4
3  QLearning -4.7875  6.570152 -4947.0


In [20]:
from IPython.display import IFrame
IFrame('../figures/ambulance_graph'+'_'+str(num_ambulance)+'_'+str(alpha)+'_line_plot'+'.pdf', width=600, height=500)

In [21]:
from IPython.display import IFrame
IFrame('../figures/ambulance_graph'+'_'+str(num_ambulance)+'_'+str(alpha)+'_radar_plot'+'.pdf', width=600, height=500)