# Revenue Management Simulations

In this notebook we will run some simulations to compare the algorithms you developed with the Bayes Selector to standard Q learning.

Note: You may get some errors + warnings around calculating the time and space that the algorithms are using.  This might be due to your operating system, so please ignore those numbers.

### Package Installation

In [1]:
import or_suite
import numpy as np

import copy

import os
from stable_baselines3.common.monitor import Monitor
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
import pandas as pd


import gym

### Test 1

We start off by doing a small scale simulation on a problem with three resources, two customer types, and a horizon of four.  This allows us to test out Q-Learning based approaches without dealing with issues around scale.

In [2]:
CONFIG =  or_suite.envs.env_configs.airline_default_config

epLen = CONFIG['epLen']

In [3]:
print(CONFIG)

{'epLen': 4, 'f': array([1., 2.]), 'A': array([[2., 3.],
       [3., 0.],
       [2., 1.]]), 'starting_state': array([6.66666667, 4.        , 4.        ]), 'P': array([[0.33333333, 0.33333333],
       [0.33333333, 0.33333333],
       [0.33333333, 0.33333333],
       [0.33333333, 0.33333333],
       [0.33333333, 0.33333333]])}


Here we see the different rewards, resource consumption, starting state, and arrival distribution being uniform.

Next up - we generate a list of historical traces according to the distribution.

In [4]:
num_traces = 50
dataset = []
for _ in range(num_traces): # samples traces
    for timestep in range(CONFIG['epLen']): # each of length of the time horizon
        # samples a customer type according to that step's distribution
        pDist = np.append(np.copy(CONFIG['P'][timestep, :]), 1 - np.sum(CONFIG['P'][timestep, :]))

        dataset.append((timestep, np.random.choice(a = CONFIG['A'].shape[1]+1, p = pDist)))
print(dataset)

[(0, 2), (1, 0), (2, 1), (3, 1), (0, 1), (1, 2), (2, 2), (3, 1), (0, 2), (1, 0), (2, 2), (3, 2), (0, 0), (1, 0), (2, 0), (3, 1), (0, 2), (1, 1), (2, 1), (3, 2), (0, 2), (1, 1), (2, 1), (3, 1), (0, 1), (1, 2), (2, 2), (3, 1), (0, 1), (1, 2), (2, 1), (3, 0), (0, 1), (1, 1), (2, 0), (3, 1), (0, 2), (1, 2), (2, 1), (3, 2), (0, 1), (1, 2), (2, 2), (3, 0), (0, 0), (1, 0), (2, 2), (3, 0), (0, 0), (1, 2), (2, 2), (3, 1), (0, 2), (1, 1), (2, 1), (3, 2), (0, 0), (1, 2), (2, 1), (3, 2), (0, 2), (1, 1), (2, 1), (3, 2), (0, 1), (1, 1), (2, 2), (3, 1), (0, 2), (1, 2), (2, 0), (3, 0), (0, 1), (1, 1), (2, 2), (3, 0), (0, 0), (1, 1), (2, 1), (3, 2), (0, 0), (1, 0), (2, 0), (3, 2), (0, 0), (1, 1), (2, 2), (3, 2), (0, 0), (1, 0), (2, 0), (3, 0), (0, 2), (1, 0), (2, 0), (3, 2), (0, 0), (1, 2), (2, 2), (3, 2), (0, 2), (1, 1), (2, 0), (3, 1), (0, 2), (1, 0), (2, 0), (3, 0), (0, 1), (1, 0), (2, 2), (3, 2), (0, 1), (1, 2), (2, 2), (3, 2), (0, 2), (1, 0), (2, 2), (3, 1), (0, 2), (1, 0), (2, 0), (3, 2), (0, 1),

In [5]:
nEps = 1
numIters = 500

DEFAULT_SETTINGS = {'seed': 1, 
                    'recFreq': 1, 
                    'dirPath': '../data/', 
                    'deBug': False, 
                    'nEps': nEps, 
                    'numIters': numIters, 
                    'saveTrajectory': True, 
                    'epLen' : epLen,
                    'render': False,
                    'pickle': False
                    }


revenue_env = gym.make('Airline-v0', config=CONFIG)
mon_env = Monitor(revenue_env)

### Specifying Agent

We specify 3 agents to compare effectiveness of each:

* `Random`
* `BayesSelector`
* `BayesSelectorTraces`

In [6]:
agents = {
'Random': or_suite.agents.rl.random.randomAgent(),
'BayesSelector': or_suite.agents.airline_revenue_management.bayes_selector.bayes_selectorAgent(epLen, round_flag=True),
'BayesSelectorTraces': or_suite.agents.airline_revenue_management.bayes_selector_traces.bayes_selector_tracesAgent(epLen, round_flag=True, dataset = dataset),
}

Next we run up the experiments.

In [7]:
path_list_line = []
algo_list_line = []
path_list_radar = []
algo_list_radar= []
for agent in agents:
    print(agent)
    DEFAULT_SETTINGS['dirPath'] = '../data/airline_'+str(agent)
    if agent == 'SB PPO':
        or_suite.utils.run_single_sb_algo(mon_env, agents[agent], DEFAULT_SETTINGS)
    else:
        or_suite.utils.run_single_algo(revenue_env, agents[agent], DEFAULT_SETTINGS)

    path_list_line.append('../data/airline_'+str(agent))
    algo_list_line.append(str(agent))
    path_list_radar.append('../data/airline_'+str(agent))
    algo_list_radar.append(str(agent))

Random


  self.data[index, 4] = np.log(((end_time) - (start_time)))


Writing to file data.csv
BayesSelector
Writing to file data.csv
BayesSelectorTraces
Writing to file data.csv


In [8]:
fig_path = '../figures/'
fig_name = 'revenue'+'_line_plot'+'.pdf'
or_suite.plots.plot_radar_plots(path_list_line, algo_list_line, fig_path, fig_name, {})

             Algorithm  Reward      Time      Space
0               Random   1.676       inf  -5267.308
1        BayesSelector   2.760  2.639713 -29416.840
2  BayesSelectorTraces   2.706  2.583099 -28898.476


Wait - so what this is showing is that with 50 traces we see that the Bayes Selector with traces is almost able to compete just as much as the original Bayes Selector?  Hmm.  Let us try running it again with even fewer traces.

### Test 2

In [9]:
num_traces = 5
dataset = []
for _ in range(num_traces): # samples traces
    for timestep in range(CONFIG['epLen']): # each of length of the time horizon
        # samples a customer type according to that step's distribution
        pDist = np.append(np.copy(CONFIG['P'][timestep, :]), 1 - np.sum(CONFIG['P'][timestep, :]))

        dataset.append((timestep, np.random.choice(a = CONFIG['A'].shape[1]+1, p = pDist)))
# print(dataset)
nEps = 1
        
numIters = 500

DEFAULT_SETTINGS = {'seed': 1, 
                    'recFreq': 1, 
                    'dirPath': '../data/', 
                    'deBug': False, 
                    'nEps': nEps, 
                    'numIters': numIters, 
                    'saveTrajectory': True, 
                    'epLen' : epLen,
                    'render': False,
                    'pickle': False
                    }


revenue_env = gym.make('Airline-v0', config=CONFIG)
mon_env = Monitor(revenue_env)

agents = { # 'SB PPO': PPO(MlpPolicy, mon_env, gamma=1, verbose=0, n_steps=epLen),
'Random': or_suite.agents.rl.random.randomAgent(),
'BayesSelector': or_suite.agents.airline_revenue_management.bayes_selector.bayes_selectorAgent(epLen, round_flag=True),
'BayesSelectorTraces': or_suite.agents.airline_revenue_management.bayes_selector_traces.bayes_selector_tracesAgent(epLen, round_flag=True, dataset = dataset),
}

fig_path = '../figures/'
fig_name = 'revenue'+'_line_plot'+'.pdf'
or_suite.plots.plot_radar_plots(path_list_line, algo_list_line, fig_path, fig_name, {})

             Algorithm  Reward      Time      Space
0               Random   1.676       inf  -5267.308
1        BayesSelector   2.760  2.639713 -29416.840
2  BayesSelectorTraces   2.706  2.583099 -28898.476


Still just as good!  In fact - we see the same performance.  This is actually one of the advantages of the approach since it appeals to "actions which are good on average".  Recent work has actually analyzed that even a "single" trace is good enough!  This clearly outperforms existing Sim2Real RL algorithms.

### Comparing Against "Discrete Q Learning"

During the presentation we discussed how standard RL is a viable approach in this model.  Since we are focusing on a small-scale problem with a discrete tabular representation, as our version of Sim2Real RL we include standard $Q$ learning that you implemented yesterday as one comparison.  We have to run on a different configuration for the environment to ensure all values are integral.

This takes a little bit of extra work in order to include as a simulation, since the code is primarily set to work in the "online" set-up and not the "exogenous" set-up.  However, we can work our way around it by training the Sim2Real RL algorithm with episodes $K = $ the number of traces we fed into the Bayes Selector algorithm and one iteration.  Afterwards, we "copy" the agent's Q values, feed that back into an algorithm, and evaluate it on $K = 1$ episode to see its average performance.

Unfortunately the "traces" will then be sampled differently, but since the distribution is the same we can ignore that.  We also have to swap to different config for the environment so the parameters are all integer.

Also, we needed to create a copy of the standard $Q$ learning code to modify the reset function.  If you remember previously, the reset function gets called and sets all of the estimates back to what they were originally.  Here, we modify it to "reset" the estimates back to the final trained $Q$ values from the "warm start" step.

In [10]:
num_traces = 5

Training the $Q$ learning algorithm on that number of traces.

In [11]:
CONFIG = or_suite.envs.env_configs.airline_integer_config
epLen = CONFIG['epLen']


nEps = num_traces

numIters = 1

DEFAULT_SETTINGS = {'seed': 1, 
                    'recFreq': 1, 
                    'dirPath': '../data/', 
                    'deBug': False, 
                    'nEps': nEps, 
                    'numIters': numIters, 
                    'saveTrajectory': True, 
                    'epLen' : epLen,
                    'render': False,
                    'pickle': False
                    }


revenue_env = gym.make('Airline-v0', config=CONFIG)

# Here we pick out the discrete q learning agent that I have included modified for you to run experiments
# with where the reset function is adjusted.
print('Create agent')
q_l_agent = or_suite.agents.airline_revenue_management.discrete_ql_data.DiscreteQl_Data(revenue_env.action_space, revenue_env.observation_space, epLen, 1)
print('Run Exp')
or_suite.utils.run_single_algo(revenue_env, q_l_agent, DEFAULT_SETTINGS)


Create agent
Run Exp
Writing to file data.csv


  self.data[index, 4] = np.log(((end_time) - (start_time)))


We can print out the estimated $Q$ values and notice that a lot of them are constant, meaning that the traces aren't even enough to visit all the possible states! That makes sense, because the Sim2Real RL algorithms aren't using any additional problem structure in these models.

In [12]:
print(q_l_agent.qVals)

[[[[[[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.        4.       ]]]


   [[[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.        4.       ]]]


   [[[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.        4.       ]]]


   [[[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.        4.       ]]

    [[4.        4.       ]
     [4.       

In [13]:
print(q_l_agent.qVals.min(), q_l_agent.qVals.max())

1.0 5.365085


We can also run the $Q$ learning algorithm with "more" iterations, just as another comparison to highlight the bias-variance tradeoffs in these models.

In [14]:
CONFIG = or_suite.envs.env_configs.airline_integer_config
epLen = CONFIG['epLen']


nEps = 5000

numIters = 1

DEFAULT_SETTINGS = {'seed': 1, 
                    'recFreq': 1, 
                    'dirPath': '../data/', 
                    'deBug': False, 
                    'nEps': nEps, 
                    'numIters': numIters, 
                    'saveTrajectory': True, 
                    'epLen' : epLen,
                    'render': False,
                    'pickle': False
                    }


revenue_env = gym.make('Airline-v0', config=CONFIG)

# Here we pick out the discrete q learning agent that I have included modified for you to run experiments
# with where the reset function is adjusted.
print('Create agent')
q_l_extra_data_agent = or_suite.agents.airline_revenue_management.discrete_ql_data.DiscreteQl_Data(revenue_env.action_space, revenue_env.observation_space, epLen, 1)
print('Run Exp')
or_suite.utils.run_single_algo(revenue_env, q_l_extra_data_agent, DEFAULT_SETTINGS)


Create agent
Run Exp


  self.data[index, 4] = np.log(((end_time) - (start_time)))


Writing to file data.csv


Next we actually evaluate the rest of the algorithms.

In [15]:
dataset = []
for _ in range(num_traces): # samples traces
    for timestep in range(CONFIG['epLen']): # each of length of the time horizon
        # samples a customer type according to that step's distribution
        pDist = np.append(np.copy(CONFIG['P'][timestep, :]), 1 - np.sum(CONFIG['P'][timestep, :]))

        dataset.append((timestep, np.random.choice(a = CONFIG['A'].shape[1]+1, p = pDist)))
# print(dataset)
nEps = 1
        
numIters = 500

DEFAULT_SETTINGS = {'seed': 1, 
                    'recFreq': 1, 
                    'dirPath': '../data/', 
                    'deBug': False, 
                    'nEps': nEps, 
                    'numIters': numIters, 
                    'saveTrajectory': True, 
                    'epLen' : epLen,
                    'render': False,
                    'pickle': False
                    }


revenue_env = gym.make('Airline-v0', config=CONFIG)
mon_env = Monitor(revenue_env)

agents = { 'Q Learning with Data': q_l_agent,
'Q Learning Extra Data': q_l_extra_data_agent,
'Ignorant Q Learning': or_suite.agents.rl.discrete_ql.DiscreteQl(revenue_env.action_space, revenue_env.observation_space, epLen, 1),
'Random': or_suite.agents.rl.random.randomAgent(),
'BayesSelector': or_suite.agents.airline_revenue_management.bayes_selector.bayes_selectorAgent(epLen, round_flag=True),
'BayesSelectorTraces': or_suite.agents.airline_revenue_management.bayes_selector_traces.bayes_selector_tracesAgent(epLen, round_flag=True, dataset = dataset),
}

path_list_line = []
algo_list_line = []
path_list_radar = []
algo_list_radar= []
for agent in agents:
    print(agent)
    DEFAULT_SETTINGS['dirPath'] = '../data/airline_'+str(agent)
    if agent == 'SB PPO':
        or_suite.utils.run_single_sb_algo(mon_env, agents[agent], DEFAULT_SETTINGS)
    else:
        or_suite.utils.run_single_algo(revenue_env, agents[agent], DEFAULT_SETTINGS)

    path_list_line.append('../data/airline_'+str(agent))
    algo_list_line.append(str(agent))
    path_list_radar.append('../data/airline_'+str(agent))
    algo_list_radar.append(str(agent))
    
fig_path = '../figures/'
fig_name = 'revenue'+'_line_plot'+'.pdf'
or_suite.plots.plot_radar_plots(path_list_line, algo_list_line, fig_path, fig_name, {})

Q Learning with Data


  self.data[index, 4] = np.log(((end_time) - (start_time)))


Writing to file data.csv
Q Learning Extra Data


  self.data[index, 4] = np.log(((end_time) - (start_time)))


Writing to file data.csv
Ignorant Q Learning


  self.data[index, 4] = np.log(((end_time) - (start_time)))


Writing to file data.csv
Random


  self.data[index, 4] = np.log(((end_time) - (start_time)))


Writing to file data.csv
BayesSelector
Writing to file data.csv
BayesSelectorTraces
Writing to file data.csv
               Algorithm  Reward      Time      Space
0   Q Learning with Data   2.076       inf  -5531.130
1  Q Learning Extra Data   2.342       inf  -5176.010
2    Ignorant Q Learning   1.636       inf  -5056.884
3                 Random   1.566       inf  -5500.914
4          BayesSelector   2.760  2.597675 -28892.322
5    BayesSelectorTraces   2.706  2.578751 -49557.468


### Results

So - what do we see?

**Bayes vs Bayes with Traces** First off, we observe that even with very few traces, the Bayes Selector algorithm using the traces has competitive performance to the original Bayes Selector algorithm even knowing the full distribution on the exogenous inputs.

**Bayes vs Q Learning** Both $Q$ learning algorithms (with the same number, and with extra traces) are outperformed by Bayes Selector algorithm. This highlights the "bias variance" trade-off discussion from the presentation, where we see that the Bayes Selector algorithm to some extent suffers from constant bias, but will get improved performance in some "small data" regimes.

**Q Learning vs Q Learning with More Data** We see that the ranking of the $Q$ learning algorithm matches our intuition: worse is the "ignorant $Q$ learning", next is "$Q$ learning with small data", and the best performing is "$Q$ learning with extra data".