Skip to content

Reinforcement Learning with Algorithms from Probabilistic Structure Estimation (RLAPSE)

Notifications You must be signed in to change notification settings

roman1e2f5p8s/rlapseingym

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reinforcement Learning with Algorithms from Probabilistic Structure Estimation (RLAPSE)

In many reinforcement learning settings, the choice of the underlying algorithm is not obvious. The RLAPSE framework provides the choice whether to use a lightweight myopic algorithm (e.g., Q-learning with discount factor 0) or more complicated (e.g., Q-learning with discount factor close to 1) reinforcement learning algorithm. The framework utilizes (i) the likelihood ratio test, and (ii) a variant of Q-learning analyzed by (Even-Dar et al. 2003)). Based on the collected statistics about the environment, the RLAPSE orchestrator switches to a more appropriate algorithm, if deemed necessary. This selection algorithm is called the orchestrator. More details about this work can be found in (Epperlein et al. 2021).

Motivation

Consider a simple illustrative example for restaurant recommendation. We have a good restaurant (GR) with very limited capacity, and a bad restaurant (BR) with near infinite capacity. Every time a user asks for a recommendation, we can send them to either of the two restaurants, hence our action space is equation, where

  • Action 1: Send user to GR;
  • Action 2: Send user to BR.

BR is always able to seat customers, whereas GR might not be able to, hence we need at least two states, so the state space is equation, where

  • State 1: There is no wait in GR;
  • State 2: There is a wait in GR.

Even when they have to wait, customers still prefer GR. Customers' enjoyment of BR is always the same, since they never have to wait there. So we can say that we have rewards equation, where

  • equation: reward for sending customer to GR while there is no wait;
  • equation: reward for sending customer to GR even if there is a wait;
  • equation: reward for sending customer to BR.

Sending a customer to GR while it is able to seat will, with high probability, say equation, lead to a crowded GR. If there was a wait already, then sending a customer there will also likely not change that, so say the probability of GR staying crowded is equation. Not sending a customer there will likely leave GR without a wait, with probability equation, or lead to GR being able to seat the next customer, with probability equation. For simplicity and illustrative purposes, we assume now that all the small probabilities are equal, i.e., equation.

This can be modeled as an MDP with transition probabilities equation and reward equation. The probabilities to move from state equation to state equation, i.e., equation for each action are encoded in the transition probability matrices:

  • If recommending GR, i.e., equation, then transition matrix

equation

  • If recommending BR, i.e., equation, then transition matrix

equation

The rewards for each state-action pair are

equation

The analysis provided in (Epperlein et al. 2021) can be used to show that for equation the myopic policy is optimal, and thus a simple lightweight algorithm would be more suitable for this problem. However, the value of equation might be unknown from the outset, and sometimes more complicated algorithms can be required. RLAPSE is a tool which automatically selects the appropriate algorithm given an unknown environment.

Note: even though the restaurant example is not practical, this toy example is intended to be a good illustration of the problem settings. A more realistic and complicated scenarios, the broker example, is discussed in (Epperlein et al. 2021).

Getting Started

Please follow these instructions to install all the requirements and use the package correctly.

Requirements and Installation

git clone https://github.com/roman1e2f5p8s/rlapseingym

If using pip use

pip3.6 install -r requirements.txt

and for conda use

conda env create -f environment.yml  #  this will create the environment "rlapse"

to install the dependencies. Then run

python3.6 setup.py install

You may want to run the tests to make sure the software is fully compatible with your system:

python3.6 setup.py pytest

Usage

RLAPSE includes three packages:

  • mdps: provides wrappers for some MDP examples with ability to convert them into OpenAI environments;
  • algorithms: implements reinforcement learning algorithms and the orchestrator;
  • utils: includes some utilities to compute properties of infinite horizon MDPs and wrapper for NumPy distributions.

MDP examples

The following MDP examples are available in rlapse.mdps.mdp:

  • MDP: create MDPs from transition probability tensor and reward matrix;
  • RandomMDP: generate random MDPs using NumPy distributions.
  • RestaurantMDP: create MDPs for the restaurant example;
  • BrokerMDP: generate random MDPs for the broker example;
  • ToyBrokerMDP: create MDPs for a toy broker example.
from rlapse.mdps.mdp import RestaurantMDP

mdp = RestaurantMDP(epsilon=0.2)
mdp.validate()
Gym environments

Generated MDPs can be easily converted to OpenAI Gym environments:

env = mdp.to_env()

The environment then can be rendered into Jupyter notebooks:

env.render()
Restaurant example MDP
Figure 1: Restaurant example MDP rendered into Jupyter notebook

Algorithms

The package includes implementation of the Q-learning and RLAPSE algorithms:

from rlapse.algorithms.qlearning import Qlearner
from rlapse.algorithms.rlapse import RLAPSE

a0 = Qlearner(env, gamma=0.0)          # Q-learning with discount factor 0.0
a1 = Qlearner(env, gamma=0.9)          # Q-learning with discount factor 0.9
rl = RLAPSE(env, a0, a1, t_start=100)  # RLAPSE

To learn a policy from the environment:

a0.learn(total_timesteps=T)
a1.learn(total_timesteps=T)
rl.learn(total_timesteps=T)

To predict the action for a given observation (i.e., state):

for observation in env.mdp.states:
    action_a0, _ = a0.predict(observation)
    action_a1, _ = a1.predict(observation)
    action_rl, _ = rl.predict(observation)

Utilities

Use the value iteration algorithm to compute the optimal policy:

from rlapse.utils.infhmdp import ValueIteration, expected_reward

VI = ValueIteration(R=env.mdp.R, P=env.mdp.P)
OPTIMAL_POLICY = VI.policy
OPTIMAL_REWARD = expected_reward(R=env.mdp.R, P=env.mdp.P, policy=OPTIMAL_POLICY)
Distributions for random MDPs

Use the Distribution wrapper to generate random MDPs:

import numpy as np
from rlapse.utils.distribution import Distribution
from rlapse.mdps.mdp import RandomMDP

P_distribution = Distribution(np.random.gamma, shape=1.0, scale=5.0)
R_distribution = Distribution(np.random.gamma, shape=0.1, scale=5.0)
mdp = RandomMDP(
        n_states=10, n_actions=3,
        controlled=False, rank1pages=True,
        P_distribution=P_distribution,
        R_distribution=R_distribution)
mdp.validate()

For more examples, please see the examples/ folder and examples.ipynb notebook.

Built With

Authors

  • Roman Overko

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

None