In [1]:
import or_suite
import numpy as np
import copy
import os
import gym

# OR Suite 

Key references:

[OpenAI Gym Documentation](https://www.gymlibrary.ml/)

[Spaces Documentation](https://www.gymlibrary.ml/content/spaces/)

[ORSuite Contribution Guide](https://github.com/cornell-orie/ORSuite/blob/main/ORSuite_Contribution_Guide.md)


##  Introduction: Creating a Custom Simulator

In the first code demonstration we will work on creating a ``custom simulator`` using the OpenAI Gym API framework. 

As we had discussed, the OpenAI Gym comes loaded with a lot of different simulators, ranging from classic control tasks (cartpole, mountaincar, etc) to Atari games.  However, the package still provides an API and outline for creating custom environments which are not part of the Gym package.  Thankfully, the framework is well documented and outlines all of the required steps in order to create your own custom environment.

We will additionally incorporate the custom environment into the ORSuite package, an open-source packaged aimed at providing simulator for researchers in RL and Operations to test algorithms on common tasks in operations management.  This is developed by a team of undergraduate students at Cornell, and is slowly building up to contain simulators such as:
- inventory control
- ridesharing systems
- revenue management problems
- resource allocation
- vaccine allocation
and many more.  Maybe some of the simulators that we create today could be incorporated into the package as well!

Due to intracacies in developing the code demonstration, you will notice that the version of ``ORSuite`` contained here does not include all of the main components.  This was done so that we can isolate the key components of developing the code demonstration.

## What makes a simulator?

As discussed during the backgrounds on MDPs earlier today, an environment or simulator is specified by the following main components:
- action space
- state space (called observation space in the Open AI gym API)
- starting state distribution
- reward function
- transition kernel
- time horizon

The OpenAI Gym API provides an abstraction for each of these.  In essence, our goal will be to create a ``subclass`` of the Environment object created by OpenAI Gym.  The high level sketch of the code will look like this:

In [2]:
'''

import gym
from gym import spaces

class CustomEnv(gym.Env):
    """Custom Environment that follows gym interface"""
    metadata = {'render.modes': ['human']}

    def __init__(self, arg1, arg2, ...):
        super(CustomEnv, self).__init__()    # Define action and observation space
        # They must be gym.spaces objects    # Example when using discrete actions:
        self.action_space = spaces.Discrete(N_DISCRETE_ACTIONS)    # Example for using image as input:
        self.observation_space = spaces.Box(low=0, high=255, shape=
                        (HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)

    def step(self, action):
    # Execute one time step within the environment
        ...  
    def reset(self):
    # Reset the state of the environment to an initial state
        ...  
    
    def render(self, mode='human', close=False):
    # Render the environment to the screen
        ...
        
'''

'\n\nimport gym\nfrom gym import spaces\n\nclass CustomEnv(gym.Env):\n    """Custom Environment that follows gym interface"""\n    metadata = {\'render.modes\': [\'human\']}\n\n    def __init__(self, arg1, arg2, ...):\n        super(CustomEnv, self).__init__()    # Define action and observation space\n        # They must be gym.spaces objects    # Example when using discrete actions:\n        self.action_space = spaces.Discrete(N_DISCRETE_ACTIONS)    # Example for using image as input:\n        self.observation_space = spaces.Box(low=0, high=255, shape=\n                        (HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)\n\n    def step(self, action):\n    # Execute one time step within the environment\n        ...  \n    def reset(self):\n    # Reset the state of the environment to an initial state\n        ...  \n    \n    def render(self, mode=\'human\', close=False):\n    # Render the environment to the screen\n        ...\n        \n'

In this framework you will notice a couple key components.

At the top we have the import statements:

```

import gym
from gym import spaces

```

gym is the name of the OpenAI Gym package, and spaces is the part of the package that allows the user to specify the action and observation spaces.

```
    def __init__(self, arg1, arg2, ...):
        super(CustomEnv, self).__init__()    # Define action and observation space
        # They must be gym.spaces objects    # Example when using discrete actions:
        self.action_space = spaces.Discrete(N_DISCRETE_ACTIONS)    # Example for using image as input:
        self.observation_space = spaces.Box(low=0, high=255, shape=
                        (HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)

```

In this statement we initialize the simulator. Note that we are able to pass arguments (i.e. specify the number of ambulances in an ambulance model, price distributions, etc).  Next we specialize the action space and observation space.  These must be ```gym.spaces``` objects.

## Gym Spaces

Spaces is a superclass that is used to define observation and action spaces, and are crucially used in Gym to define the format of valid actions and states.  They serve a couple of different purposes in running an experiment:
- They define how to interact with the environment, i.e. specify what are valid actions
- Format data in structured ways to feed into RL algortihms (i.e. numpy vectors)
- Provide method to sample elements randomly (used for exploration, debugging, $\epsilon$ greedy, etc)

There are a couple of different formats of spaces, which includes:

- ```box```: an n-dimensional continuous feature space with an upper and lower bound for each dimension

- ```dict```: a dictionary of simpler spaces and labels for those spaces

- ```discrete```: a discrete space over n integers { 0, 1, ..., n-1 }

- ```multi_binary```: a binary space of size n

- ```multi_discrete```: allows for multiple discrete spaces with a different number of actions in each

- ```tuple```: a tuple space is a tuple of simpler spaces

For example, to create a state space $[-1, 2]^{3 \times 4}$ we do:
```Box(low=-1.0, high=2.0, shape=(3, 4), dtype=np.float32)```
Unfortunately the last argument, ```dtype```, is annoyingly important.  You will need to make sure to typecast all variables appropriately when interaction with environment.

See the documentation [here](https://www.gymlibrary.ml/content/spaces/) to read more about the spaces.

## Trade Execution

We will be implementing the universal trade order execution environment from [here](https://arxiv.org/abs/2103.10860).  Unfortunately I am not a finance expert so you will have to deal with my imperfect interpretation of the environment, but see the paper for a better explanation.  Suppose you have access to a certain amount $Q$ of a fixed stock.

For sake of discussion, lets say [ETD](https://finance.yahoo.com/quote/ETD/) for the Ethan Allen furniture company, a Vermont-based furniture company named after [Ethan Allen](https://en.wikipedia.org/wiki/Ethan_Allen), forming the famous [Green Mountain Boys](https://en.wikipedia.org/wiki/Green_Mountain_Boys) who fought back against Canada, New York, and New Hampshire to take control over the territory now known as Vermont.

Unfortunately, the stock is plummeting and so you want to sell your entire volume $Q$ of the stock into the market over a fixed time horizon $T$.  For each round $t = 0, 1, \ldots, T - 1$ you will observe the price $p_t$ and select a volume $q_{t+1}$ of shares, the trading over will then actually be executed with the next price $p_{t+1}$ due to the information structure in financial markets.  The goal is to maximize the revenue with completed liquidation.  As such, given access to the full vector of prices in the market $p_1, \ldots, p_{T+1}$ your goal is to pick:

$$
\max_{q_1, \ldots, q_T} \sum_{t=0}^T q_t p_{t+1} \text{ s. t. } \sum_{t=1}^T q_{t} = Q
$$

The first corresponds to the total revenue, and the second is the constraint for total liquidation.  Note that we can instead enforce that $q_{t} = a_t Q$ where $a_t \in [0,1]$ to be the action for simpler implementation.  As such, we can then specify the action space and state space via:

```
        self.action_space = spaces.Box(low=0, high=1, shape=(1), dtype=np.float32)
        self.observation_space = spaces.Box(low=0, high=1, shape=(1), dtype=np.float32)
```


In [3]:
import gym
from gym import spaces

In [4]:
action_space = spaces.Box(low=0, high=1, shape=[1], dtype=np.float32)

We can sample from the action space

In [5]:
action_space.sample()

array([0.7483489], dtype=float32)

and test membership

In [6]:
action_space.contains([.5])

True

In [7]:
action_space.contains([1.5])

False

In [8]:
action_space.contains([1.5,3])

False

Putting this into the framework from before gives us:

In [9]:
import gym
from gym import spaces

class Trading(gym.Env):
    """Custom Environment that follows gym interface"""
    metadata = {'render.modes': ['human']}

    def __init__(self, T, Q):
        super(Trading, self).__init__()    # Define action and observation space
        # They must be gym.spaces objects    # Example when using discrete actions:
        self.action_space = spaces.Box(low=0, high=1, shape=[1], dtype=np.float32)
        self.observation_space = spaces.Box(low=0, high=1, shape=[1], dtype=np.float32)
        self.T = T
        self.Q = Q

## Reset Function

Next we will write the reset method, which is called anytime a new environment is created or to reset an existing environment's state.  This is where we will set the initial information and number of rounds in the experiment, etc.  In our problem, the starting state can be a fixed price of $0.5$ and the time index to be zero.

In [10]:
import gym
from gym import spaces

class Trading(gym.Env):
    """Custom Environment that follows gym interface"""
    metadata = {'render.modes': ['human']}

    def __init__(self, T, Q):
        super(Trading, self).__init__()    # Define action and observation space
        # They must be gym.spaces objects    # Example when using discrete actions:
        self.action_space = spaces.Box(low=0, high=1, shape=[1], dtype=np.float32)
        self.observation_space = spaces.Box(low=0, high=1, shape=[1], dtype=np.float32)
        self.T = T
        self.Q = Q
    
    def reset(self):
        self.state = [0.5]
        self.timestep = 0
        self.market_sold = 0 # amount of inventory sold
        return self.state

## Step Function

Next our environment needs to be able to take a step.  At each step we will take a specified action (chosen by the algortihm within the action spacE), calculate the reward, return the next observation, and indicate whether or not the experiment is finished running.

In this model, we have to do two things:
- Calculate the next market price (can just add some Gaussian noise to be simple)
- Determine if we are at the last round (where the action does not matter since we need to liquidate entire asset)

In [11]:
import gym
from gym import spaces

class Trading(gym.Env):
    """Custom Environment that follows gym interface"""
    metadata = {'render.modes': ['human']}

    def __init__(self, T, Q):
        super(Trading, self).__init__()    # Define action and observation space
        # They must be gym.spaces objects    # Example when using discrete actions:
        self.action_space = spaces.Box(low=0, high=1, shape=[1], dtype=np.float32)
        self.observation_space = spaces.Box(low=0, high=1, shape=[1], dtype=np.float32)
        self.T = T
        self.Q = Q
    
    def reset(self):
        self.state = [0.5]
        self.timestep = 0
        self.market_sold = 0 # amount of inventory sold
        return self.state
    
    def step(self, action):
        assert self.action_space.contains(action)
        
        if self.market_sold + action[0] > 1: # checks if we are selling more than we have
            action = 1 - self.market_sold
    
        done = False
        if self.timestep == self.T - 1: # updates action in last timestep
            action = 1 - self.market_sold
            done = True
    
        self.market_sold += action
        
        self.timestep += 1
        
        # Calculates new price and updates new state
        
        price = self.state[0]
        price = np.clip(price + np.random.uniform(0, .1),0,1)
        self.state = [price]
        
        # Calculates reward
        reward = action * self.Q * price
        
        return self.state, reward, done, {}

## Render

The only thing left is to render.  For simplicity, we will just print out the profit made.

In [12]:
import gym
from gym import spaces

class Trading(gym.Env):
    """Custom Environment that follows gym interface"""
    metadata = {'render.modes': ['human']}

    def __init__(self, T, Q):
        super(Trading, self).__init__()    # Define action and observation space
        # They must be gym.spaces objects    # Example when using discrete actions:
        self.action_space = spaces.Box(low=0, high=1, shape=[1], dtype=np.float32)
        self.observation_space = spaces.Box(low=0, high=1, shape=[1], dtype=np.float32)
        self.T = T
        self.Q = Q
    
    def reset(self):
        self.state = [0.5]
        self.timestep = 0
        self.market_sold = 0 # amount of inventory sold
        return self.state
    
    def step(self, action):
        assert self.action_space.contains(action)
        
        action = action[0]
        if self.market_sold + action > 1: # checks if we are selling more than we have
            action = 1 - self.market_sold
    
        done = False
        if self.timestep == self.T - 1: # updates action in last timestep
            action = 1 - self.market_sold
            done = True
    
        self.market_sold += action
        
        self.timestep += 1
        
        # Calculates new price and updates new state
        
        price = self.state[0]
        price = np.clip(price + np.random.uniform(0, .1),0,1)
        self.state = [price]
        
        # Calculates reward
        reward = action * self.Q * price
        
        return self.state, reward, done, {}
    
    def render(self, mode='human', close=False):
        print(f'Current state: {self.state}')
        print(f'Shares remaining: {1 - self.market_sold}')

## TaDa!

Our environment is now complete.  We can now instantiate an object and test it out.

In [13]:
env = Trading(5, 100)
env.reset()

[0.5]

Note that calling reset returns the state to be $.5$

In [14]:
env = Trading(5, 100)
env.reset()

[0.5]

In [15]:
state, reward, done, _ = env.step([1/5])
print(f'New state: {state}, reward: {reward}, done: {done}')

New state: [0.5731734894850563], reward: 11.463469789701126, done: False


Run the step above 5 times to verify that the last component switches over to being finished.

## Incorporating into the package and registering

Next up we will incorporate and register the package as part of ORSuite.  When adding an environment we require the following file structure:

```
or_suite/envs/new_env_name:
    -> __init__.py
    -> env_name.py
    -> env_name_readme.ipynb
```

The `__init__.py` file should simply import the environment class from `env_name.py`.  
Once the environment file structure has been made, in order to include the environment in the package we additionally require:
- Specify default parameter values for the configuration dictionary of the environment, which is saved in `or_suite/envs/env_configs.py`
- Register the environment by modifying `or_suite/envs/__init__.py` to include a link to the class along with a name
- Modify `or_suite/envs/__init__.py` to import the new environment folder.

All of this has been done already as an example for the trading environment.  Note that there are a couple modifications that were made just to have consistency in the package including:
- Setting up the config as a dictionary instead of passing individual arguments
- Adjusting the naming structure


Now that the environment is registered can create one simply by specifying the name of the simulator, here taken to be ```StockTrading-v0```.

In [16]:
CONFIG = or_suite.envs.env_configs.trade_execution_default_config
stock_env = gym.make('StockTrading-v0')

In [17]:
stock_env.reset()

array([0.5])

In [18]:
stock_env.step([.5])

(array([0.53048302]), 13.262075458704226, False, {})

We can also verify that the environment is set-up correctly using a handy checker built by the [stablebaselines](https://stable-baselines.readthedocs.io/en/master/) package.  It might be a good idea to do the same for your own environment to double check your code.  Note that doing this for myself also reminded me of some things missed above!

- When resetting the returned state needs to be a np array
- In the step function the returned state also needs to be a np array

(Hence the earlier complaints on the OpenAI Gym framework being picky about datatypes)

In [19]:
from stable_baselines3.common.env_checker import check_env
check_env(stock_env)



It will give a generic warning about normalizing the action space or rewards, but that is only required for certain DeepRL implementations.

## Thoughts and comments

The OpenAI Gym API framework is great for models where you are investigating approximate dynamic programming in a known MDP, or the online learning setting where the data the algortihm has access to must come from on-policy trajectories.

However, another learning framework is the generative model setting where you assume the algorithm is able to query $s', r \sim r_h(s,a), T_h(\cdot \mid s,a)$.  Unfortunately, the step function does not work for this exactly, since it depends on some internal variables that might not be updated or captured when calling the ```step``` function.  

## Your Turn

Your goal for this code demo is to pick ``any`` model, implement it, and incorporate it into the ORSuite package.  Note that if you pick a model with a finite set of states and actions, in the next code demo when we work on developing a value iteration based algorithm we will be able to test it out on the simulator that you implement.  Also make sure to check the environment using the stablebaselines package.

For some inspiration:
- [Windy Grid World](https://github.com/ibrahim-elshar/gym-windy-gridworlds)
- Stochastic Queueing Network
- More advanced financial models
- [Pandora's Box](https://en.wikipedia.org/wiki/Pandora%27s_box)
- [Online Bin Packing](https://en.wikipedia.org/wiki/Bin_packing_problem)
- [Scheduling Problems](https://en.wikipedia.org/wiki/Scheduling_(computing))