## Problem Statement 

The agent has an initial inventory of $N$ stocks that it wishes to trade out. The goal of the agent is to realise this task by placing market or limit orders in either direction and acquire the maximum profit through the bid/ask spread. 

We take into consideration the following market indicators: 
- $Time Remaining$: Time remaining after the time period $T_k$
- $Quantity Remaining$: Quantity of inventory remaining 
- $Volume Imbalance$: he difference between the existing order volume on the best bid and best ask price levels for liquidity on both directions
- $Bid Ask Spread$: Difference between the lowest ask price and highest bid price
- $One Period Price Return$: The log-return of stock price over two consecutive days
- $T Period Price Return$: The log-return of stock price since the beginning

## Action Space

The agent can take the following actions in the beginning of a trade window T_k: 

- $Market Order$: The agent can place market order in either direction with a given volume $Q$ to buy/sell the stock at the best current price

- $Limit Order$: The agent can place a limit order in either direction with a given volume $Q$ to buy/sell the stock at no worse than price $P$

- $Do Nothing$: Do not take any action at the current trade window

## State Space 

The RL agent has information about the market through the state space representation $s_t$. 

$$S_t = (timeRem_t, quantityRem_t, volImb_t, spread_t, onePeriodPriceReturn_t, tPeriodPriceReturn_t)$$
where: 
- $timeRem_t = 2* \frac{T-t}{T} - 1$: the time advancement
- $quantityRem_t = 2 * \frac{N - \sum_{i=0}^{t} n_t}{N} - 1$: the inventory advancement
- $volImb_t = \frac{Q_{best\_bid} - Q_{best\_ask}}{Q_{best\_bid} + Q_{best\_ask}}$
- $spread_t = P_{bid\_high} - P_{ask\_low}$: bid-ask spread
- $onePeriodPriceReturn_t = log(\frac{P_t}{P_{t-1}})$
- $tPeriodPriceReturn_t = log(\frac{P_t}{P_0})$

## Reward Function

The reward function $R_t$ measures the execution price slippage and quantity. Formally, $R_t$ is defined as: 

$$R_t = (1 - \frac{|P\_fill - P\_arrival|}{P\_arrival}) . \lambda\frac{N_t}{N}$$

where $\lambda$ is a constant for scaling the effect of the quantity
component.

In [3]:
import gym 
import abides_gym

In [4]:
# create environment and set global seed 
env = gym.make(
        "markets-execution-v0",
        background_config="rmsc04"
    )

# set the seed 
env.seed(0)

[0]