# Stanford CME 241 (Winter 2021) - Assignment 13

In [1]:
import pandas as pd
import numpy as np

from gridWorldEnvironment import GridWorld
gw = GridWorld(gamma = .9, theta = .5)

### SARSA On-policy TD Control

In [14]:
def state_action_value(env):
    q = {}
    for state, action, next_state, reward in env.transitions:
        q[(state, action)] = np.random.normal()
    for action in env.actions:
        q[0., action] = 0.
    return q

def generate_greedy_policy(env, Q):
    policy = {}
    for state in env.states:
        actions = []
        q_values = []
        prob = []  
        for a in env.actions:
            actions.append(a)
            q_values.append(Q[state,a])   
        for i in range(len(q_values)):
            if i == np.argmax(q_values):
                prob.append(1)
            else:
                prob.append(0)                 
        policy[state] = (actions, prob)
    return policy

def e_greedy(env, e, q, state):
    actions = env.actions
    action_values = []
    prob = []
    for action in actions:
        action_values.append(q[(state, action)])
    for i in range(len(action_values)):
        if i == np.argmax(action_values):
            prob.append(1 - e + e/len(action_values))
        else:
            prob.append(e/len(action_values))
    return np.random.choice(actions, p = prob)

def greedy(env, q, state):
    actions = env.actions
    action_values = []
    for action in actions:
        action_values.append(q[state, action])
    return actions[np.argmax(action_values)]

In [8]:
def sarsa(env, epsilon, alpha, num_iter):
    Q = state_action_value(env)
    for _ in range(num_iter):
        current_state = np.random.choice(env.states)
        current_action = e_greedy(env, epsilon, Q, current_state)
        while current_state != 0:
            next_state, reward = env.state_transition(current_state, current_action)
            next_action = e_greedy(env, epsilon, Q, next_state)
            Q[current_state, current_action] += alpha * (reward + env.gamma * Q[next_state, next_action] - Q[current_state, current_action])
            current_state, current_action = next_state, next_action
    return Q

In [13]:
values = sarsa(gw, 0.1, 0.5, 10000)
np.array(list(values.values()))

array([-1.90002805, -2.77441382, -3.25421038, -1.        , -3.94494846,
       -4.22697587, -3.64832604, -1.90000032, -3.93100271, -2.74871695,
       -3.72877216, -3.72013637, -1.        , -3.26589619, -2.74641095,
       -1.90160927, -1.90000001, -3.49188953, -4.01193861, -3.28579892,
       -2.71866513, -4.07809399, -3.72830488, -3.70822778, -3.81109512,
       -2.88221503, -3.90667851, -3.60862454, -1.91441847, -4.29019236,
       -3.84262207, -2.84192933, -3.67826806, -3.64673726, -2.89719224,
       -3.56404418, -3.66016655, -1.98192945, -2.83877892, -3.60843753,
       -3.52313767, -1.        , -2.99683135, -3.09334585, -3.79253324,
       -3.92130484, -4.14899919, -3.95698485, -3.96391854, -4.23036559,
       -1.91061882, -3.78570823, -2.81181946, -1.9       , -1.        ,
       -3.95300302,  0.        ,  0.        ,  0.        ,  0.        ])

### Tabular Q-Learning

In [15]:
def q_learning(env, epsilon, alpha, num_iter):
    Q = state_action_value(env)
    
    for _ in range(num_iter):
        current_state = np.random.choice(env.states)
        while current_state != 0:
            current_action = e_greedy(env, epsilon, Q, current_state)
            next_state, reward = env.state_transition(current_state, current_action)
            best_action = greedy(env, Q, next_state)
            Q[current_state, current_action] += alpha * (reward + env.gamma * Q[next_state, best_action] - Q[current_state, current_action])
            current_state = next_state
    return Q

In [16]:
values = q_learning(gw, 0.2, 1.0, 10000)
np.array(list(values.values()))

array([-1.9  , -2.71 , -2.71 , -1.   , -2.71 , -3.439, -3.439, -1.9  ,
       -3.439, -2.71 , -3.439, -2.71 , -1.   , -2.71 , -2.71 , -1.9  ,
       -1.9  , -3.439, -3.439, -1.9  , -2.71 , -2.71 , -2.71 , -2.71 ,
       -3.439, -1.9  , -2.71 , -3.439, -1.9  , -3.439, -3.439, -2.71 ,
       -2.71 , -2.71 , -2.71 , -2.71 , -3.439, -1.9  , -1.9  , -3.439,
       -2.71 , -1.   , -1.9  , -2.71 , -2.71 , -3.439, -2.71 , -3.439,
       -3.439, -2.71 , -1.9  , -3.439, -2.71 , -1.9  , -1.   , -2.71 ,
        0.   ,  0.   ,  0.   ,  0.   ])

### Markov Decision Process Modeling

Since there are three cases in the question, we need to consider them separately for markov devision process modeling.
1. In the case of selling the entire quantity of stock purchased in the previous day, the states can be written as $ (t, x_t, l_t, b_t, s_t) $
<br> where $ t \in \{1,2,3, ...T\}$ (date), $x_t \in \mathbb{R}_{\geq 0}$ (net cash), $l_t \in \mathbb{R}_{\geq 0}$ (liability), $b_t \in \mathbb{R}_{\geq 0}$ (unfulfilled withdrawal requests), $s_t \in \mathbb{R}_{\geq 0}$ (stock value)
    - The key consideration in the problem is that $c_t \geq c_{min}$ where $c_{min} = K cot(\frac{\pi c_{min}}{2C})$
2. In the case of deciding to increase/reduce the liability
<br> If we let $y_t \in \mathbb{R}$ as the amount of change in liability, $y_t$ has the following constraints.
    - $y_t \geq - l_t$
    - $ y_t \geq c_{min} - x_t$
<br> Thus, we can conclude that $ y_t \geq max(-l_t, c_{min} - x_t)$
3. In the case of deciding to purchase a certain quantity of stock
<br> If we let $z_t \in \mathbb{R}_{\geq 0}$ as the number of stock shares to purchase.
    - $ 0 \leq z_t \leq \frac{x_t + y_t - c_{min}}{s_t} $

In this MDP, action is defined as $ (y_t, z_t) $
<br> where $ t < T, y_T = z_T = 0$
<br> The state transitions can be defined as
$$ x_{t+1} = max(x_t + y_t - z_t s_t - K cot(\frac{\pi min(x_t + y_t - z_t s_t, C)}{2C}) + d_{t+1} - f((t_1, b_t),0) $$
$$ b_{t+1} = max(-x_t - y_t + z_t s_t + K cot(\frac{\pi min(x_t + y_t - z_t s_t, C)}{2C}) - d_{t+1} + f((t_1, b_t),0) $$
$$ s_{t+1} = g(t+1, s_t) $$
where $d_t \in \mathbb{R}_{\geq 0}$ denotes the deposits on day $t$, $w_t = f(t, b_{t-1}), 2 \leq t \leq T$ denotes the withdrawal requests on day $t$, $s_t = g(t, s_{t-1})$ denotes the stock value on day $t$
<br> The reawrd on day $t, 1 \leq t \leq T-1 $ is 0, while the reward on day $T$ is $U(x_T - l_T)$

In order to solove this problem with reinforcement learning algorithm, here are some points we need to pay attention to:
* Histort data, which includes deposits, withdrawals, daily stock price movements can be used to build the above MDP model. 
* The state transition probabilities based on the statistically estimated probabilities of deposits $d_t$, withdrawals $f(t, b_{t-1})$ and stock moves $g(t, s_{t-1})$. We can also consider other explanatory factors for deposits, withdrawals and stock moces to make more comprehensive estimates of the statistically estimated probabilities. We may need to predict the future movements of these explanatary factors and use them to predict the probabilities of future deposits, withdrawals, and stock moves.
* Then we need to use this MDP model to generate a set of simulation episodes with appropriate sampling from the estimated probability distributions.
* The action space in this question is rather large, policy gradient algorithm can be used.
* We may need an appropriate function approximation such as Actor-Critic algorithm for the Q-value function for policy gradient. Additionally, we need to pay attenton to the features we choose for both the Actor and the Critic neural networks.

#####  Reference: CME 241 2020 Final Exam Solution