# CME 241 Assignment 4

## Shaan Patel

### Question 1

Given we know $v_0$, we can now find $v_1$ through the value iteration method.

$$ v_1(s) = B^*(v_0)(s) = \max_{a \in A} Q^\pi(s,a)$$

$$ = \max_{a \in A} [R(s,a) + \gamma \sum_{s' \in N} P(s,a,s') V^*(s')] $$

for $s_1$ we get

$$ \max_{a \in A} [R(s_1,a) + \gamma \sum_{s' \in N} P(s_1,a,s')v^*(s')] $$
$$ = 8 + (0.2 * 10 + 0.6*1 + 0.2*0) = 10.6$$ 
for $a_1$ and
$$ = 10 + (0.1*10 + 0.2*1 + 0.7*0) = 11.2$$ 

for $a_2$. So $v_1(s_1) = 11.2$. For $s_2$ we have

$$ 1 + (0.3*10 + 0.3*1 + 0.4*0) = 4.3$$ 
for $a_1$ and
$$ -1 + (0.5*10 + 0.3*1 + 0.2*0) = 4.3$$ 
for $a_2$. So $v_1(s_2) = 4.3$. $v_1(s_3) = 0$ because it is the terminal state and has no rewards.

The optimal policy $\pi_1(s)$ is 2 for $s_1$ and either 1 or 2 for $s_2$, so $\pi_1(s) = 2$.

For $v_2$ we do a similar step. In the $s_1$ case,

$$v_2(s_1) = 8 + (0.2*11.2 + 0.6*4.3 + 0) = 12.82 $$
for $a_1$ and
$$ = 10 + (0.1*11.2 + 0.2*4.3 + 0) = 11.98 $$

for $a_2$. So $v_1(s_1) = 12.82$ for action $a_1$. In the $s_2$ case,

$$v_2(s_2) = 1 + (0.3*11.2 + 0.3*4.3 + 0) = 5.65$$
for $a_1$ and
$$ -1 + (0.5*11.2 + 0.3*4.3 + 0) = 5.89 $$
for $a_2$. so $v_1(s_2) = 5.89$ for action $a_2$.

We can see that the value functions for each state are increasing with each iteration. As such, the action that has the higher probability of reaching state 1 will yield the higher value function. In the case of $s_1$, we see that action $a_1$ has a higher chance of returning both $s_1$ and $s_2$ over $a_2$. As such, through continuous iterations, action 1 will continuously give higher values than action 2.

For $s_2$, we see that the probability of reaching state 2 is the same for both actions, but the chance of reaching state 1 is higher with action 2. Thus, over continuous iterations, this will yield higher sums than in action 1.

As a result, $\pi_k(s)$ for $k > 2$ is equal to $\pi_2(s)$, which is $a_1$ for $s_1$ and $a_2$ for $s_2$.

### Question 4

In [1]:
from typing import Tuple, Dict, Mapping
from rl.markov_decision_process import FiniteMarkovDecisionProcess
from rl.policy import FiniteDeterministicPolicy
from rl.markov_process import FiniteMarkovProcess, FiniteMarkovRewardProcess
from rl.distribution import Categorical
from scipy.stats import poisson

In [None]:
@dataclass(frozen=True)
class InventoryState:
    on_hand: int
    on_order: int

    def inventory_position(self) -> int:
        return self.on_hand + self.on_order


InvOrderMapping = Mapping[
    InventoryState,
    Mapping[int, Categorical[Tuple[InventoryState, float]]]
]


class SimpleInventoryMDPCap(FiniteMarkovDecisionProcess[InventoryState, int]):

    def __init__(
        self,
        capacity: int,
        poisson_lambda: float,
        holding_cost: float,
        stockout_cost: float
    ):
        self.capacity: int = capacity
        self.poisson_lambda: float = poisson_lambda
        self.holding_cost: float = holding_cost
        self.stockout_cost: float = stockout_cost

        self.poisson_distr = poisson(poisson_lambda)
        super().__init__(self.get_action_transition_reward_map())

    def get_action_transition_reward_map(self) -> InvOrderMapping:
        d: Dict[InventoryState, Dict[int, Categorical[Tuple[InventoryState,
                                                            float]]]] = {}

        for alpha in range(self.capacity + 1):
            for beta in range(self.capacity + 1 - alpha):
                state: InventoryState = InventoryState(alpha, beta)
                ip: int = state.inventory_position()
                base_reward: float = - self.holding_cost * alpha
                d1: Dict[int, Categorical[Tuple[InventoryState, float]]] = {}

                for order in range(self.capacity - ip + 1):
                    sr_probs_dict: Dict[Tuple[InventoryState, float], float] =\
                        {(InventoryState(ip - i, order), base_reward):
                         self.poisson_distr.pmf(i) for i in range(ip)}

                    probability: float = 1 - self.poisson_distr.cdf(ip - 1)
                    reward: float = base_reward - self.stockout_cost *\
                        (probability * (self.poisson_lambda - ip) +
                         ip * self.poisson_distr.pmf(ip))
                    sr_probs_dict[(InventoryState(0, order), reward)] = \
                        probability
                    d1[order] = Categorical(sr_probs_dict)

                d[state] = d1
        return d