# CME 241 Assignment 4

## Shaan Patel

### Question 1

Given we know $v_0$, we can now find $v_1$ through the value iteration method.

$$ v_1(s) = B^*(v_0)(s) = \max_{a \in A} Q^\pi(s,a)$$

$$ = \max_{a \in A} [R(s,a) + \gamma \sum_{s' \in N} P(s,a,s') V^*(s')] $$

for $s_1$ we get

$$ \max_{a \in A} [R(s_1,a) + \gamma \sum_{s' \in N} P(s_1,a,s')v^*(s')] $$
$$ = 8 + (0.2 * 10 + 0.6*1 + 0.2*0) = 10.6$$ 
for $a_1$ and
$$ = 10 + (0.1*10 + 0.2*1 + 0.7*0) = 11.2$$ 

for $a_2$. So $v_1(s_1) = 11.2$. For $s_2$ we have

$$ 1 + (0.3*10 + 0.3*1 + 0.4*0) = 4.3$$ 
for $a_1$ and
$$ -1 + (0.5*10 + 0.3*1 + 0.2*0) = 4.3$$ 
for $a_2$. So $v_1(s_2) = 4.3$. $v_1(s_3) = 0$ because it is the terminal state and has no rewards.

The optimal policy $\pi_1(s)$ is 2 for $s_1$ and either 1 or 2 for $s_2$, so $\pi_1(s) = 2$.

For $v_2$ we do a similar step. In the $s_1$ case,

$$v_2(s_1) = 8 + (0.2*11.2 + 0.6*4.3 + 0) = 12.82 $$
for $a_1$ and
$$ = 10 + (0.1*11.2 + 0.2*4.3 + 0) = 11.98 $$

for $a_2$. So $v_1(s_1) = 12.82$ for action $a_1$. In the $s_2$ case,

$$v_2(s_2) = 1 + (0.3*11.2 + 0.3*4.3 + 0) = 5.65$$
for $a_1$ and
$$ -1 + (0.5*11.2 + 0.3*4.3 + 0) = 5.89 $$
for $a_2$. so $v_1(s_2) = 5.89$ for action $a_2$.

We can see that the value functions for each state are increasing with each iteration. As such, the action that has the higher probability of reaching state 1 will yield the higher value function. In the case of $s_1$, we see that action $a_1$ has a higher chance of returning both $s_1$ and $s_2$ over $a_2$. As such, through continuous iterations, action 1 will continuously give higher values than action 2.

For $s_2$, we see that the probability of reaching state 2 is the same for both actions, but the chance of reaching state 1 is higher with action 2. Thus, over continuous iterations, this will yield higher sums than in action 1.

As a result, $\pi_k(s)$ for $k > 2$ is equal to $\pi_2(s)$, which is $a_1$ for $s_1$ and $a_2$ for $s_2$.

### Question 4

In [16]:
from dataclasses import dataclass
from typing import Tuple, Dict, Mapping
from rl.markov_decision_process import FiniteMarkovDecisionProcess
from rl.policy import FiniteDeterministicPolicy
from rl.markov_process import FiniteMarkovProcess, FiniteMarkovRewardProcess
from rl.distribution import Categorical
from rl.dynamic_programming import policy_iteration_result, value_iteration_result
from scipy.stats import poisson

In [None]:
@dataclass(frozen=True)
class InventoryState:
    on_hand1: int
    on_order1: int
    on_hand2: int
    on_order2: int

    def inventory_position1(self) -> int:
        return self.on_hand1 + self.on_order1
    
    def inventory_position2(self) -> int:
        return self.on_hand2 + self.on_order2


InvOrderMapping = Mapping[
    InventoryState,
    Mapping[int, Categorical[Tuple[InventoryState, float]]]
]

TwoStoreAct = Tuple[int,int,int]

In [13]:
class ComplexInventoryMDPCap(FiniteMarkovDecisionProcess[InventoryState, int]):

    def __init__(
        self,
        capacity1: int,
        capacity2: int,
        poisson_lambda1: float,
        poisson_lambda2: float,
        holding_cost1: float,
        holding_cost2: float,
        stockout_cost1: float,
        stockout_cost2: float,
        transport_cost: float,
        transfer_cost: float
    ):
        self.capacity1: int = capacity1
        self.capacity2: int = capacity2
        self.poisson_lambda1: float = poisson_lambda1
        self.poisson_lambda2: float = poisson_lambda2
        self.holding_cost1: float = holding_cost1
        self.holding_cost2: float = holding_cost2
        self.stockout_cost1: float = stockout_cost1
        self.stockout_cost2: float = stockout_cost2
        self.transport_cost: float = transport_cost
        self.transfer_cost: float = transfer_cost

        self.poisson_distr1 = poisson(poisson_lambda1)
        self.poisson_distr2 = poisson(poisson_lambda2)
        super().__init__(self.get_action_transition_reward_map())

    def get_action_transition_reward_map(self) -> InvOrderMapping:
        d: Dict[InventoryState, Dict[TwoStoreAct, Categorical[Tuple[InventoryState,
                                                            float]]]] = {}

        for alpha1 in range(self.capacity1 + 1):
            for beta1 in range(self.capacity1 + 1 - alpha1):
                for alpha2 in range(self.capacity2 + 1):
                    for beta2 in range(self.capacity2 - alpha2):
                        state: InventoryState = InventoryState(alpha1, beta1, alpha2, beta2)
                        ip1: int = state.inventory_position1()
                        ip2: int = state.inventory_position2()
                        base_reward: float = - self.holding_cost1 * alpha1 - self.holding_cost2 * alpha2
                        d1: Dict[TwoStoreAct, Categorical[Tuple[InventoryState, float]]] = {}

                        for order1 in range(self.capacity1 - ip1 + 1):
                            for order2 in range(self.capacity2 - ip2 + 1):
                                for transfer in range(-alpha1, alpha2 + 1):
                                    order: TwoStoreAct = (order1, order2, transfer)
                                    added_reward = base_reward

                                    if order1 > 0 or order2 > 0:
                                        added_reward = added_reward - self.transport_cost
                                    if transfer > 0:
                                        added_reward = added_reward - self.transfer_cost
                                    
                                    sr_probs_dict: Dict[Tuple[InventoryState, float], float] =\
                                        {(InventoryState(ip1 + transfer - i, order1, ip2 - transfer - j, order2), added_reward):
                                        self.poisson_distr1.pmf(i)*self.poisson_distr2.pmf(j) for i in range(ip1) for j in range(ip2)}
                                    
                                    for j in range(ip2):
                                        prob1: float = (1 - self.poisson_distr1.cdf(ip1 + transfer - 1))*self.poisson_distr2.pmf(j)
                                        reward1: float = added_reward - self.stockout_cost1 *\
                                            (prob1 * (self.poisson_lambda1 - ip1) +
                                            ip1 * self.poisson_distr1.pmf(ip1))
                                        sr_probs_dict[(InventoryState(0,order1, ip2 - transfer - j, order2)), reward1] = \
                                            prob1
                                    
                                    for i in range(ip1):
                                        prob2: float = (1 - self.poisson_distr2.cdf(ip2 - transfer - 1))*self.poisson_distr1.pmf(i)
                                        reward2: float = added_reward - self.stockout_cost2 *\
                                            (prob2 * (self.poisson_lambda2 - ip2) +
                                            ip2 * self.poisson_distr2.pmf(ip2))
                                        sr_probs_dict[(InventoryState(ip1 + transfer - i, order1, 0, order2)), reward2] = \
                                            prob2
                                    
                                    zeroprob: float = (1 - self.poisson_distr1.cdf(ip1 + transfer - 1)) *\
                                        (1 - self.poisson_distr2.cdf(ip2 - transfer - 1))
                                    zeroreward: float = added_reward - self.stockout_cost1 *\
                                        (zeroprob * (self.poisson_lambda1 - ip1) +
                                        ip1 * self.poisson_distr1.pmf(ip1)) - self.stockout_cost2 *\
                                        (zeroprob * (self.poisson_lambda2 - ip2) +
                                        ip2 * self.poisson_distr2.pmf(ip2))
                                    sr_probs_dict[(InventoryState(0, order1, 0, order2)), zeroreward] =\
                                        zeroprob

                                    d1[order] = Categorical(sr_probs_dict)

                        d[state] = d1
        return d

In [14]:
two_stores: FiniteMarkovDecisionProcess[InventoryState, int] =\
    ComplexInventoryMDPCap(
        capacity1= 2,
        capacity2= 4,
        poisson_lambda1= 1,
        poisson_lambda2= 2,
        holding_cost1= 1,
        holding_cost2= 2,
        stockout_cost1= 10,
        stockout_cost2= 20,
        transport_cost= 5,
        transfer_cost= 2
    )

gamma = 1.0

result = policy_iteration_result(two_stores, gamma)

In [15]:
print(result)

({NonTerminal(state=InventoryState(on_hand1=0, on_order1=0, on_hand2=0, on_order2=0)): -55.0, NonTerminal(state=InventoryState(on_hand1=0, on_order1=0, on_hand2=0, on_order2=1)): -52.41986724186063, NonTerminal(state=InventoryState(on_hand1=0, on_order1=0, on_hand2=0, on_order2=2)): -36.69508941781737, NonTerminal(state=InventoryState(on_hand1=0, on_order1=0, on_hand2=0, on_order2=3)): -28.21999369744554, NonTerminal(state=InventoryState(on_hand1=0, on_order1=0, on_hand2=1, on_order2=0)): -49.46680354085582, NonTerminal(state=InventoryState(on_hand1=0, on_order1=0, on_hand2=1, on_order2=1)): -38.69508941781737, NonTerminal(state=InventoryState(on_hand1=0, on_order1=0, on_hand2=1, on_order2=2)): -30.21999369744554, NonTerminal(state=InventoryState(on_hand1=0, on_order1=0, on_hand2=2, on_order2=0)): -36.00829277047814, NonTerminal(state=InventoryState(on_hand1=0, on_order1=0, on_hand2=2, on_order2=1)): -32.21999369744554, NonTerminal(state=InventoryState(on_hand1=0, on_order1=0, on_hand2