# Assignment 3 - Problem 3

## Frog Escape Problem as MDP

For the *frog-escape problem*, we have the state space $S = \{0, 1, ..., n\}$, the terminal states $T = \{0, n\}$, and the action space $A = \{A, B\}$ (croaking). 

We have the following transition probabilities $\mathcal{P}(s,a,s') \text{ for all } i \in N$:

$$\mathcal{P}(i,A,i-1) = \frac{i}{n}$$
$$\mathcal{P}(i,A,i+1) = \frac{n-1}{i}$$
$$\mathcal{P}(i,B, s') = \frac{1}{n} \text{ for all }s' \in \{0,...,i-1,i+1,...n\}$$

All other transition probabilities are 0. Finally, we set the reward function $\mathcal{R_T}(s,a,s')$ to be:

$$\mathcal{R_T}(i,A,n) = 1 \text{ for all } i \in N $$
$$\mathcal{R_T}(i,B,n) = 1 \text{ for all } i \in N $$

Rewards for landing all other states are 0. Then, the only way to get any reward is by escaping (landing on n), and the Optimal Value Function will represent the probability of escaping the bond.

## Modeling as Finite MDP

In [1]:
from dataclasses import dataclass
from typing import Tuple, Dict
from rl.markov_decision_process import FiniteMarkovDecisionProcess
from rl.markov_decision_process import FinitePolicy, StateActionMapping
from rl.markov_process import FiniteMarkovProcess, FiniteMarkovRewardProcess
from rl.distribution import Categorical, Constant
from scipy.stats import poisson
import itertools
from pprint import pprint


In [2]:
@dataclass(frozen=True)
class PadState:
    pad_num: int
        
PadTransMapping = StateActionMapping[PadState, int]

In [22]:
class FrogEscMDP(FiniteMarkovDecisionProcess[PadState, int]):

    def __init__(self, n: int):
        self.n: int = n     
        super().__init__(self.get_action_transition_reward_map())

    def get_action_transition_reward_map(self) -> PadTransMapping:
        d: Dict[PadState, Dict[str, Categorical[Tuple[PadState, float]]]] = {}
        #define rewards
        rewards: Dict[PadState, float] = {PadState(s): 0 for s in range(n)}
        rewards[PadState(n)] = 1
            
        for state in range(1, self.n): #non-terminal states
            #action -> state-reward -> p
            d1: Dict[str, Categorical[Tuple[PadState, float]]] = {}
            
            #croaks A
            #state-reward -> p (to be turned into Categorical)
            sr_probs_dict_A: Dict[Tuple[PadState, float], float] = \
                {(PadState(state-1), rewards[state-1]): state/n, \
                (PadState(state+1), rewards[state+1]): (n-state)/n}
            d1['A'] = Categorical(sr_probs_dict_A)
            
            #croaks B
            sr_probs_dict_B: Dict[Tuple[PadState, float], float] = \
                {(PadState(next_state), rewards[next_state]): 1/n for next_state in range(n+1) if n != state}          
            d1['B'] = Categorical(sr_probs_dict_B)

            d[PadState(state)] = d1
            
        #terminal states
        d[PadState(0)] = None
        d[PadState(n)] = None
        return d

In [34]:
##test model by printing out map
n = 6
frog_mdp: FiniteMarkovDecisionProcess[PadState, int] =\
    FrogEscMDP(n=n)

print("MDP Transition Map")
print("------------------")
print(frog_mdp)

MDP Transition Map
------------------
From State PadState(pad_num=1):
  With Action A:
    To [State PadState(pad_num=0) and Reward 0.000] with Probability 0.167
    To [State PadState(pad_num=2) and Reward 0.000] with Probability 0.833
  With Action B:
    To [State PadState(pad_num=0) and Reward 0.000] with Probability 0.143
    To [State PadState(pad_num=1) and Reward 0.000] with Probability 0.143
    To [State PadState(pad_num=2) and Reward 0.000] with Probability 0.143
    To [State PadState(pad_num=3) and Reward 0.000] with Probability 0.143
    To [State PadState(pad_num=4) and Reward 0.000] with Probability 0.143
    To [State PadState(pad_num=5) and Reward 0.000] with Probability 0.143
    To [State PadState(pad_num=6) and Reward 1.000] with Probability 0.143
From State PadState(pad_num=2):
  With Action A:
    To [State PadState(pad_num=1) and Reward 0.000] with Probability 0.333
    To [State PadState(pad_num=3) and Reward 0.000] with Probability 0.667
  With Action B:
    T

In [35]:
##create 2^n deterministic policies (as instances of FinitePolicy class)
(max_value, max_policy) = (0, 0)
for i, policy in enumerate(itertools.product(['A','B'], repeat = n-1), start=1):
    frog_fdp: FinitePolicy[PadState, int] = FinitePolicy(
    #pad-1 since we dont have a policy for state 0
    {PadState(pad_num) : Constant(policy[pad_num-1]) 
     for pad_num in range(1, n)})

    print(f"Policy Map for Policy {i}")
    print("----------")
    print(frog_fdp)
    
    #create a policy-implied Finite MRP for each of these deterministic policies (using the apply finite policy method of FiniteMarkovDecisionProcess class)
    implied_frog_mrp: FiniteMarkovRewardProcess[PadState] =\
        frog_mdp.apply_finite_policy(frog_fdp)

    #evaluate the Value Function for each of those implied Finite MRPs (using the get value function vec method of FiniteMarkovRewardProcess class)
    print(f"Implied MRP Policy Evaluation Value Function for Policy {i}")
    print("--------------")
    value_func = implied_frog_mrp.get_value_function_vec(gamma=1.0)
    pprint(value_func)
    print()
    
    if value_func[0] > max_value:
        (max_value, max_policy) = (value_func[0], i)
print()
print(f"Policy {max_policy} is the optimal deterministic policy")
    

Policy Map for Policy 1
----------
For State PadState(pad_num=1):
  Do Action A with Probability 1.000
For State PadState(pad_num=2):
  Do Action A with Probability 1.000
For State PadState(pad_num=3):
  Do Action A with Probability 1.000
For State PadState(pad_num=4):
  Do Action A with Probability 1.000
For State PadState(pad_num=5):
  Do Action A with Probability 1.000

Implied MRP Policy Evaluation Value Function for Policy 1
--------------
array([0.38461538, 0.46153846, 0.5       , 0.53846154, 0.61538462])

Policy Map for Policy 2
----------
For State PadState(pad_num=1):
  Do Action A with Probability 1.000
For State PadState(pad_num=2):
  Do Action A with Probability 1.000
For State PadState(pad_num=3):
  Do Action A with Probability 1.000
For State PadState(pad_num=4):
  Do Action A with Probability 1.000
For State PadState(pad_num=5):
  Do Action B with Probability 1.000

Implied MRP Policy Evaluation Value Function for Policy 2
--------------
array([0.21276596, 0.25531915, 0.