## Problem 3


Part 1: $\textbf{With proper mathematical notation, model this as a finite MDP specifying the states, actions, rewards, state-transition probabilities and discount factor.}$

First, our states can be expressed as the state of the die on the table in sorted ascending order, and then a pair of values corresponding to the number of 1's in our hand and the sum of the die in our hand

\begin{equation*} \mathcal{S} = \{ [i_{1}, ..., i_{l}] , (I_{1}, \sum_{k=l+1}^{N} i_{k}) \} \\
    0 \leq l \leq N  \\
    1 \leq i_{p} \leq K \\
    i_{k-1} \leq i_{k} \leq i_{k+1}
\end{equation*}

and we can also illustrate the set of Terminal states corresponding to no die on the table:

\begin{equation*}
    \mathcal{T} = \{ [], (I_{1}, \sum_{k=1}^{N} i_{k}) \}
\end{equation*}

Our action space is thus defined as a set of indices that take the die from the table (i.e. the first list in the state), and transferring it to the hand, thus crystallizing the die values. We must also state that the action must have a list of size greater than 1:

\begin{equation*}
    \mathcal{A} = \{ j_{1}, ..., j_{q} \}  \\
    1 \leq q \leq l \\
    j_{k} \in \{0,1\}
\end{equation*}

Also, our rewards functions are zero for the nonterminal states, so we will show only the rewards for the terminal states $t \in \mathcal{T} $

\begin{equation*}
    R(t) = 
    \begin{cases}
        0, & \sum_{k=1}^{k=N} I(i_{k}) \le C  \\
        \sum_{k=1}^{k=N} i_{k}, & otherwise
    \end{cases}
\end{equation*}

where $I(i_{k})$ is the indicator function for whether the die is a 1 or not.


The transition probabilities vary depending on the action taken by our agent, and are shown below:

\begin{equation*}
    \mathcal{P}(s,a,s^{'}) = (\frac{1}{K})^{N-\pi(a)} \cdot c_{1} \\
    \pi(a) = \textrm{number of die left on table after action a} \\
    c_{1} = \textrm{Number of ways we could have gotten to table die}
\end{equation*}


Part 2: $\textbf{Implement this MDP in python}$

In [2]:
import numpy as np
import itertools
from dataclasses import dataclass
from typing import Optional, Mapping, Dict, Tuple, List
from rl import markov_process
from rl.distribution import Categorical, Constant, FiniteDistribution
from rl.markov_process import FiniteMarkovProcess, NonTerminal, MarkovRewardProcess, FiniteMarkovRewardProcess
from rl.markov_decision_process import FiniteMarkovDecisionProcess
from rl.policy import FiniteDeterministicPolicy
from rl.dynamic_programming import value_iteration_result

In [33]:
@dataclass(frozen=True)
class DieGame:
    table: Tuple
    hand: Tuple
                

ActionMapping = Mapping[DieGame, Mapping[List, Categorical[Tuple[DieGame, float]]]]
        
        
        
class die_MDP(FiniteMarkovDecisionProcess[DieGame, List]):
    sides: int
    die: int
    cutoff: int
    
    def __init__(self, K: int, N: int, C: int):
        self.sides = K
        self.die = N
        self.cutoff = C
        super().__init__(self.get_action_transition_reward_map())
        
    def get_hand_state(self, die_list: Tuple):
        ones = [x for x in die_list if x == 1]
        return (len(ones), sum(die_list))
        
    def get_unique_hand_states(self, hand_list: List[Tuple]):
        unique_states = dict()
        for hl in hand_list:
            hand_state = self.get_hand_state(hl)
            if hand_state in unique_states:
                unique_states[hand_state] += 1
            else:
                unique_states[hand_state] = 1
        return unique_states
    
    def get_unique_table_states(self, table_list: List[Tuple]):
        unique_states = dict()
        for tl in table_list:
            table_state = self.get_table_state(tl)
            if table_state in unique_states:
                unique_states[table_state] += 1
            else:
                unique_states[table_state] = 1
        return unique_states
    
    def get_table_state(self, die_list: Tuple):
        sorted_state = sorted(list(die_list))
        sorted_state = tuple(sorted_state)
        return sorted_state
            
        
    def reward_function(self, die_list: Tuple):
        if die_list[0] >= C:
            return die_list[1]
        else:
            return 0
        
    def delete_multiple_element(self, list_object, indices):
        indices = sorted(indices, reverse = True)
        for idx in indices:
            if idx < len(list_object):
                list_object.pop(idx)
        
        
    def get_action_transition_reward_map(self) -> ActionMapping:
        d: Dict[DieGame, Dict[List, Categorical[Tuple[DieGame, float]]]] = {}
        splits = list(itertools.product(range(N+1), repeat = 2))
        splits = [x for x in splits if sum(x) == N and x[0] != 0]
        
        for split in splits:

            table_list = list(itertools.product(range(1,K+1), repeat = split[0]))
            hand_list = list(itertools.product(range(1,K+1), repeat = split[1]))
            
            #Idea: lets get unique state possibilities from hand_list to lessen for loop load
            hand_state = self.get_unique_hand_states(hand_list)
            table_state = self.get_unique_table_states(table_list)
            print(split)
            
            for tl, _ in table_state.items():
                for hl, _ in hand_state.items():    
                        
                    state = DieGame(tl, hl)
            
                    d1: Dict[List, Categorical[Tuple[DieGame, float]]] = {}
                    
                    actions = list(itertools.product([0,1], repeat = len(tl)))
                    actions = [x for x in actions if sum(x) >= 1]
                    #need to remove action where we don't move any die
                    
                    for action in actions:
                        state_prob_map: Mapping[Tuple[DieGame, float], float] = {}
                            
                        #For this action first swap die to hand, then throw die again
                        tl_a = list(tl)
                        hl_a = list(hl)
                        swaplist = list()
                        for idx, swap in enumerate(action):
                            if swap == 1:
                                val = tl_a[idx]
                                swaplist.append(idx)
                                hl_a[1] += val
                                if val == 1:
                                    hl_a[0] += 1
                                
                        self.delete_multiple_element(tl_a, swaplist)
                        hl_a = tuple(hl_a)
                        tl_a = self.get_table_state(tl_a)
                                
                        #First check that new table list is not empty, otherwise calc reward
                        if len(tl_a) == 0:
                            reward = self.reward_function(hl_a)
                            state_prob_map[(DieGame(tl_a,hl_a), reward)] = 1
                        else:                            
                            
                            new_table_list = list(itertools.product(range(1,K+1), repeat = len(tl_a)))
                            new_table_states = self.get_unique_table_states(new_table_list)
                            for new_state_table, count in new_table_states.items():
                                state_prob_map[(DieGame(new_state_table, hl_a), 0)] = \
                                                pow((1./K), len(new_state_table))*count
                        
                        d1[action] = Categorical(state_prob_map)
                    
                    
                    d[state] = d1
        return d
                    
        

In [34]:
N = 6
K = 4
C = 2
user_gamma = 1.0

die_mdp = die_MDP(K = K, N = N, C = C)



(1, 5)
(2, 4)
(3, 3)
(4, 2)
(5, 1)
(6, 0)


In [35]:
opt_vf_vi, opt_policy_vi = value_iteration_result(die_mdp, gamma = user_gamma)

# Calculate Expected Score of the game playing optimally:
 This is equivalent to asking what is the average value of the optimal value function when our state is at the start, i.e. an empty hand. To do this, we must sum up the product of the initial state and its value function. 

In [36]:
#To get all of the inital starting states, use our itertools t vals
table_list = list(itertools.product(range(1,K+1), repeat = N))
table_states = die_mdp.get_unique_table_states(table_list)


expected_score = 0
hl = (0,0)
for tl, _ in table_states.items():
    starting_state = NonTerminal(DieGame(tl, hl))
    expected_score += (1.0/len(table_states))*opt_vf_vi[starting_state]

print("Expected Score of game: ",expected_score)

Expected Score of game:  15.216421169450586


# Calculate the optimal action when rolling {1,2,2,3,3,4} on the first roll



In [39]:
#print(opt_policy_vi)
#IF i print this to screen, pdf will be unreadable, please run it to check my answer that i give below

Thus, as we can see, our optimal policy for the state corresponding to (1,2,2,3,3,4) and none in the hand is the action: (1,0,0,0,0,0). In other words, it means that we should choose to transfer the 1 to our hand and nothing else. This makes sense, as we are ensuring that we have a good chance to get the number of required ones, and we chose not to pocket the highest roll we can possibly get, but that's because we are constricting ourself and making it harder to reach the required amount of 1's for our setup, which was 2. 