# CS486 - Artificial Intelligence
## Lesson 14 - Markov Decision Processes

*Expectimax* is a way to search a tree for the best action when outcomes are ucertain. In practice, however, we can rarely  search to the root of an expectimax tree. 

`Markov Decision Processes` (MDPs) are a way of formulating problems such that we can use an *expectimax* approach to establish a **policy** for selecting optimal actions in states without having to perform a search every time.  

In [None]:
import helpers
from aima.text import *
from aima.notebook import psource

## A *Draw HiLo* Policy

Last time used `expectimax` to decide which action to choose for a given draw. Our implementation was limited our us to 5 draws since, in theory, the game tree is infinite. 

Instead of running `expectimax` every time a new card is drawn, let's see if we can use an MDP to create a **policy** which lists the best action to take any given state. Here's what the transition graph looks like for the 5 draw: <img src='images/hilo.svg'>

In [None]:
psource(MDP)

In [None]:
class MDP:

    """A Markov Decision Process, defined by an initial state, transition model,
    and reward function. We also keep track of a gamma value, for use by
    algorithms. The transition model is represented somewhat differently from
    the text. Instead of P(s' | s, a) being a probability number for each
    state/state/action triplet, we instead have T(s, a) return a
    list of (p, s') pairs. We also keep track of the possible states,
    terminal states, and actions for each state. [page 646]"""

    def __init__(self, init, actlist, terminals, transitions=None, reward=None, states=None, gamma=0.9):
        if not (0 < gamma <= 1):
            raise ValueError("An MDP must have 0 < gamma <= 1")

        # collect states from transitions table if not passed.
        self.states = states or self.get_states_from_transitions(transitions)
            
        self.init = init
        
        if isinstance(actlist, list):
            # if actlist is a list, all states have the same actions
            self.actlist = actlist

        elif isinstance(actlist, dict):
            # if actlist is a dict, different actions for each state
            self.actlist = actlist
        
        self.terminals = terminals
        self.transitions = transitions or {}
        if not self.transitions:
            print("Warning: Transition table is empty.")

        self.gamma = gamma

        self.reward = reward or {s: 0 for s in self.states}

        # self.check_consistency()

    def R(self, state):
        """Return a numeric reward for this state."""

        return self.reward[state]

    def T(self, state, action):
        """Transition model. From a state and an action, return a list
        of (probability, result-state) pairs."""
        if not self.transitions:
            raise ValueError("Transition model is missing")
        else:
            return self.transitions[state][action]

    def actions(self, state):
        """Return a list of actions that can be performed in this state. By default, a
        fixed list of actions, except for terminal states. Override this
        method if you need to specialize by state."""

        if state in self.terminals:
            return [None]
        else:
            return self.actlist

    def get_states_from_transitions(self, transitions):
        if isinstance(transitions, dict):
            s1 = set(transitions.keys())
            s2 = set(tr[1] for actions in transitions.values()
                     for effects in actions.values()
                     for tr in effects)
            return s1.union(s2)
        else:
            print('Could not retrieve states from transitions')
            return None

    def check_consistency(self):

        # check that all states in transitions are valid
        assert set(self.states) == self.get_states_from_transitions(self.transitions)

        # check that init is a valid state
        assert self.init in self.states

        # check reward for each state
        assert set(self.reward.keys()) == set(self.states)

        # check that all terminals are valid states
        assert all(t in self.states for t in self.terminals)

        # check that probability distributions for all actions sum to 1
        for s1, actions in self.transitions.items():
            for a in actions.keys():
                s = 0
                for o in actions[a]:
                    s += o[0]
                assert abs(s - 1) < 0.001

In [None]:
def value_iteration(mdp, epsilon=0.001):
    """Solving an MDP by value iteration. [Figure 17.4]"""

    U1 = {s: 0 for s in mdp.states}
    R, T, gamma = mdp.R, mdp.T, mdp.gamma
    while True:
        U = U1.copy()
        delta = 0
        for s in mdp.states:
            U1[s] = R(s) + gamma * max(sum(p*U[s1] for (p, s1) in T(s, a))
                                                   for a in mdp.actions(s))
            delta = max(delta, abs(U1[s] - U[s]))
        if delta <= epsilon*(1 - gamma)/gamma:
            return U

In [None]:
rewards = {"win": 1, "lose": -1}
actions = {"win": ["draw"],"lose":[None]}
transitions = {"win": {"draw": []}, "lose": {None: []}}

for card in range(1,14):
    rewards[card] = 0
    actions[card] = ["higher","lower"]
    transitions["win"]["draw"].append([1/13,card])
    transitions[card] = {
        "higher": [[(13-card)/13,"win"], [(card-1)/13,"lose"]],
        "lower":  [[(13-card)/13,"lose"], [(card-1)/13,"win"]]
    }

class HiLo(MDP):
    def __init__(self):
        MDP.__init__(
            self,
            init="win", 
            actlist=actions,
            terminals=["lose"], 
            transitions=transitions, 
            reward=rewards, 
            states=None, 
            gamma=1)

    def actions(self,state):
        return self.actlist[state]
    
value_iteration(HiLo())
best_policy(HiLo(),value_iteration(HiLo()))

## Value Iteration

A depth 1 `expectimax` search from every state worked great to create a *Draw HiLo* policy, but when is that not sufficient? Consider the following MDP: 





Why exactly did we need `expectimax` to search to depth 5 in our tree? Do future our past draws effect which action is optimal for the current draw? It doesn't - just like the last result of a coin flip has no bearing on the next. This is the **Markov property**. 

In the case of *Draw HiLo*, running an `expectimax` to depth 1 from every state will produce the optimal action for each state. Together, the set of optimal actions for all states is called a **policy**. Here is the policy for *Draw HiLo*:

| Draw | Action | Exp. Return | 
| - | - | - |
| 1 | H | 1.000 |
| 2 | H | 1.008 |
| 3 | H | 1.000 |
| 4 | L | 0.975 |
| 5 | L | 0.933 |
| 6 | H | 0.875 |
| 7 | - | 0.900 |
| 8 | L | 0.833 |
| 9 | H | 1.000 |
| 10 | H | 1.000 |
| J | L | 0.833 |
| Q | L | 1.000 |
| K | L | 1.000 |