# Lab 10: Markov Decision Process

## Implement robot navigation problem through Markov Decision Process

Markov Decision Process is a mathematical model of sequential decision, which is used to simulate the achievable random strategy and rewards of the agent in an environment where the system state has Markov properties.

### Initialization definition 

Transitions stores transition state probability.

Reward stores reward value for reaching a certain state.

Gamma is the discount factor.It is used to avoid infinite returns in the loop or infinite Markov decision process.We set it to 0.9 as a test.

Epsilon is the maximum error allowed in the utility of any state.We set it to 0.001 as a test.

In [1]:
import csv
Transitions = {}
Reward = {}
gamma = 0.9
epsilon = 0.001

### Define file reading function 

Read trasitions and rewards from files.

In [2]:
def read_file():
    #Read transitions from file and store it to a variable.
    with open('./data/transitions.csv', 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        for row in reader:
            if row[0] in Transitions:
                if row[1] in Transitions[row[0]]:
                    Transitions[row[0]][row[1]].append((float(row[3]), row[2]))
                else:
                    Transitions[row[0]][row[1]] = [(float(row[3]), row[2])]
            else:
                Transitions[row[0]] = {row[1]:[(float(row[3]),row[2])]}

    #Read rewards file and save it to a variable.
    with open('./data/rewards.csv', 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        for row in reader:
            Reward[row[0]] = float(row[1]) if row[1] != 'None' else None

read_file()

### Define Markov Decision Process

The process contains states, actions, transition model and reward function.

States represent the set of all states of the robot.

Actions represent the set of actions that can be performed in this state.

Transition represents the probability of transition from one state to another.

Reward represents the reward value for this state.

In [3]:
class MarkovDecisionProcess:
    def __init__(self, transition={}, reward={}, gamma=.9):
        #Collect all nodes from the transition models.
        self.states = transition.keys()
        #Initialize transition.
        self.transition = transition
        #Initialize reward.
        self.reward = reward
        #Initialize gamma.
        self.gamma = gamma

     # Reward for this state.
    def R(self, state):
        return self.reward[state]
    
    # Set of actions that can be performed in this state.
    def actions(self, state):
        return self.transition[state].keys()
    
    #For a state and an action, return a list of (probability, result-state) pairs.
    def T(self, state, action):
        
        return self.transition[state][action]

#Initialize the MarkovDecisionProcess object.
mdp = MarkovDecisionProcess(transition=Transitions, reward=Reward)

### Value iteration

Solving the MDP by value iteration,the value iteration process:


(1) Initialize V(s) for  each state s.


(2) For each state s,update $ V(s)=R(s)+ \gamma max_{a\epsilon A}\sum_{{s}'}P_{sa}({s}')V({s}') $.


(3) Repeat step (2) until convergence.

In [4]:
def value_iteration():
    states = mdp.states
    actions = mdp.actions
    T = mdp.T
    R = mdp.R

    #Initialize value of all the states to 0 (this is k=0 case).
    V1 = {s: 0 for s in states}
    while True:
        V = V1.copy()
        delta = 0
        for s in states:
            #Bellman update, update the utility values.
            V1[s] = R(s) + gamma * max([ sum([p * V[s1] for (p, s1) in T(s, a)]) for a in actions(s)])
            #calculate maximum difference in value
            delta = max(delta, abs(V1[s] - V[s]))

        #Check for convergence, if values converged then return V.
        if delta < epsilon * (1 - gamma) / gamma:
            return V


### Define calculation of the best policy

 Given an MDP and a utility values V, determine the best policy as a mapping from state to action.
 
 For each state s,update $\pi(s)=max(a(s),\sum_{{s'}} P({s'})*V({s'}))$

In [5]:
def best_policy(V):
    states = mdp.states
    actions = mdp.actions
    pi = {}
    for s in states:
        pi[s] = max(actions(s), key=lambda a: expected_utility(a, s, V))
    return pi


The expected utility of doing a in state s, according to the MDP and V.

In [6]:
def expected_utility(a, s, V):
    T = mdp.T
    return sum([p * V[s1] for (p, s1) in mdp.T(s, a)])

### Main function

In [7]:
if __name__ == '__main__':
    #Call value iteration.
    V = value_iteration()
    print ('State - Value')
    for s in V:
        print (s, ' - ' , V[s])
    pi = best_policy(V)
    print ('\nOptimal policy is \nState - Action')
    for s in pi:
        print (s, ' - ' , pi[s])

State - Value
(3 0)  -  0.12987274656746342
(3 1)  -  -1.0
(1 0)  -  0.25386699846479516
(2 1)  -  0.48644001739269643
(1 2)  -  0.649585681261095
(2 0)  -  0.3447542300124158
(3 2)  -  1.0
(2 2)  -  0.7953620878466678
(0 1)  -  0.3984432178350045
(0 0)  -  0.2962883154554812
(0 2)  -  0.5093943765842497

Optimal policy is 
State - Action
(3 0)  -  L
(3 1)  -  EXIT
(1 0)  -  R
(2 1)  -  U
(1 2)  -  R
(2 0)  -  U
(3 2)  -  EXIT
(2 2)  -  R
(0 1)  -  U
(0 0)  -  U
(0 2)  -  R
