In [1]:
from lib.util import *
from lib.mdp import *
from lib.policy import *

# Markov Decision Processes

A set of states $S$, a state transition probability matrix $P$, a reward function $R$ s.t. $R(s) = \mathbb{E}[R_{n+1} \mid S_n = s]$, a discount factor $\gamma \in [0, 1]$, and a finite set of actions $A$. MDPs are similar to MRPs, but with actions.

A policy $\pi$ is a probability distribution of the actions given a state: $\pi(a\mid s) = \mathbb{P}[A_t = a \mid S_t = s]$

A value function $v_\pi$ for a given policy $\pi$ is the expected return from a state $s$ that is obtained by following the policy $\pi$: $v_\pi(s) = \mathbb{E}_\pi(G_t \mid S_t = s)$

Bellman Expectation Equations: 
$$v_\pi(s) = \sum_{a\in A} \pi(a\mid s) q_\pi(s,a) $$
$$q_\pi(s,a) = R(s, a) + \gamma \sum_{s' \in S} P(s, s', a) v_\pi(s') $$
$$v_\pi(s) = \sum_{a\in A} \pi(a\mid s) [R(s, a) + \gamma \sum_{s' \in S} P(s, s', a) v_\pi(s')] $$
$$q_\pi(s,a) = R(s, a) + \gamma \sum_{s' \in S} P(s, s', a) [\sum_{a'\in A} \pi(a'\mid s') q_\pi(s',a')] $$

Matrix Form of the Bellman Expectation Equation:
$$v_\pi = R^\pi + \gamma R^\pi v_\pi $$

Bellman Optimality Equations: 
$$v_*(s) = \max_a{q_*(s,a)} $$
$$q_*(s,a) = R(s, a)  + \gamma \sum_{s' \in S} P(s, s', a) v_*(s') $$
$$v_*(s) = \max_a [R(s, a)  + \gamma \sum_{s' \in S} P(s, s', a) v_*(s')] $$
$$q_*(s,a) = R(s, a)  + \gamma \sum_{s' \in S} P(s, s', a) \max_{a'}{q_*(s',a')} $$

In [2]:
n = 5
gamma = 0.95

In [3]:
P = generate_stochastic_matrix(n)
R = generate_reward_vector(n)
mrp = MRP(P, R, gamma)
mdp = MDP(gamma, [mrp]*n)
Q = generate_stochastic_matrix(n)
policy = Policy(Q)

print(mdp.policy_evaluation(policy))

defaultdict(<class 'float'>, {0: 6.347691649851708, 1: 6.191906306344334, 2: 7.016720981999342, 3: 6.295817241003564, 4: 6.416818280986636})


In [4]:
print(mdp.policy_iteration())

[[1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]] defaultdict(<class 'float'>, {0: 0.274677347427577, 1: 0.2080695222876411, 2: 0.853411241386409, 3: 0.4685763224779896, 4: 0.6952279338170133})
(<lib.policy.DeterministicPolicy object at 0x0000020E7F230358>, defaultdict(<class 'float'>, {0: 0.274677347427577, 1: 0.2080695222876411, 2: 0.853411241386409, 3: 0.4685763224779896, 4: 0.6952279338170133}))


In [5]:
print(mdp.value_iteration())

[5.30136106 5.25460945 5.51760721 5.39206045 5.33744083]
