# Markov Reward Process

![image.png](./image.png)

这里假设从 $s_1$ 出发，terminal state 是 $s_6$

In [5]:
import numpy as np

P = [
    [0.9, 0.1, 0.0, 0.0, 0.0, 0.0],
    [0.5, 0.0, 0.5, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.6, 0.0, 0.4],
    [0.0, 0.0, 0.0, 0.0, 0.3, 0.7],
    [0.0, 0.2, 0.3, 0.5, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
]

P = np.array(P)
rewards = [-1, -2, -2, 10, 1, 0]
gamma = 0.5

def get_return(start_index, chain: list, gamma):
    G = 0
    for i in reversed(range(start_index, len(chain))):
        G = gamma * G + rewards[chain[i]-1]
    return G

chain = [1,2,3,6]
start_index = 0
G = get_return(start_index, chain, gamma)
print("The return is %.2f"%G)


The return is -2.50


根据 Markov Reward Process，可以得到
$$
V = R + \gamma P V
$$
进而
$$
V = (I-\gamma P)^{-1} R
$$

In [7]:
def compute_V(P, rewards, gamma):
    rewards = np.array(rewards).reshape((-1,1))
    state_num = P.shape[0]
    V = np.dot(np.linalg.inv(np.eye(state_num) - gamma * P), rewards)
    return V

V = compute_V(P, rewards, gamma)
print("The state value is\n", V)


The state value is
 [[-2.01950168]
 [-2.21451846]
 [ 1.16142785]
 [10.53809283]
 [ 3.58728554]
 [ 0.        ]]
