### Gridworld

1. 문제: 5x5 격자로 이루어진 판 위 여러 규칙 내에서 이동할 때 위치(state) 별로 value function 값을 구해보자.
2. 조건
    - state: 말의 위치
    - action: 상,하,좌,우 
    - reward: 벽에 부딪혔을 때(-1), 이동(0), 목적지 도달(A': +10, B': +5)
3. 학습 목표
    - Policy evaluation
        - Bellman equation
        - Bellman optimality equation 구현을 통해 value function 값을 구해보기

In [14]:
from collections import defaultdict
import numpy as np

In [40]:
WORLD_SIZE = 5
A_POS = (0, 1)
A_PRIME_POS = [4, 1]
B_POS = (0, 3)
B_PRIME_POS = [2, 4]
UP, DOWN, LEFT, RIGHT = 0, 1, 2, 3 

class Gridworld:
    def __init__(self):
        self.nA = 4
        self.nS = WORLD_SIZE * WORLD_SIZE

        P = defaultdict(lambda: [[] for i in range(WORLD_SIZE - 1)])
        for i in range(WORLD_SIZE):
            for j in range(WORLD_SIZE):
                s = (i, j)
                if i == 0:
                    P[s][UP] = [1.0, [i, j], -1.0, False]
                else:
                    P[s][UP] = [1.0, [i-1, j], 0.0, False]
                if i == WORLD_SIZE - 1:
                    P[s][DOWN] = [1.0, [i, j], -1.0, False]
                else:
                    P[s][DOWN] = [1.0, [i+1, j], 0.0, False]
                if j == 0:
                    P[s][LEFT] = [1.0, [i, j], -1.0, False]
                else:
                    P[s][LEFT] = [1.0, [i, j-1], 0.0, False]
                if j == WORLD_SIZE - 1:
                    P[s][RIGHT] = [1.0, [i, j], -1.0, False]
                else:
                    P[s][RIGHT] = [1.0, [i, j+1], 0.0, False]
                if s == A_POS:
                    P[s][UP] = P[s][DOWN] = P[s][LEFT] = P[s][RIGHT] = [1.0, A_PRIME_POS, 10, False]
                if s == B_POS:
                    P[s][UP] = P[s][DOWN] = P[s][LEFT] = P[s][RIGHT] = [1.0, B_PRIME_POS, 5, False]                
        self.P = P

In [1]:
def random_policy():
    return np.array([0.25, 0.25, 0.25, 0.25])

### Bellman equation 
아래 식은 policy가 주어졌을 때 value function의 값을 구하는 **Bellman equation** 이다. random policy를 집어넣은 뒤에 문제를 풀어보자.

$$v_{\pi}(s) = \displaystyle \sum_{a} \pi(a \vert s) \sum_{s', r} p(s',r \lvert s, a) \big [r+\gamma v_{\pi}(s') \big], \text{ for all } s \in S : \text{ Bellman equation for }v_{\pi}$$

In [44]:
def bellman_equation(env, policy, discount_factor=0.9, theta=1e-4):
    V = np.zeros([WORLD_SIZE, WORLD_SIZE])
    while True:    
        delta = 0
        for i in range(WORLD_SIZE):
            for j in range(WORLD_SIZE): 
                s = (i, j)
                v = 0
                for a in range(env.nA):                
                    action_prob = policy[a]
                    prob, next_state, reward, done = env.P[s][a]
                    v += action_prob * prob * (reward + discount_factor*V[next_state[0]][next_state[1]])
                delta = max(delta, np.abs(v - V[s]))
                V[s] = v
        if delta < theta:
            break
    return V

In [2]:
env = Gridworld()
policy = random_policy()
value = bellman_equation(env, policy)

NameError: name 'Gridworld' is not defined

In [46]:
value

array([[ 3.28592667,  8.7660323 ,  4.13828967,  4.51705437,  1.05916427],
       [ 1.4880908 ,  2.93075511,  2.08192546,  1.60127383,  0.29403413],
       [ 0.01982242,  0.68938662,  0.58255279,  0.22363303, -0.53666674],
       [-0.99833352, -0.46932881, -0.40594874, -0.65335652, -1.25519416],
       [-1.87832649, -1.37114695, -1.26420916, -1.46641581, -2.02242922]])

### Bellman optimality equation
아래 식은 value의 값을 max로 취하면서 value function의 값을 찾는 **Bellman optimality equation**이다. 

$$ v_{\ast}(s) = \max_a \displaystyle \sum_{s',r} p(s',r \lvert s, a) \big [r + \gamma v_{\ast}(s') \big ] : \text{ Bellman optimality equation for } v_{\ast}$$

In [47]:
def bellman_optimality_equation(env, discount_factor=0.9, theta=1e-4):
    V = np.zeros([WORLD_SIZE, WORLD_SIZE])
    while True:    
        delta = 0
        for i in range(WORLD_SIZE):
            for j in range(WORLD_SIZE): 
                s = (i, j)
                v = 0
                v_candidate = []
                for a in range(env.nA):          
                    prob, next_state, reward, done = env.P[s][a]
                    v_candidate.append(prob * (reward + discount_factor*V[next_state[0]][next_state[1]]))
                v = max(v_candidate)
                delta = max(delta, np.abs(v - V[s]))
                V[s] = v
        if delta < theta:
            break
    return V

In [48]:
env = Gridworld()
value = bellman_optimality_equation(env)

In [49]:
value

array([[ 21.9773651 ,  24.41934924,  21.97741432,  17.97741432,
         16.17967288],
       [ 19.77962859,  21.97741432,  19.77967288,  17.8017056 ,
         16.02153504],
       [ 17.80166573,  19.77967288,  17.8017056 ,  16.02153504,
         14.41938153],
       [ 16.02149916,  17.8017056 ,  16.02153504,  14.41938153,
         12.97744338],
       [ 14.41934924,  16.02153504,  14.41938153,  12.97744338,
         11.67969904]])