<a href="https://colab.research.google.com/github/ychervonyi/reinforcement-learning-learning/blob/main/gridworld_example_chapter3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Bellman optimality equation for the value function is $v_*(s) = \mathrm{max}_a\sum_{s', r'} p(s', r|s, a)[r+\gamma v_*(s')]$. $s$ - any state, $s'$ - successor state, $r$ - rewards, $a$ - actions. $p(s', r|s, a)$ defines the dynamics of Markov decision processes $p(s', r|s, a) = \mathrm{Pr}(S_t=s', R_t=r| S_{t-1}=s,A_{t-1}=a)$.

These equations can be used to find the optimal policy. One way to solve them is using dynamic programming (see Chapters 3 and 4 in Sutton & Barto). In this notebook we will use some of the dynamic programming methods to solve grid world example (p. 60, example 3.5).

In the gridworld example there are 25 states (5x5 grid) $\mathcal{S} = \{0, 1..., 23, 24\}$. There are 4 actions for each cell - left, right, up, down $\mathcal{A}(s) = \{0, 1, 2, 3\}$ for all $s$. There are 4 rewards $\mathcal{R}\in \{-1, 0, 10, 5\}$.

Let's define all $p(s',r|s,a)$. 
For example, for state 0 (upper left corner) we get 
\begin{equation}
p(s',r|0,left) = \left\{ \begin{matrix}1 ~ \mathrm{if}~s'=0~\mathrm{and}~r=-1  \\ 0~\mathrm{otherwise} \end{matrix}\right.\\
p(s',r|0,up) = \left\{ \begin{matrix}1 ~ \mathrm{if}~s'=0~\mathrm{and}~r=-1  \\ 0~\mathrm{otherwise} \end{matrix}\right.\\
p(s',r|0,right) = \left\{ \begin{matrix}1 ~ \mathrm{if}~s'=1~\mathrm{and}~r=0  \\ 0~\mathrm{otherwise} \end{matrix}\right.\\
p(s',r|0,down) = \left\{ \begin{matrix}1 ~ \mathrm{if}~s'=5~\mathrm{and}~r=0  \\ 0~\mathrm{otherwise} \end{matrix}\right.\\
\end{equation}

In [19]:
import numpy as np

side = 5
states = np.array([[side * c + r for r in range(side)] for c in range(side)]) # coordinates [r, c]
rewards = {-1: 0, 0: 1, 10: 2, 5: 3} # [-1, 0, 10, 5]
actions = [[-1, 0], [1, 0], [0, -1], [0, 1]] # [dr, dc] = [up, down, left, right]
actions_map = ["\u2191", "\u2193", "\u2190", "\u2192"]

p = np.zeros((states.size, len(rewards), states.size, len(actions)))

# Assign p(s_p, r| s, a) for all cells except A and B
for r in range(side):
    for c in range(side):
        for i_a, (dr, dc) in enumerate(actions):
            # Skip positions A and B (these are special cases - see lower)
            if (r, c) in ((0, 1), (0, 3)):
                continue
            # New state
            r_p, c_p = r + dr, c + dc
            # Edge - return to previous state and assing reward -1
            if r_p in (-1, side) or c_p in (-1, side):
                p[states[r][c]][rewards[-1]][states[r][c]][i_a] = 1
            # Non-edge (inside the grid) - reward 0
            else:
                p[states[r_p][c_p]][rewards[0]][states[r][c]][i_a] = 1
# A -> A' with reward +10
p[states[4][1], rewards[10], states[0][1], :] = np.ones(len(actions))
# B -> B' with reward +5
p[states[2][3], rewards[5], states[0][3], :] = np.ones(len(actions))

In [2]:
# Check 1: final state (0,0), reward -1, original state (0, 0). For left and up we should get probabilities 1
p[states[0][0], rewards[-1], states[0][0], ::]

array([1., 0., 1., 0.])

In [20]:
def print_mtr(A):
    """
    Utility to print a square matrix of size (side, side).
    """
    for r in range(side):
        s = ""
        t = isinstance(A, (np.ndarray))
        for c in range(side):
            item = A[r*side+c]
            if t:
                item = round(item, 1)
            s += f" {item}"
        print(s)

def iterative_policy_evaluation(gamma, epsilon):
    """
    Iterative policy evaluation (p. 75)
    """

    # Value function
    V = np.random.uniform(low=-1, high=1, size=states.size)
    # V = np.zeros(states.size)

    steps, max_steps = 0, 20
    # Initial delta
    delta = epsilon
    # Keep max_steps in case it does not converge
    while delta >= epsilon and steps < max_steps:
        # Find max delta at each iteration
        delta = 0
        for s in range(states.size):
            v = V[s]
            v_new = 0
            for a_i in range(len(actions)):
                tmp = 0
                for s_p in range(states.size):
                    for r, r_i in rewards.items():
                        tmp += p[s_p][r_i][s][a_i] * (r + gamma * V[s_p])
                v_new = max(v_new, tmp)
            delta = max(delta, abs(v - v_new))
            V[s] = v_new
        print(f"delta: {delta}")
        # print_mtr(greedy_optimal_policy(V))
        # print_v(V)
        steps += 1
    print("Value function")
    print_mtr(V)

    # Print greedy optimal policy
    print("Optimal action")
    optimal_policy = greedy_optimal_policy(V)
    print_mtr(optimal_policy)
    return V, optimal_policy

def greedy_optimal_policy(value_func):
    """
    Find optimal action for each state (optimal policy).
    We find ONE optimal action for simplicity
    (even though there could be many)
    """
    n = len(value_func)
    optimal_actions = [0] * n
    for s in range(n):
        max_v = 0
        for a_i in range(len(actions)):
            tmp = 0
            for s_p in range(states.size):
                for r, r_i in rewards.items():
                    tmp += p[s_p][r_i][s][a_i] * (r + gamma * value_func[s_p])
            if tmp > max_v:
                max_v = tmp
                optimal_actions[s] = actions_map[a_i]
    return optimal_actions

In [21]:
gamma = 0.5
threshold = 0.01
print(f"gamma: {gamma}, epsilon: {threshold}")
vv, aa = iterative_policy_evaluation(gamma, threshold)

print("=====================")
gamma = 0.9
print(f"gamma: {gamma}, epsilon: {threshold}")
_, _ = iterative_policy_evaluation(gamma, threshold)

gamma: 0.5, epsilon: 0.01
delta: 9.747184066025968
delta: 4.873592033012984
delta: 0.18455835642333174
delta: 0.011936055982569194
delta: 0.0014920069978208161
Value function
 5.2 10.3 5.2 5.7 2.9
 2.6 5.2 2.6 2.9 1.4
 1.3 2.6 1.3 1.4 0.7
 0.6 1.3 0.6 0.7 0.4
 0.3 0.6 0.3 0.4 0.2
Optimal action
 → ↑ ← ↑ ←
 → ↑ ↑ ↑ ↑
 → ↑ ↑ ↑ ↑
 → ↑ ↑ ↑ ↑
 → ↑ ↑ ↑ ↑
gamma: 0.9, epsilon: 0.01
delta: 9.760345224343439
delta: 8.784310701909096
delta: 5.387548205166979
delta: 3.1812933396690486
delta: 1.8785219041411771
delta: 1.1092483991763267
delta: 0.6550000872296273
delta: 0.3867710015082224
delta: 0.2283844086805935
delta: 0.1348587094818008
delta: 0.07963271936190708
delta: 0.04702232445601595
delta: 0.027766212368032228
delta: 0.016395670741196966
delta: 0.009681479615966992
Value function
 22.0 24.4 22.0 19.4 17.5
 19.8 22.0 19.8 17.8 16.0
 17.8 19.8 17.8 16.0 14.4
 16.0 17.8 16.0 14.4 13.0
 14.4 16.0 14.4 13.0 11.7
Optimal action
 → ↑ ← ↑ ←
 → ↑ ↑ ← ←
 → ↑ ↑ ↑ ↑
 → ↑ ↑ ↑ ↑
 → ↑ ↑ ↑ ↑


In [22]:
# Let's check that there some states with multiple optimal actions
s = 1
for a_i in range(len(actions)):
    tmp = 0
    for s_p in range(states.size):
        for r, r_i in rewards.items():
            tmp += p[s_p][r_i][s][a_i] * (r + gamma * vv[s_p])
    print(tmp)

10.580645140850619
10.580645140850619
10.580645140850619
10.580645140850619
