In [1]:
import numpy as np

State space $S = \{\text{hungry}, \text{full}\}$.

Action space $A = \{\text{ignore}, \text{feed}\}$.

Initial state distribution $p_{S_0} = \begin{bmatrix}1/2 & 1/2\end{bmatrix}^T$.

Rewards = $\{-3, -2, 1, 2\}$

In [2]:
p_S0 = np.array([1/2, 1/2])

## Dynamics
```math
p(s',r\mid s, a) = \mathrm{Pr}[S_{t+1} = s', R_{t+1}\mid S_t = s, A_t = a]
```

In [3]:
# Dynamics as a matrix
dynamics_matrix = [[0, 1/3,   0,   0], # (hungry, -3)
                   [1,   0, 3/4,   0], # (hungry, -2)
                   [0, 2/3,   0,   1], # (full, 1)
                   [0,   0, 1/4,   0], # (full, 2)
                   ]
dynamics_matrix = np.array(dynamics_matrix) 

Transition Probability Matrix
```math
\begin{align*}
p(s'\mid s,a) &= \mathrm{Pr}\left[S_{t+1} = s' \mid S_t = s, A_t = a\right]\\
&= \sum_{r \in R} p(s', r \mid s, a)
\end{align*}
```

In [4]:
def get_p_sp_given_s_a(dynamics_matrix):
    p_sp_given_s_a = np.array([[1,1,0,0], [0,0,1,1]]) @ dynamics_matrix
    return p_sp_given_s_a

In [5]:
p_sp_given_s_a = get_p_sp_given_s_a(dynamics_matrix)
p_sp_given_s_a

array([[1.        , 0.33333333, 0.75      , 0.        ],
       [0.        , 0.66666667, 0.25      , 1.        ]])

Expected reward given State-Action Pair

```math
\begin{align*}
r(s,a) &= \mathbb{E}[R_{t + 1} \mid S_t = s, A_t = a]\\
&= \sum_{s' \in S, r \in R} r \cdot p(s', r \mid s, a)
\end{align*}
```

In [6]:
def get_r(dynamics_matrix):
    r = np.sum(dynamics_matrix * np.array([[-3], [-2], [1], [2]]), axis=0) # Keep in vector form just in case.
    return r

In [7]:
r = get_r(dynamics_matrix)
r

array([-2.        , -0.33333333, -1.        ,  1.        ])

Expected reward given state, action and next state.
```math
\begin{align*}
r(s,a,s') &= \mathbb{E}[R_{t + 1} \mid S_t = s, A_t = a, S_{t + 1} = s']\\
&= \frac{\sum_{r \in R}r \cdot p(s', r \mid s, a)}{p(s'\mid s, a)}
\end{align*}
```

In [8]:
def get_r_sp(dynamics_matrix, p_sp_given_s_a):
    r_sp = np.array([[1,1,0,0],[0,0,1,1]]) @ (dynamics_matrix * np.array([[-3], [-2], [1], [2]]))
    r_sp = np.nan_to_num(r_sp / p_sp_given_s_a)
    return r_sp

In [9]:
r_sp = get_r_sp(dynamics_matrix, p_sp_given_s_a)
r_sp

  r_sp = np.nan_to_num(r_sp / p_sp_given_s_a)


array([[-2., -3., -2.,  0.],
       [ 0.,  1.,  2.,  1.]])

Policy $\pi$

In [10]:
pi = np.array([1/4, 3/4, 5/6, 1/6])
pi

array([0.25      , 0.75      , 0.83333333, 0.16666667])

Initial State-Action Distribution
```math
\begin{align*}
p_{S_0, A_0, \pi}(s, a) &= \mathrm{Pr}_\pi[S_0 = s, A_0 = a]\\
&= p_{S_0}(s)\pi(a\mid s)
\end{align*}
```

In [11]:
def get_p_S0_A0_pi(pi, p_S0):
    pi_matrix = pi.reshape(2,2)
    p_S0_A0_pi = p_S0.reshape(-1,1) * pi_matrix
    return p_S0_A0_pi

In [12]:
p_S0_A0_pi = get_p_S0_A0_pi(pi, p_S0)
p_S0_A0_pi

array([[0.125     , 0.375     ],
       [0.41666667, 0.08333333]])

Transition Probability between States
```math
\begin{align*}
p_\pi(s'\mid s) &= \mathrm{Pr}_\pi[S_{t + 1} = s'\mid S_t = s]\\
&= \sum_{a \in A}p(s'\mid s, a)\pi(a\mid s)
\end{align*}
```

In [13]:
def get_p_pi_sp_given_s(pi, p_sp_given_s_a):
    p_pi_sp_given_s = (p_sp_given_s_a * pi) @ np.array([[1,1,0,0], [0,0,1,1]]).T
    return p_pi_sp_given_s

In [14]:
p_pi_sp_given_s = get_p_pi_sp_given_s(pi, p_sp_given_s_a)
p_pi_sp_given_s

array([[0.5  , 0.625],
       [0.5  , 0.375]])

Transition Probability between State-Action pairs.
```math
\begin{align*}
p_\pi(s', a'\mid s, a) &= \mathrm{Pr}_\pi[S_{t + 1} = s', A_{t + 1} = a' \mid S_t = s, A_t = a]\\
&= \pi(a' \mid s')p(s'\mid s, a)
\end{align*}
```

In [15]:
def get_p_pi_sp_given_s_a(pi, p_sp_given_s_a):
    p_pi_sp_ap_given_s_a = pi.reshape(-1,1) * np.repeat(p_sp_given_s_a,2, axis=0)
    return p_pi_sp_ap_given_s_a

In [16]:
p_pi_sp_ap_given_s_a = get_p_pi_sp_given_s_a(pi, p_sp_given_s_a)
p_pi_sp_ap_given_s_a

array([[0.25      , 0.08333333, 0.1875    , 0.        ],
       [0.75      , 0.25      , 0.5625    , 0.        ],
       [0.        , 0.55555556, 0.20833333, 0.83333333],
       [0.        , 0.11111111, 0.04166667, 0.16666667]])

Expected State Reward
```math
\begin{align*}
r_\pi(s) &= \mathbb{E}_\pi[R_{t+1}\mid S_t = s]\\
&= \sum_{a \in A}r(s,a)\pi(a\mid s)
\end{align*}
```

In [17]:
def get_r_pi(pi, r):
    pi_matrix = pi.reshape(2,2)
    r_pi = np.diag(r.reshape(2,2) @ pi_matrix.T)
    return r_pi

In [18]:
r_pi = get_r_pi(pi, r)
r_pi

array([-0.75      , -0.66666667])

## Discounted Return
Episodic task without discount.
```math
G_t = \sum_{\tau = t+1}^TR_{\tau}
```
Episodic task with discount $\gamma \in [0,1]$:
```math
G_t = \sum_{\tau = 0}^\infty \gamma^\tau R_{t+\tau+1}
```
Recursive Relationship:
```math
G_t = R_{t+1} + \gamma G_{t + 1}
```

## Value
State Value - expected return starting from $s$, then following policy $\pi$.
```math
v_\pi(s) = \mathbb{E}_\pi[G_t\mid S_t = s]
```
Action Value - expected return starting from $(s,a)$, then following $\pi$.
```math
q_\pi(s,a) = \mathbb{E}_\pi[G_t\mid S_t = s, A_t = a]
```
### Properties of Value
Using action-value pairs to back up state values
```math
v_\pi(s) = \sum_{a \in A}\pi(a\mid s)q_\pi(s,a)
```
In other words: $v_\pi(S_t) = \mathbb{E}_\pi[q_\pi(S_t,A_t)]$.

Vector form:
```math
\vec{v}_\pi = \vec{\pi} \odot \vec{q}_\pi
```

Using state values to back up action values:
```math
\begin{align*}
q_\pi(s,a) &= r(s,a) + \gamma\sum_{s' \in S}p(s'\mid s,a)v_\pi(s')\\
&= \sum_{s' \in S^+, r \in R}p(s',r\mid s,a)[r + \gamma v_\pi(s')]
\end{align*}
```
In other words, $q_\pi(S_t,A_t) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1})]$.

Vector form:
```math
\vec{q}_\pi = \vec{r} + \gamma P^\top_{S_{t+1}\mid S_t,A_t}\vec{v}_\pi
```
- $\vec{v}_\pi = (v_\pi(s): s \in S)^\top \in \mathbb{R}^{|S|}$
- $\vec{q}_\pi = (q_\pi(s,a):(s,a) \in S \times A)^\top \in \mathbb{R}^{|S||A|}$
- $\vec{\pi} = (\pi(a\mid s): (s,a) \in S \times A)^\top \in \mathbb{R}^{|S||A|}$
- $\vec{r} = (r(s,a) : (s,a) \in S \times A)^\top \in \mathbb{R}^{|S||A|}$
- $P_{S_{t+1}\mid S_t, A_t} = (p(s'\mid s,a): s' \in S, (s,a) \in S \times A) \in [0,1]^{|S| \times |S||A|}$

In [19]:
p_sp_given_s_a.shape, r.shape

((2, 4), (4,))

## Bellman Expectation Equations
The two relations above combine together to form the two relations below.

Use state values at time $t+1$ to back up the state values at time $t$:
```math
v_\pi(s) = r_\pi(s) + \gamma \sum_{s' \in S}p_\pi(s'\mid s)v_\pi(s')
```
In other words: $v_\pi(S_t) = \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1})]$.

Vector form:
```math
\vec{v}_\pi = \vec{r}_\pi + \gamma P^\top_{S_{t+1}\mid S_t;\pi}\vec{v}_\pi
```
where $\vec{r}_\pi = (r_\pi(s): s\in S)^\top \in \mathbb{R}^{|S|}$.

Use action values at time $t+1$ to represent the action values at time $t$.
```math
\begin{align*}
q_\pi(s,a) &= r(s,a) + \gamma \sum_{s'\in S, a'\in A}p_\pi(s', a'\mid s, a)q_\pi(s',a')\\
&= \sum_{s'\in S, r\in R}p(s', r\mid s, a)\left[r + \gamma\sum_{a'\in A}\pi(a'\mid s')q_\pi(s', a')\right]
\end{align*}
```
In other words, $q_\pi(S_t, A_t) = \mathbb{E}_\pi[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1})]$

Vector form:
```math
\vec{q}_\pi = \vec{r} + \gamma P^\top_{S_{t+1}, A_{t+1}\mid S_t, A_t;\pi} \vec{q}_\pi
```

Rearranged relations:
```math
\begin{align*}
\vec{v}_\pi &= (I - \gamma P^\top_{S_{t+1}\mid S_t;\pi})^{-1}\vec{r}_\pi\\
\vec{q}_\pi &= (I - \gamma P^\top_{S_{t+1}, A_{t+1}\mid S_t, A_t;\pi})^{-1}\vec{r}
\end{align*}
```

### Approach 1
1. Use $\vec{v}_\pi = (I - \gamma P^\top_{S_{t+1}\mid S_t;\pi})^{-1}\vec{r}_\pi$ to obtain $\vec{v}_\pi$.
2. Use $\vec{q}_\pi = \vec{r} + \gamma P^\top_{S_{t+1}\mid S_t, A_t; \pi}\vec{v}_\pi$ to obtain $\vec{q}_\pi$

In [20]:
# Feed and Full Example
def get_vq1(p_pi_sp_given_s, p_sp_given_s_a, gamma, r_pi, r):
    v_pi = np.linalg.inv(np.eye(2) - gamma * p_pi_sp_given_s.T) @ r_pi
    q_pi = r + (gamma * p_sp_given_s_a.T) @ v_pi
    return v_pi, q_pi

In [21]:
gamma = 4/5
v_pi1, q_pi1 = get_vq1(p_pi_sp_given_s, p_sp_given_s_a, gamma, r_pi, r)
v_pi1, q_pi1

(array([-3.59848485, -3.52272727]),
 array([-4.87878788, -3.17171717, -3.86363636, -1.81818182]))

### Approach 2
1. Use $\vec{q}_\pi = (I - \gamma P^\top_{S_{t+1}\mid S_t, A_t;\pi})^{-1}\vec{r}$ to obtain $\vec{q}_\pi$.
2. Use $\vec{v}_\pi = \vec{\pi} \odot \vec{q}_\pi$ to obtain $\vec{v}_\pi$.

In [22]:
def get_vq2(p_pi_sp_given_s, p_sp_given_s_a, gamma, pi, r):
    q_pi = np.linalg.inv(np.eye(4) - gamma * p_pi_sp_ap_given_s_a.T) @ r
    v_pi = np.array([[1,1,0,0],[0,0,1,1]]) @ (pi * q_pi)
    return v_pi, q_pi

In [23]:
v_pi2, q_pi2 = get_vq2(p_pi_sp_given_s, p_sp_given_s_a, gamma, pi, r)
v_pi2, q_pi2

(array([-3.59848485, -3.52272727]),
 array([-4.87878788, -3.17171717, -3.86363636, -1.81818182]))

Here we see that both approaches achieve the same results.

## Initial Expected Returns using Values
Expected return at $t = 0$:
```math
\begin{align*}
g_\pi &= \mathbb{E}_{S_0 \sim p_{S_0}}[v_\pi(S_0)]\\
&= \vec{p}_{S_0}^\top \vec{v}_\pi
\end{align*}

In [24]:
def get_g_pi(p_S0, v_pi):
    g_pi = p_S0 @ v_pi.reshape(-1,1)
    return g_pi[0]

In [25]:
g_pi = get_g_pi(p_S0, v_pi1)
g_pi

np.float64(-3.5606060606060614)

## Policy Improvement Theorem
If for all $s \in S$,
```math
v_\pi(s) \leq \sum_a \pi'(a\mid s)q_\pi(s, a) = \mathbb{E}_{A\sim \pi'(s)}[q_\pi(s,A)]
```
then $\pi \preccurlyeq \pi'$. I.e: $v_\pi(s) \leq v_{\pi'}(s)$.

Additionally, if there is a state $s \in S$ such that the former inequality holds, there is a state $s \in S$ such that the latter inequality also holds.

In [26]:
def policy_le(v_p1, q_p1, p2):
    return np.all(v_p1 <= (np.array([[1,1,0,0], [0,0,1,1]]) @ (p2 * q_p1)))

In [27]:
pi2 = np.array([0,1,0,1])

In [28]:
policy_le(v_pi1, q_pi1, pi2)

np.True_

### Check if a Policy is Optimal
- For each state-action pair $s \in S$ and $a \in A(s)$: if $\pi(a\mid s) > 0$ and $q_\pi(s, a) < \max_{a' \in A}q_\pi(s, a')$, then the policy is not optimal.
- Otherwise, policy $\pi$ is optimal and cannot be further improved.

In [33]:
def is_policy_optimal(pi, q_pi):
    m = int(len(q_pi) ** (1/2))
    mask = (pi.reshape(m,m) > 0) & (q_pi.reshape(m,m) < np.max(q_pi.reshape(m,m), axis=1).reshape(-1,1))
    return not np.any(mask)

In [30]:
is_policy_optimal(pi, q_pi1)

False

In [31]:
p_pi2_sp_given_s = get_p_pi_sp_given_s(pi2,p_sp_given_s_a)
r_pi2 = get_r_pi(pi2, r) 
v_pi_det, q_pi_det = get_vq1(p_pi_sp_given_s, p_sp_given_s_a, gamma, r_pi2, r)
v_pi_det, q_pi_det

(array([0.75757576, 1.96969697]),
 array([-1.39393939,  0.91919192, -0.15151515,  2.57575758]))

In [32]:
is_policy_optimal(pi2, q_pi_det)

True

### Policy Improvement Algorithm
- For each state $s \in S$, set $\pi'(s) \leftarrow \arg\max_{a \in A}q_\pi(s,a)$.