### Reinforcement Learning: Theory and Algorithms
第１章内のいくつかの事柄をpythonを使って実装します。
* ベルマン方程式のベクトル化
* 動的計画法
の実装をメインで行っていきます。


### ベクトル形式でベルマン方程式を書こう

$$Q^\pi=r+\gamma P V^\pi$$


In [46]:
import numpy as np 

S = 3
A = 2


r = np.zeros(S*A)
P = np.zeros((S*A, S))
gamma = 0.9
pi = np.zeros(S*A)
V_pi = np.zeros(S)


r = np.random.rand(S*A)
P = np.random.rand(S*A, S)
P = P / P.sum(axis=1)[:, None]
pi = np.random.rand(S*A)

#ベルマン方程式

Q_pi = r + gamma * np.dot(P, V_pi)

print(Q_pi)

[0.69882194 0.05180708 0.91843705 0.53129233 0.36030357 0.58303794]


$$
\begin{aligned}
Q^\pi=r+\gamma P^\pi Q^\pi .
\end{aligned}
$$

$$
P_{(s, a),\left(s^{\prime}, a^{\prime}\right)}^\pi:=P\left(s^{\prime} \mid s, a\right) \pi\left(a^{\prime} \mid s^{\prime}\right) .
$$
$P^\pi$は($S\times A$,$S\times A$)の行列です。

In [48]:
Q_pi = np.zeros(S*A)

P_pi = np.zeros((S*A,S*A))
for s in range(S):
    for a in range(A):
        for s_prime in range(S):
            for a_prime in range(A):
                P_pi[s*A+a, s_prime*A+a_prime] = P[s*A + a, s_prime] * pi[s_prime*A + a_prime]

for _ in range(100):  
    Q_pi = r + gamma * np.dot(P_pi, Q_pi)

print(Q_pi)
print(P_pi.shape)
print(np.dot(P_pi, Q_pi).shape)

[3.92621042e+11 2.87654080e+11 3.81218096e+11 3.48802854e+11
 2.48957676e+11 3.20759749e+11]
(6, 6)
(6,)



**価値関数・行動価値関数関連**

* **$Q^\pi$の解析解 (Analytical Solution for Q):** $Q^\pi=\left(I-\gamma P^\pi\right)^{-1} r$
* 最適価値 (Optimal Value):
    * $V^{\star}(s)  :=\sup _{\pi \in \Pi} V^\pi(s)$
    * $Q^{\star}(s, a)  :=\sup _{\pi \in \Pi} Q^\pi(s, a)$
* グリーディ方策 (Greedy Policy): $\pi_Q(s):=\operatorname{argmax}_{a \in \mathcal{A}} Q(s, a)$
* **ベルマン最適作用素 (Bellman Optimality Operator):** $\mathcal{T} Q:=r+\gamma P V_Q$
* アドバンテージ関数 (Advantage Function): $A^\pi(s, a):=Q^\pi(s, a)-V^\pi(s)$
* 状態訪問分布 (State Visitation Distribution): $d_{s_0}^\pi(s)=(1-\gamma) \sum_{t=0}^{\infty} \gamma^t \operatorname{Pr}^\pi\left(s_t=s \mid s_0\right)$ 
* $V_Q(s):=\max _{a \in \mathcal{A}} Q(s, a)$

**動的計画法**

* **価値反復法 (Value Iteration):** $Q \leftarrow \mathcal{T} Q$
* **方策反復法 (Policy Iteration):**
    * 1. **方策評価 (Policy Evaluation):**  $Q^{\pi_k}$ を計算
    * 2. **方策改善 (Policy Improvement):** $\pi_{k+1}=\pi_{Q^{\pi_k}}$ 


### 解析解を実装してみましょう



In [43]:
def compute_analytical_Q_pi():
    return np.linalg.inv(np.eye(S*A)-gamma * P_pi) @ r

Q_pi = compute_analytical_Q_pi()
print(Q_pi)
print(np.linalg.inv(np.eye(S*A)-gamma * P_pi).shape)
print(r.shape)

[-18.86074403 -15.98676975 -13.79090972  -9.82614407]
(4, 4)
(4,)


### ベルマン最適作用素を実装してみましょう

In [54]:
def V_Q(Q, s, A):
    start_index = s * A
    end_index = start_index + A
    return np.max(Q[start_index:end_index])

def bellman_optimality_operator():
    V = np.array([V_Q(Q_pi, s, A) for s in range(S)]) 
    return r + gamma * np.dot(P, V)



[3.23548049e+11 3.33957445e+11 3.33338632e+11 3.16347306e+11
 3.20109464e+11 3.08516471e+11]
