# 強化学習の線形計画法

---

まず，線形計画法の主問題を考えるため，重要な命題を示します．

**命題1**

ある有界な関数 $v:S \rightarrow \mathbb{R}$ 次の定理を満たすとき，$v$ は最適価値関数 $V^*$ の上界となる．
$$
v(s) \geq\left(B_* v\right)(s), \quad \forall s \in \mathcal{S}
$$

この命題から分かることは，$\mathbb{V}$を価値関数の集合とすると，上の最小となる $v(s)$ を求めることができれば

$$
\min_{v \in \mathbb{V}} v(s) \approx \left(B_* v\right)(s), \quad \forall s \in \mathcal{S}
$$

のように成り立ち，解けそうですね．

---

早速，このことを考えて主問題を定義していきます．

$$
\begin{cases}\text { Minimize } & \sum_{s \in \mathcal{S}} w(s) v(s) \\ \text { subject to } & v(s) \geq g(s, a)+\gamma \sum_{s^{\prime} \in \mathcal{S}} p_{\mathrm{T}}\left(s^{\prime} \mid s, a\right) v\left(s^{\prime}\right), \forall(s, a) \in \mathcal{S} \times \mathcal{A}\end{cases}
$$

制約条件はベルマン作用素を $v$ に適応した時を考えています．


---

下に動的計画法と線形計画法と比べ，線形計画法が本当に正しいかコードで試していきます．

In [50]:
import numpy as np
from typing import NamedTuple

S = 10 # number of states
A = 3 # number of actions
S_set = np.arange(S) # set of states
A_set = np.arange(A) # set of actions 
gamma = 0.8 #diconnect factor

rew = np.random.rand(S,A)

P = np.random.rand(S*A,S)
P = P/np.sum(P,axis=-1,keepdims=True)
P = P.reshape(S,A,S)

np.testing.assert_almost_equal(np.sum(P,axis=-1),1) # check if P is a valid probability matrix

class MDP(NamedTuple):
    S_set: np.ndarray
    A_set: np.ndarray
    P: np.ndarray
    rew: np.ndarray
    gamma: float
    horizon: int

    @property
    def S(self):
        return len(self.S_set)

    @property
    def A(self):
        return len(self.A_set)


horizon = int(1/(1-gamma))
mdp = MDP(S_set,A_set,P,rew,gamma,horizon)




In [53]:
import jax
import jax.numpy as jnp


# caluculate optimal Q function with DP
def _compute_optimal_V(mdp: MDP, S: int, A: int):

    def backup(optimal_Q):
        max_Q = optimal_Q.max(axis=1)
        next_v = mdp.P @ max_Q
        return mdp.rew + mdp.gamma * next_v
    optimal_Q = jnp.zeros((S, A))
    body_fn = lambda i,Q: backup(Q)
    Q = jax.lax.fori_loop(0,mdp.horizon + 1000,body_fn,optimal_Q)
    return Q.max(axis=-1)

compute_optimal_V = lambda mdp: _compute_optimal_V(mdp,mdp.S,mdp.A)
optimal_V_by_DP = compute_optimal_V(mdp)
print(optimal_V_by_DP)

[2.5476143 2.7145994 2.505139  2.6437848 2.8991792 2.4747338 2.6252167
 2.6725256 2.858446  2.7959125]


**主問題のコード**

In [54]:
import pulp
w = np.random.rand(S)

problem = pulp.LpProblem("LP_RL", pulp.LpMinimize)

v = [pulp.LpVariable(f'v_{i}') for i in range(mdp.S)]

#目的関数
for s in range(mdp.S):
    problem += w[s] * v[s]

#制約関数
for s in range(mdp.S):
    for a in range(mdp.A):
        problem += v[s] >= mdp.rew[s,a] + mdp.gamma * pulp.lpSum([mdp.P[s,a,s_prime]*v[s_prime] for s_prime in range(mdp.S)])

status = problem.solve()
print(pulp.LpStatus[status])


Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

command line - /Users/ichiharayuuseimare/opt/anaconda3/envs/syumi-note/lib/python3.9/site-packages/pulp/solverdir/cbc/osx/64/cbc /var/folders/rn/8ylp503d60g0xr_qm3ghjknr0000gn/T/8677c9b40d1b4f808d07dbcabcb48d74-pulp.mps timeMode elapsed branch printingOptions all solution /var/folders/rn/8ylp503d60g0xr_qm3ghjknr0000gn/T/8677c9b40d1b4f808d07dbcabcb48d74-pulp.sol (default strategy 1)
At line 2 NAME          MODEL
At line 3 ROWS
At line 35 COLUMNS
At line 337 RHS
At line 368 BOUNDS
At line 379 ENDATA
Problem MODEL has 30 rows, 10 columns and 300 elements
Coin0008I MODEL read with 0 errors
Option for timeMode changed from cpu to elapsed
Presolve 30 (0) rows, 10 (0) columns and 300 (0) elements
Perturbing problem by 0.001% of 1.0424583 - largest nonzero change 0 ( 0%) - largest zero change 0
0  Obj 0 Primal inf 127.92729 (30) Dual inf 0.010424483 (1) w.o. free dual inf (0)
16  Obj 1.9248724
Optimal - objective value 

In [55]:
DP_fn = optimal_V_by_DP @ w
LP_V = np.array([pulp.value(v[s]) for s in range(mdp.S)])
LP_w_V = LP_V @ w
w_DP_LP = DP_fn - LP_w_V
print(f'重み関数を考えた時の動的計画法と線形計画法の差は{w_DP_LP}です。')
DP_LP_dif = optimal_V_by_DP - LP_V
print(f'動的計画法と線形計画法で解いた答えの違いは{np.max(np.abs(DP_LP_dif))}')

重み関数を考えた時の動的計画法と線形計画法の差は-1.9073486328125e-06です。
動的計画法と線形計画法で解いた答えの違いは2.384185791015625e-07


微小な差はありますが，大体同じなどで良しとします．

---

双対問題の定義をしていきます．

$$
\left\{\begin{aligned}
\text { Maximize } & \sum_{s \in \mathcal{S}} \sum_{a \in \mathcal{A}} x(s, a) g(s, a) \\
\text { subject to } & \sum_{a^{\prime} \in \mathcal{A}} x\left(s^{\prime}, a^{\prime}\right)-\gamma \sum_{s \in \mathcal{S}} \sum_{a \in \mathcal{A}} p_{\mathrm{T}}\left(s^{\prime} \mid s, a\right) x(s, a)=w\left(s^{\prime}\right), \quad \forall s^{\prime} \in \mathcal{S} \\
& x(s, a) \geq 0, \quad \forall(s, a) \in \mathcal{S} \times \mathcal{A}
\end{aligned}\right.
$$

ここで目的関数の中に $x(s,a)$ が出てきましたね．

$x(s,a)$ がどんな特性を持っているのか，簡潔ですが，説明していきます．

ここで新しく経験度数関数($\Phi_w^\pi: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$)を定義します．

$$
\begin{aligned}
\Phi_w^\pi(s, a) & \triangleq \sum_{s_0 \in \mathcal{S}} w\left(s_0\right) \mathbb{E}^\pi\left[\sum_{t=0}^\omega \gamma^t \mathbb{I}_{\left\{S_t=s\right\}} \mathbb{I}_{\left\{A_t=a\right\}} \mid S_0=s_0\right] \\
& =\sum_{s_0 \in \mathcal{S}} w\left(s_0\right) \sum_{t=0}^{\infty} \gamma^t \operatorname{Pr}\left(S_t=s, A_t=a \mid S_0=s_0, \mathrm{M}(\pi)\right) \\
& =\pi(a \mid s) \sum_{s_0 \in \mathcal{S}} w\left(s_0\right) \sum_{t=0}^{\infty} \gamma^t \operatorname{Pr}\left(S_t=s \mid S_0=s_0, \mathrm{M}(\pi)\right)
\end{aligned}
$$


この関数が意味するのは方策$\pi$を使って，どれだけ，$(s,a)$が訪れられる確率があるかを意味するものです．

それを最新の情報に重きを置くように割引しています．



これを説明した理由は，強化学習の青本のp.70に書いてあるのですが，先ほど定義した$x(s,a)$と経験度数関数が一緒ということが示されているためです．(ちなみに制約条件を使って証明します．)

つまり，求められた最適な$x(s,a)$を合っているか確認する方法は経験値度数関数を使えばいいですね．



---


**双対問題のコード**



In [56]:
import pulp

dual_problem = pulp.LpProblem("Dual_RL", pulp.LpMaximize)

x = pulp.LpVariable.dicts("x", [(s, a) for s in range(S) for a in range(A)])

#目的関数
for a in range(mdp.A):
    for s in range(mdp.S):
        dual_problem += mdp.rew[s,a] * x[s,a]

#制約関数
for s_prime in range(mdp.S):
    dual_problem += pulp.lpSum([x[s_prime,a_prime] for a_prime in range(mdp.A)]) - mdp.gamma  * pulp.lpSum([mdp.P[s,a,s_prime]*x[s,a] for s in range(mdp.S) for a in range(mdp.A)]) == w[s_prime]

for s in range(mdp.S):
    for a in range(mdp.A):
        dual_problem += x[s,a] >= 0

dual_status = dual_problem.solve()
print(pulp.LpStatus[dual_status])


Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

command line - /Users/ichiharayuuseimare/opt/anaconda3/envs/syumi-note/lib/python3.9/site-packages/pulp/solverdir/cbc/osx/64/cbc /var/folders/rn/8ylp503d60g0xr_qm3ghjknr0000gn/T/44b760207ed14174bd247e6f1ef4b3c7-pulp.mps max timeMode elapsed branch printingOptions all solution /var/folders/rn/8ylp503d60g0xr_qm3ghjknr0000gn/T/44b760207ed14174bd247e6f1ef4b3c7-pulp.sol (default strategy 1)
At line 2 NAME          MODEL
At line 3 ROWS
At line 45 COLUMNS
At line 377 RHS
At line 418 BOUNDS
At line 449 ENDATA
Problem MODEL has 40 rows, 30 columns and 330 elements
Coin0008I MODEL read with 0 errors
Option for timeMode changed from cpu to elapsed
Presolve 10 (-30) rows, 30 (0) columns and 300 (-30) elements
Perturbing problem by 0.001% of 1.2525913 - largest nonzero change 9.1491621e-05 ( 0.0073041879%) - largest zero change 9.0752983e-05
0  Obj -0 Primal inf 85.025053 (10) Dual inf 1.2524988 (1)
0  Obj -0 Primal inf 85.0

In [59]:
dif = problem.objective.value() - dual_problem.objective.value()

print(f'主問題と双対問題の目的関数の値の差は{dif}です．')

主問題と双対問題の目的関数の値の差は-0.779596072516902です．
