# Policy Gradient的推导

- <font color=brown>符号说明：文中 $a_t和u_t$ 混用，都表示t时刻的action。</font>

## I. 原目标：最大化状态的效用期望值

### I.1 两种常见的目标展开方法

1. trajectory的期望rewards：
$$U(\theta)  = E_{\tau }R(\tau|\pi_{\theta}) = \sum_{\tau }P(\tau ;\theta )R(\tau )$$

2. 不同状态价值的均值：
$$U(\theta) = E_{s\sim d(s)}[V^{\pi_{\theta }}(s)]=\sum_{s\in \mathcal{S} }d(s)V^{\pi_{\theta }}(s)$$

### I.2 目标展开式暗含的假设条件

- 期望值合理的前提是随机变量的分布稳定。也就是说，trajectory展开式中，假设了trajectory的分布稳定；状态价值展开式中，假设了状态s的分布稳定。<font color=red>但这两个条件在现实条件下都不成立。</font>

- 在状态价值展开式中，理论上，如果agent与环境交互的次数极大的条件下，$d(s)$可以用策略对应的stationary distribution。但是这个条件在真实环境中通常也很难满足。

## II. 梯度

### II.1 用trajectory的期望rewards推导

#### II.1.1 梯度的基本公式

$$\begin{align}
\nabla _{\theta }U(\theta ) 
& = \nabla _{\theta } \sum_{\tau }P(\tau ;\theta )R(\tau ) \\
& =  \sum_{\tau }\nabla _{\theta }P(\tau ;\theta )R(\tau ), {\color{green}{这步是因为\theta带给\tau的随机性都体现在P(\tau;\theta)里面}} \\
& =  \sum_{\tau }P(\tau ;\theta )\frac{\nabla _{\theta }P(\tau ;\theta )}{P(\tau ;\theta )}R(\tau ) \\
& =  E_{\tau }\nabla _{\theta }logP(\tau ;\theta )R(\tau ) \\
\\
\nabla _{\theta }logP(\tau ;\theta ) 
& = \nabla _{\theta }log\left[ d(s_0)\prod_{t=0}^{H-1} P(s_{t+1}|s_t, u_t)*\pi_{\theta }(u_t|s_t) \right ]\\
& = \nabla _{\theta }\left[logd(s_0) + \sum_{t=0}^{H-1}logP(s_{t+1}|s_t, u_t) + \sum_{t=0}^{H-1}log\pi_{\theta }(u_t|s_t)\right ]\\
& = \underset{0}{ \underbrace{\nabla _{\theta }logd(s_0)}}  + \underset{0}{\underbrace{\nabla _{\theta }\sum_{t=0}^{H-1}logP(s_{t+1}|s_t, u_t)}} + \nabla _{\theta }\sum_{t=0}^{H-1}log\pi_{\theta }(u_t|s_t)\\
& = \nabla _{\theta }\sum_{t=0}^{H-1}log\pi_{\theta }(u_t|s_t) \\
\\
取梯度：g & = \nabla _{\theta }U(\theta )\\
& =  E_{\tau }\nabla _{\theta }logP(\tau ;\theta )R(\tau ) \\
& = E_{\tau }\left[\nabla _{\theta }\sum_{t=0}^{H-1}log\pi_{\theta }(u_t|s_t)R(\tau )\right ]\\
& = E_{\tau }\left[ \sum_{t=0}^{H-1}\nabla _{\theta }log\pi_{\theta }(u_t|s_t)\sum_{t=0}^{H-1}r(s_t, u_t)\right ]
\end{align}$$

#### II.1.2 rewards to go

$$g = E_{\tau }\left[ \sum_{t=0}^{H-1}\nabla _{\theta }log\pi_{\theta }(u_t|s_t)\sum_{t'=t}^{H-1}r(s_{t'}, u_{t'})\right ]$$

#### II.1.3 baseline

- 只要b与$\theta$无关，则下式成立：
$$\begin{align}
E_{\tau }\nabla _{\theta }logP(\tau ;\theta )b& = \nabla _{\theta } \sum_{\tau }P(\tau ;\theta )b \\
& = b\nabla _{\theta } \sum_{\tau }P(\tau ;\theta )\\
& = b\nabla _{\theta }1 = 0
\end{align}$$

- 此时，将b代入策略梯度有：
$$\begin{align}
g & =  E_{\tau }\nabla _{\theta }logP(\tau ;\theta )[R(\tau ) - b]\\
&= E_{\tau }\left[ \sum_{t=0}^{H-1}\nabla _{\theta }log\pi_{\theta }(u_t|s_t)\left( \sum_{t'=t}^{H-1}r(s_{t'}, u_{t'})-b(s_t)\right)\right ]
\end{align}$$
  - 即使b是states的函数，等式仍然成立，只要b(s)不受参数的影响。

- 为了降低估计量的方差，b(s)的sub-optimal取值为$b(s)=V^{\pi}(s)$。将action advantage记为$A_t=Q^{\pi}(s_t,u_t)-V^{\pi}(s_t)$
$$\begin{align}
g & =  E_{\tau }\nabla _{\theta }logP(\tau ;\theta )[R(\tau ) - b]\\
&= E_{\tau }\left[ \sum_{t=0}^{H-1}\nabla _{\theta }log\pi_{\theta }(u_t|s_t)\left( \sum_{t'=t}^{H-1}r(s_{t'}, u_{t'})-b(s_t)\right)\right ]\\
&= E_{\tau }\left[ \sum_{t=0}^{H-1}\nabla _{\theta }log\pi_{\theta }(u_t|s_t)A_t\right ]\\
\end{align}$$

- <font color=red>此时，策略梯度的估计量虽然降低了方差，但由于b的取值受策略的影响，从而失去了无偏性。</font>
$$\begin{align}
\hat g & = \frac{1}{N} \sum _{i=1 }^N\nabla _{\theta }logP(\tau_i;\theta )[R(\tau_i) - b]\\
&= \frac{1}{N} \sum _{i=1 }^N\left[ \sum_{t=0}^{H-1}\nabla _{\theta }log\pi_{\theta }(u_t^{(i)}|s_t^{(i)})\left( \sum_{t'=t}^{H-1}r(s_{t'}^{(i)}, u_{t'}^{(i)})-V^{\pi}(s_t^{(i)})\right)\right ]\\
&= \frac{1}{N} \sum _{i=1 }^N\left[ \sum_{t=0}^{H-1}\nabla _{\theta }log\pi_{\theta }(u_t^{(i)}|s_t^{(i)})\hat A_t\right ]
\end{align}$$

### II.2 用起始状态的state function期望值推导

- 参考math of RL page210

## III. 优化目标

### III.1 Importance sampling形式的优化目标

#### III.1.1 用原目标推导

$$\begin{align}
J (\theta )& = \underset{\tau\sim \pi _\theta}{E} [R(\tau)] = \underset{\tau\sim \pi_{\bar \theta}}{E} \ \  [\frac{P_{\theta}(\tau)}{P_{\bar \theta}(\tau)}R(\tau)] \\
其中：\\
\frac{P_{\theta}(\tau)}{P_{\bar \theta}(\tau)} & = \frac{P(s_0)\prod_{t=0}^{H-1}\pi_{\theta }(a_t|s_t)P(s_{t+1}|s_t, a_t)}{P(s_0)\prod_{t=0}^{H-1}\pi_{\bar \theta }(a_t|s_t)P(s_{t+1}|s_t, a_t)} 
= \frac{\prod_{t=0}^{H-1}\pi_{\theta }(a_t|s_t)}{\prod_{t=0}^{H-1}\pi_{\bar \theta }(a_t|s_t)} 
= \prod_{t=0}^{H-1}\frac{\pi_{\theta }(a_t|s_t)}{\pi_{\bar \theta }(a_t|s_t)}  \\
代入上式：\\
J(\theta) &= \underset{\tau\sim \pi_{\bar \theta}}{E} \ \  [\frac{P_{\theta}(\tau)}{P_{\bar \theta}(\tau)}R(\tau)]= \underset{\tau\sim \pi_{\bar \theta}}{E} \ \  [ \prod_{t=0}^{H-1}\frac{\pi_{\theta }(a_t|s_t)}{\pi_{\bar \theta }(a_t|s_t)} R(\tau)]
\end{align}$$

$$\begin{align}
\nabla _{\theta }J(\theta) & =  \underset{\tau\sim \pi_{\theta}}{E} \ \  \nabla _{\theta }logP(\tau ;\theta )R(\tau ) \\
& = \underset{\tau\sim \pi_{\bar \theta}}{E} \ \  \frac{P_{\theta}(\tau)}{P_{\bar \theta}(\tau)}\nabla _{\theta }logP_{\theta}(\tau)R(\tau )  \\
& = \underset{\tau\sim \pi_{\bar \theta}}{E} \ \  \frac{P_{\theta}(\tau)}{P_{\bar \theta}(\tau)}\nabla _{\theta }log\pi _{\theta}(\tau)R(\tau )  \\
&= \underset{\tau\sim \pi_{\bar \theta}}{E} \ \  \left [\left( \prod_{t=0}^{H-1}\frac{\pi_{\theta,t }}{\pi_{\bar \theta, t }} \right)\left(\sum_{t=0}^{H-1}\nabla _{\theta }log\pi_{\theta,t }\right)\left(\sum_{t=0}^{H-1}r_t\right)\right]\\
&= \underset{\tau\sim \pi_{\bar \theta}}{E} \ \  \left [\left( \prod_{t^{''}=0}^{t}\frac{\pi_{\theta,t^{''} }}{\pi_{\bar \theta,t^{''} }} \right)\left( \prod_{t^{'''}={t+1
}}^{H-1}\frac{\pi_{\theta, t^{'''} }}{\pi_{\bar \theta, t^{'''} }} \right)\left(\sum_{t=0}^{H-1}\nabla _{\theta }log\pi_{\theta,t }\sum_{t'=t}^{H-1}r_t\right)\right]\\
&= \underset{\tau\sim \pi_{\bar \theta}}{E} \ \  \left [\sum_{t=0}^{H-1}\nabla _{\theta }log\pi_{\theta,t }(\prod_{t^{''}=0}^{t}\frac{\pi_{\theta ,t^{''} }}{\pi_{\bar \theta ,t^{''}}})(\sum_{t'=t}^{H-1}r_t)\right]
\end{align}$$

- 实践中不用这个式子，因为上式中：
$$\prod_{t^{''}=0}^{t}\frac{\pi_{\theta ,t^{''} }}{\pi_{\bar \theta ,t^{''}}} \rightarrow exponential\ in\  T，方差极大$$

#### III.1.2 IS目标的一阶近似

- 另一种推导方式：假设$(s_t, a_t)\sim P_{\theta}(s_t, a_t)$是state-action的marginal distribution。
- 可以重新将优化目标表示为：
$$

### III.2 梯度表达式直接对应的‘pseudo-loss’优化目标

$$\begin{align}
g = E_{\tau }\left[ \sum_{t=0}^{H-1}\nabla _{\theta }log\pi_{\theta }(u_t|s_t)A_t\right ] 
& \Rightarrow 
J(\theta ) = E_{\tau }\left[ \sum_{t=0}^{H-1}log\pi_{\theta }(u_t|s_t)A_t\right ]\\
\\
\hat g= \frac{1}{N} \sum _{i=1 }^N\left[ \sum_{t=0}^{H-1}\nabla _{\theta }log\pi_{\theta }(u_t^{(i)}|s_t^{(i)})A_t^{(i)}\right ] 
 & \Rightarrow 
\hat J(\theta ) = \frac{1}{N} \sum _{i=1 }^N\left[ \sum_{t=0}^{H-1}log\pi_{\theta }(u_t^{(i)}|s_t^{(i)})A_t^{(i)}\right ]
\end{align}$$

### III.3 PPO中使用的surrogate loss