<h1><a href="http://papers.nips.cc/paper/4013-policy-gradients-in-linearly-solvable-mdps">
Policy Gradient in Linearly-Solvable MDPs</a></h1>
by Emanuel Todorov et al.


Summary
=======
* Policy gradient within the framework of linearly-solvable MDPs for both discrete and continuous stochastic systems.

Linearly-solvable MDP (LMDP)
============

Define a `state cost` $q(x)$ over a assumably `discrete state space` $X$. 

Define `passive dynamics` as $p(x'|x)$ and transition probability under `admissible action` as $\pi(x'|x)$, then $$p(x'|x)=0\implies \pi(x'|x)=0$$

The `cost function` is $l(x, \pi(|x))=q(x)+D_{KL}(\pi(|x)||p(|x))=q(x)+\sum_{x'}[\pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)})]$

The `average cost` $c$ and `differential cost-to-go` v(x) for $\pi$ satisfies
$$c+v(x)=q(x)+\sum_{x'}\pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)} + v(x'))$$
>Note: if $v':X\rightarrow R$ is a solution for $v(x)$, then $v'+C$ is also a solution where $C$ is any constant.

Assume $v^*$ is the optimal solution and the corresponding action-induced transition is $\pi^*$. Then
$$c+v^*(x)=q(x)+\sum_{x'}\pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)} + v^*(x'))$$
$$\pi^*(x'|x)=\frac{p(x'|x)exp(-v^*(x'))}{\sum_y p(y|x)exp(-v^*(y))}$$

><font color='green'>
<a href="http://localhost:8889/notebooks/Efficient%20Computation%20of%20Optimal%20Actions.ipynb">**Proof**</a> can be found in <a href="https://www.ncbi.nlm.nih.gov/pubmed/19574462">Efficient Computation of Optimal Actions</a> by Emanuel Todorov
</font>

 
Discrete Problems: Policy Gradient Method 
===================

<h3>Policy gradient for general parameterization</h3>

Consider parameterized transition probability $\pi(x'|x, w)$. 

Define the probability of reaching state $x$ as $\mu(x,w)$ and the probability of reaching $x'$ after reaching $x$ is $\mu(x, x', w)=\mu(x,w)\pi(x'|x,w)$. **Note: discount factor is not considered.**

**`Policy gradient`**: solve the optimal parameter $w$ that minizes the average cost via gradient descend.

$$\nabla_w c=\sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)}+v(x'))$$

> <font color='green'>
**Proof Sketch**
(refer to <a href="https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf">policy gradient template</a>)<br>
\begin{align}
c &=q(x)+\sum_{x'}\pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)} + v(x'))-v(x)\\
\nabla_w c &=\nabla_w[q(x)+\sum_{x'}\pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)} + v(x'))-v(x)]\\
&=\nabla_w [\sum_{x'}\pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)} + v(x'))-v(x)]\\ 
&=\sum_{x'}\nabla_w \pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)} + v(x'))+\sum_{x'}\pi(x'|x)\nabla_w (log\frac{\pi(x'|x)}{p(x'|x)} + v(x'))-\nabla_w v(x)\\
\nabla_w c &=\nabla_w c\sum_x\mu(x)=\sum_x\mu(x)\nabla_w c\\
&=\sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)} + v(x'))+\sum_x\mu(x)\sum_{x'}\pi(x'|x)\nabla_w (log\frac{\pi(x'|x)}{p(x'|x)} + v(x'))-\sum_x\mu(x)\nabla_w v(x)\\
&=\sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)} + v(x'))+\sum_{x}\mu(x)\sum_{x'}\pi(x'|x)\nabla_w log\frac{\pi(x'|x)}{p(x'|x)} + \sum_{x}\mu(x)\sum_{x'}\pi(x'|x)\nabla_w v(x')-\sum_x\mu(x)\nabla_w v(x)\\
&=\sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)} + v(x')) + \sum_{x}\mu(x)\sum_{x'}\nabla_w \pi(x'|x)+ \sum_{x'}\mu(x')\nabla_w v(x')-\sum_x\mu(x)\nabla_w v(x)\\
&=\sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)} + v(x')) + \sum_{x}\mu(x)\sum_{x'}\nabla_w \pi(x'|x)\\
&=\sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)} + v(x')) + \sum_{x}\mu(x)\nabla_w \sum_{x'}\pi(x'|x)\\
&=\sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)} + v(x'))
\end{align}
</font>


<h3>Parameterize the policy</h3>

Use Gibbs distribution for parameterized transition probability $\pi(x'|x, w)\triangleq \frac{p(x'|x)exp(-w^T f(x'))}{\sum_y p(y|x)exp(-w^T f(y))}$.

Use operator $\Pi[f](x)\triangleq \sum_y \pi(y|x)f(y)$ and rewrite $\nabla_w c=\sum_{x,x'}\mu(x, x')(\Pi[f](x)-f(x'))(v(x')-w^T f(x'))$
><font color='green'>
**Proof Sketch**
<br>
\begin{align}
\nabla_w c &=\sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)(log\frac{\pi(x'|x)}{p(x'|x)} + v(x'))\\
&= \sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)(log\pi(x'|x) - logp(x'|x) + v(x'))\\
&= \sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)(log p(x'|x) - w^Tf(x)-log\sum_yp(y|x)exp(-w^T f(y)) - logp(x'|x) + v(x'))\\
&= \sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)(- w^Tf(x)+ v(x')-log\sum_yp(y|x)exp(-w^T f(y)) )\\
&= \sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)(- w^Tf(x)+ v(x'))+ \sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)log\sum_yp(y|x)exp(-w^T f(y)))\\
&= \sum_x\mu(x)\sum_{x'}\nabla_w \pi(x'|x)(- w^Tf(x)+ v(x'))+ \sum_x\mu(x)(log\sum_yp(y|x)exp(-w^T f(y))))\sum_{x'}\nabla_w \pi(x'|x)\\
&= \sum_x\mu(x)\sum_{x'}\nabla_w \frac{p(x'|x)exp(-w^T f(x'))}{\sum_y p(y|x)exp(-w^T f(y))}(- w^Tf(x)+ v(x'))\\
&= \sum_x\mu(x)\sum_{x'}\frac{p(x'|x)exp(-w^T f(x'))}{\sum_y p(y|x)exp(-w^T f(y))}(f(x')-\frac{\sum_y -p(y|x)f(y)}{\sum_y p(y|x)exp(-w^T f(y))})(- w^Tf(x)+ v(x'))\\
&= \sum_x\mu(x)\sum_{x'}\pi(x'|x)(-f(x')-\frac{\sum_y -p(y|x)f(y)}{\sum_y p(y|x)exp(-w^T f(y))})(- w^Tf(x)+ v(x'))\\
&= \sum_{x,x}\mu(x, x')(-f(x')+\sum_y\pi(y|x)f(y))(- w^Tf(x)+ v(x'))\\
&= \sum_{x,x'}\mu(x, x')(\Pi[f](x)-f(x'))(v(x')-w^T f(x'))
\end{align}
</font>

<h3>Compatible cost-to-go function approximation</h3>

Approximate the cost-to-go function $v(x)$ with a compatible function which
* has the same gradient as Q-function
* is orghogonal to the remaining terms in the policy gradient

Try $v(x)\approx\hat{\\v}(x)=r^Tf(x)$and define $\epsilon_r(x)\triangleq v(x)-\hat{\\v}(x)$ and $d(r)\triangleq \sum_{x,x'}\mu(x,x')(\Pi[f](x)-f(x'))\epsilon_r(x')$.
It is provable that
$$d(r)= \sum_{x}\mu(x)(\Pi[f](x)\Pi[\epsilon_r](x)-f(x)\epsilon_r(x))$$
><font color='green'>
**Proof Sketch**
<br>
\begin{align}
d(r) &=\sum_{x,x'}\mu(x,x')(\Pi[f](x)-f(x'))\epsilon_r(x')\\
&= \sum_x\mu(x)\sum_{x'}\pi(x'|x)(\sum_{y}\pi(y|x)f(y)-f(x'))(v(x')-r^Tf(x'))\\ 
&= \sum_x\mu(x)\sum_{x'}\pi(x'|x)\sum_{y}\pi(y|x)f(y)v(x')-\sum_x\mu(x)\sum_{x'}\pi(x'|x)f(x')v(x')-\sum_x\mu(x)\sum_{x'}\pi(x'|x)\sum_{y}\pi(y|x)f(y)r^Tf(x')+\sum_x\mu(x)\sum_{x'}\pi(x'|x)f(x')r^Tf(x')\\
&= \sum_x\mu(x)\sum_{x'}\pi(x'|x)f(x')\sum_{y}\pi(y|x)v(y)-\sum_x\mu(x)\sum_{x'}\pi(x'|x)f(x')\sum_{y}\pi(y|x)r^Tf(y)-\sum_x\mu(x)\sum_{x'}\pi(x'|x)f(x')v(x')+\sum_x\mu(x)\sum_{x'}\pi(x'|x)f(x')r^Tf(x'))\\
&= \sum_{x}\mu(x)\sum_{x'}\pi(x'|x)f(x')\sum_{y}\pi(y|x)(v(y)-r^Tf(y))- \sum_{x}\mu(x)\sum_{x'}\pi(x'|x)f(x')(v(x')-r^Tf(x'))\\
&= \sum_{x}\mu(x)\sum_{x'}\pi(x'|x)f(x')\sum_{y}\pi(y|x)(v(y)-r^Tf(y))- \sum_{x}\mu(x)f(x)(v(x)-r^Tf(x))\\
&= \sum_{x}\mu(x)\sum_{x'}\pi(x'|x)f(x')\sum_{y}\pi(y|x)\epsilon_r(y)- \sum_{x}\mu(x)f(x)\epsilon_r(x)\\  
&= \sum_{x}\mu(x)\Pi[f](x)\Pi[\epsilon_r](x)-\sum_{x}\mu(x)f(x)\epsilon_r(x)\\
&= \sum_{x}\mu(x)(\Pi[f](x)\Pi[\epsilon_r](x)-f(x)\epsilon_r(x))
\end{align}
</font>

Use least-square to minimize the error by weighting $f(x)$ of all states with $\mu(x)$. When optimality $r_{LS}\triangleq argmin_r\ \sum_x\mu(x)(v(x)-r^Tf(x))^2$ is achieved, ideally the gradient w.r.t $r$ is $0$, thus
$$\nabla_r\sum_x\mu(x)(v(x)-r^Tf(x))^2=0\implies\text{f(x) and $\epsilon_r(x)=v(x)-r^Tf(x)$ are orthorgonal}$$

Rewrite $d(r)$ as linear combination
$$d(r)=Ar-k$$
$$A\triangleq \sum_x\mu(x)(f(x)f(x)^T-\pi[f](x)\pi[f](x)^T)$$
$$k\triangleq \sum_x\mu(x)(f(x)v(x)-\Pi[f](x)\Pi[v](x))$$

* **$A$ does not depend on $v(x), r$, but $k$ does. Now eleminate the influence of $v(x)$ on $k$**.

Recall that $c+v(x)=l(x) + \sum_x \pi(x'|x)v(x')$. Then $k$ can be rewritten as
$$k= \sum_x\mu(x)(g(x)v(x)-\Pi[f](x)(l(x)-c)$$
$$\text{where }g(x)\triangleq f(x)-\Pi[f](x)$$

Use least square to fit $\tilde{\\v}(x)=s^Tg(x)$ to $v(x)$ as done to $f(x)$, then **$g(x)$ and $\tilde{\\\epsilon}_s(x)=v(x)-\tilde{\\v}(x)$ are orthorgonal.** Then
\begin{align}
k&= \sum_x\mu(x)(g(x)v(x)-\Pi[f](x)(l(x)-c)\\
&= \sum_x\mu(x)(g(x)(v(x)-\tilde{\\\epsilon}_s(x))-\Pi[f](x)(l(x)-c)\\
&= \sum_x\mu(x)(g(x)\tilde{\\v}(x)-\Pi[f](x)(l(x)-c)
\end{align}

** Having $g(x)$, the error in $v(x)$ no longer affect $k$**

<h3>Policy gradient procedure for LMDP</h3><ol>
    <li> Fit $\tilde{\\v}(x)$ to $v(x)$ from samples via least square</li>
    <li> Use $\tilde{\\v}(x)$ to calculate average cost c</li>
    <li> Calculate $A, k$ with $\tilde{\\v}(x)$ and $c$</li>
    <li> Fit $\hat{\\v}(x)=r^Tf(x)$ from $Ar=k$</li>
    <li> Calculate policy gradient $\nabla_w c=\sum_{x,x'}\mu(x, x')(\Pi[f](x)-f(x'))(\hat{\\v}(x')-w^T f(x'))=\sum_{x,x'}\mu(x, x')(f(x')-\Pi[f](x))f(x')^T(w-r)$</li>
    </ol>

<h3>Natural policy gradient</h3>

Introduce nature metric $G(w)$ for the gradient of the objective function.
$$G(w)\triangleq\sum_{x,x'}\mu(x,x')\nabla_w log\pi(x'|x)\nabla_wlog\pi(x'|x)^T$$
Then replace $\nabla_w c$ with the natural policy gradient
$$G(w)^{-1}\nabla_w c=w-r$$

> <font color='green'>
**Proof Sketch**
\begin{align}
\nabla_w\pi(x'|x)&= \pi(x'|x)[\sum_y \pi(y|x)f(y)-f(x')]\\
\nabla_w \log{\pi(x'|x)}&= \frac{\nabla_w\pi(x'|x)}{\pi(x'|x)}\\
&= \sum_y \pi(y|x)f(y)-f(x')
\end{align}
\begin{align}
G(w)^{-1}\nabla_w c &= \frac{\sum_{x,x'}\mu(x, x')(f(x')-\Pi[f](x))f(x')^T(w-r)}{\sum_{x,x'}\mu(x,x')\nabla_w log\pi(x'|x)\nabla_wlog\pi(x'|x)^T}\\
    &= (w-r)\frac{\sum_{x,x'}\mu(x, x')(f(x')-\sum_{y}\pi(y|x)f(y))f(x')^T}{\sum_{x,x'}\mu(x,x')\nabla_w log\pi(x'|x)\nabla_wlog\pi(x'|x)^T}\\
    &= (w-r)\frac{\sum_{x,x'}\mu(x, x')(-\nabla_w \log{\pi(x'|x)})f(x')^T}{\sum_{x,x'}\mu(x,x')\nabla_w log\pi(x'|x)\nabla_wlog\pi(x'|x)^T}\\
    &= (w-r)\frac{\sum_{x,x'}\mu(x, x')(-\nabla_w \log{\pi(x'|x)})(\sum_y\pi(y|x)f(y)-\nabla_w \log{\pi(x'|x)})^T}{\sum_{x,x'}\mu(x,x')\nabla_w log\pi(x'|x)\nabla_wlog\pi(x'|x)^T}\\
    &= (w-r)\frac{\sum_{x,x'}\mu(x, x')\nabla_w \log{\pi(x'|x)})\nabla_w \log{\pi(x'|x)})^T -\sum_{x,x'}\mu(x, x')\nabla_w \log{\pi(x'|x)})\sum_y\pi(y|x)f(y)^T}{\sum_{x,x'}\mu(x,x')\nabla_w log\pi(x'|x)\nabla_wlog\pi(x'|x)^T}\\
    &= (w-r)\frac{\sum_{x,x'}\mu(x, x')\nabla_w \log{\pi(x'|x)})\nabla_w \log{\pi(x'|x)})^T -\sum_{x}\mu(x)\sum_{x'}\pi(x'|x)\nabla_w \log{\pi(x'|x)})\sum_y\pi(y|x)f(y)^T}{\sum_{x,x'}\mu(x,x')\nabla_w log\pi(x'|x)\nabla_wlog\pi(x'|x)^T}\\
    &= (w-r)\frac{\sum_{x,x'}\mu(x, x')\nabla_w \log{\pi(x'|x)})\nabla_w \log{\pi(x'|x)})^T -\sum_{x}\mu(x)\sum_{x'}\nabla_w\pi(x'|x)\sum_y\pi(y|x)f(y)^T}{\sum_{x,x'}\mu(x,x')\nabla_w log\pi(x'|x)\nabla_wlog\pi(x'|x)^T}\\
    &= (w-r)\frac{\sum_{x,x'}\mu(x, x')\nabla_w \log{\pi(x'|x)})\nabla_w \log{\pi(x'|x)})^T}{\sum_{x,x'}\mu(x,x')\nabla_w log\pi(x'|x)\nabla_wlog\pi(x'|x)^T}\\
    &= w-r
\end{align}
</font>
** When gradient is 0, $w=r$.**


<h3>Opt: Gauss-Newton Method</h3>

Besides policy gradient, there are other options to solve an optimal $v^*(x)$.

**Option 1: `approximate policy iteration`**:<ol>
<li> Given the policy parameter $w^{(i)}$ at iteration i, solve an approximated feature weight $r^{(i)}$.</li>
<li> Use approximation $w^{(i+1)}\approx r^{(i)}$</li>
</ol>
>Equivalent to natural gradient method with step size 1 $$w^{(i+1)}=w^{(i)}+1\cdot(r^{(i)}-w^{(i)})$$

**Option 2: `approximate value iteration`** solves a fix-point problem by setting $v^*(x)=w^Tf(x)$. 

Define $e(x,w)\triangleq w^Tf(x)-q(x)+\log{\sum_y p(y|x)exp(-w^Tf(y))}$

$$\nabla_w e(x, w)=f(x)-\sum_y\frac{p(y|x)exp(-w^Tf(y))}{\sum_s p(s|x)exp(-w^Tf(s))}f(y)=f(x)-\Pi[f](x)=g(x)$$

Then $e(x, w+\delta w)\approx e(x+w)+\delta w^T \nabla_w e(x,w)=e(x+w)+\delta w^T g(x)$.

Adding average cost $c$, the loss function in each iteration is 
$$min_{c,\delta w} \sum_x\bar{\\\mu}(x)(c+e(x,w)+\delta w^T g(x)$$

where $\bar{\\\mu}(x)$ can be fixed or be the on-policy sample distribution $\mu(x, w)$



Continuous Porblems: Policy Gradient Method
==================

<h3>Policy gradient for general parameterization</h3>

**`Controlled Ito diffusion`**:
$$dx=b(x,u)dt+C(x)dw$$
where $w(t)$ is a standard multidimensional Brownian motion process, and $u$ is a control input. 

**`Hamilton-Jacobi-Bellman (HJB)** equation`** for average cost $c$ and cost-to-go $v(x)$:
$$c=l(x,\pi(x))+L[v](x)$$
$$L[v](x)\triangleq b(x,\pi(x))^T\nabla_x v(x)+\frac{trace(C(x)C(x)^T\nabla_{xx}v(x)}{2}$$

Then $\int \mu(x)L[f](x)dx=0$

> <font color='green'>
**Proof Sketch**
    ?????

Considering policy parameterization $u=\pi(x,w)$, then $\nabla_w c=\nabla_w l(x)+\nabla_w b(x)^T \nabla_x v(x) + L[\nabla_w v](x)$
> <font color='green'>
**Proof Sketch**
\begin{align}
\nabla_w c &=\nabla_w [l(x,\pi(x))+L[v](x)]\\
    &= \nabla_w l(x) + \nabla_w [b(x,\pi(x))^T\nabla_x v(x)+\frac{trace(C(x)C(x)^T\nabla_{xx}v(x)}{2}]\\
    &= \nabla_w l(x) + (\nabla_w b(x, \pi(x)))^T\nabla_x v(x) + [b(x,\pi(x))^T\nabla_{x,w} v(x)+ \frac{trace(C(x)C(x)^T\nabla_{xxw}v(x)}{2}]\\
    &= \nabla_w l(x)+\nabla_w b(x)^T \nabla_x v(x) + L[\nabla_w v](x)
\end{align}

