<div align='center'>
    <h1> 
        <a href='https://arxiv.org/abs/1502.05477'> Trust Region Policy Optimization (TRPO) </a> 
    </h1>
</div>

# Intro

TRPO is an `Actor-Critic` (AC), `online` (learns from trajectories collected during run time), and `on-policy` algorithm (uses only trajectories collected with the latest policy).

- TRPO optimizes a `stochastic policy` and is suitable for both `continuous and discrete action spaces`.

To make sure that the updated policy do not move too far from the current policy, the TRPO algorithm adds a KL constraint to the performance objective function. We note that optimizing the TRPO objective is identical to VPG. The TRPO objective is:

$$ \underset{\theta}{\operatorname{maximize}} \ \hat{\mathbb{E}}_t\Bigg[\frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}\hat{A}_t \Bigg],$$ 

subject to $$\hat{\mathbb{E}}_t[\text{KL}[\pi_{\theta_{\text{old}}}(\cdot | s_t), \pi_{\theta}(\cdot | s_t)]]\leq \delta.$$

TRPO performs well in simulated robotic locomotion, and playing Atari games using pixels from the screen as input [[John Schulman, PhD Thesis, Pg.6]](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-217.html).

# Algorithm

---
**Algorithm (Pseudocode): Trust Region Policy Optimization (adapted from Open AI)**

---

- Input: initialize policy parameters $\theta_0$, and initial value function parameters $\phi_0$.

- Hyperparameters: KL-divergence limit $\delta$, backtracking coefficient $\alpha$, maximum number of backtracking steps $K$.

- **for** episode $k= 0, 1, 2, \dots$ do:

    - Collect a trajectory $\tau_k$ and stored it in the replay buffer $\mathcal{D}_k\doteq\{\tau_i\} = (s_0, a_0, r_0, \cdots , s_T, a_{T}, r_T)$ by executing the current policy $\pi_k = \pi(\theta_k)$ in the environment.
      
    - **for** trajectory $\tau_k$, compute: 
        - the reward-to-go $\hat{\mathcal{R}}_t = \sum_{t'=t}^T R(s_t', a_t', s_{t'+1})$, and
        - the Advantage Estimate $A_t$ (using any advantage estimation method such as `Temporal Difference`) based on the current on-policy state value function $V_{\phi_k}$ as the baseline to reduce sample variance in the gradient estimate: \begin{eqnarray}A(s_t ,a_t) = \delta_t = r_t + \gamma V_{\phi_k}(s_{t+1}) - V_{\phi_k}(s_t)\end{eqnarray}
    - **end for**.
    - Estimate the Policy gradient as: $$\hat{g}_k = \frac{1}{|\mathcal{D}_k|}\sum_{\tau \in \mathcal{D}_k}\sum_{t=0}^T\nabla_{\theta}log\pi_{\theta}(a_t| s_t)|_{\theta_k}\hat{A}_t .$$
    - Use the conjugate gradient algorithm to compute: $$ \hat{x}_k \approx \hat{H}_k^{-1} \hat{g}_k,$$ where $\hat{H}_k$ is the Hessian of the sample average KL-divergence.
    - Update the Policy by backtracking line search with:
$$\theta_{k+1} = \theta_k + \alpha^j \sqrt{\frac{2\delta}{\hat{x}_k^T\hat{H}_k\hat{x}_k}}\hat{x}_k,$$ where $j \in \{0, 1, 2, \cdots, K\}$ is the smallest value which improves the sample loss and satisfies the sample KL-divergence constraint.
    - Fit the value function by regression on mean-squared error, via some gradient descent algorithm, minimizing $(V_\phi(s_t) - \hat{\mathcal{R}}_t)^2$ summed over all trajectories and time steps: $$\phi_{k+1} = \underset{\phi}{\operatorname{arg\,min}} \ \frac{1}{|\mathcal{D}_k|T}\sum_{\tau \in \mathcal{D}_k} \sum_{t=0}^T \left(V_\phi(s_t) - \hat{\mathcal{R}}_t \right)^2 .$$ 
    
- **end for**.

---

# Implementation

# References

[1] [Trust Region Policy Optimization, Schulman et al. 2015.](https://arxiv.org/abs/1502.05477)

[2] https://spinningup.openai.com/en/latest/algorithms/trpo.html