# Trust Region Policy Optimization

## Introduction

Most algorithms for policy optimization can be classified
into three broad categories:

1. Policy iteration methods
2. Policy Gradient methods
3. derivative-free optimization methods (cross-entropy method and covariance matrix adaptation)

General derivative-free stochastic optimization methods
such as CEM and CMA are preferred on many problems,
because they achieve good results while being simple
to understand and implement.

First prove that minimizing a certain surrogate objective function guarantees policy improvement with non-trivial step-sizes. Then they make a series of approximations to the theoretically-justified algorithm, yield-ding a practical algorithm, which they call TRPO. They describe two variants of this algorithm:

1. the single-path method, which can be applied in a model-free setting.
2. the vine method, which is typically only possible in simulation.

## Preliminaries

$\eta (\pi) = E_{\tau \sim \pi} [\sum_{\infty}^{t=0} \gamma^t r(s_t)]$

Where $\tau \sim \pi$ means $a_t \sim \pi(a_t | s_t), s_{t+1} \sim P(s_{t+1} | s_t, a_t), s_0 \sim \rho_0$

Then we can express the expected return
of another policy $\tilde{\pi}$ in terms of the advantage over $\pi$ accumulated over timesteps:

$\eta (\tilde\pi) = \eta (\pi) + E_{\tau \sim \tilde\pi} [\sum_{t=0}^{\infty} \gamma^t A_{\pi} (s_t, a_t)]$

Where $A_\pi (s_t, a_t) = Q^{\pi} (s_t, a_t) - V^{\pi} (s_t)$

Let $\rho_{\pi} (s | s_0) = \sum_{t=0}^{\infty} \gamma^{t} P(S_t = s | S_0=s_0, A_0, ....., A_{t-1} \sim \pi)$ (unnormalized discounted future state distribution)

Then $\rho_{\pi} (s) = \int_{s_0} \rho_{\pi} (s | s_0) d\rho_{0} (s_0)$

We could rewrite:

$\eta (\tilde\pi) = \eta (\pi) + E_{\tau \sim \tilde\pi} [\sum_{t=0}^{\infty} \gamma^t A_{\pi} (s_t, a_t)]$

$= \eta (\pi) + \int_{s_0} E_{\tau \sim \tilde\pi}[\sum_{t=0}^{\infty} \gamma^t A_{\pi} (s_t, a_t) | s_0]d\rho (s_0)$

$= \eta (\pi) + \int_{s_0} \sum_{t=0}^{\infty} \gamma^t E_{\tau \sim \tilde\pi | s_0}[A_{\pi} (s_t, a_t)] d\rho (s_0)$

$= \eta (\pi) +  \int_{s_t}\int_{a_t}\int_{s_0} \sum_{t=0}^{\infty} \gamma^t P^{\tilde\pi}(S_t=s_t| S_0=s_0, A_0, ....., A_{t-1} \sim \tilde\pi) \tilde\pi(A_t=a_t | S_t=s_t) A_{\pi} (s_t, a_t)  d\rho (s_0) da_t ds$

$= \eta (\pi) +  \int_{s_t}\rho_{\tilde\pi} (s_t)\int_{a_t} \tilde\pi(A_t=a_t | S_t=s_t) A_{\pi} (s_t, a_t) da_t ds_t$

This implies that any policy update $\eta (\pi) \rightarrow \eta (\tilde\pi)$ that has a nonnegative expected advantage at every state s (ie, $\int_{a_t} \tilde\pi(A_t=a_t | S_t=s_t) A_{\pi} (s_t, a_t) da_t \geq 0, \forall s \in S$) is guaranteed to incerease the policy perforance $\eta$ or leave it constant in the case that expected advantage is zero for all state s. This implies
the classic result that the update performed by exact
policy iteration, which uses the deterministic policy $\tilde \pi (s) = argmax_{a} A_{\pi} (s, a)$, improves the policy if there is at least one state-action pair with a positive advantage value and non-zero discounted future state distribution, otherwise the algorithm has converged to the optimal policy. However, in the
approximate setting, it will typically be unavoidable, due
to estimation and approximation error, that there will  also The complex dependency
of $\rho_{\tilde\pi} (s)$ on $\tilde \pi$ makes it difficult to optimize directly. Instead, we could use:

$L_{\pi} (\tilde \pi) = \eta (\pi) + \sum_{a} \rho^{\pi} (s) \sum_{a} \tilde\pi(a|s) A_\pi (s, a)$

Notice that $\rho^{\tilde\pi} (s)$ is replaced by $\rho^{\pi} (s)$, the changes in distribution due to changes in the policy is ignored.
However, if we have differentiable $\pi_{\theta} (a | t)$, then first order gradient matches:

$L_{\pi_{\theta_0}} = \eta (\pi_{\theta_0})$

$\nabla_{\theta} L_{\pi_{\theta_0}} (\pi_{\theta}) |_{\theta = \theta_0} = \nabla_{\theta} \eta_{\pi_{\theta_0}} (\pi_{\theta}) |_{\theta = \theta_0}$

Which indicates that a small step $\pi_{\theta_0} \rightarrow \tilde\pi_{\theta}$ that improves $L_{\pi_{\theta_{old}}}$ will also improve $\eta$ but does nto give any guidance on how big of a step to take.
One possible update rule is $\pi_new (a | s) = (1 - \alpha) \pi_{old} (a | s) + \alpha \pi^{\prime} (a | s)$

Where $\pi^{\prime} = argmax_{\pi^{\prime}} L_{\pi_{old}}(\pi^{\prime})$

<img src='pngs/TRPO_1.png'>

From the lower bound (only applies to mixture policies generate from above), we can see that if we improve $L_{\pi_{old}}(\pi_{new})$ we improve true performance.

We could extend the above results to stochastic policies by replacing $\alpha$ in the update rule by a distance measure between $\pi$ and $\tilde\pi$ and changing $\epsilon$ appropriately. Here, one choice of distance is

$D_{TV} (p || q) = \frac{1}{2} \sum_i |p_{i} - q_{i}|$ for discrete probability distribution p, q or $\frac{1}{2}\int_{i} |p_i - q_i| d_i$ for continues p, q. Define $D_{TV}^{max} (\pi, \tilde \pi)$ as:

$D_{TV}^{max} (\pi, \tilde\pi) = max_{s} D_{TV} (\pi(\cdot| s) || \tilde \pi(\cdot | s))$ (i.e This is the max value of sum of differences between actions over all states)

<img src='pngs/TRPO_2.png'>

By following the relationship between total variation divergence and the KL divergence: $D_{TV} (q || p)^2 \leq D_{KL} (p || q)$, we have:

$\eta(\tilde\pi) \geq L_{\pi} (\tilde\pi) - CD_{KL}^{max} (\pi, \tilde\pi)$

Where $C = \frac{4\epsilon\gamma}{(1 - \gamma)^2}$. Then we obtain a policy iteration scheme based on the policy improvement bound above. (Population version)

It follows that the algorithm below is guaranteed to generate a monotonically improving sequence of policies $\eta(\pi_0) \geq \eta(\pi_1) \geq ....$, to see this let $M_i(\pi) = L_{\pi_i} - CD_{KL}^{max} (\pi_i, \pi)$, then:

<img src='pngs/TRPO_4.png'>

Because by selecting $\pi_{i+1}$ to be the maximizer of $M_i(\pi)$, we are sure that $M_i(\pi_{i+1}) - M_i(\pi_i) \geq 0$
<img src='pngs/TRPO_3.png'>

This is a type of minorization-maximization algorithm, which is a class of methods that also includes expectation maximization. In the terminology of MM algorithms $M_i$ is the surrogate function that minorizes $\eta$ with equality at $\pi_i$, TRPO, which is proposed in this paper, is an approximation to above population version of the algorithm which uses a constraint on the KL divergence rather than a penalty to robustly allow large updates.

## Optimization of Parameterized Policies

We now describe how to derive a practical
algorithm from these theoretical foundations, under finite
sample counts and arbitrary parameterizations. New notations

$\eta(\theta) := \eta(\pi_\theta), L_{\theta} (\tilde\theta) := L_{\pi_\theta}(\pi_\tilde\theta), D_{KL} (\theta ||\tilde\theta) := D_{KL} (\pi_\theta || \pi_{\tilde\theta})$, $\theta_{old} = \text{previous policy parameters that we want to improve upon}$

Reformulate the lower-bound we have:

$\eta(\theta) = L_{\theta_{old}} (\theta) - CD_{KL}^{max} (\theta_{old}, \theta)$, with equality $\theta_{old} = \theta$. Then by (10) and algorithm 1, we are guaranteed to have a improved policy $\pi_{\theta}$ by:

$argmax_{\theta} [L_{\theta_{old}}(\theta) - CD_{KL}^{max} (\theta_{old}, \theta)]$

In practice, if we used the penalty coefficient C recommended
by the theory above, the step sizes would be very
small (ie. if KL divergence is small, then we have larger objective, which means we take small steps). One way to take larger steps in a robust way is to use
a constraint on the KL divergence between the new policy
and the old policy (allow larger step but limit the step size to prevent too large step size):

<img src='pngs/TRPO_5.png'>

This problem imposes a constraint that the KL divergence
is bounded at every point in the state space. While it is
motivated by the theory, this problem is impractical to solve
due to the large number of constraints. Instead, we can use
a heuristic approximation which considers the average KL
divergence:

<img src='pngs/TRPO_6.png'>

## Sample-Based Estimation of the Objective and Constraint

The previous section proposed a constrained optimization
problem on the policy parameters, which
optimizes an estimate of the expected total reward subject
to a constraint on the change in the policy at each update.
This section describes how the objective and constraint
functions can be approximated using Monte Carlo
simulation.

We are trying to solve:

<img src='pngs/TRPO_7.png'>

1. We first $L_{\theta_{old}} (\theta)$ by $\frac{1}{1 - \gamma} E_{s\sim \rho_{\theta_{old}}}[\sum_a \pi_{\theta}(a | s)A_{\theta_{old}} (s, a)]$
2. Then we replace $A_{\theta_{old}} (s, a)$ with $Q_{\theta_{old}} (s, a)$, this only changes the objective by a constant.
3. Lastly, we replace the sum over the actions by an importance sampling estimator. Using q to denote the sampling distribution, the contribution of a single $s_n$ to the loss function is $E_{a \sim q} [\frac{\pi_{\theta} (a | s)}{a | s_n} Q_{\theta_{old}} (s_n, a)]$

<img src='pngs/TRPO_8.png'>

All that remains is to replace the expectations by sample
averages and replace the Q value by an empirical estimate.
The following sections describe two different schemes for
performing this estimation.

### Single Path

In this estimation procedure, we collect a sequence of states by sampling $s_0 \sim \rho_0$ and then simulating the policy $\pi\theta_{old}$ for some number of timesteps to generate a trajectory $s_0, a_0, s_1, a_1, ...., s_T$. Hence, $q(a | s) = \pi_{\theta_{old}} (a | s)$. Q_{\theta_{old}} (s, a) si computed at each state-action pair (s_t, a_t) by taking the discounted sum of future rewards along the trajectory.

### Vine

In this estimation procedure, we first sample $s_0 \sim \rho_0$ and simulate the policy $\pi_{\theta_i}$ to generate a number of trajectories. We then choos a subset of N states along these trajectories, denoted $s_1, s_2, ..., s_N$, which we call the **roll out set**. For each state $s_n$ in the rollout set, we sample K actions according to $a_{n, k} \sim q(\cdot | s_n)$. Any choice of $q(\cdot | s_n)$ with a support that includes the support of $\pi_{\theta_i} (\cdot | s_n)$ will produce a consistent estimator. In practice $q(\cdot | s_n) = \pi_{\theta_i} (\cdot | s_n)$ works well on continuous problems, while the uniform distribution workds well on discrete tasks, where it can sometimes achieve better exploration.

For each action $a_{n, k}$ sampled at each state $s_n$ from the rollout set, we estimate $\hat{Q}_{\theta_i} (s_n, a_{n, k})$ by performing a rollout starting with state $s_n$ and action $a_{n, k}$.

In small, finite action spaces, we can generate a rollout for every possible action from a given state. The contribution to $L_{\theta_{old}}$ from a single state $s_n$ is:

$L_n (\theta) = \sum_{k=1}^{K} \pi_{\theta} (a_k | s_n) \hat{Q} (s_n, a_k)$ (following an uniform q)

Where the action space is $A = {a_1, ..., a_K}$. In large
or continuous state spaces, we can construct an estimator
of the surrogate objective using importance sampling. (Weighted importance sampling)

<img src='pngs/TRPO_9.png'>

This self-normalized estimator removes the need to use a baseline for the Q-values. Averageing over $s_n \sim \pho_{\pi_\theta_i}$, we obtain an estimator for $L_{\theta_{old}}$ as well as its gradient.

The benefit of the vine method over the single path method
that is our local estimate of the objective has much lower
variance given the same number of Q-value samples in the
surrogate objective. That is, the vine method gives much
better estimates of the advantage values. The downside of
the vine method is that we must perform far more calls to
the simulator for each of these advantage estimates. Furthermore,
the vine method requires us to generate multiple
trajectories from each state in the rollout set, which limits
this algorithm to settings where the system can be reset to
an arbitrary state. In contrast, the single path algorithm requires
no state resets and can be directly implemented on a
physical system

### Practical Algorithm
1. Use the single path or vine procedures to collect a set of state-action pairs along with MC estimate of their Q-values
2. By averageing over samples, construct the estimated objective and constraint.
3. Approximately solve this contrained optimization problem to update the policy's parameter vector $\theta$.

<img src='pngs/TRPO_10.png'>