# Proximal Policy Optimization Algorithms

## Introduction

Q-learning (with function approximation) fails on
many simple problems1 and is poorly understood, vanilla policy gradient methods have poor data
efficiency and robustness, TRPO is relatively complicated and is not compatible with architectures that include noise or parameter sharing.

This paper seeks to improve the current state of affairs by introducing an algorithm that attains
the data efficiency and reliable performance of TRPO, while using only first-order optimization. We propose a novel objective with clipped probability ratios, which forms a pessimistic estimate
(i.e., lower bound) of the performance of the policy. To optimize policies, we alternate between
sampling data from the policy and performing several epochs of optimization on the sampled data.

## Background

$L^{PG} (\theta) = \hat{E_t} [log \pi_{\theta} (a_t | s_t) \hat{A_t}]$

While it is appealing to perform multiple steps of optimization on this loss using the same
trajectory, doing so is not well-justified, and empirically it often leads to destructively large policy
updates

In TRPO, an objective (the surrogate objective) is maximized subject to a constaint on the size of the policy update:
(the constant term $\eta (\pi)$ is dropped because it does not contribute to the gradient)

<img src='pngs/PPO_1.png'>

Here, $\theta_{old}$ is the vector of policy parameters before the update. This problem can efficiently be
approximately solved using the conjugate gradient algorithm, after making a linear approximation
to the objective and a quadratic approximation to the constraint.

The theory justifying TRPO actually suggests using a penalty instead of a constraint:

<img src='pngs/PPO_2.png'>

for some coefficient $\beta$ (ie. the C in TRPO paper). This follows from the fact that a certain surrogate objective which computes the max KL over states instead of the mean forms a lower bound on the performance of the policy $\pi$. TRPO uses a hard constraint because choose good $\beta$ is hard, and this objective will suggest small step. Thus, some modifications are required.

## Clipped Surrogate Objective

Let $r_t (\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$ denote the probability ratio, so $r(\theta_{old}) = \frac{\pi_{\theta_{old}}(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} = 1$. TRPO maximizes a surrogate objective:

$L^{CPI} (\theta) =  \hat{E_t} [r_t (\theta) \hat{A_t}]$

The CPI refers to conservative policy iteration, where this objective is proposed. Without constraint, this will lead to large update (ie. the change in distribution is not penalized by the KL divergence). Hence we need to modify the objective to penalize changes to the policy that move $r_{t} (\theta)$ away from 1.

The main objective is:

$L^{CLIP} (\theta) = \hat{E_t} [min(r_t (\theta) \hat{A_t}, clip(r_t (\theta), 1 - \epsilon, 1 + \epsilon)\hat{A_t})]$

Where $\epsilon$ is a hyperparameter. The motivation for this objective is as follows. The first term inside the min is $L^{CPI}$. The second term, $clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A_t}$, modifies the surrogate objective by clipping the probability ratio , which removes the incentive for moving $r_t$ outside the interval $[1 - \epsilon, 1 + \epsilon]$. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound on the upclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.

1. if advantage is positive, then L is clipped with $min(r_t(\theta), 1 + \epsilon) A_t$ a large $r_t$ only makes a small update $1 + \epsilon$ at most to control for large step size.
2. if advantage is negative, then L is not clipped if $r_t(\theta)$ is large because this will make the objective worse so is not preferred by the algorithm.

This penalization to prevent large KL divergence.

## Adaptive KL Penalty Coefficient

Another approach, which can be used as an alternative to the clipped surrogate objective, or in addition to it, is to use a penalty on KL divergence, and to adapt the penalty coefficient so that we achieve some target value of the KL divergence $d_{targ}$ each policy update. In practice, this KL penalty performed worse than the clipped surrogate objective.

<img src='pngs/PPO_3.png'>

## Algorithm

If using a neural network architecture that shares parameters
between the policy and value function, we must use a loss function that combines the policy
surrogate, and a value function error term. This objective can further be augmented by adding
an entropy bonus to ensure sufficient exploration, Combining these terms, we obtain the following objective, which is (approximately) maximized
each iteration:

<img src='pngs/PPO_4.png'>

where $c_1, c_2$ are coefficients, and S denotes an entropy bonus, and $L_t^{VF}$ is a squared-error loss $(V_{\theta} (s_t) - V_t^{targ})^2$. One stype of policy gradient implementation well-suited for use with RNN, runs the policy for T timesteps where T is much less than the episode length, and uses the collected samples for an update. This style requires an advantage estimator that does not look beyond timestep T (forward n-step TD):

<img src='pngs/PPO_5.png'>

Generalizing
this choice, we can use a truncated version of generalized advantage estimation, which reduces to 10 when $\lambda = 1$

<img src='pngs/PPO_6.png'>

<img src='pngs/PPO_7.png'>