<div align='center'>
    <h1> 
        <a href='https://arxiv.org/abs/1707.06347'> Proximal Policy Optimization (PPO) </a> 
    </h1>
</div>

# Intro

Proximal Policy Optimization (PPO) is an `Actor-Critic` (AC), `online` (learns from trajectories collected during run time), and `on-policy` algorithm (uses only trajectories collected with the latest policy). 

- PPO has two neural networks:
    - The Actor Network $\pi_{\theta} (s)$.
    - The Critic Network $V_{\phi} (s)$.

$\ $
- PPO has two objective loss functions: one for the Actor network (`policy surrogate objective loss function`) $L_\text{actor}^{\text{CLIP}}$ and another for the Critic network (`value function error term`) $L_{critic}^{\text{VF}}$.

- Contrary to VPG and A2C/A3C algorithms, PPO indirectly maximizes the Policy performance objective $J(\pi_{\theta})$ and, instead, maximizes $L_\text{actor}^{\text{CLIP}}$ using `stochastic gradient ascent` to optimize a `stochastic policy` $a_t \sim \pi_{\theta}(a_t | s_t)$.

- Unlike second-order methods (e.g., TRPO), it does not compute the Hessian matrix (second-order derivatives). PPO is a family of first-order methods, meaning that only first-order derivatives (gradients) are considered for optimization. PPO is designed to keep the updated policy close to the old policy.

- PPO is suitable for both `continuous and discrete action spaces`.

PPO has two main variants:

1) PPO-Penalty: uses KL-divergence with a penalty term in the objective function. The penalty coefficient is automatically adjusted during training.

2) PPO-Clip: improves upon the TRPO algorithm. In TRPO, the KL-divergence term (constraint) in the objective function, which prevents the old policy $\pi_{\theta_{\text{old}}}$ to be far from the updated (new) policy $\pi_{\theta}$, introduces an overhead that is circumvented in PPO by a clipped surrogate objective. The KL-divergence term is eliminated.

## Is PPO a Policy Gradient Algorithm?

["PPO is often referred to as a policy gradient algorithm, though this is slightly inaccurate." —Open AI.](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html)

["Open AI implementation of PPO makes use of Generalized Advantage Estimation for computing the policy gradient." —Open AI.](https://spinningup.openai.com/en/latest/algorithms/ppo.html)

PPO does use policy gradients in its optimization process, however, it introduces additional mechanisms that set it apart from "pure" policy gradient methods:

1) Clipping: PPO uses a surrogate objective loss function with a clipped probability ratio, which serves as a soft trust region constraint. This helps to limit the size of policy updates, improving stability. This addresses a key issue with standard policy gradient methods, where large policy updates can destabilize the learning process.

2) Batch update: PPO performs multiple gradient updates on the same minibatch of data for sample efficiency. This is different from traditional policy gradient methods that perform one gradient update per data sample. —"Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates." [1].

It is more precise to describe PPO as an advanced policy optimization algorithm that builds upon and extends policy gradient techniques.

# The RL Goal in PPO

The goal of reinforcement learning is to maximize the expected cumulative reward (a.k.a expected return) $J(\pi)$ under a policy $\pi$:

$$\text{max } J(\pi) = \text{max } \mathbb{E}_{\tau \sim \pi}[\mathcal{R}(\tau) | \pi].$$

In PPO, the RL goal is translated to learning a parameterized stochastic policy $\pi_{\theta}(s)$, represented by the Actor network, whose actions maximize the clipped surrogate objective loss function $L_\text{actor}^{\text{CLIP}}$:

$$
\underset{\theta}{\operatorname{max}} \ L_\text{actor}^{\text{CLIP}}(s, a, \theta_{old}, \theta)
= \underset{\theta}{\operatorname{max}} \
\hat{\mathbb{E}}_{s, a \sim \pi_{\theta_{old}}} \Bigg[\text{min}\Bigg(r(\theta) A^{\pi_{\theta_{old}}}(s,a), \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon)A^{\pi_{\theta_{old}}}(s,a) \Bigg) \Bigg]
$$

In PPO, the parameters of the `Actor Network` are updated by computing the `stochastic gradient ascent` of the clipped surrogate objective loss:

\begin{eqnarray}
\theta_{k+1} &=& \theta_k + \alpha \nabla_{\theta} L_\text{actor}^{\text{CLIP}} \\
&=& \theta_k + \alpha  \hat{\mathbb{E}}_{s, a \sim \pi_{\theta_{old}}} \Bigg[\nabla_{\theta} \ \text{min}\Bigg(r(\theta) A^{\pi_{\theta_{old}}}(s,a), \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon)A^{\pi_{\theta_{old}}}(s,a) \Bigg) \Bigg]
\end{eqnarray}

The derivative of the min function splits into two cases based on whether $r(\theta)$ or the clipped term is smaller.

In the limit of infinitesimal steps ($\alpha \rightarrow 0$), this corresponds to solving:

\begin{eqnarray}
\theta_{k+1} = \underset{\theta}{\operatorname{arg\,max}} \ L_\text{actor}^{\text{CLIP}}.
\end{eqnarray}

For computing samples, it is useful to use the empirical average (sample mean):

$$
L_\text{actor}^{\text{CLIP}} = \frac{1}{|\mathcal{D}_k|T}\sum_{\tau \in \mathcal{D}_k} \sum_{t=0}^T \text{min } \Bigg(\frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_k}(a_t | s_t)}A^{\pi_{\theta_k}} (s_t, a_t), g(\epsilon, A^{\pi_{\theta_k}} (s_t, a_t)) \Bigg).
$$

While the parameters of the `Critic Network` are updated by computing the `gradient descent` of the mean-squared error loss:

$$(V_{\phi_k}(s_t) - \hat{\mathcal{R}}_t)^2.$$

# Objective Loss Functions in PPO

## Actor loss

The Actor network loss function is the clipped surrogate objective loss function defined as the minimum between two objectives, the no clipping or penalty objective, which is the default in policy gradient methods, and its clipped version:

$$L_\text{actor}^{\text{CLIP}}(s, a, \theta_{old}, \theta) \equiv \hat{\mathbb{E}}_{s, a \sim \pi_{\theta_{old}}} \Bigg[\text{min}\Bigg(r(\theta) A^{\pi_{\theta_{old}}}(s,a), \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon)A^{\pi_{\theta_{old}}}(s,a) \Bigg) \Bigg].$$

Where $$r(\theta) \equiv \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)},$$ 

is the probability ratio of the new policy $\pi_{\theta}$ to the old policy $\pi_{\theta_{old}}$ at timestep $t$. If $r(\theta) > 1$, the new policy dominates.

The clip function constrains the value of the probability ratio $r(\theta)$ to the range $1-\epsilon$ to $1+\epsilon$, where $\epsilon$ is a hyperparameter. Considering $\epsilon = 0.2$, then $r(\theta) \in [0.8, 1.2]$. This constraint avoids large policy updates making sure the new policy is not too far away from the old one.

The advantage function, denoted by $$A(s_t,a_t) = Q(s_t, a_t) - V(s_t),$$

is the difference between the expected return (Action-Value function) and the baseline estimate. The baseline is the On-Policy State Value function that gives a noisy estimate of the return. If the value of the advantage function is positive then the gradient is positive and the likelihood of the selected actions (action probabilities) increases, otherwise the gradient is negative and the actions are discouraged.

This advantage is calculated as follows:

$$A(s_t,a_t) = \delta_t + (\gamma\lambda)\delta_{t+1} + \cdots + (\gamma\lambda)^{T-t+1}\delta_{T-1}.$$

Legend: 

- $\delta_t = r_t + \gamma V(s_{t+1})-V(s_t)$.
- $\gamma$ is the discount factor (~0.99).
- $\lambda$ is a parameter used to reduce variance (~0.95).

A simplified version of the PPO-Clip objective (Actor loss) is [3]:

\begin{eqnarray}
L_\text{actor}^{\text{CLIP}}(s, a, \theta_{old}, \theta) =  \hat{\mathbb{E}}_{s, a \sim \pi_{\theta_{old}}} \Bigg[\text{min}\Bigg(r(\theta)A^{\pi_{\theta_{old}}}(s,a), g(\epsilon, A^{\pi_{\theta_{old}}}(s,a)) \Bigg)\Bigg],
\end{eqnarray}

where

\begin{eqnarray}
g(\epsilon, A) = 
\begin{cases}
(1+\epsilon)A, A \geq 0. \\
(1-\epsilon)A, A < 0.
\end{cases}
\end{eqnarray}

## Critic loss

The Critic network objective loss function (a.k.a value function error term) is:

$$L_{critic}^{\text{VF}}(\phi) = (V_{\phi}(s_t) - V_t^{targ})^2.$$

## Final loss (shared parameters)

When implementing a neural network that shares parameters between the Policy (Actor network) and the Value function (Critic network), the final objective loss function is the sum of the Critic loss and the Actor loss:

$$L_t^{\text{CLIP+VF+S}}(s, a, \theta_{old}, \theta) =: L^{\text{PPO}}(s, a, \theta_{old}, \theta) = \hat{\mathbb{E}} \Bigg[L_\text{actor}^{\text{CLIP}}(s, a, \theta_{old}, \theta) -c_1 L_{critic}^{VF}(\theta)+c_2 S[\pi_{\theta}](s) \Bigg],$$

where $c_1$ and $c_2$ are scalar coefficients, and $S$ is an entropy term (bonus) used to ensure enough exploration.

# Algorithm

The following cell presents a pseudocode for the PPO-Clip variant version of the PPO algorithm with Generalized Advantage Estimate (GAE).

---
**Algorithm (Pseudocode): Proximal Policy Optimization - Clip version with GAE (adapted from [Open AI](https://spinningup.openai.com/en/latest/algorithms/ppo.html#:~:text=PPO%20is%20an%20on%2Dpolicy,discrete%20or%20continuous%20action%20spaces.))**

---

- Input: initialize policy parameters $\theta_0$, and initial value function parameters $\phi_0$.

- **for** episode $k = 0, 1, 2, \dots$ do:

    - **for** parallel actor $= 1, 2, \cdots, N$ do:

        - Collect a set of trajectories from multiple parallel actors $\mathcal{D}_k\doteq\{\tau_i\} = (s_0, a_0, r_0, \cdots , s_T, a_{T}, r_T)$ by executing the current policy $\pi_k = \pi(\theta_k)$ in the environment for $T$ time steps.
          
        - **for** each trajectory, compute: 
            - the reward-to-go $\hat{\mathcal{R}}_t = \sum_{t'=t}^T R(s_{t'}, a_{t'}, s_{t'+1})$, and
            - the Generalized Advantage Estimate $A^{\text{GAE}}_t$ (using any advantage estimation method such as `Temporal Difference (TD)`) based on the current on-policy state value function $V_{\phi_k}$ as the baseline to reduce sample variance in the gradient estimate: \begin{eqnarray}A_t^{\text{GAE}}(s_t, a_t) = \sum_{i=0}^{T} (\gamma \lambda)^i \delta_{t+i} = \delta_t + (\gamma \lambda) \delta_{t+1} + \cdots + = \delta_t + (\gamma \lambda) A^{\text{GAE}}_{t+1} \\ \delta_t = r_t + \gamma V_{\phi_k}(s_{t+1}) - V_{\phi_k}(s_t)\end{eqnarray}
        - **end for**.
    - **end for**.

    - **if $K$ episodes:**
        - Update the Policy by maximizing the PPO-Clip surrogate objective w.r.t $\theta$, typically via stochastic `gradient ascent` with Adam, for $K$ episodes and minibatch size $M \leq N T$: $$\theta_{k+1} = \underset{\theta}{\operatorname{arg\,max}} \ \frac{1}{|\mathcal{D}_k|T}\sum_{\tau \in \mathcal{D}_k} \sum_{t=0}^T \text{min } \Bigg(\frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_k}(a_t | s_t)}A^{\pi_{\theta_k}} (s_t, a_t), g(\epsilon, A^{\pi_{\theta_k}} (s_t, a_t)) \Bigg).$$
          
        - Re-fit (learn) the on-policy state value function $V_{\phi_k}$ by regression on mean-squared error, via some `gradient descent` algorithm, minimizing $(V_{\phi_k}(s_t) - A^{\text{GAE}}_t)^2$ summed over all trajectories and time steps: 
        
    $$\phi_{k+1} = \underset{\phi}{\operatorname{arg\,min}} \ \frac{1}{|\mathcal{D}_k|T}\sum_{\tau \in \mathcal{D}_k} \sum_{t=0}^T \left(V_{\phi_k}(s_t) - A^{\text{GAE}}_t \right)^2.$$ 
    - **end if**

    
- **end for**.

---

# Implementation

Following the paper's implementation, the Policy network is an MLP with two hidden layers, 64 neurons, and tanh activation function. The coefficient $c_1$ in the final objective function (critic loss + actor loss) is discarted since the policy and the value networks do not share parameters between each other. And the entropy term used to ensure enough exploration is also ignored.

When computing the stochastic gradient ascent with an automatic differentiation software, the implementation of PPO without shared parameters between Actor and Critic networks uses the $L_\text{actor}^{\text{CLIP}}(s, a, \theta_{old}, \theta)$ objective loss function instead of the traditional objective function $L^{PG} = \mathbb{E}_{\tau \sim \pi_{\theta}} [log \left( \pi_{\theta} (a_t | s_t) \right) \Phi_t]$ from vanilla policy gradient.

# References

[1] [Proximal Policy Optimization Algorithms, Schulman et al. 2017.](https://arxiv.org/abs/1707.06347)

[2] https://spinningup.openai.com/en/latest/algorithms/ppo.html

[3] https://drive.google.com/file/d/1PDzn9RPvaXjJFZkGeapMHbHGiWWW20Ey/view