<div align='center'>
    <h1> 
        <a href='https://arxiv.org/abs/1707.06347'> Proximal Policy Optimization (PPO) </a> 
    </h1>
</div>

# Intro

Proximal Policy Optimization (PPO) is an `Actor-Critic` (AC), `online` (learns from trajectories collected during run time), and `on-policy` algorithm (uses only trajectories collected with the latest policy). Contrary to VPG and A2C/A3C algorithms, PPO indirectly optimizes the Policy performance objective $J(\pi_{\theta})$ and, instead, maximizes a surrogate objective function. Unlike second-order methods (e.g., TRPO), it does not compute the Hessian matrix (second-order derivatives). PPO is a family of first-order methods, meaning that only first-order derivatives (gradients) are considered for optimization. PPO is designed to keep the updated policy close to the old policy.

- PPO optimizes a `stochastic policy` and is suitable for both `continuous and discrete action spaces`.
  
PPO has to main variants:

1) PPO-Penalty: uses KL-divergence with a penalty term in the objective function. The penalty coefficient is automatically adjusted during training.

2) PPO-Clip: improves upon the TRPO algorithm. In TRPO, the KL-divergence term (constraint) in the objective function, which prevents the old policy $\pi_{\theta_{\text{old}}}$ to be far from the updated (new) policy $\pi(\theta)$, introduces an overhead that is circumvented in PPO by a clipped surrogate objective. The KL-divergence term is eliminated.

### Is PPO a Policy Gradient Algorithm?

["PPO is often referred to as a policy gradient algorithm, though this is slightly inaccurate." —Open AI.](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html)

["Open AI implementation of PPO makes use of Generalized Advantage Estimation for computing the policy gradient." —Open AI.](https://spinningup.openai.com/en/latest/algorithms/ppo.html)

PPO does use policy gradients in its optimization process, however, it introduces additional mechanisms that set it apart from "pure" policy gradient methods:

1) Clipping: PPO uses a surrogate objective loss function with a clipped probability ratio, which serves as a soft trust region constraint. This helps to limit the size of policy updates, improving stability. This addresses a key issue with standard policy gradient methods, where large policy updates can destabilize the learning process.

2) Batch update: PPO performs multiple optimization steps on the same batch of data for sample efficiency. This is different from traditional policy gradient methods, which usually update the policy using data only once per batch.

It is more precise to describe PPO as an advanced policy optimization algorithm that builds upon and extends policy gradient techniques.

## Actor loss

For the Actor network, the clipped surrogate objective loss function is defined as the minimum between two objectives, the no clipping or penalty objective which is the default in policy gradient methods, and its clipped version:

$$L_\text{actor}^{\text{CLIP}}(s, a, \theta_k, \theta) \equiv \hat{\mathbb{E}}_t \Bigg[\text{min}\Bigg(r_t(\theta)A^{\pi_{\theta_k}}(s,a), \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)A^{\pi_{\theta_k}}(s,a) \Bigg) \Bigg],$$

where $\epsilon$ is a parameter used to clip (constraint) the value of the probability ratio $$r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_k}(a_t | s_t)},$$ in the range $1-\epsilon$ to $1+\epsilon$.

The advantage function, denoted by $$A^{\pi_{\theta_k}}(s_t,a_t) = Q^{\pi_{\theta_k}}(s_t, a_t) - V^{\pi_{\theta_k}}(s_t),$$

is the difference between the expected return (Action-Value function) and the baseline estimate. The baseline is the State Value function that gives a noisy estimate of the return. If the value of the advantage function is positive then the gradient is positive and the likelihood of the selected actions (action probabilities) increases, otherwise the gradient is negative and the actions are discouraged.

This advantage is calculated as follows:

$$A^{\pi_{\theta_k}}(s_t,a_t) = \delta_t + (\gamma\lambda)\delta_{t+1} + \cdots + (\gamma\lambda)^{T-t+1}\delta_{T-1},$$

where 

- $\delta_t = r_t + \gamma V(s_{t+1})-V(s_t)$.
- $\gamma$ is the discount factor (~0.99).
- $\lambda$ is a parameter used to reduce variance (~0.95).

A simplified version of the PPO-Clip objective, i.e., the Actor loss is [1]:

\begin{eqnarray}
L_\text{actor}^{\text{CLIP}}(s, a, \theta_k, \theta) =  \hat{\mathbb{E}}_t \Bigg[\text{min}\Bigg(r_t(\theta)A^{\pi_{\theta_k}}(s,a), g(\epsilon, A^{\pi_{\theta_k}}(s,a)) \Bigg)\Bigg],
\end{eqnarray}

where

\begin{eqnarray}
g(\epsilon, A) = 
\begin{cases}
(1+\epsilon)A, A \geq 0. \\
(1-\epsilon)A, A < 0.
\end{cases}
\end{eqnarray}

## Critic loss

The Critic network loss function is:

$$L_{critic}^{\text{VF}}(\theta) = (V_{\theta}(s_t) - V_t^{targ})^2.$$

## Final loss

The final objective is the sum of the Critic loss and the Actor loss:

$$L_t^{\text{CLIP+VF+S}}(s, a, \theta_k, \theta) =: L_t^{\text{PPO}}(s, a, \theta_k, \theta) = \hat{\mathbb{E}}_t \Bigg[L_\text{actor}^{\text{CLIP}}(s, a, \theta_k, \theta) -c_1 L_{critic}^{VF}(\theta)+c_2 S[\pi_{\theta}](s_t) \Bigg],$$

where $S$ is an entropy term used to ensure enough exploration.

# Algorithm

The following cell presents a pseudocode for the PPO-Clip variant version of the PPO algorithm with the clipped objective.

---
**Algorithm (Pseudocode): Proximal Policy Optimization - Clip version (adapted from [Open AI](https://spinningup.openai.com/en/latest/algorithms/ppo.html#:~:text=PPO%20is%20an%20on%2Dpolicy,discrete%20or%20continuous%20action%20spaces.))**

---

- Input: initialize policy parameters $\theta_0$, and initial value function parameters $\phi_0$.

- for iteration $= 0, 1, 2, \dots$ do:

    - for actor=$1,2,\cdots, N$ do:

        - Collect a set of trajectories $\mathcal{H}_t \doteq \mathcal{D}_k\doteq\{\tau_i\} = (s_0, a_0, r_0, \cdots , s_T, a_{T}, r_T)$ by executing the current policy $\pi_k = \pi(\theta_k)=\pi_{\theta_{\text{old}}}$ in the environment for $T$ time steps.
        - For each trajectory, compute: 
            - the reward-to-go $\hat{\mathcal{R}}_t = \sum_{t'=t}^T R(s_t', a_t', s_{t'+1})$, and
            - the advantage estimates $\hat{A}_1, \dots, \hat{A}_T$ (using any advantage estimation method) based on the current  on-policy state value function $V_{\phi_k}$ used as the baseline: $$\hat{A}_t = Q^{\pi_{\theta}}(s_t, a_t) - V^{\pi_{\theta}}(s_t) \in {\rm I\!R}.$$ 
    - end for.
    - Update the Policy by maximizing the PPO-Clip objective, typically via stochastic `gradient ascent` with Adam: $$\theta_{k+1} =  \text{arg max}_{\theta } \hat{\mathbb{E}}_{s,a \sim \pi_{\theta_k}} [L_\text{actor}^{\text{CLIP}}(s, a, \theta_k, \theta)] = \text{arg max}_{\theta}\frac{1}{|\mathcal{D}_k|T}\sum_{\tau \in \mathcal{D}_k} \sum_{t=0}^T \text{min } \Bigg(\frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_k}(a_t | s_t)}A^{\pi_{\theta_k}} (s_t, a_t), g(\epsilon, A^{\pi_{\theta_k}} (s_t, a_t)) \Bigg).$$
    - Fit the value function by regression on mean-squared error, via some `gradient descent` algorithm, minimizing $(V_\phi(s_t) - \hat{\mathcal{R}}_t)^2$ summed over all trajectories and time steps: 
    
$$\phi_{k+1} = \underset{\phi}{\operatorname{arg\,min}} \ \frac{1}{|\mathcal{D}_k|T}\sum_{\tau \in \mathcal{D}_k} \sum_{t=0}^T \left(V_\phi(s_t) - \hat{\mathcal{R}}_t \right)^2.$$ 

    
- end for.

---

# Implementation

Following the paper's implementation, the Policy network is an MLP with two hidden layers, 64 neurons, and tanh activation function. The $c_1$ coefficient in the final objective function (critic loss + actor loss) is discarted since the policy and the value networks do not share parameters between each other. And the entropy term used to ensure enough exploration is also ignored.

Here, we define:

$$L_t^{\text{VF}}(\theta) =: L_{critic} \equiv MSE(\hat{A}_t(s_t,a_t) + CriticValue_{mem} - CriticValue_{net}).$$

# References

[1] https://spinningup.openai.com/en/latest/algorithms/ppo.html

[2] https://drive.google.com/file/d/1PDzn9RPvaXjJFZkGeapMHbHGiWWW20Ey/view