# PPO

In short: PPO (Proximal Policy Optimization) is a widely used policy gradient method for reinforcement learning. PPO is an improved variant of TRPO (Trust Region Policy Optimization) which is itself an improved variant of the original policy gradient method. PPO was released in 2017 by OpenAI in the paper [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347).

The reason it's so popular is that it offers much better performance than the vanilla policy gradient, while still being relatively simple to implement and tune.

In this article, we'll attempt to:
1. Explain the flaws of REINFORCE, which TRPO and PPO attempt to solve.
2. Explain the intuition behind TRPO and PPO.
3. Explain how PPO improves upon TRPO.
4. Explain the PPO loss function.

## Flaws of REINFORCE

We covered REINFORCE in [a previous entry](../PolicyGradient/policygradient.ipynb), so we won't go into too much detail here. However, let's briefly review how it works.

#### REINFORCE Algorithm
1. We initialize the policy parameters $\theta$ randomly.
2. We collect a batch of trajectories $D = \{\tau_1, \tau_2, \ldots, \tau_n\}$ by running the policy in the environment.
3. We compute an estimate of the policy gradient, $\hat{g}$:
    * Recall that the policy gradient is given by:
        $$
        \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{R}_t(\tau_i) \right]
        $$
    * REINFORCE estimates the policy gradient using Monte Carlo sampling:
        $$
        \hat{g} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \hat{R}_t(\tau_i)
        $$
4. We then update the policy using gradient ascent:
    $$
    \theta \leftarrow \theta + \alpha \hat{g}
    $$
5. Repeat steps 2-4 until convergence.

Reference implementations:
* [Discrete](../PolicyGradient/policygradient_discrete_solution.ipynb)
* [Continuous](../PolicyGradient/policygradient_continuous_solution.ipynb)

#### Problems with REINFORCE
As you may have observed when implementing REINFORCE on Metadrive, it is very slow to train, and the final policies we ended up with were often quite poor.

This is because REINFORCE has two major flaws:
1. The estimate of the policy gradient has high variance.
2. We can't use old trajectories to improve the policy, so we need to collect new trajectories every time we want to update the policy.



### Problem 1: Noisy Policy Gradient Estimates
The first problem is that the estimate of the policy gradient has high variance. (When a value has high variance, it is often referred to as "noisy"). What this means in practice is that the policy gradient estimate can be very different from the true policy gradient, which can lead to poor performance, as we might be updating the policy in the wrong direction.

Since we compute the policy gradient estimate using Monte Carlo sampling, we can mitigate this by increasing the number of samples we use. However, this comes at the cost of increased computation time.

#### One Solution: Replacing $\hat{R}_t(\tau_i)$ with $A^{\pi_\theta}(s_t, a_t)$
When we [first derived the policy gradient](../PolicyGradient/policygradient.ipynb) it had the form:
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau_i) \right]
$$
The return $R(\tau_i)$ is the sum of the rewards in the trajectory $\tau_i$:
$$
R(\tau_i) = \sum_{t=0}^T r_t
$$
However, we could replace it with the reward-to-go $\hat{R}_t(\tau_i)$, which is defined as:
$$
\hat{R}_t(\tau_i) = \sum_{t'=t}^T r_{t'}
$$.
It turns out that there are other valid replacements for $R(\tau_i)$ that can reduce the variance of the policy gradient estimate. One such replacement is the advantage function $A^{\pi_\theta}(s_t, a_t)$, which is defined as:
$$
A^{\pi_\theta}(s_t, a_t) = Q^{\pi_\theta}(s_t, a_t) - V^{\pi_\theta}(s_t)
$$
where $Q^{\pi_\theta}(s_t, a_t)$ is the state-action value function and $V^{\pi_\theta}(s_t)$ is the state value function. 

Recall that:
* $V^{\pi_\theta}(s_t) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) | s_0=s_t \right]$
    * Called the "Value Function".
    * Measures how good it is to be in state $s_t$, assuming we follow policy $\pi_\theta$.
* $Q^{\pi_\theta}(s_t, a_t) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) | s_0=s_t, a_0=a_t \right]$
    * Called the "Action-Value Function", or "Q-Function".
    * Measures how good it is to take action $a_t$ in state $s_t$, and then follow policy $\pi_\theta$ afterwards.

So, what the advantage function measures is how much better it is to take action $a_t$ in state $s_t$ than it is to follow our policy. Let's take a look at an example to see why this is useful.

#### Example: Advantage Function
Assume that we are playing in the gridworld below, and that the game ends in exactly one step.

![gridworld](./gridworld_example.png)

If we choose to move left, we get a return of $+2$. If we choose to move right, we get a return of $-2$.

Our policy is:
$$
\pi_{\text{rand}}(a_t|s_t) = \begin{cases}
    0.5 & \text{if } a_t = \text{left} \\
    0.5 & \text{if } a_t = \text{right}
\end{cases}
$$

* What is $A^{\pi_{\text{rand}}}(s_0, \text{left})$?
    * $V^{\pi_{\text{rand}}}(s_0) = 0.5(2) + 0.5(-2) = 0$
    * $Q^{\pi_{\text{rand}}}(s_0, \text{left}) = 2$
    * $A^{\pi_{\text{rand}}}(s_0, \text{left}) = Q^{\pi_{\text{rand}}}(s_0, \text{left}) - V^{\pi_{\text{rand}}}(s_0)$
    * Therefore, $A^{\pi_{\text{rand}}}(s_0, \text{left}) = 2 - 0 = 2$
* What is $A^{\pi_{\text{rand}}}(s_0, \text{right})$?
    * $V^{\pi_{\text{rand}}}(s_0) = 0.5(2) + 0.5(-2) = 0$
    * $Q^{\pi_{\text{rand}}}(s_0, \text{right}) = -2$
    * $A^{\pi_{\text{rand}}}(s_0, \text{right}) = Q^{\pi_{\text{rand}}}(s_0, \text{right}) - V^{\pi_{\text{rand}}}(s_0)$
    * Therefore, $A^{\pi_{\text{rand}}}(s_0, \text{right}) = -2 - 0 = -2$

#### Why is the Advantage Function Useful?
The advantage function is useful because it is a lower variance replacement for reward-to-go in the policy gradient algorithm. Let's take a moment to discuss a few reasons why this is the case.
1. The Q-Function already takes into account the variance in the rewards we get from following our policy. This is because the Q-Function is the **expected** return we get from taking action $a_t$ in state $s_t$, and then following our policy afterwards.
    * To illustrate this with an example, imagine that in our environment, there's a low probability chance that we can suffer a large negative reward at any point. If we take a good action, but then suffer a large negative reward, the return we get will be low, even though it was a good action. However, the Q-Function will be high, because it takes into account the fact that we will usually get a high reward for taking that action.
2. By subtracting the value function from the Q-Function, we center the advantage function around zero. This means that the advantage function will be positive when the Q-Function is higher than the value function, and negative when the Q-Function is lower than the value function. This makes sense, because the model should reinforce actions that are better than expected, and discourage actions that are worse than expected. For actions that are as good as expected, the advantage function will be zero, and the policy gradient will not be affected.

#### Calculating Advantage

In practice, we cannot calculate the advantage function directly, because we do not know the true Q-Function or value function. However, we can estimate the advantage function using the following formula, as described by Equation 18 in [Schulman et al. (2015)](https://arxiv.org/abs/1506.02438):
$$
A^{\pi_\theta}(s_t, a_t) \approx \sum_{t'=t}^T \gamma^{t'-t} r_{t'} - V^{\pi_\theta}(s_t)
$$
Here, $V^{\pi_\theta}(s_t)$ is the value function, which we can estimate using a neural network. This is called the **critic** network. The neural network takes in a state $s_t$ as input, and outputs an estimate of the value function $V^{\pi_\theta}(s_t)$.

## Problem 2: Need to collect new trajectories after each policy update
In RL, we often categorize algorithms as "on-policy" or "off-policy":
* **On-policy**: The algorithm can only train on data that was collected using the current policy.
* **Off-policy**: The algorithm can train on data that was collected using any policy.

REINFORCE is an on-policy algorithm. Whenever we update the model, we can't use our trajectories from before the update, and we need to gather new ones. This is because we compute the policy gradient using the expected value over trajectories sampled from the current policy ($\mathbb{E}_{\tau \sim \pi_{\theta_\text{current}}}$). If we used trajectories from before the update, we would be computing the policy gradient using the expected value over trajectories sampled from the old policy ($\mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}}$), which is not what we want.

This makes REINFORCE sample-inefficient compared to other algorithms, like Deep Q-Learning, which we discussed in the previous notebook.

If our environment is expensive to run, this can be a problem. For example, if we are training a robot to walk, we might need to run the robot in the real world, which is time consuming and costly. With REINFORCE though, the only solution would be to increase the learning rate. This would make the model learn faster, but it would also make the model more unstable.

#### Why does a high learning rate make the model unstable?
Recall the model update rule:
$$
\theta_{\text{new}} = \theta_{\text{old}} + \alpha \nabla_\theta J(\theta)
$$
If we increase the learning rate $\alpha$, then the model parameters $\theta$ will change more with each update. The risk is that we might "overshoot" the optimal parameters, and end up with a worse model than we started with. We can visualize this below:

![high learning rate](./high_learning_rate.png)

The problem of over-updating is especially bad in RL, for two reasons:
1. Like all deep neural networks, model is nonlinear. Even if our new parameters after the update are close to our old ones in parameter space, the behavior of the model can be altered dramatically. 
2. Because the entire future trajectory can be affected by a single decision made at the beginning, even a minor change in behavior can result in large changes to the rewards we get.  

TRPO and PPO try to solve the problem of choosing the correct learning rate. They attempt to find a way to increase the "effective learning rate" without making the model unstable, by limiting how much the model's behavior can change between updates.

## TRPO: Trust Region Policy Optimization

*Note: This is a shallow overview of TRPO, check out the following sources if you want to learn more:*
1. *https://www.depthfirstlearning.com/2018/TRPO*
2. *https://www.andrew.cmu.edu/course/10-703/slides/Lecture_NaturalPolicyGradientsTRPOPPO.pdf*


We start off with TRPO, both because it came first chronologically, and because it has stronger conceptual foundations. TRPO is based on the idea of a trust region. A **trust region** is a region around the current model parameters, where we think that the gradient is a good approximation of the true gradient. 

#### Trust Regions
Trust regions are used in more areas than just RL. They are used in many optimization algorithms, and are a way to dynamically adjust the learning rate based on the local properties of the loss function. The idea is that we can use a larger learning rate if we are in a region where the loss function is relatively flat, and a smaller learning rate if we are in a region where the loss function is steep. This lets us learn faster without making the model unstable.

Illustration of the concept of a trust region:

![trust region](./trust_region.png)

In this made up example, the loss function $f(x)$ is steep in some parts, and flat in others. We want to be able to take large steps along the direction of the gradient $\frac{df}{dx}$ when we are in a flat region, and small steps when we are in a steep region. Let's assume that we have a way to compute the trust region bounds, which are shown by the dotted lines. 
* In this simple case, we could use the **curvature** of the function, which is defined by it's second derivative, to compute the trust region bounds. This makes sense, because as the second derivative, the curvature represents the rate of change of the derivative.
    * When the curvature is high, our trust region is small, since the gradient is changing quickly.
    * When the curvature is low, our trust region is large, since the gradient is changing slowly.

Using this information, our trust region optimization procedure is as follows:
1. Compute the gradient of the loss function at the current model parameters.
2. Compute the curvature of the loss function at the current model parameters.
3. Use the curvature to compute the trust region bounds.
4. Change the model parameters in the direction of the gradient, but only until we hit the trust region bounds.
5. Repeat until convergence.

#### Limiting Change in Behavior

How can we extend the notion of trust regions to RL? Trying to use the local curvature (which generalizes to the [Hessian matrix](https://en.wikipedia.org/wiki/Hessian_matrix) in higher dimensions) of our objective function quickly fails because of the high dimensionality of the problem. The matrix would simply be too large to compute.

Alternatively, we could define our trust region by the distance between the old policy and the new policy. This is the approach that TRPO takes.
What this means is that we want to limit how much the new policy's behavior can differ from the old policy's behavior.
This behavior-change limiting would mitigate the two problems we discussed earlier, of the model's behavior changing too much per update, and of the model's behavior changing in a way that impacts performance too much.

#### KL Divergence

But how do we define how much one policy's behavior differs from another? Recall that the output of a policy is a probability distribution over actions.
For discrete action spaces, this is a categorical distribution, and for continuous action spaces, this is a multivariate Gaussian distribution.

This means that a natural way to measure the difference between two policies is to measure the difference between their output distributions. This is where the **KL divergence** comes in.
The KL divergence is a measure of how different two probability distributions are.

It is defined as:
$$
D_{KL}(P||Q) = \sum_{x \in X} P(x) \log \frac{P(x)}{Q(x)}
$$
where $P$ and $Q$ are two probability distributions over some set $X$.

We won't go into too much depth here since the exact details of KL divergence not relevant to our understanding, but some of the more important points to note are:
1. KL divergence is a measure of relative entropy. It measures how "suprised" we are if we use distribution $Q$ to model distribution $P$.
2. KL divergence is not symmetric, so $D_{KL}(P||Q) \neq D_{KL}(Q||P)$.

If you want to know more, you can read the [Wikipedia page](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence).


#### TRPO Update Rule

*Note: There are two popular versions of TRPO. The first is the one described in the original paper, [Schulman et al. (2015)](https://arxiv.org/abs/1502.05477). 
The second is the one described in the [OpenAI Spinning Up documentation](https://spinningup.openai.com/en/latest/algorithms/trpo.html). The two versions are similar, but the OpenAI version is a bit simpler, so we will use that one here.*

*Note: The content in this section is based on [OpenAI Spinning Up documentation](https://spinningup.openai.com/en/latest/algorithms/trpo.html).*

Now that we have the background, we can finally look at the TRPO algorithm. The key idea is that we want to maximize the expected reward, but we want to do so while limiting the change in behavior between the old policy and the new policy. TRPO does this with a hard limit on the KL divergence between the old policy and the new policy.

The constraint on the KL divergence is a **hard constraint**, meaning that we must satisfy it exactly. This is in contrast to a **soft constraint**, which would allow us to violate the constraint, but would penalize us for doing so.

So, the update rule is:
$$
\theta_{\text{new}} = \arg \max_\theta \mathcal{L}(\theta, \theta_{\text{old}}) \\
\text{ s.t. } \bar{D}_{KL}( \pi_{\theta_{\text{old}}} || \pi_\theta) \leq \delta
$$
where:
* $\mathcal{L}(\theta, \theta_{\text{old}})$ is the objective function. In TRPO, this is the surrogate advantage:
  * $ \mathcal{L}(\theta, \theta_{\text{old}}) = \mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}, a \sim \pi_{\theta_{\text{old}}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\theta_{\text{old}}}}(s, a) \right]$
  * Optimizing the surrogate advantage is equivalent to optimizing the expected reward, but it has better properties for optimization.
* $\bar{D}_{KL}$ is the average KL divergence between the old policy and the new policy, over all states:
  * $\bar{D}_{KL}( \pi_{\theta_{\text{old}}} || \pi_\theta) = \mathbb{E}_{s \sim \pi_{\theta_{\text{old}}}} \left[ D_{KL}(\pi_{\theta_{\text{old}}}(\cdot|s) || \pi_\theta(\cdot|s)) \right]$

Let's take a closer look at the surrogate advantage objective function:
$$
\mathcal{L}(\theta, \theta_{\text{old}}) = \mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}, a \sim \pi_{\theta_{\text{old}}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\theta_{\text{old}}}}(s, a) \right]
$$
* Its value increases when we choose a value of $\theta$ that increases the probability of actions that have high advantage, and decreases the probability of actions that have low advantage.
  * This makes sense, it's the same thing that we were trying to maximize in REINFORCE.
* The subexpression $\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}$ is the ratio of the new policy to the old policy, often called the **importance sampling ratio** or **likelihood ratio**.
  * It is the ratio of the probability of taking action $a$ in state $s$ under the new policy, to the probability of taking action $a$ in state $s$ under the old policy.
  * It is used to correct for the fact that we are using data collected under the old policy to optimize the new policy.
  * It is also used to correct for the fact that we are using a different policy to collect data than the one we are optimizing.

#### TRPO in practice

What we've described above is merely the *theoretical TRPO update*. TRPO does not actually use this update rule directly. Instead, it uses a first-order approximation of the objective function, and then solves the optimization problem using a [conjugate gradient](https://en.wikipedia.org/wiki/Conjugate_gradient_method) algorithm. This is because the true objective function is too expensive to compute exactly.

We won't delve too much into the details of this, but if you want to know more, you can read the [OpenAI Spinning Up documentation](https://spinningup.openai.com/en/latest/algorithms/trpo.html).
The gist of it is that the algorithm used by TRPO is as follows:

1. Collect a set of trajectories $\mathcal{D}$ using the old policy $\pi_{\theta_{k}}$.
2. Compute the policy gradient $\hat{g}$ using advantage:
$$
\hat{g}_k = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) A^{\pi_{\theta_{k}}}(s_t, a_t)
$$
3. Compute the step direction $\hat{x}_k \approx H_k^{-1} \hat{g}_k$ using the conjugate gradient algorithm.
    * Where $H_k$ is the Hessian of the average KL divergence between the old policy and the new policy.
    * Recall that the Hessian is the matrix of second derivatives:
        * Let $f$ be a function of $n$ variables, and $\vec{x} = [x_1, x_2, \ldots, x_n]$:
        $$
        H = \begin{bmatrix}
        \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\
        \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\
        \vdots & \vdots & \ddots & \vdots \\
        \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2}
        \end{bmatrix}
        $$
    * In this case $f$ is $\bar{D}_{KL}( \pi_{\theta_{\text{old}}} || \pi_\theta)$, the average KL divergence between the old policy and the new policy, and $\vec{x}$ is $\theta$.
4. Update the policy using backtracking line search with:
    $$
    \theta_{k+1} = \theta_k + \alpha^j \sqrt{\frac{2 \delta}{\hat{x}_k^T H_k \hat{x}_k}} \hat{x}_k
    $$
    * Where $j$ is the smallest positive integer such that:
        * $\bar{D}_{KL}( \pi_{\theta_{\text{old}}} || \pi_{\theta_{k+1}}) \leq \delta$
            * The KL divergence constraint is satisfied.
        * And $\mathcal{L}(\theta_{k+1}, \theta_{k}) \geq \mathcal{L}(\theta_{k}, \theta_{k})$
            * The objective function has increased.
5. Repeat steps 1-4 until convergence.

#### Cons of TRPO

TRPO is a very powerful algorithm, but it has a large drawback: it's too complicated. In RL, bugs are often invisible, only showing up as a performance decrease. This makes it very difficult to debug complicated algorithms like TRPO. In addition, if you have hyperparameters, it's often unclear whether performance problems are due to bugs or hyperparameters. We want a method that is simple enough that we can be confident that it is implemented correctly.

## PPO: Proximal Policy Optimization

PPO is a family of algorithms that use a similar idea to TRPO, but they are much simpler to implement, and they often obtain similar performance. Thanks to this, PPO is one of the most widely used RL algorithms today. In this tutorial, we mostly discuss PPO-Clip, which is the simplest version of PPO.

Recall the main problem that TRPO aimed to solve: we want to prevent the new policy from being too different from the old policy. TRPO accomplished this by adding a constraint to the optimization problem that prevented the new policy from being too different from the old policy. PPO takes a different approach: it tries to change the objective function so that there's no incentive to change the policy too much in one update.

#### PPO-Clip objective function

The PPO-Clip objective function is:
$$
\mathcal{L}(\theta, \theta_{\text{old}}) = \mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}, a \sim \pi_{\theta_{\text{old}}}} \left[ \min \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\theta_{\text{old}}}}(s, a), \text{clip} \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 - \epsilon, 1 + \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) \right) \right]
$$
Where $\epsilon$ is a hyperparameter, usually set to a small value like 0.1 or 0.2.

Let's break this down into two cases: one where $A^{\pi_{\theta_{\text{old}}}}(s, a) \geq 0$, and one where $A^{\pi_{\theta_{\text{old}}}}(s, a) < 0$.

* When $A^{\pi_{\theta_{\text{old}}}}(s, a) \geq 0$:
    * We can factor the advantage out of the min operation:
    $$
    \begin{align*}
    & \min \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\theta_{\text{old}}}}(s, a), \text{clip} \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 - \epsilon, 1 + \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) \right)\\
    =& \min \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, \text{clip} \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 - \epsilon, 1 + \epsilon \right) \right) A^{\pi_{\theta_{\text{old}}}}(s, a)
    \end{align*}
    $$
    * We notice that the clip is redundant:
        * if $\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} < 1 - \epsilon$, then:
        $$
        \begin{align*}
        =& \min \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 - \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) & \text{clip because } \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} < 1 - \epsilon\\
        =& \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\theta_{\text{old}}}}(s, a) & \text{because } \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} < 1 - \epsilon\\
        \end{align*}
        $$
        * if $1 - \epsilon \leq \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \leq 1 + \epsilon$, then:
        $$
        \begin{align*}
        =& \min \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \right) A^{\pi_{\theta_{\text{old}}}}(s, a) & \text{no clip because } 1 - \epsilon \leq \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \leq 1 + \epsilon\\
        =& \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\theta_{\text{old}}}}(s, a)\\
        \end{align*}
        $$
        * if $\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} > 1 + \epsilon$, then:
        $$
        \begin{align*}
        =& \min \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 + \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) & \text{clip because } \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} > 1 + \epsilon\\
        =& \left( 1 + \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) & \text{because } \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} > 1 + \epsilon\\
        \end{align*}
        $$
        * We can combine these three cases into one expression:
        $$
        \min \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 + \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a)
        $$

    

* When $A^{\pi_{\theta_{\text{old}}}}(s, a) < 0$:
    * We can factor the advantage out of the min operation, recalling that since $A^{\pi_{\theta_{\text{old}}}}(s, a) < 0$, the min operation will be turned into a max operation:
    $$
    \begin{align*}
    & \min \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\theta_{\text{old}}}}(s, a), \text{clip} \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 - \epsilon, 1 + \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) \right)\\
    =& \max \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, \text{clip} \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 - \epsilon, 1 + \epsilon \right) \right) A^{\pi_{\theta_{\text{old}}}}(s, a)
    \end{align*}
    $$
    * We notice that the clip is redundant:
        * if $\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} < 1 - \epsilon$, then:
        $$
        \begin{align*}
        =& \max \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 - \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) & \text{clip because } \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} < 1 - \epsilon\\
        =& \left( 1 - \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) & \text{because } \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} < 1 - \epsilon\\
        \end{align*}
        $$
        * if $1 - \epsilon \leq \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \leq 1 + \epsilon$, then:
        $$
        \begin{align*}
        =& \max \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \right) A^{\pi_{\theta_{\text{old}}}}(s, a) & \text{no clip because } 1 - \epsilon \leq \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \leq 1 + \epsilon\\
        =& \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\theta_{\text{old}}}}(s, a)\\
        \end{align*}
        $$
        * if $\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} > 1 + \epsilon$, then:
        $$
        \begin{align*}
        =& \max \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 + \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) & \text{clip because } \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} > 1 + \epsilon\\
        =& \left( 1 + \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) & \text{because } \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} > 1 + \epsilon\\
        \end{align*}
        $$
        * We can combine these three cases into one expression:
        $$
        \max \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 - \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a)
        $$

Putting it all together, we can rewrite the clipped surrogate objective as:
$$
\mathcal{L}_{\text{CLIP}}(\theta, \theta_{\text{old}}) = \mathbb{E}_{s, a \sim \pi_{\theta_{\text{old}}}} \left[l(s, a)\right]
$$
Where:
$$
l(s, a) = 
\begin{cases}
\min \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 + \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) & \text{if } A^{\pi_{\theta_{\text{old}}}}(s, a) \geq 0\\
\max \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 - \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) & \text{if } A^{\pi_{\theta_{\text{old}}}}(s, a) < 0\\
\end{cases}
$$

Let's try to understand what optimizing this loss would mean by looking at a single state-action pair $(s, a)$.
* If the advantage of that particular action is positive, then the way to increase the $l(s, a)$ is to increase the probability of taking that action, $\pi_\theta(a|s)$.
    * However, there's no incentive to increase relative likelihood of taking that action by more than $1 + \epsilon$, compared to the old policy. This is because once the term $\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}$ is greater than $1 + \epsilon$, the min operation will just return $1 + \epsilon$.
* If there's the advantage of that particular action is negative, then the way to increase the expectation is to decrease the probability of taking that action, $\pi_\theta(a|s)$.
    * However, there's no incentive to decrease relative likelihood of taking that action by more than $1 - \epsilon$, compared to the old policy. This is because once the term $\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}$ is less than $1 - \epsilon$, the max operation will just return $1 - \epsilon$.

However, there's an important caveat. Unlike in TRPO, there's no guarantee that the KL divergence will be small. This is because even though there's no incentive to go far from the original policy in any given state-action pair, the cumulative effect of the many state-action pairs when we optimize can cause the policy to change significantly. In practice, this isn't a major problem as we can just use a smaller value of $\epsilon$ or use early stopping.

#### The PPO Algorithm
1. Collect a set of trajectories using the current policy $\pi_\theta$.
2. Compute the advantages $A^{\pi_{\theta_{\text{k}}}}(s, a)$ for each state-action pair $(s, a)$ in the trajectories, using the current value function $V_{\theta_{\text{k}}}$.
3. Update the policy by maximizing the clipped surrogate objective:
    $$
    \theta_{\text{k+1}} = \arg \max_\theta \mathcal{L}_{\text{CLIP}}(\theta, \theta_{\text{k}})
    $$
    where:
    $$
    \mathcal{L}_{\text{CLIP}}(\theta, \theta_{\text{old}}) \approx \frac{1}{|\mathcal{D}|T} \sum_{\tau \in \mathcal{D}} \sum_{t=1}^T \min \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\theta_{\text{old}}}}(s, a), \text{clip} \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 - \epsilon, 1 + \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) \right) 
    $$
    * Observe that unlike REINFORCE, we don't merely add a gradient, we set $\theta_{\text{k+1}}$ to be the value of $\theta$ that maximizes the objective.
        * We can do this simply using iterated gradient ascent in every minibatch, where we take multiple gradient steps on the objective, changing $\theta$ but leaving $\theta_{\text{k}}$ fixed.
4. Update the value network by minimizing the MSE between the predicted value and the empirical reward-to-go:
    $$
    \theta_{\text{k+1}} = \theta + \alpha \frac{1}{|\mathcal{D}|T} \sum_{\tau \in \mathcal{D}} \sum_{t=1}^T \left( V_\theta(s_t) - \hat{R}_t \right)^2
    $$
5. Repeat steps 1-4 until convergence.


# Now You Try!

Now that we've covered the theoretical foundations of PPO it's time for you to implement it yourself!

We have two notebooks for you to try. The first one has missing parts that you need to fill in, and the second one is a fully worked solution with explanations. We reccomend that you try the first notebook first, and then use the second notebook as a reference if you get stuck.
* [PPO Exercise](ppo_exercise.ipynb)
* [PPO Solution](ppo_solution.ipynb)