# PPO

In short: PPO (Proximal Policy Optimization) is a widely used policy gradient method for reinforcement learning. PPO is an improved variant of TRPO (Trust Region Policy Optimization) which is itself an improved variant of the original policy gradient method. PPO was released in 2017 by OpenAI in the paper [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347).

The reason it's so popular is that it offers much better performance than the vanilla policy gradient, while still being relatively simple to implement and tune.

In this article, we'll attempt to:
1. Explain the flaws of REINFORCE, which TRPO and PPO attempt to solve.
2. Explain the intuition behind TRPO and PPO.
3. Explain how PPO improves upon TRPO.
4. Explain the PPO loss function.

## Flaws of REINFORCE

We covered REINFORCE in [a previous entry](../PolicyGradient/policygradient.ipynb), so we won't go into too much detail here. However, let's briefly review how it works.

#### REINFORCE Algorithm
1. We initialize the policy parameters $\theta$ randomly.
2. We collect a batch of trajectories $D = \{\tau_1, \tau_2, \ldots, \tau_n\}$ by running the policy in the environment.
3. We compute an estimate of the policy gradient, $\hat{g}$:
    * Recall that the policy gradient is given by:
        $$
        \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{R}_t(\tau_i) \right]
        $$
    * REINFORCE estimates the policy gradient using Monte Carlo sampling:
        $$
        \hat{g} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \hat{R}_t(\tau_i)
        $$
4. We then update the policy using gradient ascent:
    $$
    \theta \leftarrow \theta + \alpha \hat{g}
    $$
5. Repeat steps 2-4 until convergence.

Reference implementations:
* [Discrete](../PolicyGradient/policygradient_discrete_solution.ipynb)
* [Continuous](../PolicyGradient/policygradient_continuous_solution.ipynb)

#### Problems with REINFORCE
As you may have observed when implementing REINFORCE on Metadrive, it is very slow to train, and the final policies we ended up with were often quite poor.

This is because REINFORCE has two major flaws:
1. The estimate of the policy gradient has high variance.
2. We can't use old trajectories to improve the policy, so we need to collect new trajectories every time we want to update the policy.



### Problem 1: Noisy Policy Gradient Estimates
The first problem is that the estimate of the policy gradient has high variance. (When a value has high variance, it is often referred to as "noisy"). What this means in practice is that the policy gradient estimate can be very different from the true policy gradient, which can lead to poor performance, as we might be updating the policy in the wrong direction.

Since we compute the policy gradient estimate using Monte Carlo sampling, we can mitigate this by increasing the number of samples we use. However, this comes at the cost of increased computation time.

#### One Solution: Replacing $\hat{R}_t(\tau_i)$ with $A^{\pi_\theta}(s_t, a_t)$
When we [first derived the policy gradient](../PolicyGradient/policygradient.ipynb) it had the form:
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau_i) \right]
$$
The return $R(\tau_i)$ is the sum of the rewards in the trajectory $\tau_i$:
$$
R(\tau_i) = \sum_{t=0}^T r_t
$$
However, we could replace it with the reward-to-go $\hat{R}_t(\tau_i)$, which is defined as:
$$
\hat{R}_t(\tau_i) = \sum_{t'=t}^T r_{t'}
$$.
It turns out that there are other valid replacements for $R(\tau_i)$ that can reduce the variance of the policy gradient estimate. One such replacement is the advantage function $A^{\pi_\theta}(s_t, a_t)$, which is defined as:
$$
A^{\pi_\theta}(s_t, a_t) = Q^{\pi_\theta}(s_t, a_t) - V^{\pi_\theta}(s_t)
$$
where $Q^{\pi_\theta}(s_t, a_t)$ is the state-action value function and $V^{\pi_\theta}(s_t)$ is the state value function. 

Recall that:
* $V^{\pi_\theta}(s_t) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) | s_0=s_t \right]$
    * Called the "Value Function".
    * Measures how good it is to be in state $s_t$, assuming we follow policy $\pi_\theta$.
* $Q^{\pi_\theta}(s_t, a_t) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) | s_0=s_t, a_0=a_t \right]$
    * Called the "Action-Value Function", or "Q-Function".
    * Measures how good it is to take action $a_t$ in state $s_t$, and then follow policy $\pi_\theta$ afterwards.

So, what the advantage function measures is how much better it is to take action $a_t$ in state $s_t$ than it is to follow our policy. Let's take a look at an example to see why this is useful.

#### Example: Advantage Function
Assume that we are playing in the gridworld below, and that the game ends in exactly one step.

![gridworld](./gridworld_example.png)

If we choose to move left, we get a return of $+2$. If we choose to move right, we get a return of $-2$.

Our policy is:
$$
\pi_{\text{rand}}(a_t|s_t) = \begin{cases}
    0.5 & \text{if } a_t = \text{left} \\
    0.5 & \text{if } a_t = \text{right}
\end{cases}
$$

* What is $A^{\pi_{\text{rand}}}(s_0, \text{left})$?
    * $V^{\pi_{\text{rand}}}(s_0) = 0.5(2) + 0.5(-2) = 0$
    * $Q^{\pi_{\text{rand}}}(s_0, \text{left}) = 2$
    * $A^{\pi_{\text{rand}}}(s_0, \text{left}) = Q^{\pi_{\text{rand}}}(s_0, \text{left}) - V^{\pi_{\text{rand}}}(s_0)$
    * Therefore, $A^{\pi_{\text{rand}}}(s_0, \text{left}) = 2 - 0 = 2$
* What is $A^{\pi_{\text{rand}}}(s_0, \text{right})$?
    * $V^{\pi_{\text{rand}}}(s_0) = 0.5(2) + 0.5(-2) = 0$
    * $Q^{\pi_{\text{rand}}}(s_0, \text{right}) = -2$
    * $A^{\pi_{\text{rand}}}(s_0, \text{right}) = Q^{\pi_{\text{rand}}}(s_0, \text{right}) - V^{\pi_{\text{rand}}}(s_0)$
    * Therefore, $A^{\pi_{\text{rand}}}(s_0, \text{right}) = -2 - 0 = -2$

#### Why is the Advantage Function Useful?
The advantage function is useful because it is a lower variance replacement for reward-to-go in the policy gradient algorithm. Let's take a moment to discuss a few reasons why this is the case.
1. The Q-Function already takes into account the variance in the rewards we get from following our policy. This is because the Q-Function is the **expected** return we get from taking action $a_t$ in state $s_t$, and then following our policy afterwards.
    * To illustrate this with an example, imagine that in our environment, there's a low probability chance that we can suffer a large negative reward at any point. If we take a good action, but then suffer a large negative reward, the return we get will be low, even though it was a good action. However, the Q-Function will be high, because it takes into account the fact that we will usually get a high reward for taking that action.
2. By subtracting the value function from the Q-Function, we center the advantage function around zero. This means that the advantage function will be positive when the Q-Function is higher than the value function, and negative when the Q-Function is lower than the value function. This makes sense, because the model should reinforce actions that are better than expected, and discourage actions that are worse than expected. For actions that are as good as expected, the advantage function will be zero, and the policy gradient will not be affected.

#### Calculating Advantage

In practice, we cannot calculate the advantage function directly, because we do not know the true Q-Function or value function. However, we can estimate the advantage function using the following formula, as described by Equation 18 in [Schulman et al. (2015)](https://arxiv.org/abs/1506.02438):
$$
A^{\pi_\theta}(s_t, a_t) \approx \sum_{t'=t}^T \gamma^{t'-t} r_{t'} - V^{\pi_\theta}(s_t)
$$
Here, $V^{\pi_\theta}(s_t)$ is the value function, which we can estimate using a neural network. This is called the **critic** network. The neural network takes in a state $s_t$ as input, and outputs an estimate of the value function $V^{\pi_\theta}(s_t)$.

## Problem 2: Need to collect new trajectories after each policy update
In RL, we often categorize algorithms as "on-policy" or "off-policy":
* **On-policy**: The algorithm can only train on data that was collected using the current policy.
* **Off-policy**: The algorithm can train on data that was collected using any policy.

REINFORCE is an on-policy algorithm. Whenever we update the model, we can't use our trajectories from before the update, and we need to gather new ones. This is because we compute the policy gradient using the expected value over trajectories sampled from the current policy ($\mathbb{E}_{\tau \sim \pi_{\theta_\text{current}}}$). If we used trajectories from before the update, we would be computing the policy gradient using the expected value over trajectories sampled from the old policy ($\mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}}$), which is not what we want.

This makes REINFORCE sample-inefficient compared to other algorithms, like Deep Q-Learning, which we discussed in the previous notebook.

If our environment is expensive to run, this can be a problem. For example, if we are training a robot to walk, we might need to run the robot in the real world, which is time consuming and costly. With REINFORCE though, the only solution would be to increase the learning rate. This would make the model learn faster, but it would also make the model more unstable.

#### Why does a high learning rate make the model unstable?
Recall the model update rule:
$$
\theta_{\text{new}} = \theta_{\text{old}} + \alpha \nabla_\theta J(\theta)
$$
If we increase the learning rate $\alpha$, then the model parameters $\theta$ will change more with each update. The risk is that we might "overshoot" the optimal parameters, and end up with a worse model than we started with. We can visualize this below:

![high learning rate](./high_learning_rate.png)
