<div align='center'>
    <h1> 
        <a href='https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf'> Vanilla Policy Gradient (VPG) </a> 
    </h1>
</div>

# Intro

Vanilla Policy Gradient (VPG) a.k.a REINFORCE is a rosetta stone of modern reinforcement learning. It provides the theoretical basis for many model-free RL algorithms based on policy gradient methods such as DDPG, TRPO, and PPO.

- VPG is not an actor-critic algorithm, instead, it is a pure policy-based reinforcement learning algorithm that **uses only one neural network** to estimate the policy function $\pi_{\theta}$. If a baseline function $b(s_t)$ is used to reduce the variance of the policy gradient estimates, then **a second neural network can be introduced** to estimate the baseline function.

- VPG has only one objective loss function: the policy gradient loss $L^{PG} = J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} [log \left( \pi_{\theta} (a_t | s_t) \right) \Phi_t]$.

- VPG optimizes a `stochastic policy` $a_t \sim \pi_{\theta}(a_t | s_t)$ directly using `gradient ascent` on the loss function (performance objective) $L^{PG}$ while off-policy algorithms, such as DDPG and Q-learning, use the Bellman optimality equation.
  
- VPG is suitable for both `continuous and discrete action spaces`.

# The RL Goal in VPG

The goal of reinforcement learning is to maximize the expected cumulative reward (a.k.a expected return) $J(\pi_{\theta})$ under a parameterized policy $\pi_{\theta}$:

$$\text{max } J(\pi_{\theta}) = \text{max } \mathbb{E}_{\tau \sim \pi_{\theta}}[\mathcal{R}(\tau) | \pi_{\theta_t}].$$

In Policy Gradient algorithms, such as VPG, this can be achieved by directly updating/optimizing the parameters $\theta_t$ of the parameterized Policy $\pi_{\theta}$ computing the `gradient ascent` of the performance objective $J(\pi_{\theta})$ with respect to the Policy parameters:

\begin{eqnarray}
\theta_{t+1} &=& \theta_t + \alpha \nabla_{\theta} J(\pi_{\theta})\\
&=& \theta_t + \alpha \nabla_{\theta} \mathbb{E}_{\tau \sim \pi_{\theta}}[\mathcal{R}(\tau) | \pi_{\theta_t}].
\end{eqnarray}

The general form of the Policy Gradient, which follows from the **Policy Gradient Theorem** (Sutton et al., 1999), is: 

$$ 
\nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} \sum_{t=0}^T \nabla_{\theta} log \left( \pi_{\theta} (a_t | s_t) \right) \Phi_t,
$$

where $\Phi_t$ is any of the following functions:

- $\Phi_t = \mathcal{R}(\tau)$.
- $\Phi_t = Q^{\pi_{\theta}} (s_t, a_t)$.
- $\Phi_t = A^{\pi_{\theta}} (s_t, a_t)$.
- $\Phi_t = \sum_{t=0}^T  \mathcal{R} (s, a, s_{t+1})$.
- $\Phi_t = \sum_{t=0}^T \mathcal{R} (s, a, s_{t+1}) - b(s_t)$.

Legend:

-  $\pi_{\theta} (\cdot)$ is the parameterized Policy function represented by a neural network as an expressive nonlinear function approximation.
  
- $\nabla_{\theta} J(\pi_{\theta})$ denotes the gradient of `policy performance` a.k.a `policy gradient`.

- $\mathcal{R}(\tau) \doteq \sum_{t=0}^T r_t \in {\rm I\!R} \text{ (Finite-horizon undiscounted return)}$ is the sum of rewards over a fixed window of time steps.

- $\mathcal{R}(\tau) \doteq \sum_{t=0}^{\infty} \gamma^t r_{t} \in {\rm I\!R} \text{ (Infinite-horizon discounted return)}$ is the sum of all rewards ever obtained.

- $Q^{\pi_{\theta}}$ denotes the On-Policy Action-Value function.

- $A^{\pi_{\theta}}$ denotes the Advantage function.

- $b (s_t)$ denotes the baseline.
  
- $\alpha$ is the learning rate hyperparameter.

- $\gamma \in [0,1]$ is the discount factor hyperparameter.

# Deriving the Policy Gradient

Considering the infinite-horizon discounted return function $\mathcal{R}(\tau)$, the performance objective $J(\pi_{\theta})$ can be defined as the expected cumulative reward under a parameterized policy $\pi_{\theta}$, which is mathematically represented as:

$$J(\pi_{\theta}) =  \mathbb{E}_{\tau \sim \pi_{\theta}}[\mathcal{R}(\tau)] = \int_{\tau} \underbrace{\mathbb{P}(\tau | \theta)}_{\text{Trajectory prob.}}  \underbrace{\mathcal{R}(\tau)}_{\text{Return}}.$$
 
Where the probability of trajectory $\tau$ under the policy $\pi_{\theta}$ is given by

$$\mathbb{P}(\tau | \theta) = \underbrace{\rho_0(s_0)}_{\text{start-state dist.}} \prod_{t=0}^{T} \underbrace{\underbrace{\mathbb{P}(s_{t+1}|s_t,a_t)}_{\text{State transition prob.}}}_{\text{Env. model.}} \cdot \underbrace{\underbrace{\pi_{\theta}(a_t | s_t)}_{\text{Action prob.}}}_{\text{Control function.}}.$$

Decomposing the integral, one has:

$$
J(\pi_{\theta}) = 
\int_{S} \Bigg( \rho_0(s_0) \prod_{t=0}^T \mathbb{P}(s_{t+1}|s_t,a_t) \Bigg) ds
\int_{A} \Bigg( \pi_{\theta}(a_t | s_t) \mathcal{R}(\tau) \Bigg) da.
$$

This expression includes the state transition probability $\mathbb{P}(s_{t+1}|s_t, a_t)$, which represents the dynamics of the environment. In `model-based` reinforcement learning, these dynamics are **explicitly** modeled. In `model-free` reinforcement learning there is no model of the environment, i.e., no state transition probability for planning (lookahead). Instead, algorithms of this kind (such as in VPG, DDPG, TRPO, PPO, etc.) rely on sampling trajectories from the unknown environment to approximate the gradient of the policy. These trajectories are collected via trial-and-error interactions.

**Why is the state transition probability used in the derivation of the policy gradient for model-free RL?** Even though the transition probability appears in the theoretical definition of the prob. of a trajectory and the performance objective $J(\pi_{\theta})$, the policy gradient theorem reformulates the gradient into an expectation over sampled trajectories, eliminating the need for explicit knowledge or modeling of the transition probabilities. These transition probabilities are implicitly accounted for when sampling trajectories from the environment.

Legend:

- $\tau = (s_0, a_0, s_1, a_1, \dots)$ represents a trajectory (a sequence of states and actions).

- $\pi_{\theta}$ is the parameterized Policy represented by a neural network as the expressive nonlinear function approximator to estimate/represent the Policy function $\pi$, i.e., the stochastic action probability distribution a.k.a control function.

- $\mathbb{P(\tau | \theta)}$: is the probability of getting a trajectory $\tau$ with $T$ time steps acting according to policy $\pi_{\theta}$.

- $\mathbb{P}(s_{t+1}|s_t,a_t)$ denotes the state transition probability of ending up in the next state $s_{t+1}$ given the current state $s_t$ and action $a_t$.

- $\rho_0$ denotes the initial (start)-state probability distribution. The initial state $s_0$ is sampled from the probability distribution: $s_0 \sim \rho_0(\cdot).$

- $\mathcal{R}(\tau) = \sum_{t=0}^{T} \gamma^t r_t$ is the cumulative (discounted) reward for the trajectory over an episode.

- $\gamma \in [0, 1]$ is the discount factor.

To compute the policy gradient numerically, one must first derive an analytical expression for the policy gradient in terms of the expected value and then sample trajectories through agent-environment interaction steps. The policy gradient reads:

\begin{eqnarray}
\nabla_{\theta} J (\pi_{\theta}) &=& \nabla_{\theta} \mathbb{E}_{\tau \sim \pi_{\theta}}[\mathcal{R}(\tau)] \quad (Eq. 1.1) \\
&=& \nabla_{\theta} \int_{\tau} \mathbb{P}(\tau | \theta) [\mathcal{R}(\tau)] \quad (Eq. 1.2) \\
&=& \int_{\tau} \nabla_{\theta} \mathbb{P}(\tau | \theta)  [\mathcal{R}(\tau)] \quad (Eq. 1.3) \\
&=& \int_{\tau}  \mathbb{P}(\tau | \pi_{\theta}) \Bigg[ \nabla_{\theta} log(\mathbb{P}(\tau | \theta))  \mathcal{R}(\tau) \Bigg] \quad (Eq. 1.4) \\
&=& \mathbb{E}_{\tau \sim \pi_{\theta}} \Bigg[ \nabla_{\theta} log(\mathbb{P}(\tau | \theta))  \mathcal{R}(\tau) \Bigg] \quad (Eq. 1.5) \\
&=& \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^T\nabla_{\theta}log(\pi_{\theta}(a_t| s_t))\mathcal{R}(\tau) \right] \quad (Eq. 1.6)\\
&\approx& \frac{1}{|\mathcal{D}_k|}\sum_{\tau \in \mathcal{D}_k}\sum_{t=0}^T\nabla_{\theta}log(\pi_{\theta}(a_t| s_t))|_{\theta_k}\mathcal{R}(\tau) \quad (Eq. 1.7).
\end{eqnarray}

$\quad$

- Eq. 1.2 uses the definition of the expected return, noting that $\mathbb{E} [\cdot] = \int \mathbb{P}(\cdot) [\cdot]$.

<div style="background-color: green; padding: 10px; border-radius: 5px;">

**Expected value (a.k.a population mean or weighted average)  of a distribution:**

- Expected value for **discrete variables**:
    $$\mu \doteq  \mathbb{E}[X]  \doteq \langle X\rangle = \sum_{j=1}^{\dim \Omega = d} x_jP(x_j).$$
    The expected value is not the most likely value of $X$ and may not even be a possible value of $X$, but it is bound by
    $$X_{\min} \leq\langle X\rangle\leq X_{\max}.$$

- Expected value for **continuous variables**:
    $$\langle X \rangle = \int_{-\infty}^{+\infty} x \rho(x) dx.$$
  
</div>
  
- Eq. 1.3 uses the [Leibniz integral rule](https://en.wikipedia.org/wiki/Leibniz_integral_rule) to bring the gradient symbol under the integral sign. This is possible because the integration domain (over $\tau$) does not depend on $\theta$. 
  
- Eq. 1.4 uses the log-derivative trick:

  \begin{eqnarray}\
  \frac{d}{dx}ln(f(x)) &=& \frac{1}{f(x)}\frac{d}{dx}f(x). \\
                         &\rightarrow&  \nabla_{\theta} log(\mathbb{P}(\tau | \theta)) = \frac{1}{\mathbb{P}(\tau | \theta)}  \nabla_{\theta} \mathbb{P}(\tau | \theta)\\
                       &\rightarrow&  \nabla_{\theta} \mathbb{P}(\tau | \theta) =  \mathbb{P}(\tau | \theta) \nabla_{\theta} log(\mathbb{P}(\tau | \theta))
  \end{eqnarray}

- Eq. 1.5 is the expectation form of Eq. 1.4 using $\mathbb{E} [\cdot] = \int \mathbb{P}(\cdot) [\cdot]$.

- Eq. 1.6 known as the `policy gradient theorem` denotes the expected value of the **gradient of the log-probability of a trajectory**, weighted by the return $\mathcal{R}$, over all trajectories sampled i.i.d from the policy.

<div style="background-color: green; padding: 10px; border-radius: 5px;">

**Derivation:**

  \begin{eqnarray}
   \nabla_{\theta} log(\mathbb{P}(\tau | \pi_{\theta})) &=& \nabla_{\theta} log \Bigg( \rho_0(s_0) \prod_{t=0}^{T} \mathbb{P}(s_{t+1}|s_t,a_t) \cdot \pi(a_t | s_t) \Bigg) \\
   &=& \nabla_{\theta} \Bigg[log (\rho_0(s_0)) + log \left( \prod_{t=0}^{T} \mathbb{P}(s_{t+1}|s_t,a_t) \cdot \pi(a_t | s_t) \right) \Bigg]\\
   &=& \nabla_{\theta} \Bigg[log (\rho_0(s_0)) + \sum_{t=0}^{T} log \Big( \mathbb{P}(s_{t+1}|s_t,a_t) \cdot \pi(a_t | s_t) \Big) \Bigg]\\
   &=& \nabla_{\theta} \Bigg[log (\rho_0(s_0)) + \sum_{t=0}^{T} log \Big(\mathbb{P}(s_{t+1}|s_t,a_t)\Big) + log \Big(\pi(a_t | s_t)\Big) \Bigg]\\
   &=& \Bigg[\cancel{\nabla_{\theta} log (\rho_0(s_0))} + \sum_{t=0}^{T} \cancel{\nabla_{\theta} log \Big(\mathbb{P}(s_{t+1}|s_t,a_t)\Big)} + \nabla_{\theta} log \Big(\pi(a_t | s_t)\Big) \Bigg]\\
   &=& \sum_{t=0}^T\nabla_{\theta}log(\pi_{\theta}(a_t| s_t))
  \end{eqnarray}
</div>

- Eq. 1.7 computes the empirical average (approximation) of Eq. 1.6. It is an unbiased estimator (sample mean) of the expectation $\mathbb{E}_{\tau \sim \pi_{\theta_k}} [\cdot]$ since trajectories are sampled i.i.d from the policy. As the number of trajectories $|\mathcal{D}_k| \rightarrow \infty$, Eq. 1.7 converges to Eq. 1.6 due to the Law of Large Numbers.
 
Legend:

- $\mathcal{D}_k$ denotes the set with a number $|\mathcal{D}_k|$ of trajectories sampled i.i.d from the Policy $\pi_k$ in the $k$-th iteration.
- i.i.d means independent and identically distributed.
    - Independent: the outcome of one random variable does not affect the outcomes of the others.
    - Identically Distributed: all random variables in the collection follow the same probability distribution (e.g., same mean and variance). 

# Reducing Variance

One way to reduce the variance of the policy gradient estimates is to add a baseline $b$ (average return) to the policy gradient formula.
 
For any choice of baseline $b(s_t)$ that depends only on the state $s_t$, the gradient estimator is unbiased, i.e., introducing a baseline into the policy gradient formula does not change the expected value of the policy gradient sample estimate. This is a consequence of the Expected Grad-Log-Prob (EGLP) lemma:

$$ \mathbb{E}_{x \sim P_{\theta}}[\nabla_{\theta} log P_{\theta} (x)] = 0,$$

such that

$$ \mathbb{E}_{a_t \sim \pi_{\theta}}[\nabla_{\theta} log \pi_{\theta}(a_t|s_t) b(s_t)] = 0,$$

and 

$$ \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} \Bigg[ \sum_{t=0}^T \nabla_{\theta} log \pi_{\theta}(a_t|s_t) \bigg( \sum_{t'=t}^T \mathcal{R}(s_{t'}, a_{t'}, s_{t'+1}) - b(s_t) \bigg) \Bigg],$$

A near-optimal choice for the baseline in VPG is the `on-policy State Value` function $b (s_t) = V^{\pi_{\theta}}(s_t)$. The difference between the `on-policy Action-Value` function $Q(\cdot)$ and the baseline is the `Advantage function`:

$$A^{\pi_{\theta}}(s_t, a_t) = Q^{\pi_{\theta}}(s_t, a_t) - V^{\pi_{\theta}}(s_t) \in {\rm I\!R}.$$

The baseline $V^{\pi_{\theta}}(s_t)$ is often approximated by a neural network $V_{\phi}(s_t)$ since it cannot be computed exactly. 

To reduce variance even further one can introduce discount factors. However, if the time horizon is too long, i.e, if there are too many time steps in one episode, VPG will likely not work well. Therefore, the discount factor should be think off as a variance reduction parameter. 

Recall the definitions of the value functions as the expected cumulative future reward a.k.a expected (average) return an agent can get if starting from state $s_0$ and acting according to actions $a$ sampled from a parameterized **stochastic (if  VPG)** policy $\pi_{\theta}$: 

- Parameterized On-Policy Action-Value Function:

$$Q^{\pi_{\theta}}(s, a): S \rightarrow {\rm I\!R}.$$
$$Q^{\pi_{\theta}}(s, a) = \mathbb{E}_{\tau \sim \pi_{\theta}}[\mathcal{R}(\tau) | s_0 = s, a_0 = a] = \frac{1}{N}\sum_{n=1}^N \mathcal{R}_t^n.$$

- Parameterized On-Policy State-Value Function:

$$V^{\pi}(s): S \rightarrow {\rm I\!R}.$$
$$V^{\pi}(s) = \mathbb{E}_{\tau \sim \pi}[\mathcal{R}(\tau) | s_0 = s].$$
$$V^{\pi}(s) = \mathbb{E}_{a \sim \pi}[Q^{\pi}(s, a)].$$

# Algorithm

The following algorithm uses the advantage-based expression for the policy gradient, instead of the finite-horizon undiscounted return.

---
**Algorithm (Pseudocode): Vanilla Policy Gradient (adapted from Open AI)**

---

- Input: initialize policy parameters $\theta_0$, and baseline parameters $\phi_0$ of the state value function $V_{\phi}$.

- **for** episode $k = 0, 1, 2, \dots$ do:
    - Collect a set of trajectories $\mathcal{D}_k\doteq\{\tau_i\} = (s_0, a_0, r_0, \cdots , s_T, a_{T}, r_T)$ by executing the current policy $\pi_k = \pi(\theta_k)$ in the environment for $T$ time steps.
    - **for** each trajectory, compute: 
        - the reward-to-go $\hat{\mathcal{R}}_t = \sum_{t'=t}^T R(s_{t'}, a_{t'}, s_{t'+1})$, and
        - the advantage estimate $\hat{A}_t$ (using any advantage estimation method) based on the current on-policy state value function $V_{\phi_k}$ as the baseline to reduce sample variance in the gradient estimate: $$\hat{A}_t = Q^{\pi_{\theta_k}}(s_t, a_t) - V_{\phi_k}(s_t) \in {\rm I\!R}.$$
    - **end for**.
    - Estimate the Policy gradient as: $$\nabla_{\theta} J (\pi_{\theta_k}) \equiv \hat{g}_k = \frac{1}{|\mathcal{D}_k|}\sum_{\tau \in \mathcal{D}_k}\sum_{t=0}^T\nabla_{\theta}log\pi_{\theta}(a_t| s_t)|_{\theta_k}\hat{A}_t.$$
    - Update the Policy either using standard gradient ascent or via another `gradient ascent` algorithm (such as Adam): $$\theta_{k+1}= \theta_k + a_k \hat{g}_k,$$  where $a_k$ is the learning rate in the $k$-th time step.
    - Re-fit (learn) the baseline (state value function) by regression on mean-squared error, via some `gradient descent` algorithm, by minimizing $(V_{\phi}(s_t)-\hat{\mathcal{R}}_t)^2$ summed over all trajectories and time steps: $$\phi_{k+1} = \underset{\phi}{\operatorname{arg\,min}} \ \frac{1}{|\mathcal{D}_k|T}\sum_{\tau \in \mathcal{D}_k} \sum_{t=0}^T \left(V_{\phi}(s_t) - \hat{\mathcal{R}}_t \right)^2 .$$
        
    - **end for.**
- **end for.**

---

# Implementation

# References

[1] [Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al. 2000.](https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf)

[2] [Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs, Schulman 2016(a)](http://joschu.net/docs/thesis.pdf)

[3] [Benchmarking Deep Reinforcement Learning for Continuous Control, Duan et al. 2016](https://arxiv.org/abs/1604.06778)

[4] [High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016(b)](https://arxiv.org/abs/1506.02438)

[2] https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html 

[3] https://spinningup.openai.com/en/latest/algorithms/vpg.html