<div align='center'>
    <h1> 
        <a href='#'> Vanilla Policy Gradient (VPG) </a> 
    </h1>
</div>

# Intro

Vanilla Policy Gradient (VPG) a.k.a REINFORCE is a rosetta stone for policy gradient methods and enhanced `on-policy` algorithms (such as TRPO and PPO). VPG learns the Policy function directly, while less stable off-policy algorithms, such as DDPG and Q-learning, use the Bellman optimality equation. VPG optimizes a `stochastic policy` and it is suitable for both `continuous and discrete action spaces`.

## Deriving the Policy Gradient

The goal of reinforcement learning is to maximize the expected cumulative reward (a.k.a expected return) under policy $\pi_{\theta}$:

$$\text{max } J(\pi_{\theta}) = \text{max } \mathbb{E}_{\tau \sim \pi_{\theta}}[\mathcal{R}(\tau) | \pi_{\theta_t}].$$

In Policy Gradient Algorithms, this can be achieved by directly updating/optimizing the parameters $\theta_t$ of the parameterized Policy $\pi_{\theta}$ computing the `gradient ascent` of the performance objective $J(\pi_{\theta})$ with respect to the Policy parameters $\theta_t$:


\begin{eqnarray}
\theta_{t+1} &=& \theta_t + \alpha \nabla_{\theta} J(\pi_{\theta})|_{\theta_t}\\
&=& \theta_t + \alpha \nabla_{\theta} \mathbb{E}_{\tau \sim \pi_{\theta}}[\mathcal{R}(\tau) | \pi_{\theta_t}].
\end{eqnarray}


Where:

-  $\pi_{\theta}$ is the parameterized Policy function represented by a neural network as an expressive nonlinear function approximation.

- $\alpha$ is the learning rate.
  
- $\nabla_{\theta} J(\pi_{\theta})$ denotes the gradient of `policy performance` a.k.a `policy gradient`.

- $\mathcal{R}(\tau) \doteq \sum_{t=0}^T r_t \in {\rm I\!R} \text{ (Finite-horizon undiscounted return)}$ is the sum of rewards over a fixed window of time steps.

- $\mathcal{R}(\tau) \doteq \sum_{t=0}^{\infty} \gamma^t r_{t} \in {\rm I\!R} \text{ (Infinite-horizon discounted return)}$ is the sum of all rewards ever obtained.

- $\gamma \in [0,1]$ is the discount factor.

Considering the infinite-horizon discounted return function, the performance objective $J(\pi_{\theta})$ is defined as the expected cumulative reward under a parameterized policy $\pi_{\theta}$, which is mathematically represented as:

$$J(\pi_{\theta}) =  \mathbb{E}_{\tau \sim \pi_{\theta}}[\mathcal{R}(\tau)] = \int_{\tau} \underbrace{\mathbb{P}(\tau | \pi_{\theta})}_{\text{Trajectory prob.}}  \underbrace{\mathcal{R}(\tau)}_{\text{Return}},$$

with the the probability of trajectory $\tau$ under the policy $\pi_{\theta}$ given by

$$\mathbb{P}(\tau | \pi_{\theta}) = \underbrace{\rho_0(s_0)}_{\text{start-state dist.}} \prod_{t=0}^{T} \underbrace{\underbrace{\mathbb{P}(s_{t+1}|s_t,a_t)}_{\text{State transition prob.}}}_{\text{Env. model.}} \cdot \underbrace{\underbrace{\pi(a_t | s_t)}_{\text{Action prob.}}}_{\text{Control function.}}.$$


Legend:

- $\tau = (s_0, a_0, s_1, a_1, \dots)$ represents a trajectory (a sequence of states and actions).

-  $\pi_{\theta}$ is the parameterized Policy function represented by a neural network as an expressive nonlinear function approximation.

- $\mathbb{P(\tau | \pi_{\theta})}$: is the probability of getting a trajectory $\tau$ with $T$ time steps acting according to policy $\pi_{\theta}$.

- $\mathbb{P}(s_{t+1}|s_t,a_t)$ denotes the state transition probability of ending up in the next state $s_{t+1}$ given the current state $s_t$ and action $a_t$.

- $\rho_0$ denotes the initial (start)-state probability distribution. The initial state $s_0$ is sampled from the probability distribution: $s_0 \sim \rho_0(\cdot).$

- $\mathcal{R}(\tau) = \sum_{t=0}^{T} \gamma^t r_t$ is the cumulative (discounted) reward for the trajectory over an episode.

- $\gamma \in [0, 1]$ is the discount factor.

To compute the policy gradient numerically, one must first derive an analytical expression for the policy gradient in terms of the expected value and then sample trajectories through agent-environment interaction steps. The policy gradient reads:

\begin{eqnarray}
\nabla_{\theta} J (\pi_{\theta}) &=& \nabla_{\theta} \mathbb{E}_{\tau \sim \pi_{\theta}}[\mathcal{R}(\tau)] \quad (Eq. 1.1) \\
&=& \nabla_{\theta} \int_{\tau} \mathbb{P}(\tau | \pi_{\theta}) [\mathcal{R}(\tau)] \quad (Eq. 1.2) \\
&=& \int_{\tau} \nabla_{\theta} \mathbb{P}(\tau | \pi_{\theta})  [\mathcal{R}(\tau)] \quad (Eq. 1.3) \\
&=& \int_{\tau}  \mathbb{P}(\tau | \pi_{\theta}) \Bigg[ \nabla_{\theta} log(\mathbb{P}(\tau | \pi_{\theta}))  \mathcal{R}(\tau) \Bigg] \quad (Eq. 1.4) \\
&=& \mathbb{E}_{\tau \sim \pi_{\theta}} \Bigg[ \nabla_{\theta} log(\mathbb{P}(\tau | \pi_{\theta}))  \mathcal{R}(\tau) \Bigg] \quad (Eq. 1.5) \\
&=& \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^T\nabla_{\theta}log(\pi_{\theta}(a_t| s_t))\mathcal{R}(\tau) \right] \quad (Eq. 1.6)\\
&\approx& \frac{1}{|\mathcal{D}_k|}\sum_{\tau \in \mathcal{D}_k}\sum_{t=0}^T\nabla_{\theta}log(\pi_{\theta}(a_t| s_t))|_{\theta_k}\mathcal{R}(\tau) \quad (Eq. 1.7).
\end{eqnarray}

$\quad$

- Eq. 1.2 uses the definition of the expected return, noting that $\mathbb{E} [\cdot] = \int \mathbb{P}(\cdot) [\cdot]$.

<div style="background-color: yellow; padding: 10px; border-radius: 5px;">

**Expected value (a.k.a population mean or weighted average)  of a distribution:**

- Expected value for **discrete variables**:
    $$\mu \doteq  \mathbb{E}[X]  \doteq \langle X\rangle = \sum_{j=1}^{\dim \Omega = d} x_jP(x_j).$$
    The expected value is not the most likely value of $X$ and may not even be a possible value of $X$, but it is bound by
    $$X_{\min} \leq\langle X\rangle\leq X_{\max}.$$

- Expected value for **continuous variables**:
    $$\langle X \rangle = \int_{-\infty}^{+\infty} x \rho(x) dx.$$
  
</div>
  
- Eq. 1.3 uses the [Leibniz integral rule](https://en.wikipedia.org/wiki/Leibniz_integral_rule) to bring the gradient symbol under the integral sign. This is possible because the integration domain (over $\tau$) does not depend on $\theta$. 
  
- Eq. 1.4 uses the log-derivative trick:

  \begin{eqnarray}\
  \frac{d}{dx}ln(f(x)) &=& \frac{1}{f(x)}\frac{d}{dx}f(x). \\
                         &\rightarrow&  \nabla_{\theta} log(\mathbb{P}(\tau | \pi_{\theta})) = \frac{1}{\mathbb{P}(\tau | \pi_{\theta})}  \nabla_{\theta} \mathbb{P}(\tau | \pi_{\theta})\\
                       &\rightarrow&  \nabla_{\theta} \mathbb{P}(\tau | \pi_{\theta}) =  \mathbb{P}(\tau | \pi_{\theta}) \nabla_{\theta} log(\mathbb{P}(\tau | \pi_{\theta}))
  \end{eqnarray}

- Eq. 1.5 is the expectation form of Eq. 1.4 using $\mathbb{E} [\cdot] = \int \mathbb{P}(\cdot) [\cdot]$.

- Eq. 1.6 computes the expected value of the **gradient of the log-probability of a trajectory**, weighted by the return $\mathcal{R}$, over all trajectories sampled i.i.d from the policy.

<div style="background-color: yellow; padding: 10px; border-radius: 5px;">

**Derivation:**

  \begin{eqnarray}
   \nabla_{\theta} log(\mathbb{P}(\tau | \pi_{\theta})) &=& \nabla_{\theta} log \Bigg( \rho_0(s_0) \prod_{t=0}^{T} \mathbb{P}(s_{t+1}|s_t,a_t) \cdot \pi(a_t | s_t) \Bigg) \\
   &=& \nabla_{\theta} \Bigg[log (\rho_0(s_0)) + log \left( \prod_{t=0}^{T} \mathbb{P}(s_{t+1}|s_t,a_t) \cdot \pi(a_t | s_t) \right) \Bigg]\\
   &=& \nabla_{\theta} \Bigg[log (\rho_0(s_0)) + \sum_{t=0}^{T} log \Big( \mathbb{P}(s_{t+1}|s_t,a_t) \cdot \pi(a_t | s_t) \Big) \Bigg]\\
   &=& \nabla_{\theta} \Bigg[log (\rho_0(s_0)) + \sum_{t=0}^{T} log \Big(\mathbb{P}(s_{t+1}|s_t,a_t)\Big) + log \Big(\pi(a_t | s_t)\Big) \Bigg]\\
   &=& \Bigg[\cancel{\nabla_{\theta} log (\rho_0(s_0))} + \sum_{t=0}^{T} \cancel{\nabla_{\theta} log \Big(\mathbb{P}(s_{t+1}|s_t,a_t)\Big)} + \nabla_{\theta} log \Big(\pi(a_t | s_t)\Big) \Bigg]\\
   &=& \sum_{t=0}^T\nabla_{\theta}log(\pi_{\theta}(a_t| s_t))
  \end{eqnarray}
</div>

- Eq. 1.7 computes the empirical average (approximation) of Eq. 1.6. It is an unbiased estimator (sample mean) of the expectation $\mathbb{E}_{\tau \sim \pi_{\theta_k}} [\cdot]$ since trajectories are sampled i.i.d from the policy. As the number of trajectories $|\mathcal{D}_k| \rightarrow \infty$, Eq. 1.7 converges to Eq. 1.6 due to the Law of Large Numbers.
 
Legend:

- $\mathcal{D}_k$ denotes the set with a number $|\mathcal{D}_k|$ of trajectories sampled i.i.d from the Policy $\pi_k$ in the $k$-th iteration.
- i.i.d means independent and identically distributed.
    - Independent: the outcome of one random variable does not affect the outcomes of the others.
    - Identically Distributed: all random variables in the collection follow the same probability distribution (e.g., same mean and variance). 

## Reducing Variance

Recall the definitions of the value functions: 

- On-Policy Value Function:

$$V^{\pi_{\theta}}(s_t) = \mathbb{E}_{\tau \sim \pi_{\theta}}[\mathcal{R}(\tau) | s_0 = s_t].$$
$$ V^{\pi_{\theta}}(s_t) = \mathbb{E}_{a_t \sim \pi_{\theta}(s_t)}[Q^{\pi}(s_t, a_t)].$$

- On-Policy Action-Value Function:

$$Q^{\pi_{\theta}}(s_t, a_t): S \rightarrow {\rm I\!R}.$$
$$Q^{\pi_{\theta}}(s_t, a_t) = \mathbb{E}_{\tau \sim \pi_{\theta}}[\mathcal{R}(\tau) | s_0 = s_t, a_0 = a_t] = \frac{1}{N}\sum_{n=1}^N \mathcal{R}_t^n.$$

One way to reduce variance is to add a baseline to the VPG. A near-optimal choice for the baseline in the VPG is the expected return given by the state value function $V^{\pi_{\theta}}(s_t)$. For any choice of baseline, the gradient estimator is unbiased. The difference between the expected return and the baseline is the advantage function:

$$A^{\pi_{\theta}}(s_t, a_t) = Q^{\pi_{\theta}}(s_t, a_t) - V^{\pi_{\theta}}(s_t) \in {\rm I\!R}.$$

To reduce variance even further one can introduce discount factors. However, if the time horizon is too long, i.e, if there are too many time steps in one episode, VPG will likely not work well. Therefore, the discount factor should be think off as a variance reduction parameter. 

# Algorithm

---
**Algorithm (Pseudocode): Vanilla Policy Gradient (adapted from Open AI)**

---

- Input: initialize policy parameters $\theta_0$, and baseline parameters $\phi_0$ of the state value function $V_{\phi}$.

- for $k= 0, 1, 2, \dots$ do:
    - Collect a set of trajectories $\mathcal{H}_t \doteq \mathcal{D}_k\doteq\{\tau_i\} = (s_0, a_0, r_0, \cdots , s_N, a_{N}, r_N)$ by executing the current policy $\pi_k = \pi(\theta_k)$ in the environment.
    - For each trajectory, compute: 
        - The reward-to-go $\hat{\mathcal{R}}_t = \sum_{t'=t}^T R(s_t', a_t', s_{t'+1})$;
        - The advantage estimates $\hat{A}_1, \dots, \hat{A}_T$ (using any advantage estimation method) based on the current  on-policy state value function $V_{\phi_k}$ used as the baseline: $$\hat{A}_t = Q^{\pi_{\theta}}(s_t, a_t) - V^{\pi_{\theta}}(s_t) \in {\rm I\!R}.$$ 
    - Estimate the Policy gradient as:
$$\nabla_{\theta} J (\pi_{\theta}) \equiv \hat{g}_k = \frac{1}{|\mathcal{D}_k|}\sum_{\tau \in \mathcal{D}_k}\sum_{t=0}^T\nabla_{\theta}log\pi_{\theta}(a_t| s_t)|_{\theta_k}\hat{A}_t.$$
    - Update the Policy either using standard gradient ascent or via another `gradient ascent` algorithm (such as Adam): $$\theta_{k+1}= \theta_k + a_k \hat{g}_k,$$  where $a_k$ is the learning rate in the $k$-th time step.
    - Re-fit the baseline (value function) by regression on mean-squared error, via some `gradient descent` algorithm, by minimizing $(V_{\phi}(s_t)-\hat{\mathcal{R}}_t)^2$ summed over all trajectories and time steps: $$\phi_{k+1} = \underset{\phi}{\operatorname{arg\,min}} \ \frac{1}{|\mathcal{D}_k|T}\sum_{\tau \in \mathcal{D}_k} \sum_{t=0}^T \left(V_{\phi}(s_t) - \hat{\mathcal{R}}_t \right)^2 .$$ 

    
- end for.

---

# Implementation

# References

[1] https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html 

[2] https://spinningup.openai.com/en/latest/algorithms/vpg.html