# Implementing the Continous Policy Gradient Algorithm

In the [previous notebook](policygradient_discrete_solution.ipynb), we implemented the policy gradient algorithm for a discrete action space. In this notebook, we will implement the policy gradient algorithm for a continous action space.

First, let's review a couple of key concepts from the previous notebooks.
1. Policy gradient is a stochastic policy. This means that the policy outputs a probability distribution over actions.
2. When we collect a trajectory, we sample actions from the probability distribution output by the policy.
3. When we compute the loss function, we use the log probability of the actions that were sampled.

In general, the policy gradient algorithm for a continous action space is very similar to the policy gradient algorithm for a discrete action space. The main difference is that we need to use a different way to represent the probability distribution (and this will change the way we do 2. and 3. above).

In the discrete case, we used a softmax function to convert the output of the neural network into a probability distribution over the actions. In the continous case, we will use a Gaussian distribution to convert the output of the neural network into a probability distribution over the actions.

#### Why Gaussians?
The reason we're talking about Gaussians in the first place is that we want to get a distribution we can sample from to generate a real number.
The choice of distribution is purely empirical. We could choose some other probability function to sample from, and it might work.
However, Gaussians are a very common choice for probability distributions, for the following reasons:
1. Gaussians have nice mathematical properties. For example, the sum of two Gaussian random variables is also a Gaussian random variable.
2. Gaussians are very common in nature, due to the [central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem)
3. Gaussians are unimodal, which is important for the network to learn stable policies.

## Gaussians and their Properties

Before we dive too deep into coding the implementation, let's review some properties of Gaussians, starting with 1D Gaussians.

#### Univariate Gaussians
The formula for a 1D Gaussian distribution is:
$$
\mathcal{N}(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
$$
Here's what it would look like if we plotted it:

![1D Gaussian Distribution with labeled mean and standard deviation](./normdist01_big.JPG)

*(Image source: https://www.nohsteachers.info/rlinden/statistics/Sect5/section5_4.htm )*

A 1D gaussian is completely described by two parameters: the mean and the standard deviation. The mean ($\mu$) is the center of the distribution. The standard deviation ($\sigma$) is a measure of how spread out the distribution is.

Since the Gaussian is a probability distribution, we are able to sample from it.
We denote a random variable sampled from a Gaussian distribution with mean $\mu$ and standard deviation $\sigma$ as:
$$
X \sim \mathcal{N}(\mu, \sigma^2)
$$

The **empirical rule** states that for a 1D Gaussian distribution:
- 68% of the samples will be within 1 standard deviation of the mean
- 95% of the samples will be within 2 standard deviations of the mean
- 99.7% of the samples will be within 3 standard deviations of the mean

#### Multivariate Gaussians
We can have more than one dimension in a Gaussian distribution. These types of distributions are called multivariate Gaussian distributions. The formula for a multivariate Gaussian distribution is:
$$
\mathcal{N}(\mathbf{x}; \mathbf{\mu}, \Sigma) = \frac{1}{\sqrt{(2\pi)^n|\Sigma|}}\exp\left(-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\right)
$$

There are a couple of differences between the formulas for univariate and multivariate Gaussian distributions:
1. $x$ is now a vector instead of a scalar. This means that the Gaussian is now a distribution over n-dimensional space instead of a distribution over a line.
    * In the context of reinforcement learning, each element of $x$ corresponds to an output of the policy.
    * In a policy controlling a robot arm, $x$ would be a vector of joint angles, one for each joint.   
2. The mean is now a vector instead of a scalar. This means that the mean is now a point in n-dimensional space instead of a point on a line.
3. The standard deviation is now a covariance matrix, denoted $\Sigma$. Briefly, the covariance matrix is a matrix that describes the covariance between each pair of elements in $x$.
    
#### Covariance Matrix

A covariance matrix is constructed as:
$$
\Sigma = \begin{bmatrix}
\sigma_{1}^2 & \sigma_{1}\sigma_{2} & \sigma_{1}\sigma_{3} & \dots  & \sigma_{1}\sigma_{n} \\
\sigma_{2}\sigma_{1} & \sigma_{2}^2 & \sigma_{2}\sigma_{3} & \dots  & \sigma_{2}\sigma_{n} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
\sigma_{n}\sigma_{1} & \sigma_{n}\sigma_{2} & \sigma_{n}\sigma_{3} & \dots  & \sigma_{n}^2
\end{bmatrix}
$$

The diagonal elements of the covariance matrix are the variances of each element of $x$. The off-diagonal elements are the covariances between each pair of elements of $x$.

The covariance between two elements of $x$ is a measure of how much they change together. If the covariance is positive, then the two elements change together. If the covariance is negative, then the two elements change in opposite directions. If the covariance is zero, then the two elements are independent of each other.

#### Diagonal Gaussian Policies

In this notebook, we will use a diagonal Gaussian policy. A diagonal Gaussian policy is a Gaussian distribution where the covariance matrix is diagonal, and all the off-diagonal elements are zero. This means that the covariance between any two actions is zero. This means that the actions are independent of each other. 

This is a simplifying assumption that makes the math easier. 

## Implementation

Now that we have the requisite background, let's implement the policy gradient algorithm for a continous action space.

#### Brief Summary of Algorithm
1. Our neural network predicts the parameters for a distribution given the observation.
2. We sample a random variable from the distribution. This is our action.
3. Repeat steps 1 and 2 to collect a trajectory.
4. Repeat step 3 to collect multiple trajectories.
5. Compute the gradient using the log probability of the actions that were sampled.

There are 2 key operations that we need to implement differently from the discrete case:
1. Sampling actions from the probability distribution output by the policy
2. Computing the log-likelihood of a particular action.

### Sampling Actions

Sampling actions from a diagonal Gaussian distribution is very simple. We just sample each element of the action vector independently from a 1D Gaussian distribution. So:
$$
a_i \sim \mathcal{N}(\mu_i, \sigma_i^2)