# Policy Gradient in MetaDrive

The goals of this notebook are the following:
* Provide a brief theoretical background on the
    * discrete policy gradient
    * continuous policy gradient
* Introduce Metadrive, a RL environment to train self-driving automobiles.
* Implement a simple policy gradient algorithm (REINFORCE) for the problem of self driving cars. 

### Why Policy Gradient?
There are a plethora of RL algorithms out there. Why do we focus on the policy gradient?
1. **Simplicity:** The policy gradient is one of the simplest RL algorithms out there. It's easy to understand, and it's easy to implement.
2. **Direct Optimization:** The policy gradient directly optimizes the objective that we care about: the expected return. (Read on if you want to understand what this means.) Other algorithms, such as Q-learning, optimize a proxy objective, and then use that to optimize the expected return. This can lead to suboptimal performance.
3. **Both Continuous and Discrete Actions:** The policy gradient can be used to optimize policies with both continuous and discrete action spaces. Other algorithms, such as Q-learning, can only be used to optimize policies with discrete action spaces. 

## Prerequisites

In order to keep the length of these tutorials short, and their scope focused, we assume you have some prerequisite knowledge:

Here's a list of the topics you should be familiar with, alongside some sources that you can review for 
* Partial Derivatives and Gradients
    * https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/introduction-to-partial-derivatives
    * https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/the-gradient
* Probability Distributions
    * https://www.khanacademy.org/math/statistics-probability/random-variables-stats-library/random-variables-discrete/v/discrete-and-continuous-random-variables
    * https://www.khanacademy.org/math/statistics-probability/random-variables-stats-library/random-variables-continuous/v/random-variables
* Backprop
    * 3Blue1Brown's Backprop Video: https://www.youtube.com/watch?v=tIeHLnjs5U8
    * Andrei Karpathy's Micrograd Video: https://www.youtube.com/watch?v=VMj-3S1tku0 
* PyTorch
    * https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html
    * https://pytorch.org/tutorials/beginner/pytorch_with_examples.html

## RL Background

*Note: Part of these notes are adapted from the OpenAI Spinning Up [lecture notes](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html).*

Before we start our derivation of the discrete policy gradient, we need to define our problem as well as some key terminology that we'll be using throughout the article. 


### Problem Definition
Any problem that we solve using RL is defined as a repeated interaction between an agent and an environment. At each time step $t$, the agent receives an observation $o_t$ from the environment, and then selects an action $a_t$ to perform. After performing the action, the agent receives a reward $r_t$ from the environment, and the environment transitions to a new state $s_{t+1}$. The goal of the agent is to maximize the sum of rewards it receives over the course of the interaction.

This is illustrated in the figure below:

![rl_diagram](./agentenv_whitebg.png)

*(Image Source: Barto & Sutton, 2018)*

### Markov Decision Process
Mathematically, we can define the problem as a Markov Decision Process (MDP). An MDP is a tuple $(S, A, P, R, \gamma)$, where:
* $S$ is the set of possible states of the environment.
* $A$ is the set of possible actions of the agent.
* $P$ is the transition probability matrix, where $P_a(s, s') = \Pr(s_{t+1} = s' | s_t = s, a_t = a)$.
    * It represents the probability that state $s$ transitions to state $s'$ when the agent performs action $a$. 
* $R$ is the reward function, where $R_a(s, s') = \mathbb{E}[r_{t+1} | s_t = s, a_t = a, s_{t+1} = s']$.
    * It represents the expected reward that the agent receives when it transitions from state $s$ to state $s'$ after performing action $a$.
* $\gamma$ is the discount factor, where $\gamma \in [0, 1]$.
    * It represents the agent's preference for immediate rewards over future rewards.
    * The reward recieved $n$ timesteps into the future is multiplied by $\gamma^n$. Since $\gamma < 1$, this reduces their value at an exponential rate, the farther away they are from the present.

Let's look at a toy example to examine how we can break down problems into MDPs.

#### Gridworld as an MDP

**Problem Description**: The agent is placed in a $4 \times 4$ gridworld, and it can move up, down, left, or right. The agent receives a reward of $+1$ for reaching the goal state, and a reward of $-1$ for falling into the pit. The agent receives a reward of $-0.1$ for every other state. The agent's goal is to reach the goal state as quickly as possible.

Here's how we would define this problem as an MDP:
* $S = \{s_{1,1}, s_{1,2}, \dots, s_{4,4}\}$
* $A = \{up, down, left, right\}$
* $P_a(s, s') = \begin{cases} 1 & \text{if } s' = \text{next\_state}(s, a) \\ 0 & \text{otherwise} \end{cases}$
    * Where $\text{next\_state}(s, a)$ is defined as: 
    
        $
            \text{next\_state}(s_{x,y}, a) =
            \begin{cases}
                s_{x-1,y} & \text{if } a = \text{up} \text{ and } x > 1 \\
                s_{x+1,y} & \text{if } a = \text{down} \text{ and } x < 4 \\
                s_{x,y-1} & \text{if } a = \text{left} \text{ and } y > 1 \\
                s_{x,y+1} & \text{if } a = \text{right} \text{ and } y < 4 \\
                s_{x,y} & \text{otherwise}
            \end{cases}
        $

* $R_a(s, s') = \begin{cases} +1 & \text{if } s' = \text{goal} \\ -1 & \text{if } s' = \text{pit} \\ -0.1 & \text{otherwise} \end{cases}$
* $\gamma = 1$
    * The agent does not have a preference for immediate rewards over future rewards.

### Policies, Trajectories, and Returns

**Policy**: The agent is fully described by a policy. A policy is a function (usually denoted $\pi$) that takes in a state $s$ and returns a probability distribution over actions $a$.
* $\pi(a | s) = \Pr(a_t = a | s_t = s)$

**Trajectory**: A trajectory, usually denoted $\tau$ is a sequence of actions and states that the agent experiences over the course of an interaction with the environment.
* $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \dots, s_T, a_T, r_T)$

**Return**: A return, usually denoted $R(\tau)$ is the sum of rewards that the agent receives over the course of a trajectory. This is what we want to optimize.
* $R(\tau) = \sum_{t=0}^T r_t$

When we take into account the discount factor $\gamma$, we get the discounted return:
* $R_{\gamma}(\tau) = \sum_{t=0}^T \gamma^t r_t$

## Derivation of the Discrete Policy Gradient

With the above background in mind, we are now ready to directly tackle the derivation of the policy gradient.

For this derivation, we're assuming that $\gamma = 1$, so the return is undiscounted.

#### The Policy
Our policy is denoted $\pi_{\theta}$. The subscript $\theta$ indicates that our policy is parametrized by a set of neural network parameters, denoted $\theta$. We're trying to find a value for $\theta$ that makes the policy get a high reward. 

#### Objective Function
Let's start by writing down our **objective function**.
This function, typically denoted $J$, describes how good a particular policy $\pi_{\theta}$ is.
The objective function is what we want to maximize.

For the policy gradient, $J(\pi_{\theta})$ is just the expected value of the return over the distribution of trajectories that would be visited by the policy. Mathematically:
$$
    J(\pi_{\theta}) = \mathop{\mathbb{E}}_{\tau \sim \pi_{\theta}}\left[R(\tau)\right]
$$

Let's take a minute to expand this out so we understand it better.

Let:
$$
    \mathcal{T} = \text{the set of all possible trajectories that could be produced by following $\pi_{\theta}$}\
$$
Then:
$$
    J(\pi_{\theta}) = \sum_{\tau \in \mathcal{T}} R(\tau) \Pr(\tau|\pi_{\theta})
$$

Remember that both the policy and environment can be stochastic, so even if the starting condition is the same, there may be many possible trajectories.

Ok, now that we know what the objective function is, how do we optimize $\theta$ to maximize $J$? Since most problems have absurdly complex environments, there's no straightforward equation we can write and solve analytically. In the absence of an analytical solution, a good choice is gradient ascent. *(Note: since we want to maximize $J$, we're using gradient ascent. If we wanted to minimze loss, we would use gradient descent)*.

If you're confused about how gradient ascent (or descent) works, check out the linked resources in [Prequisites](#prerequisites).

When we're doing gradient ascent, our update rule for $\theta$ is:
$$
\theta_{k+1} = \theta_{k} + \alpha \nabla_{\theta} J(\pi_{\theta_k})
$$
Where $\theta_k$ current set of parameters, and $\theta_{k+1}$ is the next set of parameters.

The quantity $\nabla_{\theta} J(\pi_{\theta_k})$ is called the **policy gradient** and lends its name to this algorithm.

### Solving the Policy Gradient

Expand definition of J:
$$
\nabla_{\theta} J(\pi_{\theta}) = \nabla_{\theta} \mathop{\mathbb{E}}_{\tau \sim \pi_{\theta}}\left[R(\tau)\right]
$$
Expand definition of expectation:
$$
\nabla_{\theta} J(\pi_{\theta}) = \nabla_{\theta} \sum_{\tau \in \mathcal{T}} R(\tau) \Pr(\tau|\pi_{\theta})
$$
Move gradient inside sum:
$$
\nabla_{\theta} J(\pi_{\theta}) = \sum_{\tau \in \mathcal{T}} R(\tau) \nabla_{\theta} \Pr(\tau|\pi_{\theta})
$$

#### Detour: The Log Trick
We're going to take a quick detour to talk about the log trick. This is a useful trick that we'll use to simplify our derivation.

We start with the following basic identity:
$$
\frac{d}{dx} \ln f(x) = \frac{1}{f(x)} \left(\frac{d}{dx} f(x)\right)
$$
Rearrange the terms to isolate $\frac{d}{dx} f(x)$:
$$
\frac{d}{dx} f(x) = f(x) \frac{d}{dx} \ln f(x)
$$


#### Back to the Derivation
Applying this identity to our policy gradient, we get:
$$
\begin{align*}
\nabla_{\theta} J(\pi_{\theta}) &= \sum_{\tau \in \mathcal{T}} R(\tau) \left(\nabla_{\theta} \Pr(\tau|\pi_{\theta})\right)\\
&= \sum_{\tau \in \mathcal{T}} R(\tau) \left(\Pr(\tau|\pi_{\theta}) \nabla_{\theta} \ln \Pr(\tau|\pi_{\theta})\right)\\
&= \sum_{\tau \in \mathcal{T}} R(\tau) \nabla_{\theta} \ln \Pr(\tau|\pi_{\theta}) \Pr(\tau|\pi_{\theta}) 
\end{align*}
$$
Apply definition of expectation:
$$
\nabla_{\theta} J(\pi_{\theta}) = \mathop{\mathbb{E}}_{\tau \sim \pi_{\theta}}\left[R(\tau) \nabla_{\theta} \ln \Pr(\tau|\pi_{\theta})\right]
$$

So, what we've discovered is that the policy gradient is equal to the expected value of the reward of a given trajectory times gradient of the log-probs of that same trajectory.

#### Solving the Log Probability of a Trajectory

Let's focus on the term $\nabla_{\theta}\ln Pr(\tau|\pi_{\theta})$, from the previous equation. This is the gradient of the log probability of a trajectory. 

The probability of a given trajectory $\tau$ given $\pi_{\theta}$ is:
$$
\Pr(\tau|\pi_{\theta}) = \Pr(s_0) \prod_{t=0}^{T-1} \Pr(s_{t+1}|s_t, a_t) \pi_{\theta}(a_t|s_t)
$$
Where $T$ is the length of the trajectory.


Substituting this in, we can expand this out as follows:
$$
\begin{align*}
\nabla_{\theta} \ln \Pr(\tau|\pi_{\theta}) &= \nabla_{\theta} \ln \left(\Pr(s_0) \prod_{t=0}^{T-1} \Pr(s_{t+1}|s_t, a_t) \pi_{\theta}(a_t|s_t)\right)\\
&= \nabla_{\theta} \left(\ln \Pr(s_0) + \sum_{t=0}^{T-1} \ln \Pr(s_{t+1}|s_t, a_t) + \ln \pi_{\theta}(a_t|s_t)\right)\\
&= \nabla_{\theta} \sum_{t=0}^{T-1} \ln \pi_{\theta}(a_t|s_t)\\
&= \sum_{t=0}^{T-1} \nabla_{\theta} \ln \pi_{\theta}(a_t|s_t)
\end{align*}
$$

#### Finishing off the Policy Gradient

Let's substitute the above result back into our equation for the policy gradient.
$$
\begin{align*}
\nabla_{\theta} J(\pi_{\theta}) &= \mathop{\mathbb{E}}_{\tau \sim \pi_{\theta}}\left[R(\tau) \nabla_{\theta} \ln \Pr(\tau|\pi_{\theta})\right]\\
&= \mathop{\mathbb{E}}_{\tau \sim \pi_{\theta}}\left[R(\tau)\sum_{t=0}^{T-1} \nabla_{\theta} \ln \pi_{\theta}(a_t|s_t)\right]
\end{align*}
$$
Move the $R(\tau)$ inside the sum.
$$
\nabla_{\theta} J(\pi_{\theta}) = \mathop{\mathbb{E}}_{\tau \sim \pi_{\theta}}\left[\sum_{t=0}^{T-1} R(\tau) \nabla_{\theta} \ln \pi_{\theta}(a_t|s_t)\right]
$$

#### Analysis of Result

And that's it! We're done! Let's now take a minute to analyze our result.

The policy gradient takes the form of an expectation over a distribution.
This means that we can estimate it using Monte-Carlo sampling.

To do this, we'll collect dozens of trajectories, and for each one calculate the value inside of the expecation, and then average them together to get an estimate of the gradient. The more samples (trajectories in this case) we collect, the closer our gradient will be to the true value.

**But what does the expression mean though?**

There is an intuition behind each of the operations in the equation. Let's go through each of the operations in the equation from the inside-out.
1. $\nabla_{\theta} \ln \pi_{\theta}(a_t|s_t)$
    * This term "selects" the part of the network that represents a particular action we took in a particular state.
        * Remember that the gradient of a scalar (in this case $\ln \pi_{\theta}(a_t|s_t)$) with respect to $\theta$ is a vector with the same dimensions as $\theta$.
        * You can conceptually think of this part of the network returning a vector with positive values for neurons that positively influenced the network's output $\pi_{\theta}(a_t|s_t)$, and negative values for neurons that negatively influenced the network's output.
        * Remember that changing any parameter in $\theta$ affects $\pi_{\theta}(a_t|s_t)$ for many different states. This is good, as it allows the network to generalize to states it hasn't seen before.
2. $R(\tau)\nabla_{\theta} \ln \pi_{\theta}(a_t|s_t)$
    * $R(\tau)$ is a scalar that represents how much we want to want to encourage or discourage this type of action. 
        * In the future we'll look at different things we could put here instead of $R(\tau)$ that improve performance. However, all of these replacements must have the property that good actions should have positive values, and bad actions have negative (or less positive) values.
    * We multiply $R(\tau)$ by the gradient of the log probability of the action we took in the state we were in. This produces a vector that represents an update in a direction that would have increased the network's reward when making the choice $\pi_{\theta}(a_t|s_t)$.
3. $\sum_{t=0}^{T-1} R(\tau)\nabla_{\theta} \ln \pi_{\theta}(a_t|s_t)$
    * We sum the update vectors from each choice over all the choices that were made in this rollout. This produces a single update vector that represents the sum of all the updates that would have increased the network's reward when making the choices $\pi_{\theta}(a_t|s_t)$.
    * If the model made multiple good choices, then we can reward them all. If the model made multiple bad choices, then we can punish them all.
4. $\mathop{\mathbb{E}}_{\tau \sim \pi_{\theta}}\left[\sum_{t=0}^{T-1} R(\tau)\nabla_{\theta} \ln \pi_{\theta}(a_t|s_t)\right]$
    * We take the expectation of the update vector over all the trajectories we collected. In practice, we use Monte-Carlo sampling to estimate this expectation.
    * We do this in order to reduce the effect of random chance on our gradient.
    * Since the environment is stochastic, a bad choice may have a good return for a particular trajectory. This is bad, because it will encourage the network to make that bad choice again. However, if we collect enough trajectories, then the network will be able to clearly see that making that choice will result in a bad return on average, and the network will be discouraged from making that choice again.

### Reward-to-Go

In the above derivation, we used the total return $R(\tau)$ to decide which actions to reinforce.

Recall that $R(\tau)$ is the sum of all the rewards we received during the trajectory:
$$
R(\tau) = \sum_{t=0}^{T-1} r_t
$$

Actions that resulted in a higher total return were encouraged, and actions that resulted in a lower total return were discouraged. This is good, but it ignores that an action can only influence the future, not the past. To put it another way, we only care about reinforcing actions that have good consequences. If we recieved a high reward *before* we took a particular action $a_t$, then we shouldn't reinforce action $a_t$ based on that.

To fix this, we can use the **reward-to-go** $\hat{R}_t(\tau)$ instead of the total return $R(\tau)$. The reward-to-go is the sum of all the rewards we received after taking action $a_t$. Mathematically:
$$
\hat{R}_t(\tau) = \sum_{t'=t}^{T-1} r_{t'}
$$

Our overall policy gradient then becomes:
$$
\nabla_{\theta} J(\pi_{\theta}) = \mathop{\mathbb{E}}_{\tau \sim \pi_{\theta}}\left[\sum_{t=0}^{T-1} \hat{R}_t(\tau) \nabla_{\theta} \ln \pi_{\theta}(a_t|s_t)\right]
$$

Empirically, using the reward-to-go instead of the total return improves performance. This is because the rewards accumulated in the past are not relevant to the action we are currently taking, and only serve to add noise to the gradient.

### Bringing Back Gamma

Thus far, we have been ignoring gamma in order to simplify the math. Recall that gamma represents our discount factor, and is used to reduce the effect of rewards that are far in the future. If we recieve a reward $r$ $n$ timesteps into the future and our discount factor is $\gamma$, then we only count that reward as $\gamma^n r$. 

We can adjust the reward-to-go to account for gamma:
$$
\hat{R}_t(\tau) = \sum_{t'=t}^{T-1} \gamma^{t'-t} r_{t'}
$$


### Difference between Discrete and Continuous Problems

In **discrete** action spaces, the agent chooses between a finite number of actions. Examples of discrete action spaces are:
1. Gridworld.
2. Choosing a move when playing chess.

In **continuous** action spaces, the agent must generate a value for each dimension of the action space. Examples of continuous action spaces are:
1. Controlling the angle and thrust of a rocket.
2. Choosing how much torque to apply to the joint of a robot. 

In some cases, the action space may be a mix of discrete and continuous actions. For example, in a self driving car, the agent may choose between discrete actions such as "activate left blinker", "activate right blinker", and "do not activate blinker", and continuous actions such as how much to turn the steering wheel.

Thus far, the derivation of the policy gradient has been identical for both discrete and continuous action spaces. However, in the following sections, we will see that there are some differences in how we implement the policy gradient for discrete and continuous action spaces.

This particular notebook focuses on the discrete case, and the next notebook will focus on the continuous case.

# Metadrive

We now introduce the environment we will be using for this notebook: [Metadrive](https://github.com/metadriverse/metadrive)

Check out the notebook [here](../quickstart.ipynb) for a quick introduction to Metadrive as well as how to install it.

Let's test out the environment by creating an instance of it and taking a random action at each timestep.

In [4]:
# We need to import metadrive to register the environments
import metadrive
import gymnasium as gym 

env = gym.make("MetaDrive-validation-v0", config={"use_render": True})
env.reset()
for i in range(100):
    obs, reward, terminated, truncated, info = env.step(env.action_space.sample())
    if terminated or truncated:
        break
env.close()

Known pipe types:
  glxGraphicsPipe
(1 aux display modules not yet loaded.)


8.31221473623458e-05
0.009233961947637252
0.029430664503088177
0.0385080765555831
0.051269874340499226
0.024825500350000494
-0.004269349232754249
-0.012220657157477671
-0.011318983514367316
0.010826522401143868
0.019600871726443807
0.00518795331222229
0.0002647879602980098
-0.00020965787401731797
0.0027548640214803766
0.0036451430530871957
0.00203804888209766
4.7831459837470954e-05
-0.0008139261675983708
0.0002649450169180858
-0.004138602420487799
-0.003833691689167148
0.005658243306209003
0.0019439392483036298
0.0017501587413670697
0.0013271993844638941
-0.0012417981516064886
0.00047796214394067146
0.005576002975745207
0.011957905409772306
0.019367010686041318
0.03269041932155331
0.02264725586738955
0.010750797150814074
0.023785212217481203
0.019831233304526522
0.015112038625447117
0.017808277012386936
0.017960457098097456
0.020937374686892773
-0.0005746566562039816
-0.001424047920491952
0.0013622548321156132
-0.0023082070131272346
-0.0007034053034793181
0.0011865168268845503
0.000501

Metadrive uses the [Farama Gymnasium](https://gymnasium.farama.org/), which has a standard API for interacting with environments. There are a couple of functions and properties that are good to know about:
1. `reset()`: Resets the environment to its initial state and returns the initial observation.
    * Documentation: https://gymnasium.farama.org/api/env/#gymnasium.Env.reset
2. `step(action)`: Takes an action and returns the next observation, the reward for taking the action, whether the episode is terminated, whether the episode is truncated (ran out of time), and any additional information.
    * Documentation: https://gymnasium.farama.org/api/env/#gymnasium.Env.step
3. `close()`: Closes the environment.
    * Documentation: https://gymnasium.farama.org/api/env/#g1ymnasium.Env.close
4. `action_space`: The action space of the environment, which tells us the shape and bounds of the action space.
    * Documentation: https://gymnasium.farama.org/api/env/#gymnasium.Env.action_space
5. `observation_space`: The observation space of the environment, which tells us the shape and bounds of the observation space.
    * Documentation: https://gymnasium.farama.org/api/env/#gymnasium.Env.observation_space

Let's take a closer look at what our observation and action spaces are:

In [11]:
import metadrive
import gymnasium as gym 

env = gym.make("MetaDrive-validation-v0", config={"use_render": False})
print("Observation Space:", env.observation_space)
print("Action Space:", env.action_space)


Action Space: Box(-1.0, 1.0, (2,), float32)
Observation Space: Box(-0.0, 1.0, (259,), float32)


For the observation space, we see that it is a `Box` observation space, with a low of 0 and a high of 1. It has a shape of (259,). This means that our observation is a vector of 259 numbers, each of which is between 0 and 1.

If we print out the action space, we see that it is a `Box` action space, with a low of -1 and a high of 1. It has a shape of (2,).