In [1]:
import numpy

# 1. Notations

### **1.1 Trajectory**

$\tau$ (trajectory): a sequence of states and actions in the world
    + $\tau=\left(s_0, a_0, s_1, a_1, \ldots\right)$

### **1.2 Policy**
- $a_t=\mu\left(s_t\right)$: determistic policy
- $a_t \sim \pi\left(\cdot \mid s_t\right)$: stochastic policy
- $a_t \sim \pi_\theta\left(\cdot \mid s_t\right)$: policy parametrized by $\theta$
- $\pi^*$: optimal policy

There're two most common kinds of stochastic policies in RL:
- Categorical policy: sampling actions from the policy (`torch.distributions.Categorical`)
- Diagonal Gaussian policy

# 2. Key Equations

The goal in RL is to select a policy which maximizes expected return when the agent acts according to it

$\pi^*=\arg \max _\pi J(\pi)$

### 2.1 Discounted Return

**Undiscounted return**: the sum of rewards obtained that an agent collected when interacts with environment in an episode

$R(\tau)=\sum_{t=0}^T r_t$

**Discounted return**: the sum of all rewards ever obtained by the agent, but discounted by how far off in the future they’re obtained

$R(\tau)=\sum_{t=0}^{\infty} \gamma^t r_t$

**Expected return**: the average of all returns

$J(\pi)=\int_\tau P(\tau \mid \pi) R(\tau)=\underset{\tau \sim \pi}{\mathrm{E}}[R(\tau)]$

### 2.2 Policy Gradient

##### Derivation for Basic Policy Gradient

Expression in expectation form: $\nabla_\theta J\left(\pi_\theta\right)=\nabla_\theta \underset{\tau \sim \pi_\theta}{\mathrm{E}}[R(\tau)]=\underset{\tau \sim \pi_\theta}{\mathrm{E}}\left[\nabla_\theta \log P(\tau \mid \theta) R(\tau)\right]$

Expression in grad-log-prob: $\nabla_\theta J\left(\pi_\theta\right)=\underset{\tau \sim \pi_\theta}{\mathrm{E}}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right) R(\tau)\right]$

### 2.3 Value Functions

Value function: expected return if you start in that state, or state-action pair and then act according to a particular policy forever after

Four main functions:

**1. On-Policy Value Function**: the average return if you start in state $s$ and always act according to policy $\pi$

$V^\pi(s)=\underset{\tau \sim \pi}{\mathrm{E}}\left[R(\tau) \mid s_0=s\right]$


**2. On-Policy Action-Value Function**: the average return if you start in state $s$, take an action $a$ and then follows the policy forever

$Q^\pi(s, a)=\underset{\tau \sim \pi}{\mathrm{E}}\left[R(\tau) \mid s_0=s, a_0=a\right]$

**3. Optimal Value Function**

$V^*(s)=\max _\pi \underset{\tau \sim \pi}{\mathrm{E}}\left[R(\tau) \mid s_0=s\right]$

**4. Optimal Action-Value Function**

$Q^*(s, a)=\max _\pi \underset{\tau \sim \pi}{\mathrm{E}}\left[R(\tau) \mid s_0=s, a_0=a\right]$

Key connections bet

### 2.4 Bellman Equations

The Bellman equations for the on-policy value functions are

$V^\pi(s)=\underset{\substack{a \sim \sim \pi \\ s^{\prime} \sim P}}{\mathrm{E}}\left[r(s, a)+\gamma V^\pi\left(s^{\prime}\right)\right]$




### 2.5 Advantage Functions

$A^\pi(s, a)=Q^\pi(s, a)-V^\pi(s)$