# Notations

**State**

$\rho_0$: the starting state distribution

$s^{\prime}$: the next state of state $s$

**Trajectory**

$\tau$ (trajectory): a sequence of states and actions in the world
    + $\tau=\left(s_0, a_0, s_1, a_1, \ldots\right)$

**Policy**
- $a_t=\mu\left(s_t\right)$: determistic policy
- $a_t \sim \pi\left(\cdot \mid s_t\right)$: stochastic policy
- $a_t \sim \pi_\theta\left(\cdot \mid s_t\right)$: policy parametrized by $\theta$
- $\pi^*$: optimal policy

There're two most common kinds of stochastic policies in RL:
- Categorical policy: sampling actions from the policy (`torch.distributions.Categorical`)
- Diagonal Gaussian policy

# Key Equations

The goal in RL is to select a policy which maximizes expected return when the agent acts according to it

$\pi^*=\arg \max _\pi J(\pi)$

### 1. Discounted Return

**Undiscounted return**: the sum of rewards obtained that an agent collected when interacts with environment in an episode

$R(\tau)=\sum_{t=0}^T r_t$

**Discounted return**: the sum of all rewards ever obtained by the agent, but discounted by how far off in the future they’re obtained

$R(\tau)=\sum_{t=0}^{\infty} \gamma^t r_t$

**Expected return**: the average of all returns

$J(\pi)=\int_\tau P(\tau \mid \pi) R(\tau)=\underset{\tau \sim \pi}{\mathrm{E}}[R(\tau)]$

### 2. Value Functions

Value function: expected return if you start in that state, or state-action pair and then act according to a particular policy forever after

Four main functions:

[1]. **On-Policy Value Function**: the average return if you start in state $s$ and always act according to policy $\pi$


$V^\pi(s)=\underset{\tau \sim \pi}{\mathrm{E}}\left[R(\tau) \mid s_0=s\right]$

[2]. **On-Policy Action-Value Function** - $Q^\pi(s, a)$: the average return if you start in state $s$, take an action $a$ and then follows the policy forever

$Q^\pi(s, a)=\underset{\tau \sim \pi}{\mathrm{E}}\left[R(\tau) \mid s_0=s, a_0=a\right]$

[3]. Optimal Value Function

[4]. Optimal Action-Value Function

### 3. Bellman Equations

The Bellman equations for the on-policy value functions are

$V^\pi(s)=\underset{\substack{a \sim \sim \pi \\ s^{\prime} \sim P}}{\mathrm{E}}\left[r(s, a)+\gamma V^\pi\left(s^{\prime}\right)\right]$

$Q^\pi(s, a)=\underset{s^{\prime} \sim P}{\mathrm{E}}\left[r(s, a)+\gamma \underset{a^{\prime} \sim \pi}{\mathrm{E}}\left[Q^\pi\left(s^{\prime}, a^{\prime}\right)\right]\right]$

### Advantage Functions

Measure how good if take action $a$ in state $s$ relative to the average return of state $s$

$A^\pi(s, a)=Q^\pi(s, a)-V^\pi(s)$

#### Examples 

### 5. Policy Gradient

Goal: the goal of policy gradient algorihm is maximize the average return

##### Probability of a  Trajectory

$P(\tau \mid \theta)=\rho_0\left(s_0\right) \prod_{t=0}^T P\left(s_{t+1} \mid s_t, a_t\right) \pi_\theta\left(a_t \mid s_t\right)$

##### **Policy Gradient**

$\nabla_\theta J\left(\pi_\theta\right) = \underset{\tau \sim \pi_\theta}{\mathrm{E}}\left[\nabla_\theta \log P(\tau \mid \theta) R(\tau)\right]$