# Reinforcement Learning

### Overview

#### What is the defining trait of Reinforment Learning?
- There exists a loop in which the **agent** will receive feedback from the **environment**

#### How do we formulate this feedback loop?
We will define $x_t$ as the state of environment at time $t$.  (i.e. $x_t$ can be defined as a vector which each entry defines the *temperature*, *luminocity*...)

However, an agent may not receive the entire $x_t$.  (A temperature gauge cannot take picture...), we will define $o_t$, which is the observation the agent makes.

$u_t$ is the action agent will take at time $t$.

Our goal is to learn **policy**, $\pi_{\theta}(u_t | o_t)$, which defines which action agent needs to take given an observati

![](terms.png)


**Ｔｒａｎsition Distribution (Transition Function, Dynamics): **
**Markov Property:**

1. Observations are not conditional independent (Markov Property), states are.

## Reward Functions

**Rewards Function**: defines what future outcomes are desirable rather than informing what to do exactly.  So the neural network need to reason with current action to reach future reward

## Goal of Reinforcement Learning

![](rl_goal1.png)

We see $$p((s_{t+1}, a_{t+1}) | (s_t, a_t)) = p(s_{t+1} | s_t, a_t) \pi_{\theta}(a_{t+1}  s_{t+1})$$.

What probabilitistic property does $\pi_{\theta}$ take?

## Optimizing Problem


$$\theta* = \arg \max_{\theta} \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[\sum_{t}r(s_t, a_t)]$$

### Infinite Horizon Case

$$\theta^* = \arg \max_{\theta} \mathbb{E}_{(s, a) \sim p_{\theta}(s, a)}[r(s, a)]$$

### Finite Horizon Case

$$\theta^* = \arg \max_{\theta} \sum_{t=1}^T \mathbb{E}_{(s_t, a_t) \sim p_{\theta}(s_t, a_t)}[r(s_t, a_t)]$$

## Problems with Training Policy 

In many cases, *reward function* $r(s_i, a_i)$ is a function of $a_t$, $a_t$ may be discrete for many problems.  

Or we don't know *reward function.*

**Temporal Credit Assignment Problem:** how to assign appropriate weight to earlier actions to derive correct reward for current timestep

### Policy Gradient

Instead of differentiating loss via "rewards network", we estimate the gradient update by enumerating trajectories, and computing the gradient along them.  

Given the optimization problem:

$$\theta^* = \arg \max_{\theta} \mathbb{E}_{\tau \sim p_{\theta}(\tau)}\bigg[\sum_t r(s_t, a_t)\bigg]$$

We can evaluate the expectation as follows:

$$J(\theta) = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \bigg[\sum_t r(s_t, a_t) \bigg] \approx \frac{1}{N} \sum_{i} \sum_{t} r（ｓ_{i,t}, a_{i,t}) $$

### Direct Policy Differentiation

In training, $\tau = (r_i, a_i, \cdots r_j, a_j)$ will be sampled from the trained policy $\pi_{\theta}(\tau).$  Hence, cost function will be defined as 

$$J(\theta) = \sum_{\tau \sim \pi_{\theta}(\tau)}[r(\tau)] = \int \pi_{\theta}(\tau) r(\tau) d \tau$$

$$\nabla_{\theta} J(\theta) = \int \nabla_{\theta}\pi_{\theta}(\tau)r(\tau)d\tau = \int \pi_{\theta}(\tau) \nabla_{\theta} \log \pi_{\theta}(\tau)r(\tau)d\tau = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \bigg[\nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau) \bigg]$$

since

$$\pi_{\theta}(\tau)\nabla_{\theta} \log \pi_{\theta}(\tau) = \pi_{\theta}(\tau) \frac{\nabla_{\theta}\pi_{\theta}(\tau)}{\pi_{\theta}(\tau)} = \nabla_{\theta} \pi_{\theta}(\tau)$$

#### Calculating $\log \pi_{\theta}(\tau)$

Since $$\pi_{\theta}(s_1, a_1, \cdots, s_T, a_T) = p(s_1) \prod_{t=1}^T \pi_{\theta}(a_t | s_t)p(s_{t+1} | s_t, a_t)$$,

$$\log \pi_{\theta}(\tau) = \log p(s_1) + \sum_{t=1}^T \log \pi_{\theta}(a_t | s_t) + \log p(s_{t+1} | s_t, a_t)$$

Since the first and last term is independent of policy, we can eliminate them.

As a result,

$$\nabla_{\theta} J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \bigg[ \bigg( \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}(a_t | s_t)\bigg)\bigg(\sum_{t=1}^T r(s_t, a_t)\bigg)\bigg]$$

### Optimizing the Model

Combining the above evaluation of policy and approximation of $J$, we define the following gradient update

$$\nabla_{\theta}J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \bigg(\sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}(a_{i,t} | s_{i, t})\bigg)\bigg(\sum_{t=1}^T r(s_{i, t}, a_{i,t})\bigg)$$

This is the **policy gradient**.

#### Reminder:

Update equation:
$$\theta \rightarrow \theta + \alpha \nabla_{\theta}J(\theta)$$

#### Algorithm:

1. sample $\{\tau^i\}$ from $\pi_{\theta}(a_t | s_t)$ from the policy
2. calculate $\nabla_{\theta}J(\theta)$
3. update via update equation

### Discrete Action Spaces

If the action $a_t$ is discrete,

$$\nabla_{\theta}J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \bigg(\sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}(a_{i,t} | s_{i, t})\bigg)$$ is analogous to 
gradient of cross-entropy action prediction loss.  (How different?)

So,
1. the loss for each trajectory is weighted by trajectory's reward

### Continuous Actions with Gaussian Policies

In this case, policy is sampled from gaussian distribution:

$$\pi_{\theta}(a_t | s_t) = \mathcal{N}\big(f_{\text{neural network}(s_t);} \Sigma\big)$$

$$\log \pi_{\theta}(a_t | s_t) = - \frac{1}{2}\left\lVert f(s_t) - a_t \right\rVert_{\Sigma}^2 + \text{const}$$

$$\nabla_{\theta}\log \pi_{\theta}(a_t | s_t) = - \frac{1}{2}\Sigma^{-1}(f(s_t) - a_t)\frac{df}{d\theta}$$

### Reducing Variance

Note, policy at time $t'$ cannot affect reward at time $t$ when $t < t'$.  So rewards part of policy gradient can be defined as 

$$\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'})$$

Note we are only summing over the future rewards.  This is known as **rewards to go**.

### Baselines

We shall subtract rewards $r(\tau)$ by $b$. (Why do we do that?)

## Off-policy policy gradient with Importance Sampling

## Trust Region Policy Optimization

## Proximal Policy Optimization