# Learn RL Notes Part 2

# 4 Model-Free Prediction

## 4.1 Introduction

**Planning by dynamic programming**

* Solve a known MDP

**Model-free prediction**

* Estimate the value function of an unknown MDP
    * MC: Monte-Carlo
    * TD: Time Difference
    
**Model-free control**

* Optimise the value function of an unknown MDP

## 4.2 Monte-Carlo Learning
* MC methods learn directly from **episodes** of experience
* MC is **model-free**: no knowledge of MDP transitions / rewards
* MC learns from complete episodes: **no bootstrapping**
* MC uses the simplest possible idea: **value = mean return**
* Caveat: can only apply MC to episodic MDPs
* **All episodes must terminate**

### 4.2.1 Monte-Carlo Policy Evaluation
* Goal: learn $v_π$ from episodes of experience under policy π
$$ S_1, A_1, R_2, ..., S_k \sim \pi $$
* Recall that the return is the total discounted reward:
$$ G_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-1}R_T$$
* Recall that the value function is the **expected return**:
$$ v_{\pi}(s) = \mathbb{E}_{\pi}[G_t|S_t = s]$$
* Monte-Carlo policy evaluation uses **empirical mean return** instead of expected return

**First-Visit Monte-Carlo Policy Evaluation**
* To evaluate state s
* The first time-step t that state s is visited in an episode,
* Increment counter $N(s) ← N(s) + 1$
* Increment total return $S(s) ← S(s) + G_t$
* Value is estimated by mean return $V(s) = S(s)/N(s)$
* By law of large numbers, $V (s) → v_{\pi}(s)$ as $N(s) → \infty$

**Every-Visit Monte-Carlo Policy Evaluation**
* To evaluate state s
* Every time-step t that state s is visited in an episode,
* Increment counter $N(s) ← N(s) + 1$
* Increment total return $S(s) ← S(s) + G_t$
* Value is estimated by mean return $V (s) = S(s)/N(s)$
* Again, $V (s) → v_{\pi} (s)$ as $N(s) → \infty$

**Blackjack Example**
* States (200 of them):
    * Current sum (12-21)
    * Dealer’s showing card (ace-10)
    * Do I have a “useable” ace? (yes-no)
* Action stick: Stop receiving cards (and terminate)
* Action twist: Take another card (no replacement)
* Reward for stick:
    * +1 if sum of cards > sum of dealer cards
    * 0 if sum of cards = sum of dealer cards
    * -1 if sum of cards < sum of dealer cards
* Reward for twist:
    * -1 if sum of cards > 21 (and terminate)
    * 0 otherwise
* Transitions: automatically twist if sum of cards < 12

**Incremental Mean**

The mean $μ_1 , μ_2 , ...$ of a sequence $x_1 , x_2 , ...$ can be computed incrementally,
$$ \begin{align*} \\
μ_k &= \frac{1}{k} \sum^k_{j=1} x_j \\
&= \frac{1}{k} (x_k + \sum^{k-1}_{j=1} x_j) \\
&= \frac{1}{k} (x_k + (k-1)μ_{k-1}) \\
&= μ_{k-1} + \frac{1}{k} (x_k - μ_{k-1})
\end{align*}
$$

**Incremental Monte-Carlo Updates**
* Update V(s) incrementally after episode $S_1 , A_1 , R_2 , ..., S_T$
* For each state $S_t$ with return $G_t$
$$ N(S_t) \gets N(S_t) + 1 \\
V(S_t) \gets V(S_t) + \frac{1}{N(S_t)}(G_t - V(S_t)$$
* In **non-stationary problems**, it can be useful to track a **running mean**, i.e. forget old episodes.
$$V(S_t) \gets V(S_t) + \alpha (G_t - V(S_t)$$


## 4.3 Temporal-Difference Learning
* TD methods learn directly from episodes of experience
* TD is model-free: no knowledge of MDP transitions / rewards
* TD learns from **incomplete episodes, by bootstrapping**
* **TD updates a guess towards a guess**

** MC and TD **

* Goal: learn $v_π$ online from experience under policy π
* Incremental every-visit Monte-Carlo
    * Update value $V (S_t )$ toward actual return $G_t$
$$ V (S_t ) \gets V (S_t ) + α (G_t − V (S_t ))$$
* Simplest temporal-difference learning algorithm: **TD(0)**
    * Update value $V (S_t )$ toward estimated return $R_{t+1} + \gamma V (S_{t+1} )$
$$ V (S_t ) \gets V (S_t ) + \alpha (R_{t+1} + \gamma V (S_{t+1} ) − V (S_t ))$$
    * $R_{t+1} + \gamma V (S_{t+1} )$ is called the **TD target**
    * $\delta_t = R_{t+1} + \gamma V (S_{t+1} ) − V (S_t )$ is called the **TD error**
    
** Advantages and Disadvantages of MC vs. TD **

* **TD can learn before knowing the final outcome**
    * TD can learn online after every step
    * MC must wait until end of episode before return is known
* **TD can learn without the final outcome**
    * TD can learn from incomplete sequences
    * MC can only learn from complete sequences
    * **TD works in continuing (non-terminating) environments**
    * MC only works for episodic (terminating) environments

** Bias/Variance Trade-Off **
* Return $G_t = R_{t+1} + γR_{t+2} + ... + γ^{T −1} R_T$ is unbiased estimate of $v_π (S_t )$
* True TD target $R_{t+1} + γv_π (S_{t+1} )$ is unbiased estimate of $v_π (S_t )$
* TD target $R_{t+1} + γV (S_{t+1} )$ is **biased estimate** of v π (S t )
* TD target is much lower variance than the return:
    * Return depends on many random actions, transitions, rewards
    * TD target depends on one random action, transition, reward
    
** Advantages and Disadvantages of MC vs. TD (2) **
* **MC has high variance, zero bias**
    * Good convergence properties
    * (even with function approximation)
    * Not very sensitive to initial value
    * Very simple to understand and use
* **TD has low variance, some bias**
    * **Usually more efficient than MC**
    * TD(0) converges to $v_π (s)$
    * **(but not always with function approximation)**
    * More sensitive to initial value
    
** Advantages and Disadvantages of MC vs. TD (3) **
* TD exploits Markov property
    * Usually **more efficient in Markov environments**
* MC does not exploit Markov property
    * Usually **more effective in non-Markov environments**
    
** Certainty Equivalency **
<img src="images/rl-mc-td-certainty-equivalence.png" width=600 />

** mc vs td vs dp backup **
<img src="images/rl-mc-backup.png" width=600 />
<img src="images/rl-td-backup.png" width=600 />
<img src="images/rl-dp-backup.png" width=600 />

** Bootstrapping and Sampling **

* Bootstrapping: update involves an estimate
    * MC does not bootstrap
    * DP bootstraps
    * TD bootstraps

* Sampling: update samples an expectation
    * MC samples
    * DP does not sample
    * TD samples

## 4.4 TD(λ)
### 4.4.1 Unified View of Reinforcement Learning
<img src="images/rl-unified-view.png" width=600 />

### 4.4.2 n-step prediction and return 
<img src="images/rl-td-nstep-prediction.png" width=600 />
<img src="images/rl-td-nstep-return.png" width=600 />

### 4.4.3 Average and $TD(\lambda)$ 
<img src="images/rl-td-nstep-return-average.png" width=600 />
<img src="images/rl-td-lambda-return.png" width=600 />
<img src="images/rl-td-lambda-weighting.png" width=600 />

### 4.4.4 forward view and backward view
<img src="images/rl-td-lambda-forward-view.png" width=600 />

* **Forward view provides theory**
* ** Backward view provides mechanism**
* Update online, every step, from incomplete sequences

<img src="images/rl-td-eligibility-trace.png" width=600 />
<img src="images/rl-td-lambda-backward-view2.png" width=600 />

#### Backward View $TD(\lambda)$ Formula
$$ E_0(s) = 0 $$
$$ \tag{Eligibility Trace} E_t(s) = \gamma \lambda E_{t-1}(s) + 1(S_t = s) $$
$$ \tag{delta} \delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) $$
$$ \tag{mean} V(s) \gets V(s) + \alpha \delta_t E_t(s)$$

### 4.4.5 TD(λ) and TD(0)
* When λ = 0, only current state is updated
$$E_t (s) = 1(S_t = s) \\
V(s) \gets V(s) + \alpha \delta_t E_t (s)$$
* This is exactly equivalent to TD(0) update
$$ V (S_t ) \gets V (S_t ) + \alpha \delta_t$$

### 4.4.6 TD(λ) and MC
* When λ = 1, credit is deferred until end of episode
* Consider episodic environments with offline updates
* Over the course of an episode, total update for TD(1) is the same as total update for MC

**Theorem**
The sum of offline updates is identical for forward-view and backward-view TD(λ)
$$ \sum_{t=1}^T \alpha \delta_t E_t(s) = \sum_{t=1}^T \alpha (G^{\lambda}_t - V(S_t)) 1(S_t=s)$$
<img src="images/rl-mc-td1.png" width=600 />
<img src="images/rl-mc-td2.png" width=600 />

* TD(1) is roughly equivalent to every-visit Monte-Carlo
* Error is accumulated online, step-by-step
* If value function is only updated offline at end of episode
* Then total update is exactly the same as MC

<img src="images/rl-td-telescoping.png" width=600 />
<img src="images/rl-td-error.png" width=600 />

** Offline updates **
* Updates are accumulated within episode
* but applied in batch at the end of episode

** Online updates **
* TD(λ) updates are applied online at each step within episode
* Forward and backward-view TD(λ) are slightly different
* NEW: **Exact online TD(λ)** achieves perfect equivalence
* By using a slightly different form of eligibility trace
* Sutton and von Seijen, ICML 2014

### 4.4.7 Summary
<img src="images/rl-td-summary.png" width=600 />


# 5 Model-Free Control
> Optimise the value function of an unknown MDP

## 5.1 Introduction
### Uses of Model-Free Control
Some example problems that can be modelled as MDPs
* Elevator 
* Robocup Soccer
* Parallel Parking 
* Quake
* Ship Steering 
* Portfolio management
* Bioreactor 
* Protein Folding
* Helicopter 
* Robot walking
* Aeroplane Logistics 
* Game of Go
For most of these problems, either:
* **MDP model is unknown, but experience can be sampled**
* **MDP model is known, but is too big to use, except by samples**

<font color=red> Model-free control can solve these problems </font>

### On and Off-Policy Learning
* On-policy learning
    * “Learn on the job”
    * Learn about policy π from experience sampled from π

* Off-policy learning
    * “Look over someone’s shoulder”
    * Learn about policy π from experience sampled from μ

## 5.2 On-Policy Monte-Carlo Control

### Monte-Carlo Policy Iteration
<img src="images/rl-ctrl-mc-policy-iteration.png" width=600 />

** Model-Free Policy Iteration Using Action-Value Function **
* Greedy policy improvement over $V(s)$ requires model of MDP
$$ \pi'(s) = \underset{a \in A}{argmax}\ R^a_s + P^a_{ss'}V(s')$$
* Greedy policy improvement over Q(s, a) is model-free
$$ \pi'(s) = \underset{a \in A}{argmax}\ Q(s,a)$$

<img src="images/rl-ctrl-mc-evaluation.png" width=600 />

** ε-Greedy Policy Improvement **

<img src="images/rl-ctrl-epsilon-greedy.png" width=600 />

<img src="images/rl-ctrl-mc-epsilon-greedy.png" width=600 />

### Monte-Carlo Control

<img src="images/rl-ctrl-mc-control.png" width=600 />
** GLIE **
Greedy in the Limit with Infinite Exploration (GLIE)
* All state-action pairs are explored infinitely many times,
$$ \lim_{k \to \infty} N_k(s,a)=\infty $$
* The policy converges on a greedy policy,
$$ \lim_{l \to \infty} \pi_k (a|s) = 1(a=\underset{a' \in A}{argmax} Q_k(s,a′)) $$

* For example, ε-greedy is GLIE if ε reduces to zero at $ε_k = \frac{1}{k}$

** GLIE Monte-Carlo Control **
* Sample kth episode using π: ${S_1, A_1, R_2, ..., S_T } ∼ \pi$ 
* For each state $S_t$ and action $A_t$ in the episode,
$$  N (S_t , A_t ) \gets N (S_t , A_t ) + 1 $$
$$ Q(S_t,A_t) \gets Q(S_t,A_t)+ \frac{1}{N(S_t,A_t)} (G_t −Q(S_t,A_t))$$
* Improve policy based on new action-value function
$$ \epsilon \gets 1/k $$
$$ \pi ← \epsilon-greedy(Q)$$

**Theorem**
GLIE Monte-Carlo control converges to the optimal action-value function, $Q(s, a) \to q_∗(s, a)$
  
## 5.3 On-Policy Temporal-Difference Learning
### 5.3.1 MC vs. TD Control
* Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC)
    * Lower variance 
    * Online
    * Incomplete sequences
* Natural idea: use TD instead of MC in our control loop 
    * Apply TD to Q(S, A)
    * Use ε-greedy policy improvement Update every time-step
    
### 5.3.2 Sarsa
<img width=600 src="images/rl-ctrl-sarsa.png" />
<img width=600 src="images/rl-ctrl-sarsa-policy-iteration.png" />
<img width=600 src="images/rl-ctrl-sarsa-algorithm.png" />
<img width=600 src="images/rl-ctrl-sarsa-convergence.png" />
<img width=600 src="images/rl-ctrl-sarsa-nstep.png" />
<img width=600 src="images/rl-ctrl-sarsa-lambda-forward.png" />
<img width=600 src="images/rl-ctrl-sarsa-lambda-backward.png" />
<img width=600 src="images/rl-ctrl-sarsa-lambda-algorithm.png" />


## 5.4 Off-Policy Learning
* Evaluate target policy π(a|s) to compute $v_π(s)$ or $q_π(s,a)$ 
* While following behaviour policy μ(a|s)
$${S_1,A_1,R_2,...,S_T} ∼ μ $$
* Why is this important?
    * Learn from observing humans or other agents
    * Re-use experience generated from old policies $π_1, π_2, ..., π_{t−1}$ 
    * Learn about **optimal policy** while following **exploratory policy** 
    * Learn about multiple policies while following one policy
    
### Importance Sampling
Estimate the expectation of a different distribution    
$$ \begin{align*} \\
\mathbb{E}_{X \sim P}[f(X)] &= \sum P(X)f(X) \\
&= \sum Q(X) \frac{P(X)}{Q(X)} f(X)\\
&= \mathbb{E}_{X \sim Q} [\frac{P(X)}{Q(X)} f(X) ] \\
\end{align*} \\
$$    

### Importance Sampling for Off-Policy Monte-Carlo
* Use returns generated from μ to evaluate π
* Weight return $G_t$ according to similarity between policies 
* Multiply importance sampling corrections along whole episode
$$ G^{\pi/\mu}_t = \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)}...\frac{\pi(A_T|S_T)}{\mu(A_T|S_T)} G_t $$
* Update value towards corrected return
$$ V(S_t) \gets V(S_t) + \alpha (G^{\pi/\mu}_t - V(S_t)) $$
* Cannot use if μ is zero when π is non-zero
* Importance sampling can dramatically increase variance

### Importance Sampling for Off-Policy TD
* Use TD targets generated from μ to evaluate π 
* Weight TD target R + γV (S′) by importance sampling 
* Only need a single importance sampling correction
$$ V(S_t) \gets V(S_t) + \alpha (\frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} (R_{t+1} + \gamma V(S_{t+1})) - V(S_t))$$ 
* Much lower variance than Monte-Carlo importance sampling 
* Policies only need to be similar over a single step

### Q-Learning
* We now consider off-policy learning of action-values Q(s,a) 
* No importance sampling is required
* Next action is chosen using behaviour policy $A_{t+1} \sim μ(· \mid S_t)$ 
* But we consider alternative successor action $A′ \sim π(· \mid S_t)$
* And update $Q(S_t,A_t)$ towards value of alternative action
$$ Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A') - Q(S_t, A_t)) $$

### Off-Policy Control with Q-Learning
* We now allow both **behaviour** and **target** policies to improve
* The target policy π is greedy w.r.t. Q(s,a)
$$ \pi(S_{t+1}) = \underset{a'}{argmax} Q(S_{t+1, a'}) $$
* The behaviour policy μ is e.g. ε-greedy w.r.t. Q(s,a)
* The Q-learning target then simplifies:
$$ \begin{align*} \\
R_{t+1} + \gamma Q(S_{t+1}, A') \\
&= R_{t+1} + \gamma Q(S_{t+1}, \underset{a'}{argmax} Q(S_{t+1}, a')) \\
&= R_{t+1} + \underset{a'}{max} \gamma Q(S_{t+1}, a') \\
\end{align*}
$$

<img width=600 src="images/rl-ctrl-q-learning.png" />

<img width=600 src="images/rl-ctrl-q-learning-algorithm.png" />

## 5.5 Summary

<img width=600 src="images/rl-ctrl-summary1.png" />
<img width=600 src="images/rl-ctrl-summary2.png" />


# 6 Value Function Approximation
# 7 Policy Gradient Methods
# 8 Integrating Learning and Planning
# 9 Exploration and Exploitation
# 10 Case study - RL in games
