# Lecture 4: Model-Free Prediction

### Monte-Carlo Reinforcement Learning

* MC methods learn directly from episodes of experience
* MC is *model-free*: no knowledge of MDP transitions/rewards
* MC learns from complete episodes
* Value function = mean return
* Caveat: can only apply MC to *episodic* MDPs
    * All episodes must terminate
   
#### Monte-Carlo Policy Evaluation

* Goal: learn $v_{\pi}$ from epidodes of experience under policy $\pi$
$$S_{1}, A_{1}, R_{2}, ..., S_{k} \sim \pi$$
* Monte-Carlo policy evaluation uses *empirical mean* return instead of *expected* return

#### First-Visit Monte-Carlo Policy Evaluation

* To evaluate state $s$
* The first time-step $t$ that state $s$ is visited in an episode,
* Increment counter $N(s) \leftarrow N(s) + 1$
* Increment total return $S(s) \leftarrow S(s) + G_{t}$
* Value is estimated by mean return $V(s) = S(s)/N(s)$
* By law of large numbers, $V(s) \rightarrow v_{\pi}(s)$ as $N(s) \rightarrow \infty$
> In this case we don't explore the whole state space and only traverse the states that can be covered under a given policy. This leads to sampling, thus reducing the search space.

#### Every-Visit Monte-Carlo Policy Evaluation

* To evaluate state $s$
* Every time-step $t$ that state $s$ is visited in an episode,
* Increment counter $N(s) \leftarrow N(s) + 1$
* Increment total return $S(s) \leftarrow S(s) + G_{t}$
* Value is estimated by mean return $V(s) = S(s)/N(s)$
* By law of large numbers, $V(s) \rightarrow v_{\pi}(s)$ as $N(s) \rightarrow \infty$

<img src="Figures/04-incremental-mean.png" style="width: 550px;"/>

<img src="Figures/04-incremental-mc.png" style="width: 550px;"/>

### Temporal-Difference Learning

* TD methods learn directly from episodes of experience
* TD is *model-free:* no knowledge of the transitions/rewards
* TD learns from *incomplete* episodes, by *bootstrapping*: i.e. we don't need to go all the way until hit the wall and compute the reward obtained from the complete trajectory. We could take a partial trajectory and make an estimate of the reward from the current state up until the wall in place of the return. This idea is called bootstrapping in which we make an estimate of what will happen in the remainder of the trajectory from that point onwards. We update our guess of our reward. (Fundamental concept behind TD learning)

#### n-Step Prediction
* Let TD target look *n* steps into the future
* *n* step return
$$ G_{t}^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma ^{n-1}R_{t+n} + \gamma ^{n}V(S_{t+n})$$

#### Averaging n-Step Returns
* We can average n-step returns over different *n*
* We can efficiently combine information from all time-steps

#### $\lambda$-return
* Using weight $(1-\lambda)\lambda ^{n-1}$
$$G_{t}^{\lambda} = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_{t}^{(n)} $$
* Forward-view TD($\lambda$)
$$V(S_{t}) \leftarrow V(S_{t}) + \alpha(G_{t}^{\lambda} - V(S_{t})) $$

<img src="Figures/04-eligibility-trace.png" style="width: 550px;"/>

<img src="Figures/04-backward-view.png" style="width: 550px;"/>