# n-step Bootstrapping

In this chapter, we unify the MC methods and the one-step TD methods presented in the previous two chapters. Neither MC methods nor one-step TD methods are always the best.
In this chapter, we present **n-step** TD methods that generalize both methods so that one can shift from one to the other smoothly as needed to meet the demands of a particular task.
The best methods are often intermediate between the two extremes.

Another way of looking at the benefits of n-step methods is that they free you from the tyranny of the time step (one time step only). With one-step TD methods the same time step determines how often the action can be changed and the time interval over which bootstrapping is done.
In many applications one wants to be able to update the action very fast to take into account anything that has changed, but bootstrapping works best if it is over a length of time in which a significant and recognizable state change has occurred. With one-step TD methods,
these time intervals are the same, and so a compromise must be made. However, n-step methods enable bootstrapping to occur over multiple steps, freeing us from the tyranny of the single time step.

## N-step TD Prediction

What is the space of methods lying between Monte Carlo and TD methods? Consider
estimating v⇡ from sample episodes generated using ⇡. Monte Carlo methods perform
an update for each state based on the entire sequence of observed rewards from that
state until the end of the episode. The update of one-step TD methods, on the other
hand, is based on just the one next reward, bootstrapping from the value of the state
one step later as a proxy for the remaining rewards. *One kind of intermediate method,
then, would perform an update based on an intermediate number of rewards: more than
one, but less than all of them until termination*. For example, a two-step update would
be based on the first two rewards, and the estimated value of the state two steps later.
Similarly, we could have three-step updates, four-step updates, and so on.

The methods that use n-step updates are still TD methods because they still change an earlier estimate based on how it differs from a later estimate. Now the later estimate is not one step later, but n-step later. Methods in which
the TD extends over n steps are called n-step TD methods. The TD methods introduced in the previous chapter all used on-step updates, which is why we called them one-step TD methods.

More formally, consider the update of the estimated value of state $S_{t}$ as a result of the state reward sequence, $S_{t}, R_{t+1}, S_{t+1}, ..., R_{T}, S_{T}$. We know that in MC updates the estimate of $v_{\pi} (S_{t})$ is updated in the direction of the complete return:

$G_{t} =  \sum_{k=0}^{T} \gamma^k R_{t+k+1}$

Where T is the last time step of the episode. Let us call this quantity the target of the update. Whereas in MC updates the target is the return, in one-step updates, the target is the first reward plus the discounted estimated value of the next state ($\hat{T^{\pi}} V $), we call this
*one-step return (empirical Bellman operator)*

$\hat{T^{\pi}} V = G_{t:t+1} = R_{t+1} + \gamma V_{t} (S_{t+1})$

Where $V_{t}$ here is the estimate at time t of $v_{\pi}$. The subscripts indicate that it is a truncated return for time t using rewards up until time t+1, with the discounted estimate $\gamma V_{t} (S_{t+1})$ taking the place of the rest terms of return ($\gamma R_{t+2} + ... + \gamma^{T-t-1} R_{T}$). Our point is that, this idea makes just as much sense after two steps as it does after one.
The target for a two-step update is the two-step return:

$G_{t:t+2} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V_{t} (S_{t+2})$

Where now $\gamma^2 V_{t+1} (S_{t+2})$ corrects for the absence of the rest of the terms of return. Similarly, the target for an arbitrary n-step update is the n-step return.

$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^n V_{t+n-1}(S_{t+n})$

For all n, t such that $n \geq 1$ and $0 \leq t < T-n$. All n-step returns can be considered approximations to the full return, truncated after n steps and then corrected for the remaining missing terms by $V_{t+n-1} (S_{t+n})$. If $t+n \geq T$, then all the missing terms are taken as zero, and the n-step return defined to be equal to the ordinary full return $G_{t:t+n} = G_{t}$

Note that n-step returns for $n > 1$ involve future rewards and states that are not available at the time of transition from t to t-1. No real algorithm can use the n-step return until after it has seen R_{t+n} and computed $V_{t+n-1}$. The first time these are available is $t+n$. The natural state-value learning algorithm for using n-step return is thus,

$V_{t+n} (S_t) = V_{t+n-1} (S_{t}) + \alpha [G_{t:t+n} - V_{t+n-1} (S_{t})]$ for $0 \leq t < T$

while the values of all other states remain unchanged: $V_{t+n} (s) = V_{t+n-1} (s), \forall s \neq S_t$. We call this algorithm n-step TD. Note that no changes at all are made during the first n - 1 steps of each episode. To make up for that, an equal number of additional updates are made at the end of the episode, after termination and before starting the next episode.

<img src='pngs/n-step-td.png'>