# Temporal-Difference Learning

## TD Prediction

Both TD and Monte Carlo methods use experience to solve the prediction problem. Given some experience following $\pi$, we estimate $V_{
\pi} (S_t)$ occurring in that experience. Roughly speaking, MC methods wait until the return following the visit is known, then use that return as
a target for $V(S_t)$. A simple every visit MC method suitable for non-stationary environments is:

$V(S_t) = V(S_t) + \alpha (G_t - V(S_t))$

Where $G_t$ is the actual reward after time t, and $\alpha$ is a constant step size parameter. Let's call this constant-$\alpha$ MC. Whereas MC methods must wait until the end of the
episode to determine the increment to V(S_t) (only then $G_t$ is known, $G_t = R_{t+1} + \gamma G_{t+1}$). In contrast, TD methods need to wait only until the next time step. At time t + 1 they
immediately form a target and make a useful update using the observed reward $R_{t+1}$ and $V(S_{t+1})$.
The simplest TD method makes the update:

$V(S_t) = V(S_t) + \alpha (R_{t+1} + \gamma V(S_{t+1}) - V(S_t))$

In fact, the target update for MC method is $G_{t}$, whereas the target for the TD update is $R_{t+1} + \gamma V(S_{t+1})$. This TD method is called
**TD(0)**, or one-step TD, because it is a special case of the **TD($\lambda$)** and n-step TD methods.

<img src="tabular-TD0.png">

Because TD(0) bases its update in part on an existing estimate, we say that it is a bootstrapping method, like DP. We know that:

$v_{\pi} (s) = E_{\pi} [G_t | S_t = s] = E_{\pi} [R_{t+1} + \gamma G_{t+1} | S_{t} = s] =  E_{\pi} [R_{t+1} | S_{t}] + \gamma E_{\pi} [G_{t+1} | S_{t}] = E_{\pi} [R_{t+1} | S_{t}] + \gamma E_{\pi} [E_{\pi} [G_{t+1} | S_{t+1} = s^\prime] | S_{t}] = E_{\pi} [R_{t+1} + \gamma v_{\pi} (S_{t+1}) | S_{t} = s]$,

MC methods use estimate of $E_{\pi} [G_t | S_t = s]$ as a target, whereas DP methods use estimate of $E_{\pi} [R_{t+1} + \gamma V_{\pi} (S_{t+1}) | S_{t} = s]$ as a target. The MC target is an estimate because the expected value is not known. (a sample return is used in place of the real expected return)
The DP target is an estimate not because of the expected values, which are assumed to be completely provided by a model of the environment, but because $v_{\pi} (S_{t+1})$ is unknown and we use $V_{\pi}(S_{t+1})$ as an estimate. The TD target is an estimate for both reasons: it samples the expected values $E[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_{t} = s] $ and it uses current estimate value $V(S_{t+1})$ instead of
the true value $v_{\pi} (S_{t+1})$. Thus, TD methods combine the sampling of MC with the bootstrapping of DP. As we shall see, with care and imagination this can take us a long way toward obtaining the advantages of both MC and DP methods.
For TD(0), the value estimate for the current state is updated on the basis of the one sample transition from it to the immediately following state. We refer to TD and MC updates as **sample updates** because they involve looking ahead to a sample successor state or state-action pair using the value of the successor, and the reward along the way to compute a back-up value, and the nupdating the value of the original state accordingly. Sample
updates differ from the expected updates of DP methods in that they are based on a single sample successor rather than on a complete distribution of all possible successors (ie. we know $p(s\prime, r | s, a)$ in DP).

### TD error

Finally, note that the quantity in brackets in the TD(0) update is a sort of error, measuring the difference between the estimated value of $S_t$ and the better estimate $R_{t+1} + \gamma V(S_{t+1})$. This quantity, called the **TD error**, arises in various forms throughout reinforcement learning:

$\delta_{t} = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$

Notice that the TD error at each time is the error in the estimate made at that time. Because the TD error depends on the next state and next reward, it is not actually avaliable until one time step later. That is, $\delta_{t}$ is the error in $V(S_{t})$, available at time $t+1$. Also note that if the array V does not change during the episode (as it does not in MC methods), then the MC error can be written as a sum of TD errors:

$G_t - V(S_t) = R_{t+1} + \gamma G_{t+1} - V(S_t) + \gamma V(S_{t+1}) - \gamma V(S_{t+1})$

$= \delta_t + \gamma (G_{t+1} - V(S_{t+1}))$

$= \delta_t + \gamma (\delta_{t+1}) + \gamma^2 (G_{t+2} - V(S_{t+2}))$

$= \delta_t + \gamma (\delta_{t+1}) + \gamma^2 (G_{t+2} - V(S_{t+2})) + .... + \gamma^{T-t-1} \delta_{T-1} + \gamma^{T-t} (G_{T} - V(S_T))$

$= \delta_t + \gamma (\delta_{t+1}) + \gamma^2 (G_{t+2} - V(S_{t+2})) + .... + \gamma^{T-t-1} \delta_{T-1} + \gamma^{T-t}(0 - 0)$ (ie. the return from terminal state is 0, the same as the value)

$= \sum^{T-1}_{k=t} \gamma^{k-t} \delta_{k}$

This identity is not exact if V is updated during generating the episode as in TD(0), but if the step size (ie. $\gamma$) is small the nit may still hold approximately. Generalization of this identity play an important role in the theory and algorithms of TD learning

## Advantage of TD prediction methods

TD methods update their estimates based in part on other estimates. They learn a guess from a guess ---- they bootstrap. Is this a good thing to do? What advantages do TD methods have over MC and DP methods? in this section we briefly anticipate some of the answers.

Obviously, TD methods have an advantage over DP methods in that they do not require a model of the environment, of its reward and next-state probability distributions. The next most obvious advantage of TD methods over MC methods is that they are naturally implemented in an online, fully incremental fashion (Does not require $G_{t+1}$, instead only $V(S_{t+1})$). With MC methods, one must wait until the end of an episode, because only then is the return known, whereas with TD methods on need wait only one time step. Surprisingly, often this turns out to be a critical
consideration. Some applications have very long episodes, so that delaying all learning until the end of the episode is too slow. Other applications are continuing tasks and have no episodes at all. Finally, as we noted in the previous chapter, some MC methods must ignore or discount episodes on which experimental actions are taken, which can greatly slow learning ($pi(A_t | S_t) = 0$).

But what are the disadvantages of TD methods? Certainly it is convenient to learn one guess from the next, without waiting for an actual outcome, but can we still guarantee convergence to $v_{\pi}$? Happily, the answer is yes. For any fixed policy $\pi$, TD(0) has been proved to converge to $v_{\pi}$, in the mean for a constant step-size parameter if it is sufficiently small, and with probability 1 if the step size parameter decreases according to the usual stochastic approximation conditions ($\sum_{n=1}^{\infty} a_n (a) = \infty$ and $\sum_{n=1}^{\infty} a_n^2(a) < \infty$).
Most convergence proofs apply only to the table-based case of the algorithm, but some also apply to the case of general linear function approximation.

If both TD and MC methods converge asymptotically to the correct predictions, then a natural next question is "which gets there first?" In other words, which method learns faster? which makes the more efficient use of limited data? At the current tie, this is an open question in the sense that no one has been able to prove mathematically that one method converges faster than the other. In fact, it is not even clear what is the most appropriate formal way to phrase this question. In practice, however, TD methods have usually been found to converge faster than constant-$\alpha$ MC methods on stochastic tasks.

## optimality of TD(0)

Suppose there is available only a finite amount of experience, say 10 episodes or 100 time steps. In this case, a common approach with incremental learning methods is to present the experience repeatedly until the method converges upon an answer. Given an approximate value function, V, the increments specified above are computed for every time step t at which a non-terminal state is visited, but the value function is changed only once, by the sum of all the increments (after visit all 10 episodes). Then all the available experience is processed again with the new value function to produce a new overall increment, and so on, until the value function converges. We call this **batch updating** because updates are made only after processing each complete batch of training data.

Under batch updating, TD(0) converges deterministically to a single answer independent of the step-size parameter, $\alpha$, as long as $\alpha$ is chosen to be sufficiently small. The constant-$\alpha$ MC method also converges deterministically under the same conditions, but to a different answer. Understanding these two answers will help us understand the difference bewteen the two methods. Under normal updating the methods do not move all the way to their respective batch answers, but in some sense they take steps in these directions.

