# Temporal Difference (TD) Learning
RL algorithm family that is model-free, like MC learning. They can learn from samples without an environment model.

TD learning uses bootstrapping - uses existing value estimates to update values estimates.

TD learning can therefore update values estimates without requiring that an episode finishes. They can be used for episodic or sequential tasks.

## TD Return
MC learning performed the following to estimate values:
1. Sample trajectory starting from $s \in S$ or $(s, a) \in S \times A$ to the end of episode to obtain return sample $G_t$.
2. Estimate values according to $v_\pi(s) = \mathbb{E}[G_t\mid S_t = s]$ or $q_\pi(s, a) = \mathbb{E}_[G_t\mid S_t = s, A_t = a]$.

We introduce the new statistical value $U_t$ - the TD return. No need to sample to end of episode to get TD return sample. We also have
\begin{align*}
v_\pi(s) &= \mathbb{E}_\pi[U_t\mid S_t = s]\\
q_\pi(s, a) &= \mathbb{E}_\pi[U_t \mid S_t = s, A_t = a]
\end{align*}
Given $n \in \mathbb{N}$, the n-step TD return is
- $n$-step TD return bootstrapped from state value:
  $$U_{t:t+n}^{(v)} = \begin{cases}\sum_{i = 1}^n\gamma^{i-1}R_{t+i} + \gamma^nv(S_{t+n}) & t + n < T\\ \sum_{k=1}^{T - t}\gamma^{i - 1}R_{t+i} & t + n \geq T\end{cases}$$
- $n$-step TD return bootstrapped from action value:
  $$U_{t:t+n}^{(q)} = \begin{cases}\sum_{i = 1}^n\gamma^{i-1}R_{t+i} + \gamma^nq(S_{t+n}, A_{t+n}) & t + n < T\\ \sum_{k=1}^{T - t}\gamma^{i - 1}R_{t+i} & t + n \geq T\end{cases}$$

When there is no confusion, we can write $U_{t: t+n}^{(q)}$ as $U_t$. From here it is possible to prove that
\begin{align*}
v_\pi(s) &= \mathbb{E}_\pi[U_t\mid S_t=s]\\
q_\pi(s, a) &= \mathbb{E}_\pi[U_t\mid S_t=s, A_t=a]
\end{align*}
Some environments provide an indicator $D_t$ to show whether state $S_t$ is a terminal state:
$$D_t = \begin{cases}
1 & S_t = s_{\text{end}}\\
0 & \text{otherwise}
\end{cases}$$
This allows the 1-step TD return to be simplified as follows:
\begin{align*}
    U_{t:t+1}^{(v)} &= R_{t+1} + \gamma(1 - D_{t+1})v(S_{t+1})\\
    U_{t:t+1}^{(q)} &= R_{t+1} + \gamma(1 - D_{t+1})q(S_{t+1}, A_{t+1})
\end{align*}
We can extend this to the next $n$ states as follows:
\begin{align*}
U_{t:t+n}^{(v)} &= \left(\sum_{i = 1}^n\gamma^{i - 1}(1 - D_{t+i-1})^{\mathbb{1}[i \geq 2]}R_{t+i}\right) + \gamma^n(1 - D_{t+n})v(S_{t+n})\\
U_{t:t+n}^{(q)} &= \left(\sum_{i = 1}^n\gamma^{i - 1}(1 - D_{t+i-1})^{\mathbb{1}[i \geq 2]}R_{t+i}\right) + \gamma^n(1 - D_{t+n})q(S_{t+n}, A_{t+n})
\end{align*}

Greatest advantage of TD return over MC return is that we can obtain the return without needing to complete the episode.

## On-Policy TD Learning
From previous chapter, we updated $q(S_t, A_t)$ to reduce $[G_t - q(S_t, A_t)]^2$. In MC return with incremental implementation, we use the following update rule:
$$q(S_t, A_t) \leftarrow q(S_t, A_t) + \alpha[G_t - q(S_t, A_t)]$$
In TD learning, we use $U_t$ instead of $G_t$. **Remember that $U_t$ is shorthand for $U_{t: t+n}^{(q)}$!**
$$q(S_t, A_t) \leftarrow q(S_t, A_t) + \alpha[U_t - q(S_t, A_t)]$$
Where $\alpha \in (0, 1]$.

We will also define the TD error as $\Delta_t = U_t - q(S_t, A_t)$.

When comparing this with MC learning, instead of using $\alpha$, we used $1/c(S_t, A_t)$ which is monotonically decreasing.

However, since TD uses bootstrapping, the return samples will be increasingly trustworthy - can therefore put more weights on recent samples than historical samples.

### Algorithm 5.1: 1-step TD policy evaluation to estimate action values.
1. (Initialise) Initialise $q$ arbitrarily.
2. (TD Update) For each episode:
    1. (Initialise state-action pair) Select initial state $S$, use $\pi(\cdot\mid S)$ to determine action $A$.
    2. Loop until episode ends (e.g: reach max steps or $S$ is terminal)
        1. (Sample) Execute the action $A$, observe reward $R$ and next state $S'$.
        2. (Decide) If $S'$ not terminal:
            1. Use policy $\pi(\cdot \mid S')$ to determine action $A'$.
            2. (Calculate TD return) $U \leftarrow R + \gamma q(S', A')$.
        3. (Decide) If $S'$ is terminal: $U \leftarrow R$.
    3. Update $q(S, A)$ to reduce $[U - q(S, A)]^2$:
        1. $q(S, A) \leftarrow q(S, A) + \alpha[U - q(S, A)]$.
        2. $S \leftarrow S'$, $A \leftarrow A'$.

### Algorithm 5.2: 1-step TD policy evaluation to estimate action values with indicator of episode end.
1. (Initialise) Initialise $q$ arbitrarily.
2. (TD Update) For each episode:
    1. (Initialise state-action pair) Select initial state $S$, use $\pi(\cdot\mid S)$ to determine action $A$.
    2. Loop until episode ends (e.g: reach max steps or $S$ is terminal)
        1. (Sample) Execute the action $A$, observe reward $R$ next state $S'$, and indicator of episode end $D'$.
        2. (Decide) If $S'$ not terminal:
            1. Use policy $\pi(\cdot \mid S')$ to determine action $A'$ (can be arbitrarily chosen if $D' = 1$).
            2. (Calculate TD return) $U \leftarrow R + \gamma (1 - D')q(S', A')$.
        3. (Decide) If $S'$ is terminal: $U \leftarrow R$.
    3. Update $q(S, A)$ to reduce $[U - q(S, A)]^2$:
        1. $q(S, A) \leftarrow q(S, A) + \alpha[U - q(S, A)]$.
        2. $S \leftarrow S'$, $A \leftarrow A'$.

The same steps can be used when estimating state values $v(S)$ (Algorithm 5.3).

### Algorithm 5.3: $n$-step TD policy evaluation to estimate action values.
1. (Initialise) Initialise the action value estimates $q$ arbitrarily.
2. (TD update) For each episode:
    1. (Sample first n steps) Use $\pi$ to generate first $n$ steps of trajectory $S_0, A_0, R_1, \dots, R_n, S_n$. If terminal state is encountered, set subsequent rewards to 0 and subsequent states to $s_\text{end}$. Each $S_t$ is accompanied by episode end indicator $D_t$.
    2. For $t = 0, 1, \dots, n$:
        1. (Decide) Use $\pi(\cdot\mid S_{t+n})$ to determine the $A_{t+n}$. Can be arbitrarily chosen if $D_{t+n} = 1$.
        2. (Calculate TD return) $U \leftarrow R_{t+1} + \gamma(1 - D_{t+1})R_{t+2} + \dots + \gamma^{n-1}(1 - D_{t+n-1})R_{t+n} + \gamma^n(1-D_{t+n})q(S_{t+n}, A_{t+n})$.
        3. (Update value estimate) Update $q(S_t, A_t)$ to reduce $[U - q(S_t, A_t)]^2$. For example:
            1. $q(S_t, A_t) \leftarrow q(S_t, A_t) + \alpha[U - q(S_t, A_t)]$.
        4. (Sample) If $S_{t+n}$ is not the terminal state, execute $A_{t+n}$ and observe the reward $R_{t+n+1}$, next state $S_{t+n+1}$ and indicator of episode end $D_{t+n+1}$. If $S_{t+n}$ is the terminal state, set $R_{t+n+1} \leftarrow 0$, $S_{t+n+1} \leftarrow s_\text{end}$ and $D_{t+n+1} \leftarrow 1$.


## SARSA
State-Action-Reward-State-Action method based on random variable $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$. This algorithm uses
$$U_t = R_{t+1} + \gamma (1 - D_{t+1})q_t(S_{t+1}, A_{t+1})$$
to get the one-step return $U_t$ and then update $q(S_t, A_t)$ using
$$q(S_t, A_t) \leftarrow q(S_t, A_t) + \alpha[U_t - q(S_t, A_t)]$$
SARSA is an on-policy TD **policy optimisation algorithm, not a policy evaluation algorithm.**
### Algorithm 5.6: SARSA
1. (Initialise) Initialise $q$ arbitrarily.
2. (TD Update) For each episode:
    1. (Initialise state-action pair) Select initial state $S$, use $\pi(\cdot\mid S)$ to determine action $A$.
    2. Repeatedly perform the following until the episode ends:
        1. (Sample) Execute action $A$, then observe reward $R$, next state $S'$ and indicator $D'$.
        2. (Decide) Use the input policy $\pi(\cdot \mid S')$ to determine the action $A'$ (If $D' = 1$ then can arbitarily choose $A'$).
        3. (Calculate TD return) $U \leftarrow R + \gamma (1 - D')q(S', A')$.
        4. (Update value estimate) Udpate $q(S, A)$ to optimise $[U - q(S, A)]^2$. For example: $q(S, A) \leftarrow q(S, A) + \alpha[U - q(S, A)]$.
        5. (Improve policy) Use $q(S, \cdot)$ to modify $\pi(\cdot \mid S)$ (e.g: via $\epsilon$-greedy policy).
        6. $S \leftarrow S'$, $A \leftarrow A'$.

### Algorithm 5.8: $n$-step SARSA
1. (Initialise) Initialise $q$ arbitrarily.
2. (TD update) For each episode:
    1. (Sample first $n$ steps) Use policy derived from action values $q$ (e.g: via $\epsilon$-greedy) to generate first $n$ steps of trajectory $S_0, A_0, R_1, \dots, R_n, S_n$. If terminal state encountered, set subsequent rewards to 0 and subsequent states to $s_\text{end}$.
    2. For $t = 0, 1, 2, \dots, S_t = s_\text{end}$:
        1. (Decide) Use policy derived from action values $q(S_{t+n}, \cdot)$ to determine action $A_{t+n}$.
        2. (Calculate TD return) $U \leftarrow R_{t+1} + \gamma (1 - D_{t+1})R_{t+2} + \dots + \gamma^{n-1}(1 - D_{t+n-1})R_{t+n} + \gamma^n(1 - D_{t+n})q(S_{t+n}, A_{t+n})$.
        3. (Update value estimate) Update $q(S_t, A_t)$ to reduce $[U - q(S_t, A_t)]^2$. For example: $q(S_t, A_t) \leftarrow q(S_t, A_t) + \alpha[U - q(S_t, A_t)]$.
        4. (Sample) If $S_{t + n} \neq s_\text{end}$, execute $A_{t+n}$ and observe reward $R_{t+n+1}$, next state $S_{t+n+1}$ and indicator $D_{t+n+1}$.
        5. If $S_{t+n} = s_\text{end}$, then set $R_{t+n+1} \leftarrow 0$, $S_{t+n+1} \leftarrow s_\text{end}$, and $D_{t+n+1} \leftarrow 1$.

## Expected SARSA
Uses $U_{t:t+1}^{(v)} = R_{t+1} + \gamma (1-D_{t+1})v(S_{t+1})$ instead of $U_{t:t+1}^{(q)}$.

According to relationship between $v$ and $q$, state-value 1-step TD return can be written as:
$$U_t = R_{t+1} + \gamma(1 - D_{t+1})\sum_{a\in A(S_{t+1})}\pi(a\mid S_{t+1})q(S_{t+1}, a)$$
This makes expected SARSA more computationally expensive because it has an extra summation to calculate. However, this computation reduces some negative impacts due to some bad actions at the later stage of learning. Expected SARSA is therefore more stable than SARSA, and generally uses a larger $\alpha$.

### Algorithm 5.9: Expected SARSA
1. (Initialise) Initialise $q$ arbitrarily. If policy explicitly maintained, use $q$ to determine $\pi$ (e.g: through $\epsilon$-greedy).
2. (TD update) For each episode:
    1. (Initialise state) Select initial state $S$.
    2. Loop until episode ends:
        1. (Decide) Use $\pi(\cdot \mid S)$ to determine action $A$.
        2. (Sample) Execute the action $A$, then observe $R, S', D'$.
        3. (Calculate TD return) $U \leftarrow R + \gamma(1 - D')\sum_{a\in A(S')}\pi(a\mid S')q(S', a)$.
        4. (Update value estimate) Update $q(S, A)$ to reduce $[U - q(S, A)]^2$. E.g: $q(S, A) \leftarrow q(S, A) + \alpha[U - q(S, A)]$.
        5. (Improve policy) For situation that policy is maintained explicitly, use action values $q(S, \cdot)$ to modify $\pi(\cdot\mid S)$.
        6. $S\leftarrow S'$.

### Algorithm 5.10: $n$-step Expected SARSA
1. (Initialise) Initialise $q$ arbitrarily. If $\pi$ maintained explicitly, use $q$ to determine $\pi$ (e.g: through $\epsilon$-greedy policy).
2. (TD update) For each episode:
    1. (Sample first $n$ steps) Use the policy $\pi$ to generate first $n$ steps of trajectory $S_0, A_0, R_1, \dots, R_n, S_n$. If policy implicitly maintained, use policy derived from $q$. If terminal state encountered, set subsequent rewards to 0 and subsequent states to $s_\text{end}$.
    2. For $t = 0, 1, \dots, S_t = s_\text{end}$:
        1. (Cacluate TD return) $U \leftarrow R_{t+1} + \sum_{k+1}^{n-1}\gamma^k(1 - D_{t+k})R_{t+k+1} + \gamma^n(1-D_{t+n})\sum_{a \in A(S_{t+n})}]\pi(a\mid S_{t+n})q(S_{t+n}, a)$.
        2. (Update action-value estimates) Update $q(S, A)$ to reduce $[U - q(S_t, A_t)]^2$ (e.g: $q(S_t, A_t) \leftarrow q(S_t, A_t) + \alpha[U - q(S_t, A_t)]$).
        3. (Improve policy) If policy is maintained explicitly, use $q(S_t, \cdot)$ to update $\pi(\cdot\mid S_t)$.
        4. (Decide and sample) If $S_{t+n}\neq s_\text{end}$:
            1. Use $\pi(\cdot\mid S_{t+n})$ or policy derived from $q(S_{t+n}, \cdot)$ to determine action $A_{t+n}$.
            2. Execute $A_{t+n}$ to observe $R_{t+n+1}, S_{t+n+1}, D_{t+n+1}$.
            3. If $S_{t+n} = s_\text{end}$ set $(R_{t+n+1}, S_{t+n+1}, D_{t+n+1}) = (0, s_\text{end}, 1)$.

## Off-Policy TD Learning
Generally more popular than on-policy.
### Importance Sampling
$$\rho_{t+1:n+n-1} = \frac{\text{Pr}_\pi[R_{t+1}, S_{t+1}, A_{t+1}, \dots, S_{t+n}\mid S_t, A_t]}{\text{Pr}_b[R_{t+1}, S_{t+1}, A_{t+1}, \dots, S_{t+n}\mid S_t, A_t]} = \prod_{\tau = t+1}^{t+n-1}\frac{\pi(A_\tau\mid S_\tau)}{b(A_\tau\mid S_\tau)}$$
### Algorithm 5.11: $n$-step TD policy evaluation of SARSA with importance sampling.
1. (Initialise) Initialise $q$ arbitrarily.
2. (TD update) For each episode:
    1. (Designate behaviour policy) Designate $b$ such that $\pi \ll b$.
    2. (Sample first $n$ steps) Use the behaviour policy $b$ to generate first $n$ steps of trajectory $S_0, A_0, R_1, \dots, R_n, S_n$. If terminal state is encountered, set subsequent rewards to 0 and states to $s_\text{end}$. Each $S_t$ has an associated $D_t$.
    3. For $t = 0, 1, \dots, S_t = s\text{end}$:
        1. (Decide) Use policy derived from $b(\cdot\mid S_{t+n})$ to determine the action $A_{t+n}$.
        2. (Calculate TD return) $U = \left(R_{t+1} + \sum_{\tau=1}^{n-1}\gamma^{\tau}(1-D_\tau)R_{t+\tau+1}\right) + \gamma^n(1-D_{t+n})q(S_{t+n}, A_{t+n})$.
        3. (Calculate imp. samp. ratio) $\rho \leftarrow \prod_{\tau = t + 1}^{\min\{t+n, T\} - 1}\frac{\pi(A_\tau\mid S_\tau)}{b(A_\tau\mid S_\tau)}$.
        4. (Update value estimate) Update $q(S_t, A_t)$ to reduce $[U - q(S_t, A_t)]^2$.
        5. (Improve policy) For policy optimisation and the policy is maintained explicitly, use $q(S, \cdot)$ to update $\pi(\cdot\mid S)$.
        6. (Sample) If $S_{t+n}\neq s_\text{end}$ then execute $A_{t+n}$. Observe $R_{t+n+1}$, $S_{t+n+1}$, $D_{t+n+1}$.
        7. If $S_{t+n}$ terminal, then $R_{t+n+1} \leftarrow 0$, $S_{t+n+1} \leftarrow s_\text{end}$ and $D_{t+n+1} \leftarrow 1$.

bCan use similar method for estimating state values and expected SARSA algorithm.
\begin{align*}
\text{Pr}_\pi[A_t, R_{t+1}, S_{t+1}, A_{t+1}, \dots, S_{t+n}\mid S_t] &= \prod_{\tau = t}^{t+n-1}\pi(A_\tau\mid S_\tau)\prod_{\tau=t}^{t+n-1}p(S_{\tau+1}, R_{\tau+1}\mid S_\tau, A_\tau)\\
\text{Pr}_b[A_t, R_{t+1}, S_{t+1}, A_{t+1}, \dots, S_{t+n}\mid S_t] &= \prod_{\tau = t}^{t+n-1}b(A_\tau\mid S_\tau)\prod_{\tau=t}^{t+n-1}p(S_{\tau+1}, R_{\tau+1}\mid S_\tau, A_\tau)
\end{align*}
The ratio between the two is as follows:
$$
\rho_{t:t+n-1} = \frac{\text{Pr}_\pi[\dots]}{\text{Pr}_b[\dots]} = \prod_{\tau = t + 1}^{t + n - 1}\frac{\pi(A_\tau\mid S_\tau)}{b(A_\tau \mid S_\tau)}
$$

## Q Learning
Recall that SARSA uses
$$U_t = R_{t+1} + \gamma (1 - D_{t + 1})q(S_{t+1}, A_{t+1})$$
to generate return samples. Expected SARSA algorithm uses
$$U_t = R_{t+1} + \gamma (1 - D_{t+1})v(S_{t+1})$$

In both algorithms, return samples are generated using policy that is currently being maintained.

After action values are updated, the policy is updated either explicitly or implicitly.

Q learning takes this a step further: uses deterministic policy to generate action sample, and calculate TD return using samples generated by improved deterministic policy.
$$U_t = R_{t+1} + \gamma (1 - D_{t+1})\max_{a \in A(S_{t+1})}q(S_{t+1}, a)$$
Rationale:
- When using $S_{t+1}$ to back up $U_t$, can use improved policy according to $q(S_{t+1}, \cdot)$ rather than original $q(S_{t+1}, A_{t+1})$ or $v(S_{t+1})$, so that the value is closer to the optimal values.
- However, since deterministic policy to generate samples is not maintained policy, which is usually $\epsilon$-soft, $Q$ learning is off policy.

### Algorithm 5.12: Q Learning
1. (Intialise) Initialise $q$ arbitrarily.
2. (TD Update) For each episode:
    1. (Initialise state) Select initial state $S$.
    2. Loop until episode ends:
        1. (Decide) Use policy derived from $q$ (say, $\epsilon$-greedy) to determine $A$.
        2. (Sample) Execute $A$, observe $R$, $S'$, $D'$.
        3. (Calculate TD return) $U \leftarrow R + \gamma (1 - D')\max_{a\in A(S')}q(S', a)$.
        4. (Update value estimate) Update $q(S,A)$ to reduce $[U - q(S, A)]^2$.
        5. (Improve policy) Use $q(S, \cdot)$ to modify $\pi(\cdot\mid S)$ (e.g: via $\epsilon$-greedy).
        6. $S\leftarrow S'$.

Multi-step Q learning uses following TD return:
$$U_t = R_{t+1} + \gamma(1 - D_{t+1})R_{t+2} + \dots + \gamma^{n-1}(1 - D_{t+n-1})R_{t+n} + \gamma^n(1-D_{t+n})\max_{a\in A(S_{t+n})}q(S_{t+n}, a)$$
## Double Q Learning
Q learning algorithm bootstraps on $\max_a(S_{t+1}, a)$ to obtain the TD return for updating. This may make optimal action value estimate overbiased - "Maximisation Bias".

### Double Q Learning
Reduces impact of maximisation bias by using two independent estimates of action value $q^{(0)}$ and $q^{(1)}$. It replaces $\max_aq(S_{t+1}, a)$ with:
$$q^{(0)}\left(S_{t+1}, \arg\max_aq^{(1)}(S_{t+1}, a)\right)$$
or
$$q^{(1)}\left(S_{t+1}, \arg\max_aq^{(0)}(S_{t+1}, a)\right)$$
If the following are true:
- $q^{(0)}$ and $q^{(1)}$ are completely independent estimates,
- $\mathbb{E}[q^{(0)}(S_{t+1}, A^*)] = q(S_{t+1}, A^*)$ where $A^* = \arg\max_a q^{(1)}(S_{t+1}, a)$.

The bias is completely eliminated here.

Though there are 2 different $q$ data structures to maintain and update, and they may not be completely independent at any given time, the maximisation bias is much less than using a single action value estimate.

In Double-Q Learning, each learning step updates the action value estimates in one of the two following ways, where $k \in \{0,1\}$ can be arbitrarily chosen.
1. Calculate TD return $U^{(k)}_t = R_{t+1} + \gamma (1 - D_{t+1})q^{(1 - k)}\left(S_{t+1}, \arg\max_aq^{(k)}(S_{t+1}, a)\right)$,
2. Update $q^{(k)}(S_t, A_t)$ to minimise $[U_t^{(k)} - q^{(k)}(S_t, A_t)]^2$ via $q^{(k)}(S_t, A_t) \leftarrow q^{(k)}(S_t, A_t) + \alpha[U_t^{(k)} - q^{(k)}(S_t, A_t)]$.

### Algorithm 5.13: Double Q Learning
1. (Initialise) Initialise $q^{(i)}$ arbitrarily for $i \in \{0, 1\}$.
2. Loop until episode ends:
    1. (Decide) Use the policy derived from $q^{(0)} + q^{(1)}$ to determine $A$.
    2. (Sample) Execute action $A$, then observe $R$, $S'$, $D'$.
    3. (Choose between two action value estimates) Choose a value $I \in \{0, 1\}$ with equal probability. Then $q^{(I)}$ will be the action value estimate to be updated.
    4. (Calculate TD return) $U \leftarrow R + \gamma (1 - D') q^{(1 - I)}(S', \arg\max_a q^{(I)}(S', a))$.
    5. (Update value estimate) Update $q^{(I)}(S, A)$ to reduce $[U - q^{(I)}(S, A)]^2$ via $q^{(I)}(S, A) \leftarrow q^{(I)}(S, A) + \alpha [U - q^{(I)}(S, A)]$.
    6. $S \leftarrow S'$.]

Double Q Learning can be extended to maintain $n \geq 2$ $q^{(i)}$ estimates, then use $\min_{o \leq i < n}q^{(i)}$ to update the TD return sample.

## Eligibility Trace
Mechanism that trades off between MC learning and TD learning. Can improve learning performance with simple implementation.
### $\lambda$ Return
Given $\lambda \in [0,1]$, $\lambda$-return is the weighted average of TD returns $\{U_{t:t+n}\}_{n=1}^\infty$ with weights $\{(1 - \lambda)\lambda^n\}_{k = 0}^\infty$.

Episodic Task:
$$U_t^{\langle \lambda\rangle} = (1 - \lambda)\sum_{n = 1}^{T - t - 1}\lambda^{n-1}U_{t: t+n} + \lambda^{T - t - 1}G_t$$
Sequential Task:
$$U_t^{\langle \lambda\rangle} = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}U_{t:t+n}$$
### Offline $\lambda$ Return
Uses $\lambda$ return to update the value estimates, either action-value or state value estimates.

Unlike MC update, only changes return samples from $G_t$ to $U_t^{\langle \lambda\rangle}$. For episodic tasks, offline $\lambda$ return algorithm calculates $U_t^{\langle \lambda \rangle}$ for each $t$, and update all action value estimates.

Sequential tasks cannot use offline $\lambda$ return algorithm, since $U_t^{\langle \lambda\rangle}$ cannot be computed.

"Offline" included because it only updates at the end of episodes, like MC algorithms. Cannot be used for other tasks.

Since $\lambda$-return trades off between $G_t$ and $U_{t:t+1}$, offline $\lambda$-return algorithm may perform better than both MC and TD learning in some tasks. However, there are some shortcomings:
- Can only be used in episodic tasks, not sequential.
- At end of each episode, it needs to calculate $U_t^{\langle \lambda\rangle}$ which requires lots of computations. Eligibility trace algorithms in the next section will deal with these shortcomings.

## $\text{TD}(\lambda)$
Improved algorithm based on offline $\lambda$ return algorithm.

When offline $\lambda$ return algorithm updates optimal value estimate $q(S_{t-n}, A_{t-n})$ or $v(S_{t-n})$, the weight assigned is $(1-\lambda)\lambda^{n-1}$. Although we can calculate $U_{t-n}^{\langle\lambda\rangle}$ only at the end of the episode, we can calculate $U_{t-n:t}$ upon obtaining $(S_t, A_t)$ - can partially update $q(S_{t-n}, A_{t-n})$.

Eligibility trace $e_t(s,a)$ represents the weight of using 1-step TD return. It is defined as follows:
\begin{align*}
e_0(s,a) &= 0\\
e_t(s,a) &= \begin{cases}
    1 + \beta \gamma \lambda e_{t-1}(s,a) & S_t=s, A_t=a\\
    \gamma \lambda e_{t-1}(s,a) & \text{otherwise.}
    \end{cases}
\end{align*}
Where $\beta \in [0,1]$.

State-action pair $(S_{t-\tau}, A_{t-\tau})$ is $\tau$ steps away from time $t$, so the weight of $U_{\tau - t}$ in the $\lambda$ return $U_t^{\langle\lambda\rangle}$ is $(1-\lambda)\lambda^{t - \tau - 1}$.

Notice that
$$U_{\tau: t} = R_{\tau + 1} + \dots + \gamma^{t - \tau - 1}U_{t-1: t}$$
and $U_{t-1: t}$ should be added up to $U_\tau^{\langle\lambda\rangle}$ with the discount $(1-\lambda)(\lambda\gamma)^{t - \tau - 1}$.