# TD($λ$) learning

TD(0)'s objective is the same as Monte Carlo - V(s) on-policy estimation.

The TD(0) strategy can be loosely described as: "I don't want to wait for the calculation of cumulative rewards. I will estimate the V(s) by the proxy of immediate rewards instead. I will also look at the next state in the training episode - if it is a good state, that means my current state must also be good."

TD($λ$), where $λ∈[0,1]$, goes one step further. When performing the look-ahead bootstrap, if a big reward is going to be achieved in the next state, TD(0) only gives credit to the state before that big reward, whereas TD(1) would be equivalent to Monte Carlo, giving credit up to the first state in the episode. *In-between* values would give proportionally less merit to those states further from the big reward state.

TD(λ) (Temporal-Difference learning with λ-returns and eligibility traces) is a method in reinforcement learning for estimating the value function of a given policy.

It combines ideas from:

- TD(0) — bootstrapping with one-step lookahead

- Monte Carlo — using full-episode returns

The parameter $λ∈[0,1]$ controls the trade-off between these two.

Use cases

- TD(0): When you want stable, incremental learning

- TD(λ): When you want faster convergence and more credit assignment flexibility

- TD(1) or Monte Carlo: When full returns are available and sample efficiency is less of a concern



Monte Carlo vs TD(0): Summary and Trade-offs

Monte Carlo:

- Updates occur only after the episode ends.

- Uses the full return from the current state to the end of the episode as the target.

- Does not use bootstrapping — it relies purely on actual returns.

- Provides unbiased estimates of the value function.

- Has high variance, especially in long or stochastic episodes.

- Slower learning because it updates each state only once per episode.

- Not suitable for online learning (can't learn during an episode).

- Less sample-efficient.

- Can be unstable when combined with function approximation due to noisy targets.

TD(0):

- Updates occur after every step.

- Uses the immediate reward plus the current estimate of the next state's value as the target.

- Uses bootstrapping — updates are based partly on existing value estimates.

- Introduces bias due to using its own estimates in the update.

- Lower variance, resulting in more stable learning.

- Learns faster because it updates continuously throughout the episode.

- Suitable for online learning and real-time applications.

- More sample-efficient.

- More stable with function approximation because targets are smoother.

Trade-offs:
Monte Carlo is simple and unbiased but has high variance and slower learning. It's suitable for environments with short episodes or dense rewards. TD(0) is biased but more efficient, stable, and responsive, making it a better fit for long or continuing tasks, online learning, and function approximation scenarios.

Key takeaway:
Use TD(0) when you need faster, online, or stable learning, especially in sparse or long-horizon tasks. Use Monte Carlo when you can afford to wait for full returns and want unbiased estimates.

In [1]:
from utils import compress_state, generate_extreme_value_state_image
import numpy as np

import minari
from collections import defaultdict
from IPython.display import HTML
import uuid


def td_lambda_evaluation(dataset_id, gamma=0.90, lambda_=0.0, alpha=1.0):
    """
    Evaluates the state-value function V(s) for the implicit policy in a Minari dataset
    using TD(lambda) learning with eligibility traces.

    Parameters
    ----------
    dataset_id : str
        The Minari dataset ID (assumed to contain episodes generated by a fixed policy).
    
    gamma : float
        Discount factor.
    
    lambda_ : float
        Trace decay parameter (controls bias-variance trade-off).
    
    alpha : float
        Learning rate for TD updates.

    Returns
    -------
    V : dict
        A dictionary mapping each state (as a hashable key) to its estimated value.

    state_locations : dict
        A dictionary mapping state keys to (episode_index, timestep) of first occurrence.
    """
    dataset = minari.load_dataset(dataset_id)
    V = defaultdict(float)
    state_locations = {}

    for episode_idx, episode in enumerate(dataset.iterate_episodes()):
        observations = episode.observations
        rewards = episode.rewards
        actions = episode.actions

        E = defaultdict(float)  # eligibility traces: propagation of TD error

        for t in range(len(rewards)):
            obs_t = {k: v[t] for k, v in observations.items()}
            obs_tp1 = {k: v[t + 1] for k, v in observations.items()}
            reward = rewards[t]

            s_t = compress_state(obs_t)
            s_tp1 = compress_state(obs_tp1)

            # Record first occurrence of the state
            if s_t not in state_locations:
                state_locations[s_t] = (episode_idx, t)

            # TD error
            delta = reward + gamma * V[s_tp1] - V[s_t] # difference between expected and actual value

            # Update eligibility trace
            E[s_t] += 1

            # Propagate the TD error through traces
            for s in E:
                V[s] += alpha * delta * E[s] # apply TD update
                E[s] *= gamma * lambda_ # every state trace decays: so that earlier states are less influential
        
        # Track missing state at the end of the episode
        if s_tp1 not in state_locations:
            state_locations[s_tp1] = (episode_idx, len(actions))

    values = np.array(list(V.values()))
    print("Value function statistics:")
    print(f"  Count:       {len(values)}")
    print(f"  Min value:   {np.min(values):.4f}")
    print(f"  Max value:   {np.max(values):.4f}")
    print(f"  Mean value:  {np.mean(values):.4f}")
    print(f"  Std dev:     {np.std(values):.4f}")

    return V, state_locations

#### Highest value image

In [2]:
dataset_id = "minigrid/BabyAI-Pickup/optimal-fullobs-v0"
output_path = "./minigrid/BabyAI-Pickup/optimal-fullobs-v0/td_lambda/highest_value_function.png"

generate_extreme_value_state_image(
    dataset_id=dataset_id,
    output_path=output_path,
    value_fn_generator=td_lambda_evaluation,
    highest=True
)

# === Display in notebook ===
cache_buster = uuid.uuid4().hex
HTML(f'<img src="{output_path}?v={cache_buster}" width="400">')

Value function statistics:
  Count:       52916
  Min value:   0.0000
  Max value:   0.9969
  Mean value:  0.0173
  Std dev:     0.1253


  from pkg_resources import resource_stream, resource_exists


#### Lowest value image

In [3]:
dataset_id = "minigrid/BabyAI-Pickup/optimal-fullobs-v0"
output_path = "./minigrid/BabyAI-Pickup/optimal-fullobs-v0/td_lambda/lowest_value_function.png"

generate_extreme_value_state_image(
    dataset_id=dataset_id,
    output_path=output_path,
    value_fn_generator=td_lambda_evaluation,
    highest=False
)

# === Display in notebook ===
cache_buster = uuid.uuid4().hex
HTML(f'<img src="{output_path}?v={cache_buster}" width="400">')

Value function statistics:
  Count:       52916
  Min value:   0.0000
  Max value:   0.9969
  Mean value:  0.0173
  Std dev:     0.1253
Sampling rejected: unreachable object at (15, 5)


As seen before, highest value states are those that are close to the goal, while lowest value states are those that are far from the goal.