Temporal Difference (TD) learning is a reinforcement learning method that combines ideas from Monte Carlo and Dynamic Programming methods. It allows learning to occur from raw experience without a model of the environment's dynamics. TD learning updates estimates based on other learned estimates, without waiting for a final outcome, which is known as bootstrapping¹.

In tabular TD learning, specifically the TD(0) algorithm, the value function for a given policy is updated after every step from experience. Unlike Monte Carlo methods, which wait until the end of an episode to update the value function, TD methods update the value function at every time step.

Here's a simple pseudocode for the tabular TD(0) algorithm, formatted as if it were in a Jupyter Notebook markdown cell:

```latex
# Tabular Temporal Difference (TD) Learning - TD(0) Pseudocode

1. Initialize \( V(s) \) arbitrarily for all \( s \in S \)
2. Repeat (for each episode):
    2.1 Initialize \( s \)
    2.2 Repeat (for each step of episode):
        2.2.1 Choose \( a \) from \( s \) using policy derived from \( V \) (e.g., epsilon-greedy)
        2.2.2 Take action \( a \), observe reward \( r \), and next state \( s' \)
        2.2.3 \( V(s) \leftarrow V(s) + \alpha [r + \gamma V(s') - V(s)] \)
        2.2.4 \( s \leftarrow s' \)
    2.3 until \( s \) is terminal
```

In the pseudocode:
- \( V(s) \) represents the value function for state \( s \).
- \( \alpha \) is the learning rate.
- \( \gamma \) is the discount factor.
- \( r \) is the reward received after taking action \( a \).
- \( s' \) is the new state after taking action \( a \).

The update rule \( V(s) \leftarrow V(s) + \alpha [r + \gamma V(s') - V(s)] \) is the core of the TD(0) algorithm, where the value function \( V(s) \) is updated towards the estimated return \( r + \gamma V(s') \).

Source: Conversation with Copilot, 6/11/2024
(1) Chapter 6: Temporal Difference Learning - Stanford University. https://web.stanford.edu/class/cme241/lecture_slides/rich_sutton_slides/11-12-TD.pdf.
(2) Temporal difference learning - Wikipedia. https://en.wikipedia.org/wiki/Temporal_difference_learning.
(3) Analysis of Temporal Difference Learning: Linear System Approach. https://arxiv.org/abs/2204.10479v5.
(4) Temporal difference learning (TD Learning) | Engati. https://www.engati.com/glossary/temporal-difference-learning.
(5) Temporal Difference Learning | SpringerLink. https://link.springer.com/chapter/10.1007/978-1-4842-9606-6_5.
(6) Temporal-Difference Learning | Littleroot - Trung's Place. https://trunghng.github.io/posts/reinforcement-learning/td-learning/.
(7) Temporal Difference Methods for Control - GitHub Pages. https://armahmood.github.io/rlcourse/lectures_w2020/week_9_mar_2.pdf.
(8) A Meta-learning Method Based on Temporal Difference Error. https://link.springer.com/chapter/10.1007/978-3-642-10677-4_60.

In [12]:
import gym
import pandas as pd 
import numpy as np
from collections import  defaultdict
env= gym.make("FrozenLake-v1")
def policy(state):
    if state in [1,4]:
        return 0
    elif state in [2,9,14]:
        return 1
    elif state in [0,6,13]:
        return 2
    else: return 3
    
def st_policy(state):
    if int(np.random.choice([0, 1], p=[0.2, 0.8])) == 1: return policy(state)
    else:return int(np.random.choice([0, 1, 2,3])) 
    


In [13]:
alfa= 0.05
def TD_learing(policy= policy,eps =10000,alfa= 0.2,gamma=1):
        V= defaultdict(int)
        for _ in range(eps):
            st,prob= env.reset()
            while True:
                At= st_policy(st)
                st1, rt, done, _, _ = env.step(At)
                V[st]= V[st]+alfa*(rt+gamma*V[st1]-V[st])
                st=st1
                if done: break
        return V
            
            
    
V=TD_learing()


  if not isinstance(terminated, (bool, np.bool8)):


In [14]:
V_df= pd.DataFrame(V.items(),columns=["state","value"])
V_df.head(16)

Unnamed: 0,state,value
0,0,0.061435
1,4,0.096136
2,5,0.0
3,8,0.15905
4,9,0.336053
5,10,0.063388
6,6,0.074195
7,2,0.057226
8,3,0.051443
9,1,0.026116


In [15]:

for _ in range(10):
    print(int(np.random.choice([0, 1], p=[0.2, 0.8])))
    

1
1
1
1
1
1
1
1
1
1
