# Temporal Difference (TD) Learning in Reinforcement Learning

Temporal Difference (TD) learning is a model-free approach in reinforcement learning that combines ideas from Monte Carlo methods and dynamic programming. It allows agents to learn directly from raw experience without needing a model of the environment's dynamics.

## TD Prediction

TD prediction methods are used to estimate the value function $V(s)$ for a given policy $\pi$. The key feature of TD methods is that they update the value of a state based on the estimated value of the next state, rather than waiting for the final outcome.

### TD(0) Update Rule:
---


>$$V(s) \leftarrow V(s) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(s)]$$

Where:
- $V(s)$ is the current value of state $s$
- $\alpha$ is the learning rate
- $R_{t+1}$ is the reward received after transitioning from state $s$ to state $S_{t+1}$
- $\gamma$ is the discount factor


---

### Advantages of TD:
- TD methods are model-free, meaning they do not require knowledge of the transition and reward functions of the MDP.
- TD methods can learn from incomplete episodes, unlike Monte Carlo methods, which require complete episodes.
- TD methods develop an estimate based on an estimate, allowing them to learn before knowing the final outcome.


## TD Control

TD control methods extend TD prediction to learn optimal policies. They use value functions to improve the policy iteratively.

### Update Policy and Behavior Policy:
- **Update Policy**: The policy used to update the value function.
- **Behavior Policy**: The policy used to generate the behavior (actions) of the agent.

### On-policy vs. Off-policy:
- **On-policy**: The agent follows a single policy for both updating the value function and generating behavior.
- **Off-policy**: The agent uses different policies for updating the value function and generating behavior.

### On-policy Example: SARSA
SARSA (State-Action-Reward-State-Action) is an on-policy TD control algorithm. It updates the Q-value based on the action actually taken by the current policy.

### Off-policy Example: Q-Learning
Q-Learning is an off-policy TD control algorithm. It updates the Q-value based on the maximum possible Q-value of the next state, independent of the action taken by the current policy.


## SARSA

SARSA is an on-policy TD control algorithm that updates the action-value function based on the current policy.

### SARSA Update Rule:
---


>$$Q(s, a) \leftarrow Q(s, a) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(s, a)]$$

Where:
- $Q(s, a)$ is the current value of taking action $a$ in state $s$
- $\alpha$ is the learning rate
- $R_{t+1}$ is the reward received after taking action $a$ in state $s$
- $\gamma$ is the discount factor
- $S_{t+1}$ is the next state
- $A_{t+1}$ is the next action taken by the current policy


## Q-Learning

Q-Learning is an off-policy TD control algorithm that updates the action-value function based on the maximum possible Q-value of the next state.

### Q-Learning Update Rule:
---


>$$Q(s, a) \leftarrow Q(s, a) + \alpha [R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(s, a)]$$

Where:
- $Q(s, a)$ is the current value of taking action $a$ in state $s$
- $\alpha$ is the learning rate
- $R_{t+1}$ is the reward received after taking action $a$ in state $s$
- $\gamma$ is the discount factor
- $S_{t+1}$ is the next state
- $\max_{a'} Q(S_{t+1}, a')$ is the maximum Q-value of the next state over all possible actions


## Example Implementation of SARSA and Q-Learning in Python

Below is an example implementation of the SARSA and Q-Learning algorithms in Python.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

# Define the environment and policy
states = [0, 1, 2, 3, 4]
actions = ['a', 'b']

# Simulate an environment
def generate_episode(policy):
    episode = []
    state = np.random.choice(states)
    while state != 4:  # Terminal state
        action = policy[state]
        next_state = np.random.choice(states)
        reward = np.random.randn()  # Random reward
        episode.append((state, action, reward))
        state = next_state
    return episode

# Epsilon-greedy policy
def epsilon_greedy_policy(Q, state, epsilon=0.1):
    if np.random.rand() < epsilon:
        return np.random.choice(actions)
    else:
        return actions[np.argmax(Q[state])]

# SARSA algorithm
def sarsa(episodes, alpha=0.1, gamma=0.9, epsilon=0.1):
    Q = defaultdict(lambda: np.zeros(len(actions)))
    for _ in range(episodes):
        state = np.random.choice(states)
        action = epsilon_greedy_policy(Q, state, epsilon)
        while state != 4:  # Terminal state
            next_state = np.random.choice(states)
            reward = np.random.randn()  # Random reward
            next_action = epsilon_greedy_policy(Q, next_state, epsilon)
            Q[state][actions.index(action)] += alpha * (reward + gamma * Q[next_state][actions.index(next_action)] - Q[state][actions.index(action)])
            state, action = next_state, next_action
    return Q

# Run SARSA
Q_sarsa = sarsa(episodes=1000)

print("Estimated Q-Values (SARSA):")
for state, values in Q_sarsa.items():
    for action, value in zip(actions, values):
        print(f"Q({state}, {action}) = {value:.2f}")


Estimated Q-Values (SARSA):
Q(4, a) = 0.00
Q(4, b) = 0.00
Q(3, a) = -0.05
Q(3, b) = -0.35
Q(0, a) = -0.26
Q(0, b) = 0.17
Q(1, a) = -0.08
Q(1, b) = -0.04
Q(2, a) = -0.15
Q(2, b) = -0.27


In [2]:
# Q-Learning algorithm
def q_learning(episodes, alpha=0.1, gamma=0.9, epsilon=0.1):
    Q = defaultdict(lambda: np.zeros(len(actions)))
    for _ in range(episodes):
        state = np.random.choice(states)
        while state != 4:  # Terminal state
            action = epsilon_greedy_policy(Q, state, epsilon)
            next_state = np.random.choice(states)
            reward = np.random.randn()  # Random reward
            best_next_action = np.argmax(Q[next_state])
            Q[state][actions.index(action)] += alpha * (reward + gamma * Q[next_state][best_next_action] - Q[state][actions.index(action)])
            state = next_state
    return Q

# Run Q-Learning
Q_q_learning = q_learning(episodes=1000)

print("Estimated Q-Values (Q-Learning):")
for state, values in Q_q_learning.items():
    for action, value in zip(actions, values):
        print(f"Q({state}, {action}) = {value:.2f}")


Estimated Q-Values (Q-Learning):
Q(0, a) = 0.05
Q(0, b) = 0.19
Q(4, a) = 0.00
Q(4, b) = 0.00
Q(2, a) = -0.12
Q(2, b) = 0.12
Q(3, a) = -0.03
Q(3, b) = -0.14
Q(1, a) = -0.19
Q(1, b) = 0.23


## Explanation of the Temporal Difference Learning Implementation

### SARSA
- **Initialization**: We initialize the Q-values for all state-action pairs.
- **Generate Episodes**: The agent follows an epsilon-greedy policy to generate episodes.
- **Update Q-Values**: The Q-values are updated based on the actual action taken by the policy.

### Q-Learning
- **Initialization**: We initialize the Q-values for all state-action pairs.
- **Generate Episodes**: The agent follows an epsilon-greedy policy to generate episodes.
- **Update Q-Values**: The Q-values are updated based on the maximum Q-value of the next state, independent of the action taken by the policy.


The estimated Q-values provide the expected return for each state-action pair under the given policy. The SARSA algorithm updates the Q-values based on the action actually taken by the policy, while the Q-Learning algorithm updates the Q-values based on the maximum possible Q-value of the next state.