# Temporal Differences

Temporal Difference Learning (Unsupervised technique in which the learning agent learns to predict the expected value of a variable occurring at the end of a sequence of states.) Both Temporal Difference algorithms (SARSA and Q-learning) are **model-free** which means that actions are associated with values and the agent acts off that. Agents will choose the action with the maximum value


## SARSA (aka. State-Action-Reward-State-Action)

In SARSA, the agent uses the On-policy for learning where the agent learns from the current set of actions in the current state and the target policy or the action to be performed.\
Let us begin implementing the algorithm in the pre-made OpenAI gym environment, FrozenLake-v1. We start by importing the necessary packages and rendering the environment. 

In [1]:
import gym
import numpy as np
import matplotlib.pyplot as plt

In [5]:
env = gym.make("FrozenLake-v1", is_slippery=False, render_mode='human')
environment.reset()
environment.render()
n_states = env.observation_space.n
n_actions = env.action_space.n

Similarly to Q-learning, we will setup a Q-table and fill it with action-value pairs.

In [6]:
#Set up the Q-table
Q = np.zeros((n_states, n_actions))


In [7]:
# Set the hyperparameters
alpha = 0.6  # learning rate
gamma = 0.5  # discount factor
epsilon = 0.5  # exploration rate


We now begin the training portion of the algorithm. This is where the agent explores the environment and gathers data on the q-values of the various states and actions. This is almost identical to the training in Q-learning with the exception of the Bellman equation which is slightly modified. It becomes: 

$New  Q(s,a)= Q(s,a) + \alpha [R(s,a) +\gamma Q(s',a') +Q(s,a)]$

The only difference is that instead of $Max Q(s', a')$ we just use Q(s', a'). Previous states and actions are not considered during the implementation of the next action.

In [None]:
#Training
for episode in range(100):
    # Reset the environment and get the initial state
    state = env.reset()
    state=np.asarray(state)
    
    # Choose the initial action using epsilon-greedy policy
    if np.random.rand() < epsilon:
        action = env.action_space.sample()  # explore
    else:
        action = np.argmax(Q[state[0]])  # exploit
    
    # Loop over steps in the episode
    while True:
        # Take the chosen action and observe the next state and reward
        next_state, reward, done, truncated, info = env.step(action)

        # Choose the next action using epsilon-greedy policy
        if np.random.rand() < epsilon:
            next_action = env.action_space.sample()  # explore
        else:
            next_action = np.argmax(Q[next_state])  # exploit
        
        # Update the Q-value of the current state-action pair
        Q[state[0], action] += alpha * (reward + gamma * Q[next_state, next_action] - Q[state[0], action])
        
        # Update the current state and action
        state[0] = next_state
        action = next_action
        
        # Check if the episode is over
        if done:
            break

Now that we are done exploring (Training) we can test the agent. 

In [None]:
# Test the agent
n_success = 0
for episode in range(100):
    state = env.reset()
    done = False
    while not done:
        action = np.argmax(Q[state[0]])
        next_state, reward, done, truncated, info = env.step(action)
        #print(info)
    if done:
        n_success += reward
print('Success rate:', n_success / 100)

# Close the environment
env.close()