# Lab04 - Temporal-Difference Learning

### Learning Goals:
- Getting familiar with the cliff walking environment
- Understanding Temporal-Difference Learning in particular Q-Learning
- Understanding the difference to SARSA
- Visualizing the training process

In [None]:
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
from collections import defaultdict
from toolbox.cliff_walking import CliffWalkingEnv
import sys
import itertools
import pandas as pd

## 4.1 Cliff Walking Environment
Consider the gridworld shown below. This is a standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left. Reward is -1 on all transitions except those into the region marked "The Cliff". Stepping into this region incurs a reward of -100 and sends the agent instantly back to the start.

<div>
<img src="images/Ex4.1_cliff_walking.png" width="700"/>
</div>


**TODO:** Take a few steps inside the cliff walking environment and get familiar with it. Try the functions `reset()`, `step()` and `render()`.

In [25]:
env = CliffWalkingEnv()

## 4.2 Q-Learning (Off-policy TD Control)

Temporal-difference (TD) learning is undoubtedly a central idea for reinforcement learning. It is a combination of Monte Carlo ideas and DP ideas. Monte Carlo methods: learning from experience without knowing the model. Dynamic programming: update estimate based on other learned estimates.

**Advantage:** Opposite to DP methods, TD methods do not require a model of the environment, which is very helpful. They also can be implemented in incremental fashion, not needing to wait for the end of episodes, like it is the case with Monte Carlo methods. Learning from one guess to another without waiting for the final outcome is very convenient and in TD case it has also been proven to converge. 

General Update Rule: `Q[s,a] += learning_rate * (td_target - Q[s,a])`. 

TD Error: `td_target - Q[s,a]`

TD Target for Q-Learning: `R[t+1] + discount_factor * max(Q[next_state])`

**TODO:** Implement the Q-learning (off-policy TD control) from Sutton & Barto Chapter 6.5

In [None]:
def epsilon_greedy_action(Q, observation, nA, epsilon):
    """
    Chooses an epsilon-greedy action based on the state action function, observation and current epsilon value.
    Args:
        Q: Dictionary mapping state -> action values.
        observation: Tuple (state, action, reward).
        nA: Number of possible actions in the given environment.
        epsilon: Current epsilon value.
        
    Returns:
        action: The chosen action.
    """
    
    return action

In [None]:
def calculate_new_Q(Q, reward, discount_factor, state, next_state, current_action):
    """
    Calculates the new Q value based on the given parameters.
    
    Args:
        Q: Dictionary mapping state -> action values.
        reward: Current reward achieved from the last step.
        discount_factor: float representing the discount factor to be used in the Q-learning algorithm.
        state: Current state the agent is in.
        next_state: The state after the step taken.
        current_action: Action agent has taken.
    """
    

In [None]:
def q_learning(env, num_episodes, discount_factor=1.0, alpha=0.5, epsilon=0.1):
    """
    Implementation of the Q-learning algorithm. 
    
    Args:
        env: OpenAI environment.
        num_episodes: This is an integer representing the number of episodes of interaction with the environment that the Q-learning algorithm should run for. 
        discount_factor: This is a float representing the discount factor to be used in the Q-learning algorithm.
        alpha: This is a float representing the learning rate to be used in the Q-learning algorithm.
        epsilon: This is a float representing the exploration rate to be used in the Q-learning algorithm. 
    
    Returns:
        A tuple (Q, stats).
        Q is a dictionary mapping state -> action values.
        episode_lengths is an array holding the lengths of each episode (how many steps have been taken)
        episode_rewards is an array holding the achieved reward of each episode
    """
    
    # The final action-value function.
    # A nested dictionary that maps state -> (action -> action-value).
    Q = ...

    # Keeps track of useful statistics
    episode_lengths = ...
    episode_rewards = ...
        
    # Loop through episodes
    for ...:
        # Print out which episode we're on, useful for debugging.
        if (i_episode + 1) % 100 == 0:
            print("\rEpisode {}/{}.".format(i_episode + 1, num_episodes), end="")
            sys.stdout.flush()
        
        # Reset the environment
        state = ...
        
        # Take steps until termination state
        while ...:
            
            # Take a step using your `epsilon_greedy_action()` function
            action = ...
            next_state, reward, done, _ = ...
            
            # Update your Q table according to the current knowledge
            calculate_new_Q(...)
            
            # Update your statistics 
            episode_lengths = ...
            episode_rewards = ...
            
            # Break the loop if the episode is done
            if ...:
                break
                
            state = next_state
    
    return Q, episode_lengths, episode_rewards

## 4.3 Visualizing the training process
Use the created statistics inside your `episode_lengths` and `episode_rewards` in order to visualize the training process of your agent. 

**TODO:**
- First visualization should be a plot of the episode lengths over episodes. Your x-axis displays the current episode, while y-axis is the length (number of steps the agent has taken during the episode.
- Second visualization should be a plot of the achieved revard of episodes. Your x-axis displays the current episode, whilye y-axis shows the reward achieved during that episode.

In [None]:
def plot_episode_stats(episode_lengths, episode_rewards):
    """
    Plotting the statistics of the training process.
    Args:
        episode_lengths is an array holding the lengths of each episode (how many steps have been taken)
        episode_rewards is an array holding the achieved reward of each episode
    """

In [None]:
Q, episode_lengths, episode_rewards = q_learning(env, 500)

In [None]:
plot_episode_stats(episode_lengths, episode_rewards)

## 4.4 SARSA (On-policy TD-Control)
The basic idea behind Sarsa is to learn the action-value function $Q(s,a)$ that estimates the expected long-term reward for taking a given action a in a given state s. This is done by using a trial-and-error approach, where the algorithm tries different actions in different states and updates its estimates based on the rewards that it receives.

The Sarsa algorithm uses an exploration-exploitation trade-off, where it balances the need to explore new actions and states with the need to exploit the current knowledge of the action-value function to take the best known action in each state. This is typically done using an exploration function, such as an epsilon-greedy strategy, which determines the probability of choosing a random action instead of the best known action.

Overall, the Sarsa algorithm is a simple and effective method for learning action values in reinforcement learning problems. It has been applied to a wide range of problems, including control, game playing, and other tasks.

TD Target for SARSA: `R[t+1] + discount_factor * Q[next_state][next_action]`

<div class="alert alert-block alert-info">
    The difference between Q-learning and SARSA is that SARSA chooses an action following the same current policy and updates its Q-values whereas Q-learning chooses the greedy action, that is, the action that gives the maximum Q-value for the state, that is, it follows an optimal policy.
</div>

**TODO:** Implement the SARSA (on-policy TD control) algorithm from Sutton and Barto Chapter 6.4. Hint: Just a minor changes inside your `q_learning()` function are needed to get the SARSA algorithm working.

In [26]:
def sarsa(env, num_episodes, discount_factor=1.0, alpha=0.5, epsilon=0.1):
    """
    Implementation of the SARSA algorithm. 
    
    Args:
        env: OpenAI environment.
        num_episodes: This is an integer representing the number of episodes of interaction with the environment that the Q-learning algorithm should run for. 
        discount_factor: This is a float representing the discount factor to be used in the Q-learning algorithm.
        alpha: This is a float representing the learning rate to be used in the Q-learning algorithm.
        epsilon: This is a float representing the exploration rate to be used in the Q-learning algorithm. 
    
    Returns:
        A tuple (Q, stats).
        Q is a dictionary mapping state -> action values.
        episode_lengths is an array holding the lengths of each episode (how many steps have been taken)
        episode_rewards is an array holding the achieved reward of each episode
    """
    
    return Q, episode_lengths, episode_rewards

## 4.5 Comparison of Results
Compare the results of both algorithms and understand the differences between both of them.

**TODO:** Plot the statistics of both algorithms over each other.