# Linewalk Q-learning experiments

---

## Introduction

In this notebook we use a simple 1D linewalk to experiment with the properties of tabular Q-learning. Our aim is to understand the dynamics of Q-learning when the current value estimates are wrong in different ways. The three types of wrong-ness we want to consider are when
- Q-learning iterates in the wrong direction because of bootstrapping.
- Q-learning iterates in the wrong direction because of an incorrect action choice.
- The value estimates are a larger scale than the immediate rewards, causing the bootstrap contributions to be more important than the rewards when calculating the (possibly multi-step) error.

---

## Overview

In Q-learning we have two primary components: bootstrapping, and a "max-over-action" operation to implement a form of off-policy learning. We do not use function approximation here, because we want to avoid conflating typical Q-learning behaviours with those of function approximation.

The learning objective is to minimise $\mathbb{E}_{s,a \sim \mu}\left[~|r_{s,a} + \gamma \cdot \mathrm{max}_{a'}q(s',a') - q(s,a)|^2~\right]$, where the expectation is over all state-action pairs selected according to the exploration policy (for us this will be uniform over all pairs). We may also extend the empirical returns to multiple steps, but the greedy-policy must still be performed over the estimated value function. This means that, whilst multi-step returns reduce the contribution of the possibly-mis-modelled bootstrap target in favour of empirical returns, they do not address the impact of _selecting the wrong action_ because of the bad value-function. This bad action-selection process therefore contributes regardless of whether we bootstrap after one step or many. We therefore have two different drivers of possibly-pathological learning dynamics: bootstrapping with a mis-modelled function, and action-selection with a mis-modelled function. I expect there to be an interplay between these two effects.

In particular, I want to initialise the value function in different ways (e.g. with correct/incorrect gradients, or correct/incorrect action-choices, or all-zeros), to see how these lead to different training trajectories.

In this simple example, it is relatively easy to construct the optimal policy along with its true value. We may therefore using three metrics to track our distance from our goal:
1. the MSE comparing our value function with the optimal one for all state-action pairs
2. the MSE comparing our value function with the optimal one using only the optimal state-action pairs
3. the "accuracy" of the greedy policy define using our value function, defined as the fraction of states in which the greedy action is correctly modelled
As well as being able to plot the estimated and true value functions as we progress through training.

Furthermore, we will track the MSE loss, and MSE loss for winning state-action pairs, and the maximum $|q(s,a)|$ over every epoch. These are the metrics we will actually have access to when training without access knowledge of the true optimal value function.

We may consider how these experiments are affected when we introduce the following
- multi-step returns
- large learning rate to reduce absolute variance on gradient updates 
- double Q-learning using cloning and alternative models
- a tailored reward signal (introduces expert knowledge and biases solution towards a given form, so does not necessarily generalise well, but forces good behaviour)

When using function approximation, we may add a term to the loss function which anchors the value estimates to priority state/action-pairs. This helps to enforce the absolute scale of value function, because bootstrapping only requires the value target to satisfy a relative relationship to nearby state-action pairs - specifically those chosen by the current possibly-mis-modelled greedy policy. This approach is not needed in a tabular setting, since there is no function to anchor (rather the value of every state-action pair is updated in isolation to every other).

N.B. this approach would require expert knowledge to define the priority state-action pairs. The impact of anchoring becomes smaller the further we are from the anchor points, since all mis-modelling in between may compound.

In [1]:
###
###  Import packages
###

import os, sys, time

import numpy as np

from matplotlib import pyplot as plt

import imageio


In [2]:
###
###  Configure environment variables
###

left_pad   = 25                          # Amount of line to the left of the terminal state
right_pad  = 4                           # Amount of line to the right of the terminal state

reward_per_turn    = -1.                 # Reward obtained per movement
reward_per_dx      = 0.                  # Reward multiplier for change in distance to goal (+ve def)
reward_at_boundary = -1.                 # Reward obtained when encountering the line boundary
discount_factor    = 1.                  # Discount factor


###
###  Non-configurable calculations
###

x_min      = 0                           # Minimum x index is always 0  <<<  do not change
x_terminal = left_pad                    # Terminal state location
x_max      = left_pad + right_pad        # Max x index, = num padding +1 (terminal state) -1 (idcs start @ 0)
x_range    = x_max - x_min               # Range of x indices, equal to x_max when x_min = 0
state_list = np.arange(x_range + 1)
num_states = len(state_list)

action_list = [-1, 1]                    # Allowed actions in units of dx
num_actions = len(action_list)


###
###  Print environment summary
###

print(f"left_pad           = {left_pad}")
print(f"right_pad          = {right_pad}")
print(f"reward_per_turn    = {reward_per_turn}")
print(f"reward_per_dx      = {reward_per_dx}")
print(f"reward_at_boundary = {reward_at_boundary}")

print(f"x_min              = {x_min}")
print(f"x_max              = {x_max}")
print(f"x_terminal         = {x_terminal}")
print(f"x_range            = {x_range}")
print(f"state_list         = {state_list}")
print(f"num_states         = {num_states}")
print(f"action_list        = {action_list}")
print(f"num_actions        = {num_actions}")


left_pad           = 25
right_pad          = 4
reward_per_turn    = -1.0
reward_per_dx      = 0.0
reward_at_boundary = -1.0
x_min              = 0
x_max              = 29
x_terminal         = 25
x_range            = 29
state_list         = [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29]
num_states         = 30
action_list        = [-1, 1]
num_actions        = 2


In [3]:
###
###  Define environment methods
###


def is_terminal(state) :
    '''
    Return True if state is the terminal state and False otherwise.
    Inputs:
      > state, int [x_min, x_max]
        x position to evaluate
    Returns:
      > bool
        whether the state is in the terminal state
    '''
    if state == x_terminal :
        return True
    return False


def is_out_of_bounds(state) :
    '''
    Return True if state is out of bounds and False otherwise.
    Inputs:
      > state, int [x_min, x_max]
        x position to evaluate
    Returns:
      > bool
        whether the state is out of bounds
    '''
    if state < x_min : return True
    if state > x_max : return True
    return False


def is_valid_agent_state(state) :
    '''
    Return True if state is within bounds and not terminal, and False otherwise.
    Inputs:
      > state, int [x_min, x_max]
        x position to evaluate
    Returns:
      > bool
        whether an agent may exist in this state
    '''
    if is_terminal     (state) : return False
    if is_out_of_bounds(state) : return False
    return True


def perform_action(state, action, base_reward=None, boundary_reward=None, dx_reward=None) :
    '''
    Given the current agent state, perform the specified action and return the reward obtained along with 
    the new agent state.
    Inputs:
      > state, int [x_min, x_max]
        x position of agent at initial timestep
      > action, int in [0, num_actions-1]
        index of action_list to dereference
      > base_reward, float, default=reward_per_turn
        basic reward returned every turn (expected -ve)
      > boundary_reward, float, default=reward_at_boundary
        reward received when encountering the edge of the game board (expected -ve)
      > dx_reward, float, default=reward_per_dx
        factor multiplied by change-in-distance to calculate movement reward (expected +ve def)
    Returns:
      > float
        reward obtained by performing action
      > int [0, horizontal_max)
        x position of agent at iterated timestep
    '''
    ##
    if type(base_reward)     == type(None) : base_reward     = reward_per_turn
    if type(boundary_reward) == type(None) : boundary_reward = reward_at_boundary
    if type(dx_reward)       == type(None) : dx_reward       = reward_per_dx
    ##  Make sure initial state is valid to protect against unexpected behaviour
    if not is_valid_agent_state(state) :
        raise RuntimeError(f"Agent position ({state}) is not a valid agent state")
    ##  Make sure action is valid to protect against unexpected behaviour
    if action < 0 or action >= num_actions :
        raise RuntimeError(f"Action index ({action}) not found in allowed range [0, {num_actions})")
    ##  Get initial distance of agent from the end
    dx_agent = np.fabs(state - x_terminal)
    ##  Iterate agent state (if agent hits boundary then add penalty and return to original position)
    state_p   = state + action_list[action]
    reward_b  = 0
    if is_out_of_bounds(state_p) :
        reward_b = boundary_reward
        state_p  = state
    ##  Get distance-based reward
    dx_agent_p = np.fabs(state_p - x_terminal)
    reward_dx  = dx_reward * (dx_agent - dx_agent_p)
    ##  Calculate total reward by summing the base, boundary, distance and weather rewards
    reward = base_reward + reward_b + reward_dx
    ##  Return reward and new agent state
    return reward, state_p
    

def get_greedy_action(state, *q_models) :
    '''
    Sample a greedy action from the q-value models provided. If multiple models provided then use their mean.
    Inputs:
      > state, int [x_min, x_max]
        x position of agent at initial timestep
      > q_models, list of np.ndarray shape=(num_states,num_actions)
        list of estimated value functions
    Returns:
      > int in action_list
        index of action defined by greedy policy over the model(s) at this agent position
    '''
    action_values = 0
    for q_model in q_models : action_values += q_model[state,:]
    action_values /= len(q_models)
    action_max, best_actions = -np.inf, []
    for a, q in enumerate(action_values) :
        if q < action_max : continue
        if q == action_max :
            best_actions.append(a)
            continue
        action_max, best_actions = q, [a]
    return np.random.choice(best_actions)


def get_true_action_values() :
    '''
    Create the true action value function for the environment configured
    '''
    action_values = np.zeros(shape=(num_states,num_actions))
    for s in range(x_min, 1+x_max) :
        for a in range(num_actions) :
            if is_terminal(s) :
                action_values[s,a] = np.nan
                continue
            g, sp = perform_action(s, a)
            step_discount = discount_factor
            while not is_terminal(sp) :
                if sp > x_terminal : ap = 0
                else               : ap = 1
                r, sp          = perform_action(sp, ap)
                g             += step_discount*r
                step_discount *= discount_factor
            action_values[s,a] = g
    return action_values


def print_action_values(action_values) :
    '''
    Print a text version of the action values
    '''
    print("       | State")
    print("Action | " + " | ".join([f'{s}'.ljust(6) for s in state_list])) 
    print("-------+" + "-"*(9*num_states-1))
    for a_idx, a in enumerate(action_list) :
        print(f"{a}".ljust(6) + " | " +  " | ".join([f'{action_values[s,a_idx]:.2f}'.ljust(6) for s in state_list])) 
        print("-------+" + "-"*(9*num_states-1))

        
def evaluate_model_accuracy(q_model_true, q_model_eval) :
    '''
    Return the accuracy of the greedy policy obtained using the action value estimates.
    '''
    num_correct, num_total = 0, 0
    for state in state_list :
        if is_terminal(state) : continue
        true_a = get_greedy_action(state, q_model_true)
        eval_a = get_greedy_action(state, q_model_eval)
        num_total += 1
        if true_a == eval_a :
            num_correct += 1
    return num_correct / num_total
    
    
def get_mean_abs_error_between_models(q_model_1, q_model_2, squared=True) :
    '''
    Return mean absolute error between the two models, weighting equally over all state-action pairs. Instead
    return the MSE if squared=True.
    '''
    q_residual = (q_model_1 - q_model_2).flatten()
    q_residual = np.where(np.isfinite(q_residual), q_residual, 0.)
    q_residual = np.fabs(q_residual)
    if squared :
        q_residual = q_residual**2
    return q_residual.sum() / len(q_residual)


def get_all_state_action_pairs() :
    '''
    Create new numpy array of all state-action pairs
    '''
    get_state_action_pairs = []
    for state in range(num_states) : 
        if not is_valid_agent_state(state) : continue
        for action in range(num_actions) : 
            state_action_pairs.append((state, action))
    state_action_pairs = np.array(state_action_pairs)
    return state_action_pairs


def get_q_target(state, action, q_model_eval, q_model_bs, num_emp_steps=1, gamma=None) :
    '''
    Return multi-step target for given state-action pair.
    '''
    state_p, action_p, emp_return, step_discount = state, action, 0., 1.
    if type(gamma) == type(None) : gamma = discount_factor
    for step_idx in range(num_emp_steps) :
        if is_terminal(state_p) : continue
        r, state_p     = perform_action(state_p, action_p)
        action_p       = get_greedy_action(state_p, q_model_eval)
        emp_return    += step_discount * r
        step_discount *= gamma
    bootstrap_return = 0. if is_terminal(state_p) else q_model_bs[state_p, action_p]
    q_target = emp_return + step_discount * bootstrap_return
    return q_target


def get_q_target_function(q_model_eval, q_model_bs, num_emp_steps=1, gamma=None) :
    '''
    Return multi-step target for all state-action pairs.
    '''
    q_target = np.full(shape=(num_states, num_actions), fill_value=np.nan)
    for s in range(num_states) :
        if is_terminal(s) : continue
        for a in range(num_actions) :
            q_target[s,a] = get_q_target(s, a, q_model_eval, q_model_bs, num_emp_steps=num_step_returns, gamma=gamma)
    return q_target


def get_empirical_error(state, action, q_model_eval, q_model_bs, num_emp_steps=1, gamma=None) :
    '''
    Return multi-step error for given state-action pair.
    '''
    q_target = get_q_target(state, action, q_model_eval, q_model_bs, num_emp_steps=num_emp_steps, gamma=gamma)
    return q_target - q_model_eval[state, action]


def get_empirical_error_function(q_model_eval, q_model_bs, num_emp_steps=1, gamma=None) :
    '''
    Return multi-step error for all state-action pairs.
    '''
    q_target = get_q_target_function(q_model_eval, q_model_bs, num_emp_steps=num_emp_steps, gamma=gamma)
    return q_target - q_model_eval


In [4]:
###
###  Evaluate and print true action-values for this environment configuration
###

true_action_values, true_max_abs_q = None, np.nan

def update_true_action_values() :
    global true_action_values, true_max_abs_q
    true_action_values = get_true_action_values()
    true_max_abs_q     = np.nanmax(np.fabs(true_action_values))
    print("Action values of optimal policy are:\n")
    print_action_values(true_action_values)
    print(f"\n with a maximum |q(s,a)| of {true_max_abs_q}")

update_true_action_values()


Action values of optimal policy are:

       | State
Action | 0      | 1      | 2      | 3      | 4      | 5      | 6      | 7      | 8      | 9      | 10     | 11     | 12     | 13     | 14     | 15     | 16     | 17     | 18     | 19     | 20     | 21     | 22     | 23     | 24     | 25     | 26     | 27     | 28     | 29    
-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-1     | -27.00 | -26.00 | -25.00 | -24.00 | -23.00 | -22.00 | -21.00 | -20.00 | -19.00 | -18.00 | -17.00 | -16.00 | -15.00 | -14.00 | -13.00 | -12.00 | -11.00 | -10.00 | -9.00  | -8.00  | -7.00  | -6.00  | -5.00  | -4.00  | -3.00  | nan    | -1.00  | -2.00  | -3.00  | -4.00 
-------+-----------------------------------------------------------------------------------------------------------

In [5]:
###
###  Find greedy action for every state
###

greedy_state_action_pairs = []
for s in range(num_states) :
    if is_terminal(s) : continue
    if s > x_terminal : a = 0
    else              : a = 1
    greedy_state_action_pairs.append((s,a))
    
print(f"Greedy state-action pairs are: {greedy_state_action_pairs}")
    

Greedy state-action pairs are: [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (26, 0), (27, 0), (28, 0), (29, 0)]


In [6]:
###
###  Configure initial action value estimates
###
#
# Value function format: np array of shape (num_states,num_actions)
# For convenience this includes the terminal state, for which the contained value will be ignored
#

# Create list to store initial action value estimates for different experiments
# List items are pairs of the form (description of experiment, initial action value estimates)
experiment_configs = []

# First experiment: all zeros
experiment_configs.append(("All_zeros", np.zeros(shape=(num_states,num_actions))))

# Second experiment: inverted (so wrong action choice and dependence at every step)
experiment_configs.append(("Wrong_action_and_dependence", -.7*true_action_values))

# Third experiment: inverted and scaled
experiment_configs.append(("Wrong_action_and_dependence_scaled_up", -5.*true_action_values))

# Fourth experiment: correct dependence and action choice, but scaled up
experiment_configs.append(("Over_scaled", 5.*true_action_values))

# Fifth experiment: correct dependence and action choice, but scaled down
experiment_configs.append(("Under_scaled", .2*true_action_values))

# Sixth experiment: correct dependence and action choice, but shifted up
experiment_configs.append(("Up_shifted", true_action_values + true_max_abs_q + 3.))

# Seventh experiment: correct dependence and action choice, but shifted down
experiment_configs.append(("Down_shifted", true_action_values - true_max_abs_q - 3.))

# Eighth experiment: incorrect dependence but correct action choice
experiment_configs.append(("Wrong_dependence", .7*np.array([-true_action_values[:,1], -true_action_values[:,0]]).transpose()))

# Nineth experiment: incorrect dependence but correct action choice
experiment_configs.append(("Wrong_dependence_scaled_up", 5.*np.array([-true_action_values[:,1], -true_action_values[:,0]]).transpose()))

# Tenth experiment: correct dependence but incorrect action choice
experiment_configs.append(("Wrong_actions", .7*np.array([true_action_values[:,1], true_action_values[:,0]]).transpose()))

# Eleventh experiment: correct dependence but incorrect action choice
experiment_configs.append(("Wrong_actions_scaled_up", 5.*np.array([true_action_values[:,1], true_action_values[:,0]]).transpose()))
              
# Twelth experiment: random values with different magnitudes
experiment_configs.append(("Random_values_small_v1", np.random.normal(size=(num_states,num_actions), scale=0.1)))
experiment_configs.append(("Random_values_small_v2", np.random.normal(size=(num_states,num_actions), scale=0.1)))

experiment_configs.append(("Random_values_medium_v1", np.random.normal(size=(num_states,num_actions), scale=1)))
experiment_configs.append(("Random_values_medium_v2", np.random.normal(size=(num_states,num_actions), scale=1)))

experiment_configs.append(("Random_values_large_v1", np.random.normal(size=(num_states,num_actions), scale=5)))
experiment_configs.append(("Random_values_large_v2", np.random.normal(size=(num_states,num_actions), scale=5)))

experiment_configs.append(("Random_values_very_large_v1", np.random.normal(size=(num_states,num_actions), scale=20)))
experiment_configs.append(("Random_values_very_large_v2", np.random.normal(size=(num_states,num_actions), scale=20)))


In [7]:
def generate_directory_for_file_path(fname, print_msg_on_dir_creation=True) :
    """
    Create the directory structure needed to place file fname. Call this before fig.savefig(fname, ...) to 
    make sure fname can be created without a FileNotFoundError
    Input:
       - fname: str
                name of file you want to create a tree of directories to enclose
                also create directory at this path if fname ends in '/'
       - print_msg_on_dir_creation: bool, default = True
                                    if True then print a message whenever a new directory is created
    """
    while "//" in fname :
        fname = fname.replace("//", "/")
    dir_tree = fname.split("/")
    dir_tree = ["/".join(dir_tree[:i]) for i in range(1,len(dir_tree))]
    dir_path = ""
    for dir_path in dir_tree :
        if len(dir_path) == 0 : continue
        if not os.path.exists(dir_path) :
            os.mkdir(dir_path)
            if print_msg_on_dir_creation :
                print(f"Directory {dir_path} created")
            continue
        if os.path.isdir(dir_path) : 
            continue
        raise RuntimeError(f"Cannot create directory {dir_path} because it already exists and is not a directory")
    

def create_config(config_fname, to_stdout=True) :
    '''
    Print environment, training and model configurations to file config_fname. Also print environment and
    training configurations to sys.stdout if requested, but do not print model summaries as they are verbose.
    Inputs:
      > config_fname, str
        name of config file to create
      > to_stdout, bool, default=True
        if True then repeat environment and training configurations to sys.stdout
    Returns:
      > None
    '''
    # Create message as list of strings
    config_message = []
    config_message.append(f"="*114 + "\n")
    config_message.append(f"Environment config:\n")
    config_message.append(f"> left_pad: {left_pad}\n")
    config_message.append(f"> right_pad: {right_pad}\n")
    config_message.append(f"> reward_per_turn: {reward_per_turn}\n")
    config_message.append(f"> reward_per_dx: {reward_per_dx}\n")
    config_message.append(f"> reward_at_boundary: {reward_at_boundary}\n")
    config_message.append(f"> discount_factor: {discount_factor}\n")
    config_message.append(f"> x_min: {x_min}\n")
    config_message.append(f"> x_max: {x_max}\n")
    config_message.append(f"> x_terminal: {x_terminal}\n")
    config_message.append(f"> x_range: {x_range}\n")
    config_message.append(f"> state_list: {state_list}\n")
    config_message.append(f"> num_states: {num_states}\n")
    config_message.append(f"> action_list: {action_list}\n")
    config_message.append(f"> num_actions: {num_actions}\n")
    config_message.append(f"="*114 + "\n")
    config_message.append(f"Training config:\n")
    config_message.append(f"> max_epochs: {max_epochs}\n")
    config_message.append(f"> end_at_MSE_true: {end_at_MSE_true}\n")
    config_message.append(f"> learning_rate: {learning_rate}\n")
    config_message.append(f"> plot_estimate_after_epochs: {plot_estimate_after_epochs}\n")
    config_message.append(f"> plot_monitors_after_epochs: {plot_monitors_after_epochs}\n")
    config_message.append(f"> switch_after_epochs: {switch_after_epochs}\n")
    config_message.append(f"> clone_after_epochs: {clone_after_epochs}\n")
    config_message.append(f"> bootstrap_method: {bootstrap_method}\n")
    config_message.append(f"> num_step_returns: {num_step_returns}\n")
    config_message.append(f"="*114 + "\n")
    # Make sure directory exists for file
    generate_directory_for_file_path(config_fname, print_msg_on_dir_creation=True)
    # Open file and print messages, also to stdout if configured
    # - also print q-model summaries, only to file
    with open(config_fname, "w") as config_file :
        for line in config_message :
            config_file.write(line)
            if not to_stdout : continue
            sys.stdout.write(line)
        
        
def create_value_estimate_plot(true_q_model, q_model, q_model_bs, q_target, epoch_idx=-1, 
                               show=False, close=False, save="", dpi=200) :
    '''
    Create a plt.Figure instance visualising the greedy policy defined by the average of the q-value models 
    provided. Allows for plot to be shown, saved and/or closed using plt interface. Returns the plot figure
    and axis objects so they can continue to be manipulated, but note that objects will no longer be in scope
    if we have called plt.close(fig).
    Inputs:
      > true_q_model, np.ndarray shape (num_states, num_actions)
        true q-values
      > q_model, np.ndarray shape (num_states, num_actions)
        current q-value estimates
      > q_model_bs, np.ndarray shape (num_states, num_actions)
        current bootstrap q-values
      > q_target, np.ndarray shape (num_states, num_actions)
        current target for q-values
      > epoch_idx, int, default=-1
        if positive then draw a text box displaying how many epochs have been performed
      > show, bool, default=False
        if True then call plt.show(fig)
      > close, bool, default=False
        if True then call plt.close(fig)
      > save, str, default=""
        if string provided then call fig.savefig(save, ...), creating any required subdirectories if needed
    Returns:
      > plt.Figure instance
        Figure object
      > plt.Axes instance
        Left-hand axis object
      > plt.Axes instance
        Right-hand axis object
    '''
     
    #  Keep track of how long plotting takes, to help inform how often to call this function    
    start_time = time.time()

    #  Make plot
    fig = plt.figure(figsize=(14, 6))
    fig.set_facecolor("white")
    fig.set_alpha(1)
    
    ax1 = fig.add_subplot(1, 2, 1)
    ax1.tick_params(axis="both", which="both", right=True, top=True, direction="in", labelsize=12)
    ax1.plot(state_list, q_model     [:,0], "o-" , c="r"         , ms=5, lw=3, alpha=0.5, label="Estimated $q(s,a)$")
    ax1.plot(state_list, q_model_bs  [:,0], "x-" , c="b"         , ms=5, lw=3, alpha=0.5, label="Bootstrap")
    ax1.plot(state_list, q_target    [:,0], "x-" , c="darkorange", ms=5, lw=3, alpha=0.5, label="Target")
    ax1.plot(state_list, true_q_model[:,0], ".--", c="gray"      , ms=5, lw=3, alpha=0.5, label="True")
    ax1.grid(True, which='both')
    ax1.set_xlabel("$x$", labelpad=15, fontsize=14)
    
    ax2 = fig.add_subplot(1, 2, 2)
    ax2.tick_params(axis="both", which="both", right=True, top=True, direction="in", labelsize=12)
    ax2.plot(state_list, q_model     [:,1], "o-" , c="r"         , ms=5, lw=3, alpha=0.5, label="Estimated $q(s,a)$")
    ax2.plot(state_list, q_model_bs  [:,1], "x-" , c="b"         , ms=5, lw=3, alpha=0.5, label="Bootstrap")
    ax2.plot(state_list, q_target    [:,1], "x-" , c="darkorange", ms=5, lw=3, alpha=0.5, label="Target")
    ax2.plot(state_list, true_q_model[:,1], ".--", c="gray"      , ms=5, lw=3, alpha=0.5, label="True")
    ax2.grid(True, which='both')
    ax2.set_xlabel("$x$", labelpad=15, fontsize=14)
             
    #  Draw accompanying plot objects
    ax1.legend(loc=(0.7,1.06), ncol=5, fontsize=14, frameon=False)
    ax1.axhline(0, lw=1, c="k", ls="-")
    ax2.axhline(0, lw=1, c="k", ls="-")
    ax1.text(0.01, 1.01, f"Action: left" , ha="left", va="bottom", weight="bold", transform=ax1.transAxes, 
             alpha=0.8, fontsize=12, c="k")
    ax2.text(0.01, 1.01, f"Action: right", ha="left", va="bottom", weight="bold", transform=ax2.transAxes, 
             alpha=0.8, fontsize=12, c="k")
    
    #  Draw greedy policies
    model_greedy_policy_str, bs_greedy_policy_str, true_greedy_policy_str = "", "", ""
    for state in state_list :
        if is_terminal(state) : continue
        model_greedy_policy_str += "L  " if get_greedy_action(state, q_model     ) == 0 else "R  "
        bs_greedy_policy_str    += "L  " if get_greedy_action(state, q_model_bs  ) == 0 else "R  "
        true_greedy_policy_str  += "L  " if get_greedy_action(state, true_q_model) == 0 else "R  "
    #ax2.text(1, -0.20, f"Model policy:  {model_greedy_policy_str}"    , ha="right", va="top", weight="bold", fontsize=16, transform=ax2.transAxes)
    ax2.text(1, -0.20, f"Bootstrap policy:  {bs_greedy_policy_str}", ha="right", va="top", weight="bold", fontsize=16, transform=ax2.transAxes)
    ax2.text(1, -0.30, f"True policy:  {true_greedy_policy_str}"     , ha="right", va="top", weight="bold", fontsize=16, transform=ax2.transAxes)
        
    #  Figure out and set y-axis ranges
    y_min   = np.min([0, np.nanmin(q_model), np.nanmin(q_model_bs), np.nanmin(true_q_model)])
    y_max   = np.max([0, np.nanmax(q_model), np.nanmax(q_model_bs), np.nanmax(true_q_model)])
    y_range = y_max - y_min
    y_pad   = 0.1
    y_lim   = [y_min - y_pad*y_range, y_max + y_pad*y_range]
    ax1.set_ylim(y_lim)
    ax2.set_ylim(y_lim)
    
    #  Draw text boxes displaying title and num. epochs
    if epoch_idx >= 0 :
        ax1.text(0., 1.08, f"After {epoch_idx} epochs", ha="left", va="bottom", weight="bold", 
                 transform=ax1.transAxes, fontsize=14)
       
    #  Save / show / close
    if len(save) > 0 :
        generate_directory_for_file_path(save)
        plt.savefig(save, bbox_inches="tight", dpi=dpi, transparent=False)
    if show :
        plt.show(fig)
    if close :
        plt.close(fig)
        
    #  Return figure and axis
    return fig, ax1, ax2

        
def create_training_curves_plot(epochs_record, MSE_record, ref_MSE_record, true_MSE_record, accuracy_record, max_abs_q_record, 
                                true_max_abs_q=np.nan, show=False, close=False, save="", dpi=300) :
    '''
    Create a plt.Figure instance visualising the training curves. Allows for plot to be shown, saved and/or 
    closed using plt interface. Returns the plot figure and axis objects so they can continue to be 
    manipulated, but note that objects will no longer be in scope if we have called plt.close(fig).
    Inputs:
      > show, bool, default=False
        if True then call plt.show(fig)
      > close, bool, default=False
        if True then call plt.close(fig)
      > save, str, default=""
        if string provided then call fig.savefig(save, ...), creating any required subdirectories if needed
    Returns:
      > plt.Figure instance
      > plt.Axes instance (axis corresponding to MSE curves)
      > plt.Axes instance (axis corresponding to ref_MSE curves)
      > plt.Axes instance (axis corresponding to true_MSE curves)
      > plt.Axes instance (axis corresponding to accuracy curves)
      > plt.Axes instance (axis corresponding to max_abs_q curves)
    '''
            
    fig = plt.figure(figsize=(30,20))
    fig.set_facecolor("white")
    fig.set_alpha(1)
    
    ax1 = fig.add_subplot(5, 1, 1)
    ax1.grid(True, which='both')
    ax1.tick_params(axis="both", which="both", right=True, top=True, direction="in", labelsize=30)
    ax1.set_title(r"$\mathbb{E}_{s,a}\left[ | q(s,a) - q_{target}(s,a) |^2 \right]$", fontsize=30)
    ax1.xaxis.set_ticklabels([])
    ax1.plot(epochs_record, MSE_record, "o-", c="r", ms=5, lw=3)
    ax1.set_yscale("log")
    
    ax2 = fig.add_subplot(5, 1, 2)
    ax2.grid(True, which='both')
    ax2.tick_params(axis="both", which="both", right=True, top=True, direction="in", labelsize=30)
    ax2.set_title(r"$\mathbb{E}_{\mathrm{true greedy}~s,a}\left[ | q(s,a) - q_{target}(s,a) |^2 \right]$", fontsize=30)
    ax2.xaxis.set_ticklabels([])
    ax2.plot(epochs_record, ref_MSE_record, "o-", c="r", ms=5, lw=3)
    ax2.set_yscale("log")
    
    ax3 = fig.add_subplot(5, 1, 3)
    ax3.grid(True, which='both')
    ax3.tick_params(axis="both", which="both", right=True, top=True, direction="in", labelsize=30)
    ax3.set_title(r"$\mathbb{E}_{s,a}\left[ | q(s,a) - q_{true}(s,a) |^2 \right]$", fontsize=30)
    ax3.xaxis.set_ticklabels([])
    ax3.plot(epochs_record, true_MSE_record, "o-", c="r", ms=5, lw=3)
    ax3.set_yscale("log")
    
    ax4 = fig.add_subplot(5, 1, 4)
    ax4.grid(True, which='both')
    ax4.tick_params(axis="both", which="both", right=True, top=True, direction="in", labelsize=30)
    ax4.set_title(r"Accuracy of greedy policy", fontsize=30)
    ax4.xaxis.set_ticklabels([])
    ax4.plot(epochs_record, accuracy_record, "o-", c="r", ms=5, lw=3)
    ax4.axhline(0, ls="--", lw=2, c="gray")
    ax4.axhline(1, ls="--", lw=2, c="gray")
    
    ax5 = fig.add_subplot(5, 1, 5)
    ax5.grid(True, which='both')
    ax5.tick_params(axis="both", which="both", right=True, top=True, direction="in", labelsize=30)
    ax5.set_title(r"Max $|q(s,a)|$", fontsize=30)
    ax5.plot(epochs_record, max_abs_q_record, "o-", c="r", ms=5, lw=3)
    ax5.set_xlabel(r"Epoch", labelpad=15, fontsize=30)
    ax5.axhline(0, ls="--", lw=2, c="gray")
    if np.isfinite(true_max_abs_q) :
        ax5.axhline(true_max_abs_q, ls="--", lw=2, c="gray")
        ax5.text(0, true_max_abs_q, "True maximum", fontsize=20, ha="left", va="top", c="k")
    
    fig.subplots_adjust(hspace=0.2)
    
    if len(save) > 0 :
        generate_directory_for_file_path(save)
        plt.savefig(save, bbox_inches="tight", dpi=dpi, transparent=False)
    if show :
        plt.show(fig)
    if close :
        plt.close(fig)
        
    return fig, ax1, ax2, ax3, ax4, ax5
    

In [8]:
## Configure

max_epochs                  = np.inf
end_at_MSE_true             = 0.01
learning_rate               = 0.1
plot_estimate_after_epochs  = 5
plot_monitors_after_epochs  = -1
switch_after_epochs         = -1
clone_after_epochs          = -1
bootstrap_method            = "self"       # ["clone", "self", "other"]
num_step_returns            = 1

%matplotlib inline

if bootstrap_method not in ["clone", "self", "other"] :
    raise NotImplementedError(f"Bootstrap method {bootstrap_method} not implemented")


In [9]:

def run_experiment(run_tag, initial_q_model, show_curves=True) :
    
    ## Set up models
    q_model  = initial_q_model.copy()
    for a in range(num_actions) : q_model[x_terminal,a] = np.nan
    bs_model = q_model.copy()

    ## Print config to file and screen (model summaries only to file because they are verbose)
    create_config(f"figures/Q_learning_1D_linewalk/{run_tag}/config.txt", to_stdout=False)

    ## Create containers and method to monitor progress
    
    MSE_record, ref_MSE_record, true_MSE_record, accuracy_record, max_abs_q_record = [], [], [], [], []
    epochs_record, q_record = [], []
    
    def calculate_monitors() :
        empirical_errors = np.zeros(shape=(num_states, num_actions))
        for s in range(num_states) :
            if is_terminal(s) : continue
            for a in range(num_actions) :
                empirical_errors[s,a] = get_empirical_error(s, a, q_model, bs_model, num_emp_steps=num_step_returns)
        empirical_errors_sq = empirical_errors ** 2
        MSE       = np.mean(empirical_errors_sq)
        ref_MSE   = np.mean([empirical_errors_sq[s,a] for s,a in greedy_state_action_pairs])
        true_MSE  = get_mean_abs_error_between_models(q_model, true_action_values, squared=True)
        max_abs_q = np.nanmax(np.fabs(q_model))
        accuracy  = evaluate_model_accuracy(true_action_values, q_model)
        return MSE, ref_MSE, true_MSE, accuracy, max_abs_q
        
    def record_progress(epoch_idx, q_model, MSE, ref_MSE, true_MSE, accuracy, max_abs_q) :
        epochs_record   .append(epoch_idx)
        q_record        .append(q_model.copy())
        MSE_record      .append(MSE)
        ref_MSE_record  .append(ref_MSE)
        true_MSE_record .append(true_MSE)
        accuracy_record .append(accuracy)
        max_abs_q_record.append(max_abs_q)
        
    MSE, ref_MSE, true_MSE, accuracy, max_abs_q = calculate_monitors()
    record_progress(0, q_model, MSE, ref_MSE, true_MSE, accuracy, max_abs_q)
    
    ## Start training

    sys.stdout.write(f"Starting learning")
    epoch_idx, start_time = 0, time.time()
    value_function_fignames = []
    while (epoch_idx < max_epochs or max_epochs < 0) and true_MSE_record[-1] > end_at_MSE_true :

        # Determine whether to plot training curves
        if plot_monitors_after_epochs > 0 and epoch_idx > 0 and epoch_idx % plot_monitors_after_epochs == 0 :
            create_training_curves_plot(epochs_record, MSE_record, ref_MSE_record, true_MSE_record, accuracy_record, max_abs_q_record, true_max_abs_q,
                                        show=False, close=True, save=f"figures/Q_learning_1D_linewalk/{run_tag}/training_curves.pdf")
            
        # Determine whether to bootstrap from self
        if bootstrap_method == "self" :
            bs_model = q_model.copy()
        
        # Determine whether to switch q1 and q2
        if bootstrap_method == "other" and switch_after_epochs > 0 and epoch_idx > 0 and epoch_idx % switch_after_epochs == 0 :
            q_model, bs_model = bs_model, q_model

        # Determine whether to copy q1 to q2
        if bootstrap_method == "clone" and clone_after_epochs > 0 and epoch_idx % clone_after_epochs == 0 :
            bs_model = q_model.copy()
            
        # For each state/action pair, find the empirical error
        q_target            = get_q_target_function(q_model, bs_model, num_emp_steps=num_step_returns)
        empirical_errors    = q_target - q_model
        empirical_errors_sq = empirical_errors ** 2
        
        # Determine whether to plot value function estimates (do now because we have the q_target values)
        if plot_estimate_after_epochs > 0 and epoch_idx % plot_estimate_after_epochs == 0 :
            figname = f"figures/Q_learning_1D_linewalk/{run_tag}/value_estimates_epoch{epoch_idx}.png"
            create_value_estimate_plot(true_action_values, q_model, bs_model, q_target, epoch_idx=epoch_idx,
                                       show=False, close=True, save=figname)
            value_function_fignames.append(figname)
                
        # Update the action-values
        q_model = q_model + learning_rate * empirical_errors
        
        # Get monitor values
        MSE, ref_MSE, true_MSE, accuracy, max_abs_q = calculate_monitors()
        sys.stdout.write(f"\rEpoch {epoch_idx+1} / {max_epochs} [t={time.time()-start_time:.2f}s] <MSE: {MSE:.4f}, ref_MSE: {ref_MSE:.4f}, true_MSE: {true_MSE:.4f}, accuracy: {accuracy:.2f}, max_abs_q: {max_abs_q:.1f}>".ljust(110))

        # Manually iterate epoch index, record monitors
        epoch_idx += 1
        record_progress(epoch_idx, q_model, MSE, ref_MSE, true_MSE, accuracy, max_abs_q)

        
    ## Terminate this line of stdout
    sys.stdout.write("\n")
    sys.stdout.flush()
       
    ## Plot final training curves
    create_training_curves_plot(epochs_record, MSE_record, ref_MSE_record, true_MSE_record, accuracy_record, max_abs_q_record, true_max_abs_q,
                                show=show_curves, close=True, save=f"figures/Q_learning_1D_linewalk/{run_tag}/training_curves.pdf")

    ## Plot final value functions
    q_target = get_q_target_function(q_model, bs_model, num_emp_steps=num_step_returns)
    figname  = f"figures/Q_learning_1D_linewalk/{run_tag}/value_estimates_epoch{epoch_idx}.png"
    create_value_estimate_plot(true_action_values, q_model, bs_model, q_target, epoch_idx=epoch_idx,
                                show=show_curves, close=True, save=figname)
    value_function_fignames.append(figname)
    imageio.mimsave(f"figures/Q_learning_1D_linewalk/{run_tag}/value_estimates_animated.gif", 
                [imageio.v2.imread(fname) for fname in value_function_fignames], fps=2)
    

In [10]:
'''for exp_tag, initial_q_values in experiment_configs :
    print(f"\nRUNNING EXPERIMENT TAG: {exp_tag} {initial_q_values.shape}\n")
    run_experiment(exp_tag, initial_q_values, show_curves=False)'''

'for exp_tag, initial_q_values in experiment_configs :\n    print(f"\nRUNNING EXPERIMENT TAG: {exp_tag} {initial_q_values.shape}\n")\n    run_experiment(exp_tag, initial_q_values, show_curves=False)'

# Conclusions

1. All of these experiments converge nicely! I conclude that the diverging / slowly-converging behaviour is only present when function approximation is present (demonstrate by running NB0 with a similar environment config

2. Convergence occurs by first getting the 'near to end' states into position, since these are not biased by the bootstrapping and want to move directly towards their correct values. This information then gradually propagates outwards from the next-to-terminal states to all others. At any given time close to convergence, the accuracy decreases as we move away from the near-terminal states, because the bootstrapping bias compounds with every step that q_model != r + y * q_bs

3. Using the MSE as a metric is flawed because it only measures the distance between the current value estimates and the target ones. However, the target is biased by bootstrapping with incorrect functional-dependence and/or incorrect action-choice. Even with a good target it is bounded by the scale of expected immediate rewards. This metric really tells us about the speed of learning, which can therefore fluctuate up/down, and does not necessarily tell us about convergence. If it falls to very small values then we probably have reached convergence, but this is may not be observable in a stochastic setting.

In [11]:
'''## Configure multi-step experiments

num_step_returns = 3

experiment_configs = []
experiment_configs.append(("Wrong_action_and_dependence_multistep", -.7*true_action_values))
experiment_configs.append(("Wrong_action_and_dependence_multistep_scaled_up", -5.*true_action_values))
experiment_configs.append(("Wrong_dependence_multistep", .7*np.array([-true_action_values[:,1], -true_action_values[:,0]]).transpose()))
experiment_configs.append(("Wrong_dependence_multistep_scaled_up", 5.*np.array([-true_action_values[:,1], -true_action_values[:,0]]).transpose()))
experiment_configs.append(("Wrong_actions_multistep", .7*np.array([true_action_values[:,1], true_action_values[:,0]]).transpose()))
experiment_configs.append(("Wrong_actions_multistep_scaled_up", 5.*np.array([true_action_values[:,1], true_action_values[:,0]]).transpose()))
experiment_configs.append(("Random_values_multistep_small_v1", np.random.normal(size=(num_states,num_actions), scale=0.1)))
experiment_configs.append(("Random_values_multistep_medium_v1", np.random.normal(size=(num_states,num_actions), scale=1)))
experiment_configs.append(("Random_values_multistep_large_v1", np.random.normal(size=(num_states,num_actions), scale=5)))

for exp_tag, initial_q_values in experiment_configs :
    print(f"\nRUNNING EXPERIMENT TAG: {exp_tag} {initial_q_values.shape}\n")
    run_experiment(exp_tag, initial_q_values, show_curves=False)
    
## Return num-steps back to 1
    
num_step_returns = 3
'''

'## Configure multi-step experiments\n\nnum_step_returns = 3\n\nexperiment_configs = []\nexperiment_configs.append(("Wrong_action_and_dependence_multistep", -.7*true_action_values))\nexperiment_configs.append(("Wrong_action_and_dependence_multistep_scaled_up", -5.*true_action_values))\nexperiment_configs.append(("Wrong_dependence_multistep", .7*np.array([-true_action_values[:,1], -true_action_values[:,0]]).transpose()))\nexperiment_configs.append(("Wrong_dependence_multistep_scaled_up", 5.*np.array([-true_action_values[:,1], -true_action_values[:,0]]).transpose()))\nexperiment_configs.append(("Wrong_actions_multistep", .7*np.array([true_action_values[:,1], true_action_values[:,0]]).transpose()))\nexperiment_configs.append(("Wrong_actions_multistep_scaled_up", 5.*np.array([true_action_values[:,1], true_action_values[:,0]]).transpose()))\nexperiment_configs.append(("Random_values_multistep_small_v1", np.random.normal(size=(num_states,num_actions), scale=0.1)))\nexperiment_configs.append(

In [12]:
'''
reward_per_dx_backup = reward_per_dx
reward_per_dx        = 1.
exp_tag              = "Random_values_tailored_reward_v1"
update_true_action_values()
run_experiment(exp_tag, np.random.normal(size=(num_states,num_actions)), show_curves=False)
reward_per_dx        = reward_per_dx_backup

discount_factor_backup = discount_factor
discount_factor        = .9
exp_tag                = "Random_values_gamma_0.9"
update_true_action_values()
run_experiment(exp_tag, np.random.normal(size=(num_states,num_actions)), show_curves=False)
discount_factor        = discount_factor_backup

bootstrap_method_backup   = bootstrap_method
clone_after_epochs_backup = clone_after_epochs
bootstrap_method          = "clone"
clone_after_epochs        = 5
exp_tag                   = "Random_values_clone_5"
update_true_action_values()
run_experiment(exp_tag, np.random.normal(size=(num_states,num_actions)), show_curves=False)
bootstrap_method          = bootstrap_method_backup
clone_after_epochs        = clone_after_epochs_backup
               
update_true_action_values()
'''

'\nreward_per_dx_backup = reward_per_dx\nreward_per_dx        = 1.\nexp_tag              = "Random_values_tailored_reward_v1"\nupdate_true_action_values()\nrun_experiment(exp_tag, np.random.normal(size=(num_states,num_actions)), show_curves=False)\nreward_per_dx        = reward_per_dx_backup\n\ndiscount_factor_backup = discount_factor\ndiscount_factor        = .9\nexp_tag                = "Random_values_gamma_0.9"\nupdate_true_action_values()\nrun_experiment(exp_tag, np.random.normal(size=(num_states,num_actions)), show_curves=False)\ndiscount_factor        = discount_factor_backup\n\nbootstrap_method_backup   = bootstrap_method\nclone_after_epochs_backup = clone_after_epochs\nbootstrap_method          = "clone"\nclone_after_epochs        = 5\nexp_tag                   = "Random_values_clone_5"\nupdate_true_action_values()\nrun_experiment(exp_tag, np.random.normal(size=(num_states,num_actions)), show_curves=False)\nbootstrap_method          = bootstrap_method_backup\nclone_after_epochs

In [13]:

discount_factor_backup = discount_factor
discount_factor        = .9
exp_tag                = "Random_values_gamma_0.9"
update_true_action_values()
run_experiment(exp_tag, np.random.normal(size=(num_states,num_actions)), show_curves=False)
discount_factor        = discount_factor_backup
update_true_action_values()


Action values of optimal policy are:

       | State
Action | 0      | 1      | 2      | 3      | 4      | 5      | 6      | 7      | 8      | 9      | 10     | 11     | 12     | 13     | 14     | 15     | 16     | 17     | 18     | 19     | 20     | 21     | 22     | 23     | 24     | 25     | 26     | 27     | 28     | 29    
-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-1     | -10.35 | -9.35  | -9.28  | -9.20  | -9.11  | -9.02  | -8.91  | -8.78  | -8.65  | -8.50  | -8.33  | -8.15  | -7.94  | -7.71  | -7.46  | -7.18  | -6.86  | -6.51  | -6.13  | -5.70  | -5.22  | -4.69  | -4.10  | -3.44  | -2.71  | nan    | -1.00  | -1.90  | -2.71  | -3.44 
-------+-----------------------------------------------------------------------------------------------------------