# Introduction
- In this kernel, we will be implementing an example environment.
- We will be deploying SARSA, Q-Learning and Expected SARSA to try and find the optimal agent's policy and the optimal value functions, in order to maximize the rewards.

# Importing Packages & Boilerplate Stuff

1. jdc: Jupyter magic that allows defining classes over multiple jupyter notebook cells.
2. numpy: the fundamental package for scientific computing with Python.
3. matplotlib: the library for plotting graphs in Python.
4. RL-Glue: the library for reinforcement learning experiments.
5. BaseEnvironment, BaseAgent: the base classes from which we will inherit when creating the environment and agent classes in order for them to support the RL-Glue framework.
6. itertools.product: the function that can be used easily to compute permutations.
7. tqdm.tqdm: Provides progress bars for visualizing the status of loops.

# Based on Version_1 (Changes)
- Some small changes in the value iteration section. Printing some extra values, nothing special.
- Removed the hyper-tuning cells from the end of the notebook.
- The key change in this version is that the step-size decreases over the iterations, in Q-Learning.

In [1]:
import jdc
import copy
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from itertools import product
from tqdm import tqdm

In [2]:
### DEBUG CODE
# Setting the seed for reproducible results
# np.random.seed(0)

# 1. Environment
- The below code cell provides the backbone of the `ExampleEnvironment` class.

In [3]:
class ExampleEnvironment():
    def __init__(self, env_info={}):
        # These are the different possible states
        self.grid = [0, 1, 2, 3]
        
        # The rewards produced by the environment in response to the different ...
        # ... actions of the agent in different states
        self.rewards = [
            [0, 0, 2],
            [0, 1, 0],
            [1, 1, 0],
            [2, 1.5, 3]
        ]

        # The environment is governed by the following dynamics
        # In mathematical notation, this is nothing but p(s'|s,a)
        # But in this example, we are assuming to be independent of actions, i.e., ...
        # p(s'|s, a) is equal for all actions in state s
        self.tran_matrix = np.array([
            [1/2, 1/2, 0, 0],    # State 0
            [1/4, 1/4, 1/2, 0],  # State 1
            [0, 1/4, 1/4, 1/2],  # State 2
            [0, 0, 1/4, 3/4]     # State 3
        ])
        
        # Defining a random generator
        self.rand_generator = np.random.RandomState(env_info.get("seed", 0))
        
        # Defines the current location
        self.cur_loc = None
        
    def start(self):
        self.cur_loc = self.rand_generator.choice(self.grid)
        return self.cur_loc
    
    def step(self, action):
        next_reward = self.rewards[self.cur_loc][action]
        next_state = self.rand_generator.choice(self.grid, 
            p = self.tran_matrix[self.cur_loc])
        self.cur_loc = next_state
        return next_state, next_reward

# 2. Value Iteration

In [4]:
def value_iteration(theta = 0.01, discount = 0.9):
    # Creating an instance for the environment
    env = ExampleEnvironment()

    # Defining the paramters for the simulation
    delta = theta * 10

    # Initializing the state values and the different possible actions
    s_vals = np.zeros(4)
    actions = list(np.arange(3))

    while delta > theta:
        delta = 0
        for s in env.grid:
            cur_val = copy.copy(s_vals[s])
            vals = []
            for a in actions:
                sum_rhs = env.tran_matrix[s] * (env.rewards[s][a] + discount * s_vals)
                vals.append(np.sum(sum_rhs))
            s_vals[s] = np.max(vals)
            delta = max(delta, abs(cur_val - s_vals[s]))
            
    return s_vals

In [5]:
s_vals = value_iteration(theta = 0.001, discount = 0.9)
print("Post Convergence of Value Iteration Algorithm")
print("State Values: ", s_vals)

s_vals = value_iteration(theta = 0.001, discount = 0.8)
print("\nPost Convergence of Value Iteration Algorithm")
print("State Values: ", s_vals)

s_vals = value_iteration(theta = 0.001, discount = 0.7)
print("\nPost Convergence of Value Iteration Algorithm")
print("State Values: ", s_vals)

Post Convergence of Value Iteration Algorithm
State Values:  [18.61753617 18.31203114 20.0076184  23.08053263]

Post Convergence of Value Iteration Algorithm
State Values:  [ 8.74730387  8.12262406  9.37273269 12.18526228]

Post Convergence of Value Iteration Algorithm
State Values:  [5.71678855 4.90395976 5.84467657 8.46846691]


In [6]:
def value_iteration_using_q(theta = 0.01, discount = 0.9):
    # Creating an instance for the environment
    env = ExampleEnvironment()

    # Defining the paramters for the simulation
    delta = theta * 10

    # Initializing the action values and the different possible actions
    rand_generator = np.random.RandomState(0)
    q_vals = rand_generator.uniform(0, 0.1, (4, 3))
    # q_vals = np.zeros((4, 3))
    actions = list(np.arange(3))

    while delta > theta:
        delta = 0
        for s in env.grid:
            for a in actions:
                cur_val = copy.copy(q_vals[s][a])
                sum_rhs = 0
                for next_s in env.grid:
                    sum_rhs += env.tran_matrix[s][next_s] * (
                        env.rewards[s][a] + discount * max(q_vals[next_s]) 
                    )
                q_vals[s][a] = sum_rhs
                delta = max(delta, abs(cur_val - q_vals[s][a]))

    return q_vals

In [7]:
q_vals = value_iteration_using_q(theta = 0.001, discount = 0.9)
print("Post Convergence of Value Iteration Algorithm")
print("Action Values: ", q_vals)
print("State Values: ", np.max(q_vals, axis = 1))

q_vals = value_iteration_using_q(theta = 0.001, discount = 0.8)
print("\nPost Convergence of Value Iteration Algorithm")
print("Action Values: ", q_vals)
print("State Values: ", np.max(q_vals, axis = 1))

q_vals = value_iteration_using_q(theta = 0.001, discount = 0.7)
print("\nPost Convergence of Value Iteration Algorithm")
print("Action Values: ", q_vals)
print("State Values: ", np.max(q_vals, axis = 1))

Post Convergence of Value Iteration Algorithm
Action Values:  [[16.61795239 16.61795239 18.61795239]
 [17.31252016 18.31252016 17.31270256]
 [20.0080671  20.00820908 19.00824102]
 [22.08096246 21.58096246 23.08096246]]
State Values:  [18.61795239 18.31252016 20.00820908 23.08096246]

Post Convergence of Value Iteration Algorithm
Action Values:  [[ 6.74775008  6.74775008  8.74775008]
 [ 7.12308557  8.12308557  7.12321984]
 [ 9.37314418  9.37324666  8.37326715]
 [11.1856504  10.6856504  12.1856504 ]]
State Values:  [ 8.74775008  8.12308557  9.37324666 12.1856504 ]

Post Convergence of Value Iteration Algorithm
Action Values:  [[3.71658373 3.71658373 5.71658373]
 [3.90385135 4.90385135 3.90397635]
 [5.84454004 5.84463445 4.84465097]
 [7.46830996 6.96830996 8.46830996]]
State Values:  [5.71658373 4.90385135 5.84463445 8.46830996]


In [8]:
optimal_q_vals = value_iteration_using_q(theta = 0.01, discount = 0.9)

# 3. Q Learning Agent

In [9]:
class QLearningAgent():
    def __init__(self, agent_info={}):
        # Defining the #actions and #states 
        self.num_actions = 3
        self.num_states = 4
        
        # Discount factor (gamma) to use in the updates.
        self.discount = agent_info.get("discount", 0.9)

        # The learning rate or step size parameter (alpha) to use in updates.
        self.step_size = agent_info.get("step_size", 0.1)

        # To control the exploration-exploitation trade-off
        self.epsilon = agent_info.get("epsilon", 0.1)
        
        # To determine if the Q-function is converged or not
        self.delta = agent_info.get("delta", 0.01)
        
        # Defining a random generator
        self.rand_generator = np.random.RandomState(agent_info.get("seed", 0))
        
        # Definining the Optimal Q-Values to which the algorithm should converge to
        self.optimal_q = agent_info.get("optimal_q", None)
        
        # Defining the initial action values
        # self.q = self.rand_generator.randn(self.num_states, self.num_actions)
        self.q = self.rand_generator.uniform(0, 0.1, (self.num_states, self.num_actions))
        
        # Initializing the variables for the previous state and action
        self.prev_state  = None
        self.prev_action = None
        
    def start(self, state, decrease_step_size = False):
        # Choose action using epsilon greedy.
        current_q = self.q[state][:]
        if self.rand_generator.rand() < self.epsilon:
            action = self.rand_generator.randint(self.num_actions)
        else:
            action = self.argmax(current_q)
            
        # Reducing the step-size
        if decrease_step_size:
            self.step_size *= 0.8
            
        self.prev_state = state
        self.prev_action = action
        return action
    
    def step(self, state, reward, decrease_step_size = False):
        # Choose action using epsilon greedy.
        current_q = self.q[state][:]
        if self.rand_generator.rand() < self.epsilon:
            action = self.rand_generator.randint(self.num_actions)
        else:
            action = self.argmax(current_q)
            
        # Reducing the step-size
        if decrease_step_size:
            self.step_size *= 0.8
        
        # Determining the new Q-Value
        new_val = -1e8
        cur_val = copy.copy(self.q[self.prev_state, self.prev_action])
        for act in range(self.num_actions):
            val = cur_val + self.step_size * (
                reward + self.discount * self.q[state, act] - cur_val
            )
            new_val = max(new_val, val)
        self.q[self.prev_state, self.prev_action] = new_val
            
        # Determining if the Q-function has converged or not
        if np.max(np.abs(self.optimal_q - self.q)) < self.delta:
            return (action, True)
        else:
            return (action, False)
            
    def argmax(self, q_values):
        top = float("-inf")
        ties = []

        for i in range(len(q_values)):
            if q_values[i] > top:
                top = q_values[i]
                ties = []

            if q_values[i] == top:
                ties.append(i)

        return self.rand_generator.choice(ties)

# 4. Running Experiments

In [10]:
def run_experiment(
        env_info = {}, agent_info = {}, max_iter = 1000, 
        re_init = 100, dec_step = 100, print_vals = True
    ):
    env = ExampleEnvironment(env_info) 
    agent = QLearningAgent(agent_info)
    has_converged = False
    num_iter = 0
    
    init_state  = env.start()                             # STARTING STATE
    init_action = agent.start(init_state)                 # STARTING ACTION
    next_state, next_reward = env.step(init_action)       # STARTING REWARD
    num_iter = 1
    
    while not has_converged and num_iter < max_iter:
        # After every `dec_step` steps, decrease the step-size
        if num_iter % dec_step == 0:
            decrease_step_size = True
        else:
            decrease_step_size = False
        
        # After every `re_init` steps, re-initialize with a random state
        if num_iter % re_init == 0:
            init_state  = env.start()                             
            init_action = agent.start(init_state, decrease_step_size)                 
            next_state, next_reward = env.step(init_action)
        else:
            next_action, has_converged = agent.step(next_state, next_reward, decrease_step_size)
            next_state, next_reward = env.step(next_action)
        
        if print_vals and num_iter % (max_iter / 5) == 0:
            print(f"Time Steps Elapsed | {num_iter}")
            print("Q-Values:", agent.q)
            print()
        
        num_iter += 1
        
    print("POST CONVERGENCE\n")
    print("Optimal Action Values:")
    print(agent.q)
    
    print("\nOptimal State Values:")
    print(np.max(agent.q, axis = -1))
    
    print("\nOptimal Policy:")
    print(np.argmax(agent.q, axis = -1))
    
    return agent.q

## 4.1.

In [11]:
# Defining the characteristics for the environment
env_info = {
    "seed": 0
}

# Defining the characteristics for the agent
agent_info = {
    "discount": 0.9,       
    "step_size": 0.5,
    "epsilon": 0.2,
    "delta": 1e-2,
    "optimal_q": optimal_q_vals,
    "seed": 0
}

max_iter = 200000
re_init = 1000
dec_step = 10000

# q_vals = run_experiment(
#     env_info, agent_info, max_iter = max_iter, re_init = re_init,
#     dec_step = dec_step
# )

Time Steps Elapsed | 40000 \
Q-Values: [[ 0.05488135 10.71070626  0.06027634]
 [11.4383117   0.04236548 12.61760995]
 [ 0.04375872 16.03734232  7.96197394]
 [ 0.03834415 10.4194207  14.9468654 ]]

Time Steps Elapsed | 80000 \
Q-Values: [[14.75741998 13.64244354 16.52596118]
 [13.30152604 16.23325656 16.39342828]
 [ 0.04375872 13.23629037 17.91907658]
 [ 0.03834415 10.4194207  18.12209091]]

Time Steps Elapsed | 120000 \
Q-Values: [[15.93430712 13.64244354 16.3076523 ]
 [13.30152604 16.17647881 15.78585586]
 [ 0.04375872 15.71974405 16.12896159]
 [ 0.03834415 15.6238155  15.47263081]]

Time Steps Elapsed | 160000 \
Q-Values: [[17.74973483 13.64244354 15.441694  ]
 [13.30152604 17.49562155 14.75383977]
 [ 0.04375872 17.50109765 16.46833595]
 [15.29431932 15.33640462 17.19304134]]

POST CONVERGENCE

Optimal Action Values: \
[[16.86760649 16.91116993 15.441694  ]
 [13.30152604 16.71770773 16.72325651]
 [16.95859037 16.50102899 16.46833595]
 [15.29431932 17.52810718 17.50259365]]

Optimal State Values: \
[16.91116993 16.72325651 16.95859037 17.52810718]

Optimal Policy: \
[1 2 0 1]

## 4.2.

In [12]:
# Defining the characteristics for the environment
env_info = {
    "seed": 0
}

# Defining the characteristics for the agent
agent_info = {
    "discount": 0.9,       
    "step_size": 0.5,
    "epsilon": 0.2,
    "delta": 1e-2,
    "optimal_q": optimal_q_vals,
    "seed": 0
}

max_iter = 200000
re_init = 1000
dec_step = 5000

# q_vals = run_experiment(
#     env_info, agent_info, max_iter = max_iter, re_init = re_init,
#     dec_step = dec_step
# )

Time Steps Elapsed | 40000 \
Q-Values: [[ 0.05488135 12.53721612  0.06027634]
 [11.42090221  0.04236548 12.2878522 ]
 [ 0.04375872 14.79477624  8.68752418]
 [ 0.03834415 10.5803314  14.29430343]]

Time Steps Elapsed | 80000 \
Q-Values: [[15.17714497 13.92537242 16.94036383]
 [14.2135258  16.2963234  16.23866067]
 [ 0.04375872 14.23049254 17.47847253]
 [ 0.03834415 10.5803314  17.63244245]]

Time Steps Elapsed | 120000 \
Q-Values: [[17.11828702 13.92537242 16.49655773]
 [14.2135258  16.60029523 16.51277718]
 [ 0.04375872 16.53074493 16.72385805]
 [ 0.03834415 16.48081953 16.47461074]]

Time Steps Elapsed | 160000 \
Q-Values: [[17.62497176 13.92537242 16.49655773]
 [14.2135258  17.69672259 16.51277718]
 [ 0.04375872 17.6962355  16.92261715]
 [15.20888001 17.09001532 17.54421258]]

POST CONVERGENCE

Optimal Action Values: \
[[17.73784069 14.31553089 16.49655773]
 [14.2135258  17.76716223 16.61942389]
 [ 2.68376526 17.81182415 16.92261715]
 [15.20888001 17.35728038 17.71094537]]

Optimal State Values: \
[17.73784069 17.76716223 17.81182415 17.71094537]

Optimal Policy: \
[0 1 1 2]

## 4.3.

In [13]:
# Defining the characteristics for the environment
env_info = {
    "seed": 0
}

# Defining the characteristics for the agent
agent_info = {
    "discount": 0.9,       
    "step_size": 0.5,
    "epsilon": 0.2,
    "delta": 1e-2,
    "optimal_q": optimal_q_vals,
    "seed": 0
}

max_iter = 2000000
re_init = max_iter + 1
dec_step = 5000

# q_vals = run_experiment(
#     env_info, agent_info, max_iter = max_iter, re_init = re_init,
#     dec_step = dec_step
# )

Time Steps Elapsed | 400000 \
Q-Values: [[0.05488135 1.05832501 0.06027634]
 [0.05448832 0.04236548 0.06458941]
 [0.04375872 0.0891773  0.09636628]
 [0.03834415 0.0791725  0.05288949]]

Time Steps Elapsed | 800000 \
Q-Values: [[0.05488135 1.05832418 0.06027634]
 [0.05448832 0.04236548 0.06458941]
 [0.04375872 0.0891773  0.09636628]
 [0.03834415 0.0791725  0.05288949]]

Time Steps Elapsed | 1200000 \
Q-Values: [[0.05488135 1.05832418 0.06027634]
 [0.05448832 0.04236548 0.06458941]
 [0.04375872 0.0891773  0.09636628]
 [0.03834415 0.0791725  0.05288949]]

Time Steps Elapsed | 1600000 \
Q-Values: [[0.05488135 1.05832418 0.06027634]
 [0.05448832 0.04236548 0.06458941]
 [0.04375872 0.0891773  0.09636628]
 [0.03834415 0.0791725  0.05288949]]

POST CONVERGENCE

Optimal Action Values: \
[[0.05488135 1.05832418 0.06027634]
 [0.05448832 0.04236548 0.06458941]
 [0.04375872 0.0891773  0.09636628]
 [0.03834415 0.0791725  0.05288949]]

Optimal State Values: \
[1.05832418 0.06458941 0.09636628 0.0791725 ]

Optimal Policy: \
[1 2 2 1]

## 4.4.

In [14]:
# Defining the characteristics for the environment
env_info = {
    "seed": 0
}

# Defining the characteristics for the agent
agent_info = {
    "discount": 0.9,       
    "step_size": 0.5,
    "epsilon": 0.2,
    "delta": 1e-2,
    "optimal_q": optimal_q_vals,
    "seed": 0
}

max_iter = 200000
re_init = max_iter / 10
dec_step = max_iter / 20

# q_vals = run_experiment(
#     env_info, agent_info, max_iter = max_iter, re_init = re_init,
#     dec_step = dec_step
# )

Time Steps Elapsed | 40000 \
Q-Values: [[0.05488135 1.57188752 0.06027634]
 [0.05448832 0.04236548 0.06458941]
 [0.04375872 0.0891773  0.09636628]
 [0.03834415 0.0791725  0.05288949]]

Time Steps Elapsed | 80000 \
Q-Values: [[0.05488135 2.44240942 0.06027634]
 [0.05448832 0.04236548 0.06458941]
 [0.04375872 0.0891773  0.09636628]
 [0.03834415 4.10542425 0.05288949]]

Time Steps Elapsed | 120000 \
Q-Values: [[0.05488135 4.25162326 0.06027634]
 [0.05448832 0.04236548 0.06458941]
 [0.04375872 0.0891773  3.18243108]
 [0.03834415 4.10542425 0.05288949]]

Time Steps Elapsed | 160000 \
Q-Values: [[0.05488135 4.25162326 0.06027634]
 [0.05448832 0.04236548 4.42800232]
 [0.04375872 0.0891773  3.18243108]
 [0.03834415 4.10542425 0.05288949]]

POST CONVERGENCE

Optimal Action Values: \
[[0.05488135 4.25162326 0.06027634]
 [0.05448832 0.04236548 4.42800232]
 [0.04375872 5.09049364 3.18243108]
 [0.03834415 4.10542425 0.05288949]]

Optimal State Values: \
[4.25162326 4.42800232 5.09049364 4.10542425]

Optimal Policy:
[1 2 1 1]