I decided to write this code to check how exactly an agent behaves when the rewards for transitions in a maze are equal to $0$ and the reward for escaping the maze is $+1$.

This is to answer the following question from Sutton and Barto's book:

Exercise 3.7 Imagine that you are designing a robot to run a maze. You decide to give it a
reward of +1 for escaping from the maze and a reward of zero at all other times. The task
seems to break down naturally into episodes—the successive runs through the maze—so
you decide to treat it as an episodic task, where the goal is to maximize expected total
reward (3.7). After running the learning agent for a while, you find that it is showing
no improvement in escaping from the maze. What is going wrong? Have you e↵ectively
communicated to the agent what you want it to achieve?

First let's create a matrix which will represent our maze. The 2 corners at the top left and bottom right are the exits of the maze. If the agent steps into the exit, then the episode terminates.

In [None]:
import numpy as np

def maze(s):
  maze = np.zeros((s, s))
  maze[0,0], maze[-1,-1] = 1, 1
  return maze

Here is how a $4 \times 4$ maze looks like:

In [110]:
print(maze(4))

[[1. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 1.]]


Now let's calculate the state values for the uniform policy. Note that if you're on the edge and you turn into the wall then you circle back to the same state. 

Under the uniform policy each action has a probability of $\frac{1}{4}$ to be taken. We calculate the state values under the uniform policy because it's interesting to see what will the estimates be with such rewards. The algorithm for evaluating the state values is:

First, initialise $v_0=0$, then iterate $$\forall s: v_{k+1}(s)\leftarrow E[R_{t+1}+\gamma v_k (S_{t+1})|s,\pi ]$$



In [111]:
def findCells(index, grid):
  """
  A function which will get the cells values neighboring a specific cell. 
  input:
      index: array of 2 entries i and j which indicate the cell's position. i is the row and j is the column
  ouput: 
      cells: 2 arrays. The first array has the row values of the neighbouring cells and the second array the column values. 
  Example:
   input: [1,1]
   output: [1, 0, 1, 2], [0, 1, 2, 1]
  """
  i, j = index[0], index[1]
  # Terminal state
  if (np.all(grid[i, :] == grid[0, :]) and np.all(grid[:, j] == grid[:, 0])) or (np.all(grid[i, :] == grid[-1, :]) and np.all(grid[:, j] == grid[:, -1])):
    return None, None
  # Top wall, not corner
  elif np.all(grid[i, :] == grid[0, :]) and np.all(grid[:, j] == grid[:,-1]) == False:
    return [i, i, i, i+1], [j-1, j, j+1, j]
  # Top wall, corner
  elif np.all(grid[i, :] == grid[0, :]) and np.all(grid[:, j] == grid[:,-1]):
    return [i, i, i, i+1], [j-1, j, j, j]
  # Bottom wall, not corner
  elif np.all(grid[i, :] == grid[-1, :]) and np.all(grid[:, j] == grid[:, 0])==False:
    return [i, i-1, i, i], [j-1, j, j+1, j]
  # Bottom wall, corner
  elif np.all(grid[i, :] == grid[-1, :]) and np.all(grid[:, j] == grid[:, 0]):
    return [i, i-1, i, i], [j, j, j+1, j]
  # Left wall, not corner
  elif np.all(grid[:, j] == grid[:, 0]) and np.all(grid[i,:]== grid[-1,:])==False:
    return [i, i-1, i, i+1], [j, j, j+1, j]
  # Right wall, not corner
  elif np.all(grid[:, j] == grid[:, -1]) and np.all(grid[i, :] == grid[-1,:])==False:
    return [i, i-1, i, i+1], [j-1, j, j, j]
  # No wall
  else:
    return [i, i-1, i, i+1], [j-1, j, j+1, j]
  
def robot_experiment(iterations, discount, maze_size):
  ITERATIONS = iterations
  grid = maze(maze_size)
  gridCollection = [maze(4)]
  probs = np.array([0.25, 0.25, 0.25, 0.25])
  for iter in range(ITERATIONS):
      for row in range(grid.shape[0]):
        for col in range(grid.shape[1]):
          index = [row, col]
          rowID, colID = findCells(index, grid)
          if rowID == None or colID == None:
            grid[row, col] = grid[row, col]
          else:
            last_grid = gridCollection[iter]
            grid[row, col] = discount*last_grid[rowID, colID] @ probs
      new_grid = grid.copy()
      gridCollection.append(new_grid)
      if (iter+1)% iterations==0:
        print(f"Iteration: {iter+1}, grid:")
        print(np.round(grid, 4))
        print("\n")

I am now going to see what the state values converge to. I will try values with a discount $\gamma \in \{0.1,0.9,1\}$. I will do 200 iterations as it's enough for convergence.

In [109]:
robot_experiment(200, 0.1, 4)
robot_experiment(200, 0.9, 4)
robot_experiment(200, 1, 4)

Iteration: 200, grid:
[[1.00e+00 2.57e-02 7.00e-04 0.00e+00]
 [2.57e-02 1.30e-03 1.00e-04 7.00e-04]
 [7.00e-04 1.00e-04 1.30e-03 2.57e-02]
 [0.00e+00 7.00e-04 2.57e-02 1.00e+00]]


Iteration: 200, grid:
[[1.     0.4722 0.2872 0.2349]
 [0.4722 0.3394 0.2819 0.2872]
 [0.2872 0.2819 0.3394 0.4722]
 [0.2349 0.2872 0.4722 1.    ]]


Iteration: 200, grid:
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]




We can see that the policies with discounts give states closer to the exit higher values as opposed to the values in the scenario where there is no discount. This is due to the fact that the closer the cell is to the exit, the sooner it can get a reward of $+1$ without it being discounted too much as opposed to the cells that are further away. We can see that the scenario with no discount has all of its state values converged to $1$. This should make sense since when the robot starts at some cell (state), it will eventually get to the exit and get a reward of $+1$, whever it is after $2$ steps of after $1000$ steps. 

I think that a useful insight I gained from this experiment is that with the discount factor, the state values converged to values which made it clear which path to take e.g if you were to employ a greedy policy, it would know momentarily which actions to take. With the scenario where we don't have a discount, if we would employ a greedy policy we would have to optimize it. Hence, the discount served as a "punishment" for not escaping the exit quickly. 

An alternative way to encourage the agent to escape the maze as quickly as possible is to put a reward of $-1$ for a transition. This way the robot will maximize its return by escaping the maze as quickly as possible. FUTURE WORK: WRITE AN ALGORITHM WHICH CALCULATE $V^*$ and $\pi^*$