# A7 Reinforcement Learning

By typing my name, I confirm that the code, experiments, results, and discussions are all written by me, except for the code provided by the instructor.  

*Ryan Blocker*

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, clear_output

# <font color="red">**70 points**</font>: Code Requirements


In the following code cell, define the function

    def run_maze_experiment(n_steps, learning_rate, steps_between_goal_changes=0):
        ...
    
that combines all of the code from Lecture Notes [21 More Reinforcement Learning Fun](https://nbviewer.org/url/www.cs.colostate.edu/~anderson/cs345/notebooks/21%20More%20Reinforcement%20Learning%20Fun.ipynb).  You can simply copy the functions in Lecture Notes 21 and paste them into the body of your `run_maze_experiment` function.  No global variables are allowed; all variables and functions defined globally in Lecture Notes 21 must be defined locally inside `run_maze_experiment`, including the goal of `[6, 6]`.  Assign a default value of `0` for the third argument, `steps_between_goal_changes`. 

Test your function by running it as

    run_maze_experiment(100_000, 0.2)
    
You should see results very similar to the results shown in Lecture Notes 21. 

Do not include these results in your submitted notebook.
 

In [None]:
def run_maze_experiment(n_steps, learning_rate, steps_between_goal_changes=0):
    
    figure = plt.figure(figsize=(10, 12)) #figure size
    size = 16 #size of maze
    
    n = size - 1
    walls = [[0, (0, n)],  # bottom wall
        [n, (0, n)],       # top wall
        [(0, n), 0],       # left wall
        [(0, n), n],       # right wall

        [(3, 9), 9],       # box right wall
        [(3, 9), 2],       # box bottom wall
        [9, (4, 8)],       # box top wall
        [3, (3, 8)],       # box bottom wall
        [(4, 12), 12]]     # additional vertical wall

    goal = np.array([6, 6]) #place goal inside the box
    
    actions = [(1, 0),  (-1, 0), (0, -1), (0, 1)] #directional movement actions
    
    def hit_walls(position, walls):
        r, c = position    # r is position row, c is position column
        for wall in walls:
            if isinstance(wall[0], int):
                # horizontal
                row = wall[0]
                cols = wall[1]
                if r == row and cols[0] <= c <= cols[1]:
                    return True
            else:
                # vertical wall
                rows = wall[0]
                col = wall[1]
                if c == col and rows[0] <= r <= rows[1]:
                    return True
        return False

    print(hit_walls((1, 1), walls))
    for p in [(1, 1), (5, 5), (0, 0), (9, 5), (8, 5)]:
        print('Position', p, 'hit_walls=', hit_walls(p, walls))
        
    def take_action(position, actioni):
        action = actions[actioni]
        next_position = [position[0] + action[0],
                        position[1] + action[1]]
        return np.clip(next_position, 0, size - 1)

    [3, 2]
    for actioni in range(4):
        print('Position', p, end='')
        print(' action', actioni, actions[actioni], end='')
        p = take_action(p, actioni)
        print(' takes us to position', p)
                
    def pick_random_position(walls):
        while True:
            position = np.random.randint(1, size - 2, 2)
            if not hit_walls(position, walls):
                break
        return position

    for i in range(10):
        print(pick_random_position(walls))
        
    Q = np.zeros((size, size, 4))
    
    def pick_action(Q, position):
        row, col = position
        Qs = Q[row, col, :]
        return np.argmin(Qs)

    actioni = pick_action(Q, [1, 1])
    actioni, actions[actioni]
    
    position = pick_random_position(walls)
    actioni = pick_action(Q, position)

    n_goals = 0
    steps_to_goal = []
    last_path = []
    starting_step = 0
    goal_found = False

    for step in range(n_steps):

        #CODE for multiple goals
        if step % steps_between_goal_changes == 0:
            goal = pick_random_position()
            position = pick_random_position()
            actioni = pick_action(Q, position, goal)
            last_path = [position]
            
        if goal_found:
            last_path = []
            goal_found = False
        
        next_position = take_action(position, actioni)
        last_path.append(next_position)
        
        if hit_walls(next_position, walls):
            row, col = position
            Q[row, col, actioni] = 500  # Make Q so high this action never selected again.
            last_path.append(position)
            actioni = pick_action(Q, position)
            
        elif np.all(next_position == goal):
            # Found goal
            goal_found = True
            n_goals += 1
            r = 1
            row, col = position
            Q[row, col, actioni] = r  # No future. Just found the goal.
            # Start at new random position
            position = pick_random_position(walls)
            actioni = pick_action(Q, position)
            steps_to_goal.append(step - starting_step)
            starting_step = step

        else:
            # Take one step to get next_Q at next position to make TD error
            r = 1
            next_actioni = pick_action(Q, next_position) 
            Q_value = Q[position[0], position[1], actioni]
            next_Q_value = Q[next_position[0], next_position[1], next_actioni]
            TD_error = r + next_Q_value - Q_value
            Q[position[0], position[1], actioni] += learning_rate * TD_error

            position = next_position.copy()
            actioni = next_actioni
            

        if goal_found and (n_goals < 100 or n_goals % 100 == 0):

            figure.clf()
            
            # Draw Q function as image with walls
            image = np.min(Q[:, :, :], axis=-1)
            imagemax = np.max(image)
            vmax = imagemax if imagemax > 5 else 5
            plt.figure(1)
            plt.clf()
            plt.subplot(2, 1, 1)
            plt.imshow(image, origin='lower', cmap='binary',
                    interpolation='nearest')
            plt.colorbar()
            
            # Draw walls
            for row in range(size):
                for col in range(size):
                    if hit_walls([row, col], walls):
                        plt.plot(col, row, 'rs', ms=10)

            # Draw last path
            last_path_array = np.array(last_path)
            plt.plot(last_path_array[:, 1], last_path_array[:, 0], 'o-')

            # Draw goal
            plt.plot(goal[1], goal[0], 'mD', ms=10)

            # Plot steps to goal for each path tried
            plt.subplot(2, 1, 2)
            plt.plot(steps_to_goal)
            plt.xlabel('Goals Found')
            plt.ylabel('Steps to Goal')
            
            clear_output(wait=True)
            display(figure);


    clear_output(wait=True)
    return Q  # Return Q table!

After you see that your code runs correctly, make changes to your `run_maze_experiment` to change the goal while training.  Required changes will include at least the following steps:

* Change the dimensionality of the `Q` table to include dimensions for the goal's row and column.
* The first code immediately following the start of the `step` loop must look like this:

```
    for step in range(n_steps):

        # NEW CODE for multiple goals
        if steps_between_goal_changes > 0 and step % steps_between_goal_changes == 0:
            goal = pick_random_position()
            position = pick_random_position()
            actioni = pick_action(Q, position, goal)
            last_path = [position]
```


# <font color="red">**20 points**</font>: Show Results


After debugging your function test that it still runs correctly with `steps_between_goal_changes=0`, which should produce results similar to what is shown here.

In [None]:
Q = run_maze_experiment(100_000, 0.2)

Now run it with the following arguments to test your changes.  You should see displays like the ones above, but for varying goal positions.

In [None]:
Q = run_maze_experiment(1_000_000, 0.2, steps_between_goal_changes=100)

# <font color="red">**20 points**</font>: Visualize Result of Reinforcement Learning

Your function returns the trained `Q` table.  Let's investigate what has been learned. A well-trained `Q` table should have values close to zero near the goal, whereever it is placed!  So, let's display images of the trained `Q` table for different goals.

Complete the following code cell.  A list of nine goals is provided, where each goal is specified by its row and column. Make a figure of three columns and three rows of images (using `plt.subplot`, `plt.imshow`, and `plt.colorbar` as shown in Lecture Notes 21) of the `Q` table values for these nine goal positions.

Do not include the walls in the displays, but do plot the goal position with a magenta diamond.

In [None]:
goals = [[12, 3], [12, 7], [12, 13],
         [7, 3], [7, 7], [7, 13],
         [1, 3], [1, 7], [1, 13]]

plt.figure(figsize=(12, 12))


# ....
# ....


# <font color="red">**10 points**</font>: Discussion

Discuss these questions in the following markdown cell.

What do your images of the `Q` table values show?  Do you think your reinforcement learning agent has learned to solve this maze problem with varying goal?

Why are the walls shown as white patches?

...type your answers here...

#  Extra Credit 1

Design a larger maze of size 32 or larger.  Create different walls that define an "interesting" maze for our reinforcement learning bug.  Do this in a new version of your function named `run_big_maze_experiment` and run it with `steps_between_goal_changes` having a value greater than 0.

#  Extra Credit 2

Write code that uses your trained `Q` function to demonstrate your trained reinforcement learning bug following a goal that you place interactively by capturing `matplotlib` mouse click events on the display of the maze.

#  Extra Credit 3

This will take a considerable amount of effort.  Attempt this only if you are not already overwhelmed with requirements in other courses.

Create a new version of your main function, called `run_neural_net_maze_experiment` as follows.

Replace the `Q` table with your neural network from A6.  Give your neural network five inputs, being the agent's row and column, the goal's row and column, and the action. Your neural network will have just one output, the Q value for that position, goal, and action.  Train it with `X` being a sample of these five values, and a target `T` of `r + Qnew`.

