# Introduction to Reinforcement Learning with NVIDIA Jetson TX2

In this session, you will use a branch of machine learning called **reinforcement learning** to teach a robot to play a game.

Feel free to use the links below to navigate the notebook:
- Step 1: [The Game](#thegame)
- Step 2: [Motivating Reinforcement Learning](#motivatingrl)
- Step 3: [Investigate Random Behavior](#random)
- Step 4: [States, Actions, and Optimal Policies](#policies)
- Step 5: [Monte Carlo with Exploring Starts](#mces)

<a id='thegame'></a>
## The Game

The setup has four LEDs.  We enumerate the LEDs starting with zero, so that the yellow LED is at position `0`, the red LED is at position `1`, and so on.  Each LED is connected to a button that is used to turn it off.  

![LEDs](images/LEDs.png)

The game round begins when **one** of the four LEDs is turned on, and the robotic arm starts at a position hovering over one of the four buttons.  The score always starts at zero.

At each point in the game, the robot has three potential movements (or **actions**) at its disposal.  It can:
- `0` - move **L**eft one position,
- `1` - stay at the current position and **P**ush, or
- `2` - move **R**ight one position.

After the robot chooses an action, its score is deducted one point.  While most actions have intuitive effects, there are a few special cases that are worth mentioning:
- The lit LED is turned off when the robot pushes its corresponding button, but pushing buttons connected to unlit LEDs has no effect (i.e., they are never turned on by the robot).  
- If the robotic arm is hovering over the leftmost button at location `0` and decides to move left, we imagine that the arm hits an imaginary wall, and the arm stays where it is.  Likewise, if the arm is hovering over the button at position `3` and decides to move right, then at the next point in the game, the arm will have maintained its position at location `3` (the rightmost location in the line).

The round ends either (1) when the robot pushes the correct button to turn off the lit LED, or (2) when the robot has played **15** moves in the round, _whatever happens first_.  Your goal in this notebook is to implement an algorithm that can learn from gameplay the optimal strategy to attain the highest (or least negative) score at each round.  To accomplish this, your robot will need to learn how to turn off the lit LED in as few movements as possible.  

<a id='motivatingrl'></a>
## Motivating Reinforcement Learning

You may have noticed that the winning game strategy is very straightforward to hard code.  With this in mind, if you are new to reinforcement learning (RL), you might wonder why an RL technique would not be overkill.

And the fact is that in the field of RL, it is very common to use simple games with well-defined rules **to build intuition** for how to design algorithms to best accomplish more complex tasks.  

In the RL setting, we assume that the robot does not have the domain knowledge to know what "left", "push", or "right" means.  The robot only knows that it has three possible actions, and it needs to figure out how to select from these actions to consistently attain the highest possible score.  

You can think of the robot as a computer player with full access to a keyboard that only contains three keys, where you further assume the player does not know any of the game controls.  Despite this missing information, the player must nonetheless learn how to beat the game.  

And, similar to how you can imagine the human player would learn, the robot will closely watch its score in the game to gauge how well it is doing.  If it is not performing well, it will amend its strategy to do better.  

The algorithm that you will use to determine the best gameplay strategy is remarkably intuitive.  The robot will learn primarily through trial-and-error, while using its score as a feedback mechanism to hone its strategy.  Its initial behavior will be incredibly random, as it tries out many different moves (or actions) in the game, to see how its score is affected.  The main idea behind the algorithm is to train the robot to leverage its gameplay experience to gradually obtain a well-informed strategy that attains a relatively high score.  

For now, in the next section, we will examine how the robot performs, if it chooses random actions at each time step.

<a id='random'></a>
## Investigate Random Behavior

We have written a simple simulator that you can use to see how the robot should perform, if it selects random actions for the entirety of the game.  Of course, your robot will learn to perform much better!

Run the code cell below to have the robot play the game for 2 separate rounds.  When parsing the output, remember that the starting conditions of the game are random!

Each round has corresponding output that looks somewhat like the snippet below:
```
Starting Round 1 ...
 LED: 0    | Arm: 1 | Score:    0 
 Action: P | Arm: 1 | Score:   -1 
 Action: L | Arm: 0 | Score:   -2 
 Action: P | Arm: 0 | Score:   -3 
FINAL SCORE:  -3
-----------------------------------
```

In the sample snippet above:
- When the game round was initiated, the lit LED was at position `0`, and the arm was at position `1`.  The game score at the beginning of a round is always zero. 
- The robot's first choice of action was to **P**ush in the current location; as a result, its score was deducted one point, and the LEDs were unaffected (i.e., the LED at position `0` remains lit, while all of the other LEDs are unlit).  
- The robot's next choice was to move **L**eft, so it moved to position `0` and the score was deducted another point.
- Then, the robot decided to **P**ush in location `0`, and the game score decreased to `-3`.  At this point, the game ended, because the final action turned off the lit LED. 
- In this case, the final score received at the end of the game was `-1` + `-1` + `-1` = `-3`.

Take the time now to understand the `JetsonEnv` class in **jetson_env.py**.  Note that the simulation encodes each of the possible actions as an integer (one of `0`, `1`, or `2`), and to get the corresponding more interpretable action label (`L`, `P`, or `R`), we use the `decipher_action` function below. 

Later in this notebook, you will use the `JetsonEnv` class to simulate games to teach your robot!

In [1]:
from jetson_env import JetsonEnv

# use a Python dictionary to decode the actions
action_dict = {0: 'L', 1: 'P', 2: 'R'}
def decipher_action(a):
    return action_dict[a]

# create a new environment
env = JetsonEnv()

# interact with the environment
for i_round in range(1, 3):
    print('Starting Round %d ...' % i_round)
    led, arm = env.reset() # reset the lit LED and arm position
    score = 0              # reset the score
    print(' LED: %d    | Arm: %d | Score: %4d ' % (led, arm, score))
    while True:
        action = env.get_random_action()      # select a random move (action)
        arm, points, done = env.step(action)  # perform the action, get new arm position and points
        score += points                       # update score
        print(' Action: %s | Arm: %d | Score: %4d ' % (decipher_action(action), arm, score))
        if done:
            print('FINAL SCORE: ', score)
            print('-'*35)
            break

Starting Round 1 ...
 LED: 0    | Arm: 2 | Score:    0 
 Action: L | Arm: 1 | Score:   -1 
 Action: P | Arm: 1 | Score:   -2 
 Action: P | Arm: 1 | Score:   -3 
 Action: L | Arm: 0 | Score:   -4 
 Action: L | Arm: 0 | Score:   -5 
 Action: P | Arm: 0 | Score:   -6 
FINAL SCORE:  -6
-----------------------------------
Starting Round 2 ...
 LED: 3    | Arm: 3 | Score:    0 
 Action: L | Arm: 2 | Score:   -1 
 Action: P | Arm: 2 | Score:   -2 
 Action: L | Arm: 1 | Score:   -3 
 Action: P | Arm: 1 | Score:   -4 
 Action: L | Arm: 0 | Score:   -5 
 Action: L | Arm: 0 | Score:   -6 
 Action: L | Arm: 0 | Score:   -7 
 Action: R | Arm: 1 | Score:   -8 
 Action: L | Arm: 0 | Score:   -9 
 Action: R | Arm: 1 | Score:  -10 
 Action: P | Arm: 1 | Score:  -11 
 Action: P | Arm: 1 | Score:  -12 
 Action: L | Arm: 0 | Score:  -13 
 Action: R | Arm: 1 | Score:  -14 
 Action: L | Arm: 0 | Score:  -15 
FINAL SCORE:  -15
-----------------------------------


<a id='policies'></a>
## States, Actions, and Optimal Policies

As discovered above, there are **three possible actions**, corresponding to:
- `0` - moving **L**eft, 
- `1` - staying and **P**ushing in the current position, and
- `2` - moving **R**ight.

The **total number of possible game states is $4^2 = 16$**, where there is a state for each possible combination of arm position and lit LED position. 

To avoid having to deal with two different numbers when referencing the state, we define the `get_state` function below that maps each possible combination of arm position (`arm`) and lit LED position (`led`) to an integer from `0` to `15`, which we refer to as the corresponding state (`state`).  

In your upcoming implementation, the state should always be encoded as a number from `0` to `15`, but you can get the corresponding arm position (`arm`) and lit LED position (`led`) by passing the state (`state`) into the `get_led_and_arm` function.

In [2]:
def get_led_and_arm(state, nLED=4):
    arm = state % nLED
    led = int(((state-arm))/nLED)
    return led, arm

def get_state(led, arm, nLED=4):
    state = led*nLED + arm
    return state

for state in range(16):
    led, arm = get_led_and_arm(state)
    print('LED: %d | Arm: %d | State: %2d ' % (led, arm, state))

LED: 0 | Arm: 0 | State:  0 
LED: 0 | Arm: 1 | State:  1 
LED: 0 | Arm: 2 | State:  2 
LED: 0 | Arm: 3 | State:  3 
LED: 1 | Arm: 0 | State:  4 
LED: 1 | Arm: 1 | State:  5 
LED: 1 | Arm: 2 | State:  6 
LED: 1 | Arm: 3 | State:  7 
LED: 2 | Arm: 0 | State:  8 
LED: 2 | Arm: 1 | State:  9 
LED: 2 | Arm: 2 | State: 10 
LED: 2 | Arm: 3 | State: 11 
LED: 3 | Arm: 0 | State: 12 
LED: 3 | Arm: 1 | State: 13 
LED: 3 | Arm: 2 | State: 14 
LED: 3 | Arm: 3 | State: 15 


The goal of your robot is to find - for each possible game state - the best action that the robot should take from that state, towards its goal of maximizing the game score.  We will think of this as a lookup table that the robot can consult when selecting actions, and we refer to it as an **optimal policy**.

For instance, consider the case that the game starts in state `0`.  This state corresponds to arm position `0` and lit LED position `0`.  In this case, the robot should select action **P**ush, to end the game with a final score of `-1` immediately after.

Likewise, state `1` corresponds to arm position `1` and lit LED position `0`.  In this case, the robot should decide to move **L**eft as the best initial move.  (This way, the robot can select to **P**ush at the next step and end the game with a best final score of `-2`.)

Take the time now to look at the printed optimal policy below.  Check to make sure that you can see why these actions are optimal, in the context of their corresponding game states!

```
LED: 0 | Arm: 0 | State:  0 | Best Action: P
LED: 0 | Arm: 1 | State:  1 | Best Action: L
LED: 0 | Arm: 2 | State:  2 | Best Action: L
LED: 0 | Arm: 3 | State:  3 | Best Action: L
LED: 1 | Arm: 0 | State:  4 | Best Action: R
LED: 1 | Arm: 1 | State:  5 | Best Action: P
LED: 1 | Arm: 2 | State:  6 | Best Action: L
LED: 1 | Arm: 3 | State:  7 | Best Action: L
LED: 2 | Arm: 0 | State:  8 | Best Action: R
LED: 2 | Arm: 1 | State:  9 | Best Action: R
LED: 2 | Arm: 2 | State: 10 | Best Action: P
LED: 2 | Arm: 3 | State: 11 | Best Action: L
LED: 3 | Arm: 0 | State: 12 | Best Action: R
LED: 3 | Arm: 1 | State: 13 | Best Action: R
LED: 3 | Arm: 2 | State: 14 | Best Action: R
LED: 3 | Arm: 3 | State: 15 | Best Action: P
```

Once the robot has determined the optimal policy, it can reference it when playing the game, in order to consistently attain the highest possible score.  For instance, if the robot is presented with a situation in the game where the lit LED is at position `2`, and its arm is at position `1`, it need only find the line corresponding to state `9` in the table, where it will see that the best action is to go left.

Next, you will implement a method known as **Monte Carlo with Exploring Starts** to guide your robot to obtain this optimal policy.

<a id='mces'></a>
## Monte Carlo with Exploring Starts

### A Central Idea

As part of this method, the robot will maintain a numpy array `Q` with `16` rows and `3` columns.  
> The entry in the `s`-th row and `a`-th column (`Q[s,a]`) contains the robot's _**estimate**_ for the highest score it can possibly obtain in the game, if the game started in state `s`, and the robot selected action `a` for its first move.  

For the first several games, the estimates in the array will be wildly inaccurate; but, the more experience the robot gets with gameplay, the more it is able to refine these estimates.

It is relatively straightforward to determine what these estimates _should_ be, _if_ they are completely accurate.  For instance, consider `Q[3,1]`.  This value corresponds to state `3`, which corresponds to the case where the game starts with the lit LED at position `0`, and the robotic arm is at position `3`.  If the robot selects action `1` as its first move in the game, then this corresponds to pushing in the current location, which results in a score of `-1`.  Then, at this point, if the robot plays optimally, it can obtain a final score of **at most** `-5`, obtained by selecting actions `0`, `0`, `0`, and `1` to end the game in four more moves.  The robot will try to estimate this value (`Q[3,1]=-5`), along with all other values in the array, from gameplay.  And - amazingly - it will refine these estimates while simultaneously trying to determine the optimal policy.  That is, it does not need to know the optimal policy *a priori* before forming these estimates!

This array is deeply connected to the optimal policy.  
> What's particularly worth noting is that in the event that we have a perfect estimate of all of the values in `Q`, we can quickly use it to obtain the optimal policy.  

To see this, suppose for now that the robot has a perfect estimate of all of the entries in the `Q` array.  Then, it can get the corresponding optimal action corresponding to any state `s` simply by looking at the `s`-th row in `Q`, or `Q[s]`.  As an example, if `s=3`, then `Q[3]=[-4, -5, -5]`.  
> Then, to get the optimal action corresponding to state `s`, we need only select the action corresponding to the index that maximizes `Q[s]`.  

In the case that `s=3`, we see that `-4` is the largest entry in `Q[s]` (which appears at index `0`), and so the optimal action is action `0` (which corresponds to action **L**eft).  This lines up with the optimal policy above!  

```
LED: 0 | Arm: 3 | State:  3 | Best Action: L
```

To understand more generally **_why_** this is the case, note that (_... still assuming that the robot has a perfect estimate for `Q` ..._) each entry in `Q[s]` contains the final score if the robot follows the optimal policy, for all moves except (potentially) the first.  Then, one of the three available first moves must be optimal, and it makes sense that the optimal move will correspond to the action (or index) that yields the largest value in `Q[s]`.  This fact will come in very useful in the next section, when we talk about how to implement the method in detail.

### Playing the Game

As we saw in the previous section, once the robot has perfected the estimates in the numpy array `Q`, it can quickly obtain the optimal policy.  In particular, for any game state `s` that the robot encounters, the best possible game move is identified as the index of the maximal entry in `Q[s]`.

> **_But now the question is_** ... how should the robot behave, if it is unsure about its estimates in `Q`?

To answer this question, we'll describe how a full game round should evolve.  In the **Monte Carlo with Exploring Starts** method, the robot will always follow the following procedure:
1. At the beginning of any game round, the robot selects an action completely at random.  
2. For all later moves in the round, the robot uses the entries in `Q` to decide actions.  In particular, for each move, the robot evaluates the current game state `s`, and then chooses an action by selecting the index that maximizes `Q[s]`.  (In the event that there are multiple possible actions that meet this criterion, ties can be broken arbitrarily.)

With the exception of the initial random move, the above strategy for selecting actions ensures that the robot does the best it can with the estimates in `Q` that it has available.  Of course, as `Q` gets more and more accurate, the robot will select better and better actions, and its strategy for selecting actions will gradually start to resemble the optimal policy.

Then, precisely since `Q` is not yet perfect, it makes sense for the robot to also attempt a random move, to see if it can - by chance - find a play that exceeds its current expectations. 

Your task is to complete the `play_round` function in the next code cell.  The function uses the estimates in `Q` to play a single game round, and it returns a list of `(state, action)` tuples, detailing how the game evolved.  Your implementation should replace these lines of code in the cell below:
```python
# your code here
action = ...                                # select action from set of best actions
```

Once you have completed your implementation, execute the code cell to see how the game evolved for three different rounds.  This will prove useful for debugging!  Each round will have corresponding output that looks somewhat like the snippet below:
```
Round 1: [(9, 2), (10, 1)] | Score: -2 
```
In the sample round above, the game began in state `9`, and the robot's initial action was `2`.  This changed the game to state `10`, and the robot selected action `1` to end the game.  The robot's final score was `(-1)+(-1)=-2`.

Note that since the robot is always deducted a point after each choice of action, we can obtain the robot's game score from the length of the list (`game_round`) that is returned by the `play_round` function.

In [3]:
import numpy as np
import random

def play_round(env, Q=None):
    if Q is None:                               # default value for Q if not provided
        Q = -5*np.ones((env.nS, env.nA), dtype=float)
    # start the round
    game_round = []
    led, arm = env.reset()                      # reset the game
    state = get_state(led, arm, env.nLED)       # initial state

    # select a random action
    action = env.get_random_action()            # select initial random action 
    arm, points, done = env.step(action)        # take initial action
    game_round.append((state, action))          # save initial state and action

    # finish the round
    while not done:
        state = get_state(led, arm, env.nLED)   # get next game state
        best_actions = np.argwhere(Q[state] == np.amax(Q[state])).flatten() 
        action = random.choice(best_actions)        # select action from set of best actions
        arm, points, done = env.step(action)    # take action
        game_round.append((state, action))      # save state and action
    
    return game_round

# play the game for three separate rounds
for i in range(1,4):
    game_round = play_round(env)
    print('Round %d: %s | Score: %d' % (i, game_round, -len(game_round)))

Round 1: [(7, 1), (7, 0), (6, 0), (5, 0), (4, 0), (4, 2), (5, 2), (6, 1), (6, 1), (6, 1), (6, 2), (7, 0), (6, 1), (6, 0), (5, 0)] | Score: -15
Round 2: [(15, 0), (14, 2), (15, 0), (14, 0), (13, 0), (12, 1), (12, 0), (12, 0), (12, 1), (12, 1), (12, 2), (13, 0), (12, 0), (12, 2), (13, 1)] | Score: -15
Round 3: [(7, 2), (7, 0), (6, 1), (6, 2), (7, 0), (6, 1), (6, 2), (7, 0), (6, 1), (6, 1), (6, 2), (7, 2), (7, 0), (6, 0), (5, 2)] | Score: -15


### Estimating `Q`

So far, we have mentioned that the estimates in `Q` will be gradually improved with gameplay, but we haven't learned how **_exactly_** the robot will accomplish this.  This is what you will learn now, and it's the last piece of the puzzle before you're ready to complete your implementation of the **Monte Carlo with Exploring Starts** method!

We'll begin by considering how the robot can estimate an entry corresponding to a particular state-action pair `(s, a)`.  For instance, consider `s=15` and `a=0`, or `Q[15,0]`.  Remember that `Q[15,0]` contains the robot's **estimate** for the **highest** score it can possibly obtain in the game, if the game starts in state `15`, and the robot selects action `0` for its first move.  

> The main idea behind updating the values in `Q` is that the robot will play many game rounds that begin in state `s=15`, where the robot selects action `a=0` as its first move, and it will get a useful estimate for `Q[15,0]` simply by averaging the final scores obtained from these rounds.  

For instance, suppose by chance that the first three game rounds begin with state `15` and action `0`:

```
Round 1: [(15, 0), (14, 1), (14, 2), (15, 1)] | Score: -4
Round 2: [(15, 0), (14, 0), (13, 0), (12, 2), (13, 2), (14, 0), (13, 1), (13, 0)] | Score: -8
Round 3: [(15, 0), (14, 0), (13, 1), (13, 2), (14, 2), (15, 2)] | Score: -6
```
Then: 
- After round `1`, the method will set the value of `Q[15,0]` as the observed score that was obtained, or `-4`.  
- After round `2`, `Q[15,0]` is set to **the average of `-4` and `-8`**, or `-6`.  
- After round `3`, `Q[15,0]` is **the average of `-4`, `-8`, and `-6`**, or `-6`.

And note that these games can also be used to estimate the entries in `Q` corresponding to other state-action pairs.  For instance, consider `Q[14,1]`.  Even though round `1` did not technically start with state `14` and action `1`, the mini-round obtained by removing `(15, 0)` proves particularly useful for estimating `Q[14,1]`.  In particular, the mini-round `[(14, 1), (14, 2), (15, 1)]` shows that after the robot selected action `1` when the game was in state `14`, the robot effectively received a final score of `-3`.

Before describing the algorithm in full, it is necessary to talk about what should happen in the event that the same state-action pair appears in a round multiple times.  For instance, consider `(13, 0)` in round `2`.  Should we use a "final score" of `-6` (taken from the first occurrence of `(13, 0)` in the round), or `-1` (derived from the final occurrence of `(13, 0)`)?  Or should we use both?
> For this session, if the same state-action pair appears multiple times in a round, you are encouraged to consider only the mini-round that begins at the first occurrence.  This yields a method known as **First-Visit** Monte Carlo with Exploring Starts.  (_Another option, that you are encouraged to explore later, is **Every-Visit** Monte Carlo with Exploring Starts._)  

So, in this case, the robot should set `Q[13,0]` to `-6` after round `2`.

Now that we have discussed an example, you will use these ideas to update the entries in `Q` in the `monte_carlo` function in the code cell below.

### Implement the Method

You will implement the **Monte Carlo with Exploring Starts** method in the `monte_carlo` function in the code cell below.  

The method begins with an initial guess for the estimates in `Q`:
```python
Q = -5*np.ones((nS, nA), dtype=float)   # initialize empty array
```

Of course, this initial estimate isn't so great, but the robot will learn quickly how to improve it!

The robot plays the game for `num_rounds` game rounds.  The results of each game round are saved in the `game_round` list:
```python
game_round = play_round(env, Q)         # play game round
```

After each game round, you will record the effective scores obtained after visiting each state-action pair.  (In the case that the same state-action pair appears multiple times in a round, you should consider only the first occurrence.)  You will record this information in the dictionary `scores`, which is initialized towards the beginning of the function:  
```python
scores = defaultdict(lambda: [])        # initialize dictionary of empty lists
```
In particular, for each state-action pair `(s, a)` that appears in the round, append the effective final score to the list in `scores[s,a]`.  Then, you can use this list to update `Q[s,a]` to the average of the values in `scores[s,a]`.  

In case an example helps, we'll work with the example from the previous section, with three game rounds that began with the state-action pair `(15, 0)`.  In this case, we should have:
- after round `1`, `scores[15,0]=[-4]`  
- after round `2`, `scores[15,0]=[-4, -8]`, and 
- after round `3`, `scores[15,0]=[-4, -8, -6]`.  

Likewise, 
- after round `1`, `scores[14,1]=[-3]`, and  
- after round `2`, `scores[13,0]=[-6]`. 

Your implementation should appear inside the loop below:
```python
# use game round to update Q
for s, a in set(game_round):                                    # loop over state-action pairs
    idx = min([i for i,x in enumerate(game_round) if x==(s,a)]) # obtain first index where pair appears
    # your code here                                            # append effective final score to scores[s,a]
    # your code here                                            # set Q[s,a] to the mean of scores[s,a]
```

You need only complete the two lines in the code snippet above to complete the implementation.  Once you have finished, run the code cell below to print your robot's estimated optimal policy after playing the game for `100` rounds.

Next, you'll learn how to run the code in real time on your robot, to visualize how it learns!

In [4]:
import sys
from collections import defaultdict

def monte_carlo(env, num_rounds=100):
    nS = env.nS        # number of states
    nA = env.nA        # number of actions
    
    Q = -5*np.ones((nS, nA), dtype=float)   # initialize empty array
    scores = defaultdict(lambda: [])        # initialize dictionary of empty lists
    
    # loop over game rounds
    for i_round in range(1, num_rounds+1):
        
        # monitor progress
        print("\rGame Round {}/{}.".format(i_round, num_rounds), end="")
        sys.stdout.flush()
        
        game_round = play_round(env, Q)     # play game round
        
        # use game round to update Q
        for s, a in set(game_round):                                    # loop over state-action pairs
            idx = min([i for i,x in enumerate(game_round) if x==(s,a)]) # obtain first index where pair appears
            scores[s,a].append(-len(game_round[idx:]))                  # append effective final score to scores[s,a]
            Q[s,a] = np.mean(scores[s,a])                               # set Q[s,a] to the mean of scores[s,a]
    return Q

# run the algorithm
Q = monte_carlo(env)

# print the estmated optimal policy
print('\n\nEstimated Optimal Policy:')
correct_policy = [1, 0, 0, 0, 2, 1, 0, 0, 2, 2, 1, 0, 2, 2, 2, 1]
for state in range(Q.shape[0]):
    led, arm = get_led_and_arm(state, env.nLED)
    to_print = (led, arm, state, decipher_action(np.argmax(Q[state])))
    if correct_policy[state] == np.argmax(Q[state]):
        print('LED: %d | Arm: %d | State: %2d | Best Action: %s' % to_print)
    else:
        print('LED: %d | Arm: %d | State: %2d | Best Action: %s (INCORRECT)' % to_print)

# print 
print('\nCorrect: %d/16' % sum(np.argmax(Q, axis=1) == correct_policy))

Game Round 100/100.

Estimated Optimal Policy:
LED: 0 | Arm: 0 | State:  0 | Best Action: P
LED: 0 | Arm: 1 | State:  1 | Best Action: L
LED: 0 | Arm: 2 | State:  2 | Best Action: L
LED: 0 | Arm: 3 | State:  3 | Best Action: L
LED: 1 | Arm: 0 | State:  4 | Best Action: R
LED: 1 | Arm: 1 | State:  5 | Best Action: P
LED: 1 | Arm: 2 | State:  6 | Best Action: L
LED: 1 | Arm: 3 | State:  7 | Best Action: L
LED: 2 | Arm: 0 | State:  8 | Best Action: R
LED: 2 | Arm: 1 | State:  9 | Best Action: R
LED: 2 | Arm: 2 | State: 10 | Best Action: P
LED: 2 | Arm: 3 | State: 11 | Best Action: L
LED: 3 | Arm: 0 | State: 12 | Best Action: R
LED: 3 | Arm: 1 | State: 13 | Best Action: R
LED: 3 | Arm: 2 | State: 14 | Best Action: R
LED: 3 | Arm: 3 | State: 15 | Best Action: P

Correct: 16/16
