## Introduction to Reinforcement Learning with NVIDIA Jetson TX2

In this session, you will use a branch of machine learning called **reinforcement learning** (RL) to teach a robot to play a game.

## The Game

The setup has four LEDs.  We enumerate the LEDs starting with zero, so that the yellow LED is at position `0`, the red LED is at position `1`, and so on.  Each LED is connected to a button that is used to turn it off.  

![LEDs](images/LEDs.png)

The game round begins when **one** of the four LEDs is turned on, and the robotic arm starts at a position hovering over one of the four buttons.  The score always starts at zero.

At each point in the game, the robot has three potential movements (or **actions**) at its disposal.  It can:
- `0` - move **L**eft one position,
- `1` - stay at the current position and **P**ush, or
- `2` - move **R**ight one position.

After choosing an action, the robot is deducted one point.  While most actions have intuitive effects, there are a few special cases that are worth mentioning:
- The lit LED is turned off when the robot pushes its corresponding button, but pushing buttons connected to unlit LEDs has no effect (i.e., they are never turned on by the robot).  
- If the robotic arm is hovering over the leftmost button at location `0` and decides to move left, we imagine that the arm hits an imaginary wall, and the arm stays where it is.  Likewise, if the arm is hovering over the button at position `3` and decides to move right, then at the next point in the game, the arm will have maintained its position at location `3` (the rightmost location in the line).

The round ends when the robot pushes the correct button to turn off the lit LED.  Your goal in this notebook is to implement an agent that can learn from gameplay the optimal strategy to attain the highest (or least negative) score at each round.  To accomplish this, your agent will need to learn how to turn off the lit LED in as few movements as possible.  

## Do we really need RL?

You may have noticed that the winning game strategy is very straightforward to hard code.  With this in mind, if you are new to RL, you might wonder why an RL technique would not be overkill.

And the fact is that in the field of RL, it is very common to use simple games with well-defined rules **to build intuition** for how to design algorithms to best accomplish more complex tasks.  

In the RL setting, we assume that the robot does not have the domain knowledge to know what "left" or "right" means.  The robot only knows that it has three possible actions, and it needs to figure out how to select from these actions to consistently attain the highest possible score.  

You can think of the robot as a computer player with full access to a keyboard that only contains three keys, where you further assume the player does not know any of the game controls.  Despite this missing information, the player must nonetheless learn how to beat the game.  

And, similar to how you can imagine the human player would learn, the robot will closely watch its score in the game to gauge how well it is doing.  If it is not performing well, it will amend its strategy to do better.  

## Motivating RL (still need to write)

introduce terminology: "agent", "reward", "episode", "optimal policy".

want to teach artificially intelligent agent to learn from interactions with its environment.  This robot will learn a lot like humans do: through trial-and-error.  

- Initially, the robot doesn't know much about its environment.  It knows that there are 20 possible states, and it can take 3 possible actions, but that's it.  
- actions influence states
- robot doesn't know how to doesn't know how each action influences state, or which states are most desirable. it only wants to win the game, and get as many points as possible.
- we teach the robot to learn by rewarding it when it arrives at the desired outcome, or punishing it ... reinforces the best behavior.
- given well-defined input and a goal, can learn by interacting with the environment 
- on the policy // The goal of the agent is to find a way to select the best action in response to the state ... where how good an  action is is determined by the reward that the agent collects

At every time step, the agent receives a reward and state from the environment and chooses an action to perform in response.  In this way, the interaction evolves as a sequence of states, actions, and rewards.  We are working with an episodic task, where the interaction stops at some time step $T$ when the agent encounters a terminal state.  And we refer to the sequence as an episode.

## Investigate Random Behavior

We have written a simple simulator that you can use to see how the agent should perform, if it selects random actions for the entirety of the game.  Of course, your agent will learn to perform much better!

Run the code cell below to have the agent play the game for 2 separate episodes.  When parsing the output, remember that the starting conditions of the game are random!

Each episode has corresponding output that looks somewhat like the snippet below:
```
Starting Episode 1 ...
 LED: 0    | Arm: 1 |
 Action: P | Arm: 1 | Reward: -1 
 Action: L | Arm: 0 | Reward: -1 
 Action: P | Arm: 0 | Reward: -1 
Final Score: -3.0
```

In the sample snippet above:
- When the game round was initiated, the lit LED was at position `0`, and the arm was at position `1`.  
- The agent's first choice of action was to **P**ush in the current location; as a result, it received a reward of `-1`, and the LEDs were unaffected (i.e., the LED at position `0` remains lit, while all of the other LEDs are unlit).  
- The agent's next choice was to move **L**eft, so it moved to position `0` and received another reward of `-1`. 
- Then, the agent decided to **P**ush in location `0` and received another reward of `-1`.  At this point, the game ended, because the final action turned off the lit LED. 
- In this case, the final score received at the end of the game is `-1` + `-1` + `-1` = `-3`.

Take the time now to understand the `JetsonEnv` class in **jetson_env.py**.  Note that the simulation encodes each of the possible actions as an integer (one of `0`, `1`, or `2`), and to get the corresponding more interpretable action label (`L`, `P`, or `R`), we use the `decipher_action` function below. 

Later in this notebook, you will use the `JetsonEnv` class to simulate games to teach your own agent!

In [1]:
from jetson_env import JetsonEnv

# use a Python dictionary to decode the actions
action_dict = {0: 'L', 1: 'P', 2: 'R'}
def decipher_action(a):
    return action_dict[a]

# create a new environment
env = JetsonEnv()

# interact with the environment
for i_episode in range(1, 3):
    print('Starting Episode %d ...' % i_episode)
    # reset the lit LED, arm position, and score
    board, pos = env.reset()
    score = 0
    print(' LED: %d    | Arm: %d |' % (board, pos))
    while True:
        action = env.get_random_action() # select a random action
        pos, reward, done = env.step(action)
        score += reward
        print(' Action: %s | Arm: %d | Reward: %d ' % (decipher_action(action), pos, reward))
        if done:
            print('Final Score:', score)
            break

Starting Episode 1 ...
 LED: 2    | Arm: 2 |
 Action: P | Arm: 2 | Reward: -1 
Final Score: -1.0
Starting Episode 2 ...
 LED: 3    | Arm: 1 |
 Action: P | Arm: 1 | Reward: -1 
 Action: L | Arm: 0 | Reward: -1 
 Action: R | Arm: 1 | Reward: -1 
 Action: P | Arm: 1 | Reward: -1 
 Action: R | Arm: 2 | Reward: -1 
 Action: P | Arm: 2 | Reward: -1 
 Action: P | Arm: 2 | Reward: -1 
 Action: P | Arm: 2 | Reward: -1 
 Action: R | Arm: 3 | Reward: -1 
 Action: L | Arm: 2 | Reward: -1 
 Action: P | Arm: 2 | Reward: -1 
 Action: L | Arm: 1 | Reward: -1 
 Action: R | Arm: 2 | Reward: -1 
 Action: P | Arm: 2 | Reward: -1 
 Action: L | Arm: 1 | Reward: -1 
 Action: R | Arm: 2 | Reward: -1 
 Action: L | Arm: 1 | Reward: -1 
 Action: R | Arm: 2 | Reward: -1 
 Action: R | Arm: 3 | Reward: -1 
 Action: P | Arm: 3 | Reward: -1 
Final Score: -20.0


## Implementing Monte Carlo ES

As discovered above, there are **three possible actions**, corresponding to:
- `0` - moving **L**eft, 
- `1` - staying and **P**ushing in the current position, and
- `2` - moving **R**ight.

The **total number of possible game states is $4^2 = 16$**, where there is a state for each possible combination of arm position and lit LED position. 

To avoid having to deal with two different numbers when referencing the state, we define the `get_state` function below that maps each possible combination of arm position (`pos`) and lit LED position (`board`) to an integer from `0` to `15`, which we refer to as the corresponding state (`state`).  

In your upcoming implementation, the state should always be encoded as a number from `0` to `15`, but you can get the corresponding arm position (`pos`) and lit LED position (`board`) by passing the state (`state`) into the `get_board_and_pos` function.

In [2]:
def get_board_and_pos(state, nLED=4):
    pos = state % nLED
    board = int(((state-pos))/nLED)
    return board, pos

def get_state(board, pos, nLED=4):
    state = board*nLED + pos
    return state

for state in range(16):
    board, pos = get_board_and_pos(state)
    print('Board: %d | Pos: %d | State: %d ' % (board, pos, state))

Board: 0 | Pos: 0 | State: 0 
Board: 0 | Pos: 1 | State: 1 
Board: 0 | Pos: 2 | State: 2 
Board: 0 | Pos: 3 | State: 3 
Board: 1 | Pos: 0 | State: 4 
Board: 1 | Pos: 1 | State: 5 
Board: 1 | Pos: 2 | State: 6 
Board: 1 | Pos: 3 | State: 7 
Board: 2 | Pos: 0 | State: 8 
Board: 2 | Pos: 1 | State: 9 
Board: 2 | Pos: 2 | State: 10 
Board: 2 | Pos: 3 | State: 11 
Board: 3 | Pos: 0 | State: 12 
Board: 3 | Pos: 1 | State: 13 
Board: 3 | Pos: 2 | State: 14 
Board: 3 | Pos: 3 | State: 15 


The goal of your agent is to find the **optimal policy**.  The optimal policy specifies - for each possible starting game state - the best initial action that the agent should take from that state, towards its goal of maximizing the game score.

For instance, consider the case that the game starts in state `0`.  This state corresponds to arm position `0` and lit LED position `0`.  In this case, the robot should select action **P**ush, to obtain a reward of `-1` and end the game with a final score of `-1` immediately after.

Likewise, state `1` corresponds to arm position `1` and lit LED position `0`.  In this case, the robot should decide to move **L**eft as the best initial move.  In this way, the robot can select to **P**ush at the next step and end the game with a best final score of `-2`. 

Take the time now to look at the printed optimal policy below.  Check to make sure that you can see why these actions are optimal, in the context of their corresponding game states!

```
Board: 0 | Pos: 0 | State: 0 | Best Action: P
Board: 0 | Pos: 1 | State: 1 | Best Action: L
Board: 0 | Pos: 2 | State: 2 | Best Action: L
Board: 0 | Pos: 3 | State: 3 | Best Action: L
Board: 1 | Pos: 0 | State: 4 | Best Action: R
Board: 1 | Pos: 1 | State: 5 | Best Action: P
Board: 1 | Pos: 2 | State: 6 | Best Action: L
Board: 1 | Pos: 3 | State: 7 | Best Action: L
Board: 2 | Pos: 0 | State: 8 | Best Action: R
Board: 2 | Pos: 1 | State: 9 | Best Action: R
Board: 2 | Pos: 2 | State: 10 | Best Action: P
Board: 2 | Pos: 3 | State: 11 | Best Action: L
Board: 3 | Pos: 0 | State: 12 | Best Action: R
Board: 3 | Pos: 1 | State: 13 | Best Action: R
Board: 3 | Pos: 2 | State: 14 | Best Action: R
Board: 3 | Pos: 3 | State: 15 | Best Action: P
```

Now, you will implement a method known as **Monte Carlo with Exploring Starts** to guide your agent to obtain this optimal policy (printed above).

As part of this algorithm, the agent will maintain a numpy array $Q$ with 16 rows and 3 columns, containing ... //

*This description of Monte Carlo Exploring Starts still needs to be fleshed out.  The plan is that the attendee will code everything from scratch.*

In [3]:
import sys
import numpy as np
import math

def monte_carlo(env, num_episodes):
    nLED = env.nLED
    nS = int(math.pow(nLED, 2))       # number of states
    nA = env.nA                       # number of actions
    
    # initialize empty arrays
    Q = np.zeros((nS, nA), dtype=float)
    
    ##### CODE ABOVE THIS LINE PROVIDED TO ATTENDEES.  THEY HAVE TO WRITE EVERYTHING BELOW. #####
    
    returns_sum = np.zeros((nS, nA), dtype=float)
    returns_count = np.zeros((nS, nA), dtype=float)
    
    # loop over episodes
    for i_episode in range(1, num_episodes+1):
        
        print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="")
        sys.stdout.flush()
        
        # start the interaction
        episode = []
        board, pos = env.reset()
        state = get_state(board, pos, nLED)
        
        # select a random action
        action = env.get_random_action()
        pos, reward, done = env.step(action)
        episode.append((state, action, reward))
        
        # finish the episode
        for i in range(100):
            if not done:
                # get state index
                state = get_state(board, pos, nLED)
                # select most profitable action
                action = np.argmax(Q[state])
                pos, reward, done = env.step(action)
                episode.append((state, action, reward))
            else:
                break
                
        # use episode performance to update Q
        sa_set = set([(x[0], x[1]) for x in episode])
        for state, action in sa_set:
            first_idx = min([i for i,x in enumerate(episode) if x[0] == state and x[1] == action])
            returns_sum[state][action] += sum([x[2] for i,x in enumerate(episode[first_idx:])])
            returns_count[state][action] += 1
            Q[state][action] = returns_sum[state][action]/returns_count[state][action]
            
    return Q

In [4]:
Q = monte_carlo(env, 200)

Episode 200/200.

In [5]:
# print your agent's learned policy
for state in range(Q.shape[0]):
    board, pos = get_board_and_pos(state, env.nLED)
    print('Board: %d | Pos: %d | State: %d | Best Action: %s' % (board, pos, state, decipher_action(np.argmax(Q[state]))))

Board: 0 | Pos: 0 | State: 0 | Best Action: P
Board: 0 | Pos: 1 | State: 1 | Best Action: L
Board: 0 | Pos: 2 | State: 2 | Best Action: L
Board: 0 | Pos: 3 | State: 3 | Best Action: L
Board: 1 | Pos: 0 | State: 4 | Best Action: R
Board: 1 | Pos: 1 | State: 5 | Best Action: P
Board: 1 | Pos: 2 | State: 6 | Best Action: L
Board: 1 | Pos: 3 | State: 7 | Best Action: L
Board: 2 | Pos: 0 | State: 8 | Best Action: R
Board: 2 | Pos: 1 | State: 9 | Best Action: R
Board: 2 | Pos: 2 | State: 10 | Best Action: P
Board: 2 | Pos: 3 | State: 11 | Best Action: L
Board: 3 | Pos: 0 | State: 12 | Best Action: R
Board: 3 | Pos: 1 | State: 13 | Best Action: R
Board: 3 | Pos: 2 | State: 14 | Best Action: R
Board: 3 | Pos: 3 | State: 15 | Best Action: P


In [6]:
# cleaner printing of learned policy
np.argmax(Q, axis=1)

array([1, 0, 0, 0, 2, 1, 0, 0, 2, 2, 1, 0, 2, 2, 2, 1])