## Reinforcement learning
<div class="alert alert-block alert-warning">
    <b>Learning outcomes:</b>
    <br>
    <ul>
        <li> Introduction to the theories of reinforcement learning.</li>
        <li> Exploration of how Q-learning works</li>
        <li> What is the next step after Q-learning </li>
    </ul>
</div>

Most of us would probably have heard of AI learning to play computer games on its own. A very popular example would be Deepmind. Deepmind took the world by surprise when its AlphaGo program won the Go world champion. In recent times, AI have been able to defeat human players in strategy game. One such example would be OpenAI's AlphaStar. Here, the difficulty is compounded as such game require long term strategic planning.

Dario "TLO" Wünsch, a professional StarCraft player remarked "I’ve found AlphaStar’s gameplay incredibly impressive – the system is very skilled at assessing its strategic position, and knows exactly when to engage or disengage with its opponent. And while AlphaStar has excellent and precise control, it doesn’t feel superhuman – certainly not on a level that a human couldn’t theoretically achieve. Overall, it feels very fair – like it is playing a ‘real’ game of StarCraft."

### Reinforcement learning analogy

Consider the scenario of teaching a dog new tricks. The dog doesn't understand human language, so we can't tell him what to do. Instead, we can create a situation or a cue, and the dog tries to behave in different ways. If the dog's response is desired, we reward them with their favourite snack. Now guess what, the next time the dog is exposed to the same situation, the dog executes a similar action with even more enthusiasm in expectation of more food. That's like learning "what to do" from positive experiences. Similarly, dogs will tend to learn what not to do when face with negative experiences. For example, whenever the dog behave undesirably, we would admonish him. This helps the dog to understand and reinforce behavior that are desirable. At the same time, the dog would avoid undesirable behavior.

That's exactly how Reinforcement Learning works in a broader sense:

Your dog is an "agent" that is exposed to the environment. The environment could in your house, with you.
The situations they encounter are analogous to a state. An example of a state could be your dog standing and you use a specific word in a certain tone in your living room

Our agents react by performing an action to transition from one "state" to another "state," your dog goes from standing to sitting, for example.
After the transition, they may receive a reward or penalty in return. You give them a treat! Or a "No" as a penalty. The policy is the strategy of choosing an action given a state in expectation of better outcomes.

Here are some points to take note of:

- Greedy (pursuit of current rewards) is not always good.
    - There are things that are easy to do for instant gratification, and there's things that provide long term rewards The goal is to not be greedy by looking for the quick immediate rewards, but instead to optimize for maximum rewards over the whole training.

- Sequence matters in Reinforcement Learning
    - The reward agent does not just depend on the current state, but the entire history of states. Unlike supervised, time step and sequence of state-action-reward is important here.

### Q-table

In our example below, we will be using OpenAI Gym's Taxi environment

In [2]:
import sys
sys.tracebacklimit = 0
import gym
import numpy as np
import random
from IPython.display import clear_output
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

# Init Taxi-V2 Env
env = gym.make("Taxi-v3").env

# Init arbitary values
q_table = np.zeros([env.observation_space.n, env.action_space.n])

# Hyperparameters
alpha = 0.7 # Momemtum 0.2, Current 0.8 Greedy, 0.2 is to reduce volatality and flip flop
gamma = 0.2 # Learning Rate 0.1 Greedyness is 10%
epsilon = 0.4 # explore 10% exploit 90%


all_epochs = []
all_penalties = []
training_memory = []

for i in range(1, 50000):
    state = env.reset()

    # Init Vars
    epochs, penalties, reward, = 0, 0, 0
    done = False

    #training
    while not done:
        if random.uniform(0, 1) < epsilon: 
            # Check the action space
            action = env.action_space.sample() # for explore
        else:
            # Check the learned values
            action = np.argmax(q_table[state]) # for exploit

        next_state, reward, done, info = env.step(action) #gym generate, the environment already setup for you

        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state]) #take highest from q table for exploit

        # Update the new value
        new_value = (1 - alpha) * old_value + alpha * \
            (reward + gamma * next_max)
        q_table[state, action] = new_value        
        
        # penalty for performance evaluation
        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1
    

    if i % 100 == 0:
        training_memory.append(q_table.copy())
        clear_output(wait=True)
        print("Episode:", i)
        print("Saved q_table during training:", i)

print("Training finished.")
print(q_table)


Episode: 49900
Saved q_table during training: 49900
Training finished.
[[  0.           0.           0.           0.           0.
    0.        ]
 [ -1.24999956  -1.24999782  -1.24999956  -1.24999782  -1.24998912
  -10.24999782]
 [ -1.249728    -1.24864     -1.249728    -1.24864     -1.2432
  -10.24864   ]
 ...
 [ -1.2432      -1.216       -1.2432      -1.24864    -10.2432
  -10.2432    ]
 [ -1.24998912  -1.2499456   -1.24998912  -1.2499456  -10.24998912
  -10.24998912]
 [ -0.4         -1.08        -0.4          3.          -9.4
   -9.4       ]]


** There are four designated locations in the grid world indicated by R(ed), B(lue), G(reen), and Y(ellow). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drive to the passenger's location, pick up the passenger, drive to the passenger's destination (another one of the four specified locations), and then drop off the passenger. Once the passenger is dropped off, the episode ends. There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is the taxi), and 4 destination locations. Actions: There are 6 discrete deterministic actions: **

    0: move south
    1: move north
    2: move east
    3: move west
    4: pickup passenger
    5: dropoff passenger
Rewards: There is a reward of -1 for each action and an additional reward of +20 for delievering the passenger. There is a reward of -10 for executing actions "pickup" and "dropoff" illegally. Rendering:

    blue: passenger
    magenta: destination
    yellow: empty taxi
    green: full taxi
    other letters: locations


state space is represented by:
    (taxi_row, taxi_col, passenger_location, destination)

Here, the highest number in the array represents the action that the Taxi agent would take

In [3]:
# At state 499 i will definitely move west
state = 499
print(training_memory[0][state])
print(training_memory[20][state])
print(training_memory[50][state])
print(training_memory[200][state])

[-1.008     -1.0682761 -1.1004     2.72055   -9.2274    -9.1      ]
[-0.40000039 -1.07648283 -0.40000128  3.         -9.39958914 -9.39998055]
[-0.4  -1.08 -0.4   3.   -9.4  -9.4 ]
[-0.4  -1.08 -0.4   3.   -9.4  -9.4 ]


In [4]:
# At state 77 i will definitely move east
state = 77
print(training_memory[0][state])
print(training_memory[20][state])
print(training_memory[50][state])
print(training_memory[200][state])

[-1.07999095 -1.008       3.         -1.08309178 -9.1        -9.18424273]
[-1.08 -0.4   3.   -1.08 -9.4  -9.4 ]
[-1.08 -0.4   3.   -1.08 -9.4  -9.4 ]
[-1.08 -0.4   3.   -1.08 -9.4  -9.4 ]


In [5]:
# To show that at state 393, how the move evolved
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

In [6]:
action_dict = {0:  "move south"
,1: "move north"
,2: "move east"
,3: "move west"
,4: "pickup passenger"
,5: "dropoff passenger"
}

ENV_STATE = env.reset()
print(env.render(mode='ansi'))
state_memory = [i[ENV_STATE] for i in training_memory]
printmd("For state **{}**".format(ENV_STATE))
for step, i in enumerate(state_memory):
    
    if step % 200==0:
        choice = np.argmax(i)
        printmd("for episode in {}, q table action is {} and it will ... **{}**".format(step*100, choice, action_dict[choice]))
        print(i)
        print()

+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : |[43m [0m: |
|[34;1mY[0m| : |B: |
+---------+




For state **369**

for episode in 0, q table action is 0 and it will ... **move south**

[ -1.24999822  -1.24999945  -1.24999867  -1.24999849 -10.22355492
 -10.24977275]



for episode in 20000, q table action is 1 and it will ... **move north**

[ -1.25  -1.25  -1.25  -1.25 -10.25 -10.25]



for episode in 40000, q table action is 1 and it will ... **move north**

[ -1.25  -1.25  -1.25  -1.25 -10.25 -10.25]



### Running a trained taxi

This is a clearer view of the transition between states and the reward that will be received.
Notice that, as the reward is consistently high for a trained model.

In [7]:
import time
def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Episode: {frame['episode']}")
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        time.sleep(0.8)

total_epochs, total_penalties = 0, 0
episodes = 10 # Try 10 rounds
frames = []

for ep in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state]) # deterministic (exploit), not stochastic (explore), only explore in training
        env
        state, reward, done, info = env.step(action) #gym

        if reward == -10:
            penalties += 1
        
        # Put each rendered frame into dict for animation, gym generated
        frames.append({
            'frame': env.render(mode='ansi'),
            'episode': ep, 
            'state': state,
            'action': action,
            'reward': reward
            }
        )
        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print_frames(frames)

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[34;1m[43mB[0m[0m[0m: |
+---------+
  (Dropoff)

Episode: 9
Timestep: 123
State: 475
Action: 5
Reward: 20
Results after 10 episodes:
Average timesteps per episode: 12.3
Average penalties per episode: 0.0


Here, we looked at how Q-table is being used. However, it is a primitive example as we are dealing with finite states for infinite states we would have to rely on a model instead of a table. This is called Q-learning which is not covered here.