# Q-Learning

Reinforcement learning policy that will find the next best action, given a current state. It chooses this action at random and aims to maximize the reward. \
Q-learning is a model-free, off-policy algorithm that chooses the best action based on the current. The agent uses predictions of the environment’s expected response to move forward. It does not learn via the reward system, but rather, trial and error. This is a type of **Temporal Difference** Learning (Unsupervised technique in which the learning agent learns to predict the expected value of a variable occurring at the end of a sequence of states.)\
Let us begin implementing the algorithm in the pre-made OpenAI gym environment, FrozenLake-v1. We start by importing the necessary packages and rendering the environment. 


In [5]:
import gym
import random
import numpy as np

In [6]:
environment = gym.make("FrozenLake-v1", is_slippery=False, render_mode='human')
environment.reset()
environment.render()

## **Q-Table** 
A Q-table gives the value of each action given the state that the agent is in. Each cell corresponds to one state-action pair value. 
|State|Actions||||
|-----|-----|-----|-----|-----|
|     |  L  |  R  |  U  |  D  |
|  A  | 0.1 | 0.3 | 0   | 0.4 |
|  B  | 0.2 | 0.7 | 0.5 | 0   |
|  C  | 0.6 | 0   | 0.7 | 0.9 |
|  D  | 0   | 0.2 | 0.1 | 0   |

The Q-table is always initialised as a zero matrix before the agent begins exploring. The table will always have dimensions (states x actions), in our case it is 16 x 4

In [7]:
# Our table has the following dimensions:
# (rows x columns) = (states x actions) = (16 x 4)
qtable = np.zeros((16, 4))

In [9]:
# Alternatively, the gym library can also directly g
# give us the number of states and actions using 
# "env.observation_space.n" and "env.action_space.n"
nb_states = environment.observation_space.n  # = 16
nb_actions = environment.action_space.n      # = 4
qtable = np.zeros((nb_states, nb_actions))
print(qtable)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [5]:
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 300
plt.rcParams.update({'font.size': 17})

In [10]:
#We re-initialize the Q-table
qtable = np.zeros((environment.observation_space.n, environment.action_space.n))

# Hyperparameters
episodes = 200       # Total number of episodes
alpha = 0.5            # Learning rate
gamma = 0.9            # Discount factor

# List of outcomes to plot
outcomes = []

print('Q-table before training:')
print(qtable)

Q-table before training:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


We now begin the training portion of the algorithm. This is where the agent explores the environment and gathers data on the q-values of the various states and actions. 

In [None]:
# Training
for _ in range(episodes):
    state = environment.reset()
    done = False
    # By default, we consider our outcome to be a failure
    outcomes.append("Failure")

    # Until the agent gets stuck in a hole or reaches the goal, keep training it
    while not done:
        state=np.asarray(state)
        #print("Currently", state)
        # Choose the action with the highest value in the current state
        if np.max(qtable[state[0]]) > 0:
          action = np.argmax(qtable[state[0]])
          

        # If there's no best action (only zeros), take a random one
        else:
          action = environment.action_space.sample()
             
        # Implement this action and move the agent in the desired direction
        new_state, reward, done, truncated, info = environment.step(action)

        # Update Q(s,a)
        
        qtable[state[0], action] = qtable[state[0], action] + alpha * (reward + gamma * np.max(qtable[new_state]) - qtable[state[0], action])
      
       
        state[0]=new_state

        #print('this is the new state', state)

        # If we have a reward, it means that our outcome is a success
        if reward:
          outcomes[-1] = "Success"


## **The Bellman Equation**
The equation is used to determine the value of a particular state and deduce how good it is to be in that state.The Q-function $Q(s,a)$ gives the value of the current state, $R(s,a)$ is the reward, $Max Q(s',a')$ is the maximum expected future reward. 

$New  Q(s,a)= Q(s,a) + \alpha [R(s,a) +\gamma Max Q(s',a') -Q(s,a)]$

The Bellman Equation is used to explore and update the Q-table. The agent is improved by performing a greedy search where only the maximum reward received for the particular set of actions in that particular state is considered. Q-learning considers previous states when considering the next action


In [None]:
print('Q-table after training:')
print(qtable)

# Plot outcomes
plt.figure(figsize=(5, 5))
plt.xlabel("Run number")
plt.ylabel("Outcome")
ax = plt.gca()
ax.set_facecolor('#efeeea')
plt.bar(range(len(outcomes)), outcomes, color="#0A047A", width=1.0)
plt.show()

nb_success = 0

The Q-table after training should contain non-zero elements corresponding to the value of each state. The plot provided shows the amount of successes. 

The Algorithm can now be evaluated or performs after training. In this case, it is whether the agent can reach the reward without running into any terminal states (holes). Q-learning works by selecting the action with the highest Q-value for each given state each time. If there is no highest value it will simply choose a random action (and thus state) and go from there. 

In [None]:
#Evaluation
for _ in range(100):
    state = environment.reset()
    done = False
    
    # Until the agent gets stuck or reaches the goal, keep training it
    while not done:
        # Choose the action with the highest value in the current state
        if np.max(qtable[state[0]]) > 0:
          action = np.argmax(qtable[state[0]])

        # If there's no best action (only zeros), take a random one
        else:
          action = environment.action_space.sample()
             
        # Implement this action and move the agent in the desired direction
        new_state, reward, done, truncated, info = environment.step(action)

        # Update our current state

        state=[new_state]

        # When we get a reward, it means we solved the game
        nb_success += reward

We now begin checking the success rate by running the algorithm several times to attain how succesful the method is. Note that the agent is not obeying any set policy but rather acts according to values (**Value-Based methods**)

In [None]:
# Let's check our success rate!
print (f"Success rate = {nb_success/episodes*100}%")

from IPython.display import clear_output
import time 

state = environment.reset()
done = False
sequence = []

In [None]:
while not done:
    # Choose the action with the highest value in the current state
    if np.max(qtable[state]) > 0:
      action = np.argmax(qtable[state])

    # If there's no best action (only zeros), take a random one
    else:
      action = environment.action_space.sample()
    
    # Add the action to the sequence
    sequence.append(action)

    # Implement this action and move the agent in the desired direction
    new_state, reward, done, truncated, info = environment.step(action)

    # Update our current state ----------------
    if(type(new_state)==list):
        state =new_state
    else:
        state=[new_state]

    # Update the render
    clear_output(wait=True)
    environment.render()
    time.sleep(1)

print(f"Sequence = {sequence}")