# Q-Learning in Reinforcement Learning 

Q-Learning is a Reinforcement learning policy that will find the next best action, given a current state. It chooses this action at random and aims to maximize the reward.

Q-learning is a model-free, off-policy reinforcement learning that will find the best course of action, given the current state of the agent. Depending on where the agent is in the environment, it will decide the next action to be taken. 

Important Terms in Q-Learning

1. States: The State, S, represents the current position of an agent in an environment. 
2. Action: The Action, A, is the step taken by the agent when it is in a particular state.
3. Rewards: For every action, the agent will get a positive or negative reward.
4. Episodes: When an agent ends up in a terminating state and can’t take a new action.
5. Q-Values: Used to determine how good an Action, A, taken at a particular state, S, is. Q (A, S).
6. Temporal Difference: A formula used to find the Q-Value by using the value of current state and action and previous state and action.

In Q-Learning, the agent has an initial Q-Table with arbitary fixed values (could be zero). 

Initally, the agent will be in start position in the environment denoted as ($s_{t=0}$)


At each time($t$), the agent 
- Performs an Action ($a_t$)
- Observe the reward ($r_t$)
- Enter to a new state ($s_{t+1}$)

Then, based on operation the Q-Table updates.


For updating the Q-table the algorithm uses the **Bellman Equation** which states

$$Q(s_t, a_t) = R(s_t, a_t) + \gamma * Max(Q(s_{t+1}, A)) $$

where, 
- $s_t$ is current state
- $a_t$ is current action 
- $R(s_t, a_t)$ is reward for current action and state
- $\gamma$ is discount factor
- $s_{t+1}$ is the next state
- $A$ is all actions
- $Q(s_{t+1}, A)$ is the q-value for all action in next state

In [1]:
import gym
import numpy as np
import random
from IPython.display import clear_output
from environment import Environment
from generic_agent import GenericAgent

For understanding the Q-Learning technique, we are using Frozen Lake environment from OpenAI Gym. <br>
To know more about [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)


![Image](https://www.gymlibrary.dev/_images/frozen_lake.gif "Frozen Lake Without Slippery")

In [2]:
# Defining the Environment Name
env_name = 'FrozenLake-v1'

In [4]:
env = Environment(env_name, render_mode="rgb_array", is_slippery=False, map_name="4x4")

Environment Name:  FrozenLake
Action Space Type:  DISCRETE
Observation Space Type:  DISCRETE
Observation Space:  Discrete(16)


## Q-Agent 

I am defining the Q-Agent, that will find the next best action based on current observations. 

### Algorithm 

1. Define the Q-Table with action and observation space. Initialize it to zeros.
2. Select the best action for the observation from Q-Table
    $$a = Max(Q(O_t, A)) $$
    where, 
    - $O_t$ is the current observation
    - $A$ is the all action
    
3. Update the Q-Table based on selection
    $$Q(s_t, a_t) = R(s_t, a_t) + \gamma * Max(Q(s_{t+1}, A)) $$


In [6]:
class QAgent(GenericAgent):
    
    def __init__(self, env: gym.wrappers.time_limit.TimeLimit, learning_rate = 0.57, discount_rate = 0.97, epsilon=1):
        
        super(QAgent, self).__init__(env)
        self.env = env
        self.initalize_qtable()
        self.learning_rate = learning_rate
        self.discount_rate = discount_rate
        self.epsilon = epsilon
        self.explore, self.exploit = 0, 0
    
    def initalize_qtable(self):
        # Creating the Q Table with almost zero values
        self.q_table = 1e-4 * np.random.random(size=[self.observation_size, self.action_size])
#         self.q_table = np.zeros([self.observation_size, self.action_size])
    
    def get_action(self, observation):
        
        if random.random() < self.epsilon:
            self.explore += 1
            return self.get_random_action(observation)
        state_table = self.q_table[observation]
        action = np.argmax(state_table) 
        self.exploit += 1
        return action

    def update_qtable(self, experience, action):
        
        observation, reward, terminate, truncate, info = experience
        q_next = self.q_table[observation]
        q_next = np.zeros([self.action_size]) if terminate else q_next
        q_target = reward + self.discount_rate * np.max(q_next)
        q_update = q_target - self.q_table[observation, action]
        self.q_table[observation, action] += self.learning_rate * q_update
        
        
        if terminate:
            self.epsilon *= 0.99

In [7]:
agent = QAgent(env)

# Training for 100 episodes
prev_obs = env.observation_space.sample()
total_reward = 0

In [8]:
from time import sleep

In [11]:
agent.epsilon = .5

In [12]:

for episode in range(1000): 
    env.reset()
    agent.explore = 0
    agent.exploit = 0
    terminate = False
    sleep(0.1)
    while not terminate:
        env.render()
        action = agent.get_action(prev_obs)
        
        (observe, reward, terminate, truncate, info) = env.step(action)
        total_reward += reward
        agent.update_qtable((observe, reward, terminate, truncate, info), action)
        print("Episode: {}, Reward: {}".format(episode, total_reward))
        print("Previous Observation: {}, Observation: {}".format(prev_obs, observe))
        print("Action: {} Epsilon: {}".format(action, agent.epsilon))
        print("Exploration Step: {} Exploitation Steps: {}\n\n".format(agent.explore, agent.exploit))
        if terminate:
            env.reset()
        
        prev_obs = observe
        print(agent.q_table)
        clear_output(wait=True)

Episode: 999, Reward: 2.0
Previous Observation: 1, Observation: 5
Action: 1 Epsilon: 2.158562370532892e-05
Exploration Step: 0 Exploitation Steps: 3


[[9.45390334e-05 3.22905005e-05 9.74629211e-05 9.45390334e-05]
 [6.11335998e-05 6.33467581e-05 6.14463553e-05 6.14463553e-05]
 [8.20463740e-05 9.05931915e-05 8.78753957e-05 8.78753947e-05]
 [2.40144135e-05 4.25632189e-05 4.21758328e-05 2.93294315e-05]
 [6.39103060e-05 6.24836989e-05 6.44161843e-05 6.33987855e-05]
 [1.95174236e-18 0.00000000e+00 6.86415424e-16 1.55275973e-05]
 [7.91983556e-05 7.68224049e-05 7.39393840e-05 6.75280735e-05]
 [8.51983656e-05 4.42706545e-07 1.91936674e-06 7.96973046e-05]
 [9.23424475e-05 9.23451698e-05 7.15561013e-05 9.52012069e-05]
 [5.77591793e-05 4.64412436e-05 5.59538801e-05 5.30965776e-05]
 [5.21426284e-06 7.56921208e-05 6.45047619e-05 8.00714931e-05]
 [8.78075703e-05 1.72337399e-05 6.34275676e-06 6.72513690e-05]
 [9.97570651e-05 8.66987652e-07 3.02476199e-05 8.84291606e-05]
 [9.32850607e-05 8.88722556e-0

In [None]:
env.close()

In [16]:
experience

(5, 0.0, True, False, {'prob': 0.3333333333333333})

In [17]:
experience[2]

True