# Q-learning

One of the most well-known algorithms used in reinforcement learning is Q-learning. Q-learning is an RL technique used for learning the optimal policy in a Markov decision process. Q-learning strives to find the optimal policy by learning the optimal Q-value in each state-action pair. The algorithm iteratively updates the Q-values in its Q-table for every state-action pair using the Bellman optimality equation. It does so until the Q-values nearly converges to the optimal Q-function. Initially, the Q-table for every state-action pair is initialized to zero. This is because the agent does not know anything about the environment at the start of an episode. We define a sequence of states, actions, and rewards, which ends in a terminal condition as an episode. The Q-values will be iteratively updated during various episodes using this value iteration method. After each episode, the Q-table will slowly converge to the optimal Q-value. The agent can make better decisions during episodes in every state based on the Q-table's highest values for that state.


## Import Libraries

In [1]:
import gym
import random
import numpy as np

## Reinforcement Learning Agent

In [2]:
class Agent():
    def __init__(self, observation_space, action_space, learning_rate = 0.1, discount_rate = 0.99, exploration_rate = 1.0, 
                 max_exploration_rate = 1, min_exploration_rate = 0.01, exploration_decay_rate = 0.001):
        
        self.observation_space = observation_space
        self.action_space = action_space
        
        self.learning_rate = learning_rate
        self.discount_rate = discount_rate

        self.exploration_rate = exploration_rate
        self.max_exploration_rate = max_exploration_rate
        self.min_exploration_rate = min_exploration_rate
        self.exploration_decay_rate = exploration_decay_rate
        
        self.episode = 0
        self.Q = np.zeros((self.observation_space, self.action_space))
        
    def choice_action(self, state):
        exploration_rate_threshold = np.random.random()
        if exploration_rate_threshold > self.exploration_rate:
            action = np.argmax(self.Q[state,:])
        else:
            action = np.random.choice(self.action_space)
            
        return action
        
    def decrease_exploration_rate(self):
        self.episode += 1
        self.exploration_rate = self.min_exploration_rate + \
            (self.max_exploration_rate - self.min_exploration_rate) * np.exp(- self.exploration_decay_rate * self.episode)

    def learn(self, state, action, reward, next_state):
        self.Q[state, action] = self.Q[state, action] * (1 - self.learning_rate) + \
            self.learning_rate * (reward + self.discount_rate * np.max(self.Q[next_state, :]))

### Main Program

In [3]:
episodes = 10000
env = gym.make("FrozenLake-v0")
agent = Agent(env.observation_space.n,  env.action_space.n)

scores = []

for i in range(episodes):
    state = env.reset()
    done = False
    score = 0
    
    while not done:
        action = agent.choice_action(state)
        next_state, reward, done, info = env.step(action)
        agent.learn(state, action, reward, next_state)
        
        state = next_state
        score += reward

    scores.append(score)
    agent.decrease_exploration_rate()

# Calculate and print the average reward per thousand episodes
rewards_per_thosand_episodes = np.split(np.array(scores), episodes / 1000)
count = 1000
print("********Average reward per thousand episodes********\n")
for r in rewards_per_thosand_episodes:
    avg_reward = sum(r/1000)
    print(count, ": ", str(avg_reward))
    count += 1000

********Average reward per thousand episodes********

1000 :  0.04200000000000003
2000 :  0.18900000000000014
3000 :  0.3950000000000003
4000 :  0.5530000000000004
5000 :  0.6190000000000004
6000 :  0.6750000000000005
7000 :  0.6720000000000005
8000 :  0.6770000000000005
9000 :  0.7100000000000005
10000 :  0.6960000000000005
