<a href="https://colab.research.google.com/github/zanzivyr/Optimizers/blob/main/FrozenLake.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning

**Frozen Lake** 

I'm following a video series from DeepLizard. The code is a tutorial from user simoninithomas. The purpose of this study is to get a hands-on, practical, experience with RL. This example clearly shows how to move from the mathematical concepts of MDP's (Markov Desicion Processes) and Q-Learning, to real-world results.

- DeepLizard, RL Playlist - https://www.youtube.com/playlist?list=PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv
- simoninithomas, Deep RL Frozen Lake Example - https://github.com/simoninithomas/Deep_reinforcement_learning_Course/tree/master/Q%20learning/FrozenLake

In [1]:
import numpy as np
import gym
import random

## Install OpenGym

In [2]:
env = gym.make("FrozenLake-v0")

In [35]:
action_size = env.action_space.n
state_size = env.observation_space.n

action_size, state_size

(4, 16)

Now we setup the q table where we will write all of our state-action pairs.

In [36]:
q_table = np.zeros((state_size, action_size))
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


## Hyperparameters

In [37]:
num_episodes = 15000              # Total episodes
learning_rate = 0.7               # Learning Rate
max_steps_per_episode = 99        # Max steps per episode
discount_rate = 0.65              # Discounting rate

# Exploration parameters
exploration_rate = 1.0            # Exploration rate
max_exploration_rate = 1.0        # Exploration probability at start
min_exploration_rate = 0.01       # Minimum exploration probability
exploration_decay_rate = 0.005    # Exponential decay

## Q-Learning

1. Initialize Q-values (state-action pair) arbitrarily
2. For each episode:
  3. Choose an action (a) in the current state (s), based on current Q-value estimate
  4. Take the action and observe the outcome state and reward (s' and r)
  5. Update bellman optimality equation

**Bellman Optimality Equation**

q_star(s,a) = E[ r_t+1 + gamma * max(a') q_star(s',a') ]

where,

- r is reward
- gamma is discount rate
- s' next state
- a' next action

In [24]:
rewards_all_episodes = []

# Q-Learning algorithm
for episode in range(num_episodes):
  state = env.reset() # resets state back to default each episode

  done = False
  rewards_current_episode = 0

  for step in range(max_steps_per_episode):

    # Exploration-exploitation trade-off
    exploration_rate_threshold = random.uniform(0, 1)
    if exploration_rate_threshold > exploration_rate:
      action = np.argmax(q_table[state,:])
    else:
      action = env.action_space.sample()

    new_state, reward, done, info = env.step(action)

    # Update Q-table for Q(s,a)
    q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
      learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))

    state = new_state
    rewards_current_episode += reward

    if done == True:
      break

  # Exploration rate decay
  exploration_rate = min_exploration_rate + \
    (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)

  rewards_all_episodes.append(rewards_current_episode)

# After all episodes are finished,
# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes), num_episodes/1000)
count = 1000
print("******Average reward per thousand episodes******\n")
for r in rewards_per_thousand_episodes:
  print(count, ":, ", str(sum(r/1000)))
  count += 1000

# Print updated Q-table
print("\n\n*********Q-table***********\n")
print(q_table)


******Average reward per thousand episodes******

1000 :,  0.06700000000000005
2000 :,  0.25400000000000017
3000 :,  0.22100000000000017
4000 :,  0.21800000000000017
5000 :,  0.19500000000000015
6000 :,  0.21000000000000016
7000 :,  0.17800000000000013
8000 :,  0.2810000000000002
9000 :,  0.23900000000000018
10000 :,  0.20800000000000016
11000 :,  0.2480000000000002
12000 :,  0.23200000000000018
13000 :,  0.23200000000000018
14000 :,  0.21500000000000016
15000 :,  0.18100000000000013


*********Q-table***********

[[1.70364197e-07 7.94478596e-05 1.55358703e-05 1.75294310e-05]
 [5.04838558e-06 4.08294329e-06 6.25912311e-08 1.08134303e-04]
 [6.46750077e-04 1.99965905e-06 1.03089789e-06 1.61780972e-06]
 [1.42444400e-07 2.04799899e-07 7.57253141e-08 2.26747543e-07]
 [2.74669837e-05 2.99196350e-05 2.98042624e-05 2.91717735e-09]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.67702020e-02 1.35000909e-07 1.00062214e-06 3.90226047e-07]
 [0.00000000e+00 0.00000000e+00 0.00000

## Play Frozen Lake

Now that our q_table is trained, we can use it to play Frozen Lake.

In [34]:
env.reset()

for episode in range(5):
  state = env.reset()
  step = 0
  done = False
  print("==================")
  print("EPISODE ", episode)

  for step in range(max_steps_per_episode):

    # Take the action (index that have the maximum expected future reward given that state)
    action = np.argmax(q_table[state,:])

    new_state, reward, done, info = env.step(action)

    if done:
      # Here, we decide to only print the last state (to see if our agent is on goal or falls into a hole)
      env.render()

      # We print the number of steps it took
      print("Number of steps", step)
      break

    state = new_state

  env.close()

EPISODE  0
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
Number of steps 2
EPISODE  1
  (Up)
SFFF
FHFH
FFFH
[41mH[0mFFG
Number of steps 4
EPISODE  2
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
Number of steps 24
EPISODE  3
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
Number of steps 4
EPISODE  4
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
Number of steps 3
