# Frozen Lake Game

We'll be making use of Gym to provide us with an environment for a simple game called Frozen Lake. We'll then train an agent to play the game using Q-learning, and we'll get a playback of how the agent does after being trained.

So, let's jump into the details for Frozen Lake!

I've grabbed the description of the game directly from Gym's website. Let's read through it together.

This grid is our environment where S is the agent's starting point, and it's safe. F represents the frozen surface and is also safe. H represents a hole, and if our agent steps in a hole in the middle of a frozen lake, well, that's not good. Finally, G represents the goal, which is the space on the grid where the prized frisbee is located.

The agent can navigate left, right, up, and down, and the episode ends when the agent reaches the goal or falls in a hole. It receives a reward of one if it reaches the goal, and zero otherwise.

SFFF

FHFH

FFFH

HFFG

This grid is our environment where S is the agent's starting point, and it's safe. F represents the frozen surface and is also safe. H represents a hole, and if our agent steps in a hole in the middle of a frozen lake, well, that's not good. Finally, G represents the goal, which is the space on the grid where the prized frisbee is located.

The agent can navigate left, right, up, and down, and the episode ends when the agent reaches the goal or falls in a hole. It receives a reward of one if it reaches the goal, and zero otherwise.

State |	Description |	Reward
------| ------------|-----------
S	| Agent's starting point - safe |	0
F	| Frozen surface - safe	| 0
H	| Hole - game over	| 0
G	| Goal - game over	| 1

In [5]:
pip install gymnasium

Collecting gymnasium
  Downloading gymnasium-1.0.0-py3-none-any.whl.metadata (9.5 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl.metadata (558 bytes)
Downloading gymnasium-1.0.0-py3-none-any.whl (958 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m958.1/958.1 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-1.0.0


In [6]:
import numpy as np
import gymnasium as gym
import random
import time
from IPython.display import clear_output

## Creating the environment
Next, to create our environment, we just call gym.make() and pass a string of the name of the environment we want to set up. We'll be using the environment `FrozenLake-v1`.

In [7]:
env = gym.make('FrozenLake-v1', render_mode='ansi')

With this env object, we're able to query for information about the environment, sample states and actions, retrieve rewards, and have our agent navigate the frozen lake. That's all made available to us conveniently with Gym.

## Creating the Q-table

We're now going to construct our Q-table, and initialize all the Q-values to zero for each state-action pair.

Remember, the number of rows in the table is equivalent to the size of the state space in the environment, and the number of columns is equivalent to the size of the action space. We can get this information using using `env.observation_space.n` and `env.action_space.n`, as shown below. We can then use this information to build the Q-table and fill it with zeros.

In [8]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size, action_space_size))

Here's our Q-table!

In [9]:
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


## Initializing Q-learning parameters

Now, we're going to create and initialize all the parameters needed to implement the Q-learning algorithm.

First, with `num_episodes`, we define the total number of episodes we want the agent to play during training. Then, with `max_steps_per_episode`, we define a maximum number of steps that our agent is allowed to take within a single episode. So, if by the one-hundredth step, the agent hasn't reached the frisbee or fallen through a hole, then the episode will terminate with the agent receiving zero points.

In [10]:
num_episodes = 10000
max_steps_per_episode = 100

learning_rate = 0.1
discount_rate = 0.99

exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

## Coding the Q-learning algorithm training loop



First, we create this list to hold all of the rewards we'll get from each episode. This will be so we can see how our game score changes over time.

In [11]:
rewards_all_episodes = []

In [20]:
# Q-learning algorithm
for episode in range(num_episodes):
    # initialize new episode params
    state = env.reset()[0]

    done = False
    rewards_current_episode = 0

    for step in range(max_steps_per_episode):
        # Exploration-exploitation trade-off
        exploration_rate_threshold = random.uniform(0,1)
        # Take new action
        if(exploration_rate_threshold > exploration_rate):
          action = np.argmax(q_table[state, :])
        else:
          action = env.action_space.sample()

        new_state, reward, done, truncated, info = env.step(action)

        # Update Q-table for Q(s,a)
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \
    learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))

        # Set new state
        state = new_state

        # Add new reward
        rewards_current_episode += reward

        if done == True:
          break

    # Exploration rate decay
    exploration_rate = min_exploration_rate + \
        (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)

    # Add current episode reward to total rewards list
    rewards_all_episodes.append(rewards_current_episode)




## After all episodes complete

After all episodes are finished, we now just calculate the average reward per thousand episodes from our list that contains the rewards for all episodes so that we can print it out and see how the rewards changed over time.

In [21]:
# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/1000)
count = 1000

print("********Average reward per thousand episodes********\n")
for r in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000

********Average reward per thousand episodes********

1000 :  0.20500000000000015
2000 :  0.6970000000000005
3000 :  0.6850000000000005
4000 :  0.6820000000000005
5000 :  0.6860000000000005
6000 :  0.6680000000000005
7000 :  0.6720000000000005
8000 :  0.6840000000000005
9000 :  0.6780000000000005
10000 :  0.6920000000000005


## Interpreting the training results

Let's take a second to understand how we can interpret these results. Our agent played 10,000 episodes. At each time step within an episode, the agent received a reward of 1 if it reached the frisbee, otherwise, it received a reward of 0. If the agent did indeed reach the frisbee, then the episode finished at that time-step.

So, that means for each episode, the total reward received by the agent for the entire episode is either 1 or 0. So, for the first thousand episodes, we can interpret this score as meaning that **20%** of the time, the agent received a reward of 1 and won the episode. And by the last thousand episodes from a total of 10,000, the agent was winning **70%** of the time.

By analyzing the grid of the game, we can see it's a lot more likely that the agent would fall in a hole or perhaps reach the max time steps than it is to reach the frisbee, so reaching the frisbee **70%** of the time isn't too shabby, especially since the agent had no explicit instructions to reach the frisbee. It learned that this is the correct thing to do.

## Updated Q-table

In [22]:
print("\n\n********Q-table********\n")
print(q_table)



********Q-table********

[[0.54765181 0.50081325 0.4985004  0.51667671]
 [0.17923753 0.1252865  0.18409783 0.45999579]
 [0.4088647  0.16823971 0.16274984 0.19609939]
 [0.11039722 0.         0.         0.        ]
 [0.56265466 0.2877087  0.28737066 0.34555136]
 [0.         0.         0.         0.        ]
 [0.15074044 0.07191357 0.43576494 0.03462862]
 [0.         0.         0.         0.        ]
 [0.27141721 0.44754787 0.42663396 0.59721602]
 [0.50169291 0.64967969 0.4925115  0.53327344]
 [0.67070253 0.34581851 0.30710393 0.40116643]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.45914988 0.65523179 0.70612883 0.50028482]
 [0.7012334  0.83958548 0.72021309 0.7409227 ]
 [0.         0.         0.         0.        ]]


## Watch the agent play the game

In [35]:
# Watch our agent play Frozen Lake by playing the best action
# from each state according to the Q-table

for episode in range(3):
    # initialize new episode params
    state = env.reset()[0]
    done = False

    clear_output(wait = True)
    print("**** EPISODE ", episode + 1, " ****\n\n\n")
    time.sleep(1)

    for step in range(max_steps_per_episode):
      clear_output(wait = True)
      # Show current state of environment on screen
      print(env.render())
      time.sleep(0.3)

      # Choose action with highest Q-value for current state
      action = np.argmax(q_table[state, :])
      # Take new action
      new_state, reward, done, truncated, info = env.step(action)

      if done:
        clear_output(wait = True)
        print(env.render())
        if reward == 1:
            # Agent reached the goal and won episode
            print("*** You reached the goal! ***")
            time.sleep(2)
        else:
            # Agent stepped in a hole and lost episode
            print("*** You fell through a hole! ***")
            time.sleep(2)
        break

      # Set new state
      state = new_state

env.close()

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m

*** You reached the goal! ***
