# Getting Familiar with FrozenLake Environment

 Gymnasium is a toolkit for developing and comparing reinforcement learning algorithms. It provides a wide variety of environments (including games, tasks, and simulations) that can be used to test and benchmark different algorithms.<br>
 
 In this notebook, we will explore the FrozenLake environment using the Gymnasium library. We will implement a simple Q-learning algorithm to train an agent to navigate the FrozenLake environment.




## The Environment

In Gymnasium, an environment is a class that simulates a problem to be solved. The environment provides feedback to the agent in the form of observations and rewards, based on the actions taken by the agent.

### Main Functions of an Environment

1. **`reset()`**: Resets the environment to an initial state and returns the initial observation.

2. **`step(action)`**: Takes an action and returns the next state, reward, done flag, and additional info.

3. **`render()`**: Renders the current state of the environment (optional).

4. **`close()`**: Closes the environment and releases any resources.

## What is FrozenLake?

FrozenLake is a classic reinforcement learning environment where the agent must navigate a frozen lake to reach a goal without falling into holes. The environment is represented as a grid where each cell can be either a safe frozen surface or a dangerous hole. The agent starts in the top left corner of the grid and must find the goal at the bottom right corner. Each succesful episode leads to a reward of 1. This is an example of a sparse reward function that is commonly used in reinforcement learning. Agent is only rewarded for finding the goal and is not punished for falling into holes or taking more steps.

## States and Actions

- **States**: The state represents the current situation of the agent in the environment. In FrozenLake, each state corresponds to a cell in the grid.

- **Actions**: Actions are the possible moves the agent can take. In FrozenLake, actions typically include moving up, down, left, or right.

### Observation Space and Action Space

- **Observation Space**: The set of all possible states. For FrozenLake, it includes all cells in the grid.

- **Action Space**: The set of all possible actions the agent can take. For FrozenLake, this includes four actions: up, down, left, and right. The action space is discrete and has a mapping from action index to action name. For example: `0: up`, `1: down`, `2: left`, `3: right`

## Implementing Q-learning in FrozenLake

Now, let's implement a simple Q-learning algorithm to train an agent to navigate the FrozenLake environment.<br>

Let's install some of the dependencies required to run the code.

In [None]:
! pip install gymnasium # Install the gymnasium package
! pip install numpy # Install the numpy package to handle arrays

Let's import the necessary libraries and define the environment.

In [6]:
import random # For generating random numbers
import gymnasium as gym # For creating the environment
import numpy as np # For working with arrays


env = gym.make("FrozenLake-v1", is_slippery=True)

The gym.make function is used to create the environment. It can take in  arguments like:

- `is_slippery`: Whether the environment is slippery or not. For FrozenLake, it is set to `True`. This adds some stochasticity to the environment.

- `render_mode`: The rendering mode. In this case, it is set to `None` which is the default. It is used to disable the rendering of the environment. Other rendering modes can be used, such as `human` or `rgb_array`.

- `desc`: A 2D array representing the map of the environment. For FrozenLake, it is set to `None` which is the default. It is used to define the map of the environment.

- `map_name`: The name of the map. It is used to define the size of the environment. For example: `4x4`, `8x8`, etc. By default it is set to `4x4`.

In [7]:
alpha = 0.1 # Learning rate
gamma = 0.99 # Discount factor
epsilon = 0.1 # Exploration rate
num_episodes = 10000 # Number of episodes to run

The above variables are used to define the hyperparameters of the Q-learning algorithm.<br>

- `alpha`: The learning rate is a hyperparameter that controls the speed of the learning process. It is set to `0.1` by default.

- `gamma`: The discount factor is a hyperparameter that controls the importance of future rewards. It is set to `0.99` by default.

- `epsilon`: The exploration rate is a hyperparameter that controls the randomness of the agent. It is set to `0.1` by default.

- `num_episodes`: The number of episodes is a hyperparameter that controls the number of episodes to run. It is set to `10000` by default.

Let's define a greedy epsilon function.<br>

- `greedy_epsilon`: This function takes in the Q-table, the state, and returns the action with the highest Q-value for the given state.<br>


In [8]:
def greedy_epsilon(qtable, state):
  if random.uniform(0, 1) < epsilon: # Choose a random action with probability epsilon (Exploring)
    action = env.action_space.sample() 
  else: # Choose the action with the highest q-value (Exploiting)
    if np.all(qtable[state, :]) == qtable[state, 0]: # If all q-values are 0, choose a random action
      action = env.action_space.sample()
    else: # Choose the action with the highest q-value
      action = np.argmax(qtable[state, :])
  return action

Now, let's implement the Q-learning algorithm by writing a function run_env.<br>

- `run_env`: This function runs the Q-learning algorithm for a specified number of episodes and returns the Q-table.<br>


In [9]:
def run_env():
  qtable = np.zeros((env.observation_space.n, env.action_space.n)) # Initialize the qtable with zeros
  rewards = [] # List to store rewards per episode
  steps = [] # List to store number of steps per episode
  for _ in range(num_episodes): 
    state, _ = env.reset() # Reset the environment
    done = False
    episode_rewards = 0
    episode_steps = 0

    while not done:
      action = greedy_epsilon(qtable, state) # Choose an action using epsilon-greedy
      new_state, reward, terminated, truncated, info = env.step(action) # Take the action (a) and observe the outcome state(s') and reward (r)
      next_best_action = np.argmax(qtable[new_state, :]) # Find the action with the highest q-value for the next state
      td_target = reward + gamma * qtable[new_state, next_best_action] # Calculate the temporal difference target
      td_error = td_target - qtable[state, action] # Calculate the temporal difference error
      qtable[state, action] += alpha * td_error # Update the qtable using the Bellman equation
      state = new_state # Set the current state as the next state
      episode_rewards += reward
      episode_steps += 1
      done = terminated or truncated # Check if the episode is done

    rewards.append(episode_rewards)
    steps.append(episode_steps)

  print("Episodes complete")
  print(f"Mean reward: {np.mean(rewards)}")
  print(f"Max steps: {np.max(steps)}")
  return qtable

Now, let's evaluate the performance of the Q-learning algorithm by writing a function evaluation.<br>

- `evaluation`: This function takes in the Q-table, the number of trials, and returns the mean reward, max number of steps taken, and success rate.<br>

In [10]:
def evaluation(qtable, num_trials): 
  num_successes = 0 # Number of successful episodes
  rewards = [] # List to store rewards per episode
  steps = [] # List to store number of steps per episode
  for _ in range(num_trials):
    state, _ = env.reset()
    done = False
    episode_reward = 0
    episode_steps = 0
    while not done:
      action = np.argmax(qtable[state, :]) # Choose the action with the highest q-value
      new_state, reward, terminated, truncated, info = env.step(action)
      state = new_state
      episode_reward += reward
      episode_steps += 1
      done = terminated or truncated # Check if the episode is done

    rewards.append(episode_reward)
    steps.append(episode_steps)
    if episode_reward == 1: # Check if the episode was successful
      num_successes += 1

  print("Evaluation complete")
  print(f"Mean reward: {np.mean(rewards)}")
  print(f"Max steps: {np.max(steps)}")
  print(f"Success rate: {num_successes / num_trials}")

With this implementation, we can now train the Q-learning algorithm and evaluate its performance. Now let's run the Q-learning algorithm and evaluate its performance.<br>

In [None]:
qtable = run_env()
evaluation(qtable, 10)
# Also make sure to close the environment when we are done.
env.close()

NOTE: The success rate of the policy while evaluating will fluctuate. This is because of the randomness of the environment because the `is_slippery` attribute is set to True. 