<a href="https://colab.research.google.com/github/vijaygwu/classideas/blob/main/QLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

** FrozenLake-v0 environment **

This is a grid world where the agent must navigate from the start to the goal without falling into holes.

** The FrozenLake Environment Details: **

The agent moves through a 4x4 grid world.

States: One for each of the 16 grid cells.
Actions: {Left, Right, Up, Down}.
Rewards: 0 for most transitions, 1 for reaching the goal, and 0 for falling into a hole.
The episode terminates when the agent reaches the goal or falls into a hole.

This script sets up the FrozenLake-v0 environment, initializes the Q-table, and then iterates through episodes, updating the Q-values as it learns. After training, the agent's policy (derived from the Q-table) is tested over a few episodes.

In [2]:
import numpy as np
import gym

# Initialize the "FrozenLake" environment
env = gym.make('FrozenLake-v1')

# Q-learning parameters
learning_rate = 0.1
discount_factor = 0.99
exploration_rate = 1.0
max_exploration_rate = 1.0
min_exploration_rate = 0.01
exploration_decay_rate = 0.001
num_episodes = 10000

# Initialize the Q-table: state_size x action_size
q_table = np.zeros((env.observation_space.n, env.action_space.n))

for episode in range(num_episodes):
    state = env.reset()
    done = False

    while not done:
        # Choose action using epsilon-greedy policy
        if np.random.uniform(0, 1) < exploration_rate:
            action = env.action_space.sample()  # Explore: choose random action
        else:
            action = np.argmax(q_table[state, :])  # Exploit: choose best action from Q-table

        new_state, reward, done, _ = env.step(action)

        # Update Q-value using the Bellman equation
        q_table[state, action] = q_table[state, action] + learning_rate * \
                                (reward + discount_factor * np.max(q_table[new_state, :]) - q_table[state, action])

        state = new_state

    # Decay exploration rate to reduce exploration over time
    exploration_rate = min_exploration_rate + \
                       (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)

# Test the trained agent
total_test_rewards = []
for episode in range(3):
    state = env.reset()
    done = False
    total_rewards = 0
    print("EPISODE ", episode+1, "\n\n\n\n")
    env.render()

    while not done:
        action = np.argmax(q_table[state, :])
        new_state, reward, done, _ = env.step(action)
        total_rewards += reward
        env.render()
        state = new_state

    total_test_rewards.append(total_rewards)

print("Average test reward: ", sum(total_test_rewards)/3)


  deprecation(
  deprecation(


EPISODE  1 






If you want to render in human mode, initialize the environment in this way: gym.make('EnvName', render_mode='human') and don't call the render method.
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


EPISODE  2 




EPISODE  3 




Average test reward:  1.0
