### Run in collab
<a href="https://colab.research.google.com/github/racousin/data_science_practice/blob/master/website/public/modules/data-science-practice/module9/exercise/module9_exercise2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install swig==4.2.1
!pip install gymnasium==1.2.0

In [2]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt

# module9_exercise2 : ML - Arena <a href="https://ml-arena.com/viewcompetition/5" target="_blank"> FrozenLake Competition</a> 

### Objective
Get at list an agent running on ML-Arena <a href="https://ml-arena.com/viewcompetition/5" target="_blank"> FrozenLake Competition</a> with mean reward upper than 0.35 (ie 35%)


You should submit an agent file named `agent.py` with a class `Agent` that includes at least the following attributes:

In [3]:
TRAINED_Q_TABLE = None

class Agent:
    def __init__(self, env, q_table=None):
        self.env = env
        if q_table is not None:
            self.q_table = np.array(q_table, dtype=np.float64)
        else:
            if TRAINED_Q_TABLE is None:
                raise RuntimeError(
                    "Q-table not trained yet. Execute the training cell before instantiating Agent."
                )
            self.q_table = TRAINED_Q_TABLE.copy()
        self.policy = np.argmax(self.q_table, axis=1)

    def choose_action(self, observation, reward=0.0, terminated=False, truncated=False, info=None):
        if observation is None:
            return self.env.action_space.sample()
        state = int(observation)
        if state >= self.q_table.shape[0]:
            return self.env.action_space.sample()
        q_values = self.q_table[state]
        if np.allclose(q_values, 0.0):
            return self.env.action_space.sample()
        return int(self.policy[state])


### Description

The game starts with the player at location [0,0] of the frozen lake grid world with the goal located at far extent of the world [7,7].

Holes in the ice are distributed in set locations.

The player makes moves until they reach the goal or fall in a hole.

Each run will consist of 10 attempts to cross the ice. The reward will be the total amount accumulated during those trips. For example, if your agent reaches the goal 3 times out of 10, its reward will be 3.

The environment is based on :

In [4]:
env = gym.make('FrozenLake-v1', map_name="8x8")

In [5]:
training_env = gym.make('FrozenLake-v1', map_name="8x8")
num_states = training_env.observation_space.n
num_actions = training_env.action_space.n
transition_model = training_env.unwrapped.P

# Value iteration hyperparameters
max_iterations = 10_000
threshold = 1e-9
gamma = 0.99

value_function = np.zeros(num_states, dtype=np.float64)

for iteration in range(max_iterations):
    delta = 0.0
    for state in range(num_states):
        current_value = value_function[state]
        action_values = np.zeros(num_actions, dtype=np.float64)
        for action in range(num_actions):
            for prob, next_state, reward, terminated in transition_model[state][action]:
                continuation = 0.0 if terminated else value_function[next_state]
                action_values[action] += prob * (reward + gamma * continuation)
        value_function[state] = np.max(action_values)
        delta = max(delta, abs(current_value - value_function[state]))
    if delta < threshold:
        print(f"Value iteration converged after {iteration + 1} iterations with delta={delta:.2e}.")
        break
else:
    print("Warning: value iteration reached the iteration limit before convergence.")

q_table = np.zeros((num_states, num_actions), dtype=np.float64)
for state in range(num_states):
    for action in range(num_actions):
        for prob, next_state, reward, terminated in transition_model[state][action]:
            continuation = 0.0 if terminated else value_function[next_state]
            q_table[state, action] += prob * (reward + gamma * continuation)

training_env.close()


Value iteration converged after 394 iterations with delta=9.62e-10.


In [6]:
def evaluate_policy(q_values, episodes=5_000):
    eval_env = gym.make('FrozenLake-v1', map_name="8x8")
    rng = np.random.default_rng(2025)
    total_reward = 0.0
    successes = 0

    for _ in range(episodes):
        state, _ = eval_env.reset(seed=int(rng.integers(0, 1_000_000)))
        done = False

        while not done:
            action_values = q_values[state]
            if np.allclose(action_values, action_values[0]):
                action = eval_env.action_space.sample()
            else:
                best_value = np.max(action_values)
                best_actions = np.flatnonzero(np.isclose(action_values, best_value))
                action = int(best_actions[rng.integers(0, len(best_actions))]) if best_actions.size else eval_env.action_space.sample()

            next_state, reward, terminated, truncated, _ = eval_env.step(action)
            done = terminated or truncated
            state = next_state

            if done:
                total_reward += reward
                if reward > 0:
                    successes += 1

    eval_env.close()
    mean_reward = total_reward / episodes
    success_rate = successes / episodes
    return mean_reward, success_rate

mean_reward, success_rate = evaluate_policy(q_table)
print(f"Mean reward over evaluation episodes: {mean_reward:.3f}")
print(f"Success rate: {success_rate:.2%}")

if mean_reward >= 0.35:
    print("Objective satisfied: mean reward is above 0.35.")
else:
    print("Warning: mean reward is below the 0.35 target, consider tuning hyperparameters.")

global TRAINED_Q_TABLE
TRAINED_Q_TABLE = q_table.copy()


Mean reward over evaluation episodes: 0.637
Success rate: 63.70%
Objective satisfied: mean reward is above 0.35.


### Before submit
Test that your agent has the right attributes

In [8]:
env = gym.make('FrozenLake-v1', map_name="8x8")
agent = Agent(env)

observation, _ = env.reset()
reward, terminated, truncated, info = None, False, False, None
rewards = []
while not (terminated or truncated):
    action = agent.choose_action(observation, reward=reward, terminated=terminated, truncated=truncated, info=info)
    observation, reward, terminated, truncated, info = env.step(action)
    rewards.append(reward)
print(f'Cumulative Reward: {sum(rewards)}')

Cumulative Reward: 1.0
