# RFP: Maze Solvers

## Project Overview
You are invited to submit a proposal that answers the following question:

### What path will your elf take?

*Please submit your proposal by **2/11/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, use [Gymnasium](https://gymnasium.farama.org/) to set up a [Frozen Lake maze](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) for your project. When you are done with the set up, describe the reward system you plan on using.

*Note, a level 5 maze is at least 10 x 10 cells large and contains at least five lake cells.*

In [3]:
import gymnasium as gym
import random
import pandas as pd
from gymnasium.envs.toy_text.frozen_lake import generate_random_map


In [4]:
# Make maze
cells =["S", "F", "F", "H","F","G"]
maze=["SFFHFFFFFH", "FHFFFHHFFF", "FFFHFHFFFH", "FHFFFFFFFF", "FFFFHFFHFH","FHFHFHFHFH","FFFFFFFFFF", "FHFFFFFFFF","HHFFFHFFFF","FFFFFFFFGF"]
env = gym.make('FrozenLake-v1', desc=maze)

initial_state = env.reset()

env.render()

# Take a step (0: LEFT, 1: RIGHT, 2: UP, 3: DOWN)
action = 2
new_state, reward, terminated, truncated, info = env.step(action)

env.render()

  gym.logger.warn(


In [5]:
env.close()

#### Reward System:
## Empty Space = -1 point
## Present = 100 points
## Hole = -100 points

In [7]:
def updateQ(q, alpha, gamma, step, cell, reward):
    row = [q[3][cell], q[1][cell], q[0][cell], q[2][cell]]
    bell = (1-alpha)*(q[step][cell]) + alpha*(reward + (gamma*max(row)))
    q[step][cell] = bell

### 2. Training Your Model
In the cell seen below, write the code you need to train a Q-Learning model. Display your final Q-table once you are done training your model.

*Note, level 5 work uses only the standard Python library and Pandas to train your Q-Learning model. A level 4 uses external libraries like Baseline3.*

In [9]:
# Learning rate - 0.5, discount - 0.5
# Bellman Equation: (1-alpha)q(s , a) + alpha(R + gamma(max(q(s' , a'))))
# Q Table Diagram

In [10]:
# Environment parameters
num_states = 100  # Total number of states (e.g., 10x10 grid)
num_actions = 4   # 0: Left, 1: Right, 2: Up, 3: Down
grid_size = 10    # Size of the grid (10x10)

# Initialize Q-table
q = {state: [0] * num_actions for state in range(num_states)}

In [11]:
# Create my own reward system
cell_types = ['S', 'F', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'H', 'F', 'H', 'F', 'F', 'F', 'H', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'H', 'F', 'H', 'F', 'F', 'F', 'H', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'H', 'F', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'H', 'H', 'F', 'F', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'G', 'F']

def getReward(state):
    if cell_types[state] == "G":
        return 100
    elif cell_types[state] == "H":
        return -100
    else:
        return -1

In [12]:
for episode in range(1000):
    while not terminated:
        # Take a step (0: LEFT, 1: DOWN, 2: RIGHT, 3: UP)
        action = random.randint(0, 3)
        new_state, reward, terminated, truncated, info = env.step(action)
        idx = new_state % len(q[3])
        if cell_types[new_state] == "F" or cell_types[new_state] == "S":
            reward = -1
        elif cell_types[new_state] == "H":
            reward = -100
        else:
            reward = 100
        updateQ(q, 0.5, 0.5, action, idx, reward)
    iniital = env.reset()
    terminated = False

In [13]:
def get_valid_actions(current_state, grid_size):
    valid_actions = []
    row, col = current_state // grid_size, current_state % grid_size

    if col > 0:  # Can move left
        valid_actions.append(0)
    if col < grid_size - 1:  # Can move right
        valid_actions.append(1)
    if row > 0:  # Can move up
        valid_actions.append(2)
    if row < grid_size - 1:  # Can move down
        valid_actions.append(3)

    return valid_actions

In [14]:
def updateQTable(q, alpha, gamma, current_state, new_state, action):
    current_q = q[current_state][action]
    reward = getReward(new_state)
    next_state_row = [q[new_state][0], q[new_state][1]] 
    next_max_q = max(next_state_row)
    new_q = ((1 - alpha) * current_q) + (alpha * (reward + (gamma * next_max_q)))
    q[current_state][action] = new_q

In [15]:
# Simulate the environment
def step(action, current_state, grid_size):
    row, col = current_state // grid_size, current_state % grid_size

    if action == 0:  # Move left
        new_state = current_state - 1
    elif action == 1:  # Move right
        new_state = current_state + 1
    elif action == 2:  # Move up
        new_state = current_state - grid_size
    elif action == 3:  # Move down
        new_state = current_state + grid_size
    else:
        raise ValueError("Invalid action")

    # Ensure the new state is within bounds
    new_state = max(0, min(new_state, num_states - 1))

    # Calculate reward and termination
    reward = getReward(new_state)
    terminated = (new_state == 'G' or new_state == 'H')
    truncated = False
    info = {}

    return new_state, reward, terminated, truncated, info


In [16]:
q

{0: [-1.9999999999999996,
  -1.999999999999997,
  -1.9999999999999996,
  -199.9999834270789],
 1: [-1.9999999999999996,
  -2.0000000000000204,
  -1.9999999999999996,
  -199.99998506715968],
 2: [-2.0, -2.1933593749999982, -1.9999999999999996, -199.99998839014603],
 3: [-2.0, -1.9999999999999982, -1.9999999999999996, -199.9999900052803],
 4: [0, 0, 0, 0],
 5: [0, 0, 0, 0],
 6: [0, 0, 0, 0],
 7: [0, 0, 0, 0],
 8: [0, 0, 0, 0],
 9: [0, 0, 0, 0],
 10: [0, 0, 0, 0],
 11: [0, 0, 0, 0],
 12: [0, 0, 0, 0],
 13: [0, 0, 0, 0],
 14: [0, 0, 0, 0],
 15: [0, 0, 0, 0],
 16: [0, 0, 0, 0],
 17: [0, 0, 0, 0],
 18: [0, 0, 0, 0],
 19: [0, 0, 0, 0],
 20: [0, 0, 0, 0],
 21: [0, 0, 0, 0],
 22: [0, 0, 0, 0],
 23: [0, 0, 0, 0],
 24: [0, 0, 0, 0],
 25: [0, 0, 0, 0],
 26: [0, 0, 0, 0],
 27: [0, 0, 0, 0],
 28: [0, 0, 0, 0],
 29: [0, 0, 0, 0],
 30: [0, 0, 0, 0],
 31: [0, 0, 0, 0],
 32: [0, 0, 0, 0],
 33: [0, 0, 0, 0],
 34: [0, 0, 0, 0],
 35: [0, 0, 0, 0],
 36: [0, 0, 0, 0],
 37: [0, 0, 0, 0],
 38: [0, 0, 0, 0],
 3

In [17]:
df = pd.DataFrame(q)
df #for some reason the states are up top

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-2.0,-2.0,-2.0,-2.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,-2.0,-2.0,-2.193359,-2.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,-2.0,-2.0,-2.0,-2.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,-199.999983,-199.999985,-199.999988,-199.99999,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 3. Testing Your Model
In the cell seen below, write the code you need to test your Q-Learning model for **1000 episodes**. It is important to test your model for 1000 episodes so that we are all able to compare our results.

*Note, level 5 testing uses both a success rate and an average steps taken metric to evaluate your model. Level 4 uses one or the other.*

### 4. Final Answer
In the first cell below, describe the path your elf takes to get to the gift. *Note, a level 5 answer includes a gif of the path your elf takes in order to reach the gift.*

In the second cell seen below, describe how well your Q-Learning model performed. Make sure that you explicitly name the **learning rate**, **the discount factor**, and the **reward system** that you used when training your final model. *Note, a level 5 description describes the model's performance using two types of quantitative evidence.*

![example image](https://gymnasium.farama.org/_images/frozen_lake.gif)

#### Describe the path your elf takes here.

#### Describe how well your Q-Learning model performed here.