# RFP: Maze Solvers

## Project Overview
You are invited to submit a proposal that answers the following question:

### What path will your elf take?

*Please submit your proposal by **2/11/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, use [Gymnasium](https://gymnasium.farama.org/) to set up a [Frozen Lake maze](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) for your project. When you are done with the set up, describe the reward system you plan on using.

*Note, a level 5 maze is at least 10 x 10 cells large and contains at least five lake cells.*

In [3]:
import gymnasium as gym
import random
import pandas as pd
from gymnasium.envs.toy_text.frozen_lake import generate_random_map


In [43]:
maze=["SFFHFFFFFH", "FHFFFHHFFF", "FFFHFHFFFH", "FHFFFFFFFF", "FFFFHFFHFH","FHFHFHFHFH","FFFFFFFFFF", "FHFFFFFFFF","HHFFFHFFFF","FFFFFFFFGF"]
env = gym.make('FrozenLake-v1', desc=maze, render_mode = "human")
initial_state = env.reset()
env.render()

#### Reward System:
## Empty Space = -1 point
## Present = 100 points
## Hole = -100 points

### 2. Training Your Model
In the cell seen below, write the code you need to train a Q-Learning model. Display your final Q-table once you are done training your model.

*Note, level 5 work uses only the standard Python library and Pandas to train your Q-Learning model. A level 4 uses external libraries like Baseline3.*

In [8]:
# Learning rate - 0.5, discount - 0.5
# Bellman Equation: (1-alpha)q(s , a) + alpha(R + gamma(max(q(s' , a'))))
# Q Table Diagram

In [9]:
num_states = 100  # Total number of states (e.g., 10x10 grid)
num_actions = 4   # 0: Left, 1: Right, 2: Up, 3: Down
grid_size = 10    # Size of the grid (10x10)

# Initialize Q-table
q = {state: [0] * num_actions for state in range(num_states)}

In [10]:
def updateQ(q, alpha, gamma, step, cell, reward):
    row = [q[3][cell], q[1][cell], q[0][cell], q[2][cell]]
    bell = (1-alpha)*(q[step][cell]) + alpha*(reward + (gamma*max(row)))
    q[step][cell] = bell

In [11]:
# Create my own reward system
cell_types = ['S', 'F', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'H', 'F', 'H', 'F', 'F', 'F', 'H', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'H', 'F', 'H', 'F', 'F', 'F', 'H', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'H', 'F', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'H', 'H', 'F', 'F', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'G', 'F']

def getReward(state):
    if cell_types[state] == "G":
        return 100
    elif cell_types[state] == "H":
        return -100
    else:
        return -1

In [12]:
def updateQTable(q, alpha, gamma, current_state, new_state, action):
    current_q = q[current_state][action]
    reward = getReward(new_state)
    next_state_row = [q[new_state][0], q[new_state][1]] 
    next_max_q = max(next_state_row)
    new_q = ((1 - alpha) * current_q) + (alpha * (reward + (gamma * next_max_q)))
    q[current_state][action] = new_q

In [26]:
current_state = 0
alpha = 0.2
gamma = 0.8
terminated = False
for episode in range(35000):
    while not terminated:
        action = random.randint(0, 3)
        new_state, reward, terminated, truncated, info = env.step(action)
        updateQTable(q, alpha, gamma, current_state, new_state, action)
        current_state = new_state
    inital = env.reset()
    terminated = False

In [28]:
q

{0: [-5.316640778060231,
  -12.118262895027398,
  -14.422521994302528,
  -14.533142982904069],
 1: [-23.394936677698094,
  -30.16809326274336,
  -54.427073245466964,
  -20.72621968807935],
 2: [-36.71288867596839,
  -56.68879089751239,
  -66.83177242683767,
  -41.309895144564344],
 3: [-5.264449894732392,
  -8.657131635668707,
  -18.09724791871852,
  -22.926484259578515],
 4: [-61.44187218389393,
  -62.275735836167655,
  -42.594124146861404,
  -75.6392062774033],
 5: [-62.67304602359023,
  -58.1727447073832,
  -64.40701717830305,
  -40.5177724548531],
 6: [-38.69921273293152,
  -39.053563586452064,
  -54.67934923400566,
  -19.90966378794788],
 7: [-0.2, -6.3504724992, -0.7839360000000002, -4.5940503552],
 8: [-0.36000000000000004, -0.52, -0.2, -0.36000000000000004],
 9: [-0.3384308409421005, 0, 0, 0],
 10: [-5.484573823929537,
  -15.433016358374212,
  -43.87525085346342,
  -51.944969561857306],
 11: [-5.352459108570603,
  -9.116845986092882,
  -9.671473454023818,
  -8.820273702751264],

In [32]:
df = pd.DataFrame(q)
df = df.T
df

Unnamed: 0,0,1,2,3
0,-5.316641,-12.118263,-14.422522,-14.533143
1,-23.394937,-30.168093,-54.427073,-20.726220
2,-36.712889,-56.688791,-66.831772,-41.309895
3,-5.264450,-8.657132,-18.097248,-22.926484
4,-61.441872,-62.275736,-42.594124,-75.639206
...,...,...,...,...
95,-32.985600,-0.440087,0.000000,-16.232000
96,-0.200000,-0.616000,-0.488000,0.000000
97,0.000000,-0.360000,-0.200000,-0.392000
98,-1.050484,0.000000,-7.110763,0.000000


In [39]:
env.close()
# Save final q table as a csv file
df.to_csv('final_q_values.csv', index=False)

### 3. Testing Your Model
In the cell seen below, write the code you need to test your Q-Learning model for **1000 episodes**. It is important to test your model for 1000 episodes so that we are all able to compare our results.

*Note, level 5 testing uses both a success rate and an average steps taken metric to evaluate your model. Level 4 uses one or the other.*

In [47]:
maze=["SFFHFFFFFH", "FHFFFHHFFF", "FFFHFHFFFH", "FHFFFFFFFF", "FFFFHFFHFH","FHFHFHFHFH","FFFFFFFFFF", "FHFFFFFFFF","HHFFFHFFFF","FFFFFFFFGF"]
env = gym.make('FrozenLake-v1', desc=maze, render_mode = "human")
initial_state = env.reset()
env.render()

In [None]:
env.close()

In [None]:
current_state = 0
terminated = False
success_rate = 0
for i in range(1000):
    while not terminated:
        max_q = df.iloc[current_state].max()
        column_name = df.columns[(df == max_q).any()].tolist()
        print("Max: ", max_q, "Action", column_name, "Current State", current_state)
        current_state, terminated = takeAction(column_name[0])
    cell_types = ['S', 'F', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'H', 'F', 'H', 'F', 'F', 'F', 'H', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'H', 'F', 'H', 'F', 'F', 'F', 'H', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'H', 'F', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'H', 'H', 'F', 'F', 'F', 'H', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'G', 'F']
    terminated = False
    initial_state = env.reset()
    if cell_types[current_state] == "G":
        success_rate += 1
print((success_rate/1000) * 100)

### 4. Final Answer
In the first cell below, describe the path your elf takes to get to the gift. *Note, a level 5 answer includes a gif of the path your elf takes in order to reach the gift.*

In the second cell seen below, describe how well your Q-Learning model performed. Make sure that you explicitly name the **learning rate**, **the discount factor**, and the **reward system** that you used when training your final model. *Note, a level 5 description describes the model's performance using two types of quantitative evidence.*

![example image](https://gymnasium.farama.org/_images/frozen_lake.gif)

#### Describe the path your elf takes here.

#### Describe how well your Q-Learning model performed here.