# FrozenLake Experiments (from `main.py`)

- The assignment does not explicitly specify whether the environment transitions are deterministic or stochastic. 

- However, since the Frozen Lake problem can also be interpreted as a slippery environment, experiments were conducted under both settings.
- Deterministic Transition Environment: "frozen_lake_env.py"
- Stochastic Transition (Slippery Environment): "environment_probablistic.py"

- In the probabilistic Frozen Lake environment, slipping is modeled through stochastic action execution. The intended action is executed with probability 1-p, while with probability p, one of the two perpendicular actions is executed with equal probability. Rewards remain deterministic.

For example: If agent chooses RIGHT:
- executes RIGHT with prob 0.8 ---- executes UP with prob 0.1 --- executes DOWN with prob 0.1




### Task 1: Training using Monte Carlo Reinforcement Technique

- Experiment 1: Deterministic Environment - 4×4 grid - Epsilon is fixed - Tie Breaking is biased towards lower indexed action (Executed from monte_carlo.py) - Training upto 30k episodes ---- Outcome: Success rate converges to 1 approximately after 9000 epsisodes

- Experiment 2: Extended version of Experiment 1 for 10×10 grid - Training upto 80k episodes ---- Outcome: Success rate remains 0 throughout the training, Robot osicallates between 2 cells inorder to avoid falling into the hole which produces the negative reward

- Experiment 3: Probablistic Environment - 4×4 grid - Same assumption as Experiment 1 ---- Outcome: success rate Converges approximately to 0.8 (which means within the tested 200 episodes: approximately 160 episodes reaches the goal successfully)

- Experiment 4: Extended version of Experiment 3 for 4×4 grid with more training upto 80k epsisodes ---- Outcome: Observations are similar to Experiment 3

- Experiment 5: Extended version of Experiment 4 for 10×10 grid ---- Outcome: Observations are similar to Experiment 2 - Success rate remains 0 throughout the training

-------------------------------------------------------------------------------------------------------

- Experiment 6: Deterministic Environment - 4×4 grid - exponentially decaying ε-greedy strategy was adopted - Random tie-breaking was used for greedy action selection to avoid systematic bias toward lower-indexed actions (Left side)  ---- Outcome: Success rate converges to 1 from very early training (Within 3000 episodes of Training)

- Experiment 7: Extended version of Experiment 6 on a 10×10 grid (with deterministic transitions, random tie-breaking, and a decaying ε-greedy strategy. ) ---- Outcome: Despite these algorithmic improvements, the method failed to converge, as the greedy policy success rate remained zero even after 80,000 training episodes


In [22]:
%matplotlib tk


In [23]:
# --- General imports ---
import sys
import time
import matplotlib.pyplot as plt

In [24]:
# Deterministic environment
from frozen_lake_env import FrozenLakeEnv as FrozenLakeEnvDet, generate_random_solvable_holes

# Probabilistic environment
from environment_probablistic import FrozenLakeEnv as FrozenLakeEnvProb

# Renderer for Frozen Lake GUI
from frozen_lake_render import FrozenLakeMatplotlibRenderer

In [25]:
# --- Original Monte-Carlo imports (Task 1) ---
from monte_carlo import (
    mc_control_first_visit_no_exploring_starts as mc_control_base,
    greedy_action as greedy_action_base,
    print_greedy_policy_grid as print_policy_base,
    evaluate_greedy_policy as eval_greedy_base,
)

In [26]:
# --- Updated Monte-Carlo (random tie-break + epsilon decay) imports ---
from monte_carlo_update import (
    mc_control_first_visit_no_exploring_starts as mc_control_up,
    greedy_action as greedy_action_up,
    print_greedy_policy_grid as print_policy_up,
    evaluate_greedy_policy as eval_greedy_up,
)

In [27]:
# --- SARSA (Task 2) imports ---
from sarsa import (
    sarsa_control_epsilon_greedy as sarsa_control_up,
    greedy_action as greedy_action_sarsa,
    print_greedy_policy_grid as print_policy_sarsa,
    evaluate_greedy_policy as eval_greedy_sarsa,
)

### Visualization of the single episode after Training

In [28]:
def run_episode_with_render(env, Q, renderer, greedy_action_fn, max_steps=200, pause=0.30):
    """
    Run ONE episode using the greedy policy from Q.
    Updates matplotlib renderer each step.
    """
    s = env.reset()

    print("Initial state_id:", s, "pos:", env.s)
    env.render()        # ASCII
    renderer.draw()     # GUI initial draw

    for t in range(max_steps):
        a = greedy_action_fn(Q[s])
        ns, r, done, info = env.step(a)

        # If probabilistic env, show executed action info too
        extra = ""
        if isinstance(info, dict) and "executed_action" in info:
            extra = f" | executed={env.ACTION_NAMES[info['executed_action']]} slipped={info.get('slipped', False)}"

        print(
            f"[t={t}] a={env.ACTION_NAMES[a]} ({a})  ns={ns}  r={r}  done={done}  pos={env.s}{extra}"
        )

        env.render()
        renderer.draw(action=a, reward=r, done=done)

        s = ns

        if pause is not None and pause > 0:
            time.sleep(pause)

        if done:
            print("Episode finished.")
            break

    print("Close plot window to exit.")
    plt.ioff()
    plt.show()

### Reusable pipeline for Monte-Carlo Experiments

- Environment setup (deterministic or probabilistic)
- Train a policy using Monte-Carlo control
- Evaluate the learned greedy policy
- Optionally visualize one greedy episode

In [29]:
def run_mc_experiment(
    env_class,
    n,
    holes,
    env_seed,
    mc_control_fn,
    greedy_action_fn,
    print_policy_fn,
    eval_fn,
    mc_kwargs=None,
    # Common experiment settings:
    train_episodes=30000,
    gamma=0.99,
    max_steps_train=100,
    eval_episodes=2000,
    max_steps_eval=100,
    verbose_every=3000,
    render_one_episode=True,
    render_pause=0.30,
    env_kwargs=None,
):
    """
    Runs MC control on either deterministic or probabilistic environment,
    depending on env_class passed in.
    """
    env_kwargs = env_kwargs or {}
    mc_kwargs = mc_kwargs or {}

    print("\n" + "=" * 80)
    print(f"Running MC experiment: grid={n}x{n} | holes={len(holes)} | hole_ratio={len(holes)/(n*n):.2%}")
    print(f"Environment: {env_class.__name__} | extra_args={env_kwargs}")
    print(f"MC implementation: {mc_control_fn.__module__}.{mc_control_fn.__name__} | mc_kwargs={mc_kwargs}")
    print("=" * 80)

    env = env_class(n=n, holes=holes, seed=env_seed, **env_kwargs)

    # --- Train ---
    Q, pi = mc_control_fn(
        env,
        num_episodes=train_episodes,
        gamma=gamma,
        max_steps_per_episode=max_steps_train,
        seed=0,
        verbose_every=verbose_every,
        **mc_kwargs
    )

    # --- Print policy ---
    print("\nFinal greedy policy (grid):")
    print_policy_fn(env, Q)

    # --- Evaluate ---
    sr = eval_fn(env, Q, episodes=eval_episodes, max_steps=max_steps_eval, seed=999)
    print(f"\nFinal greedy success rate over {eval_episodes} episodes: {sr:.3f}")

    if render_one_episode:
        if env_class.__name__.lower().find("prob") != -1 or env_kwargs.get("slippery", False):
            env_label = "Slippery"
        else:
            env_label = "Deterministic"

        renderer = FrozenLakeMatplotlibRenderer(
            env,
            bg_image_path=None,
            pause=render_pause,
            title=f"Greedy after MC {n}x{n} - {env_label}"
        )
        run_episode_with_render(
            env, Q, renderer, greedy_action_fn,
            max_steps=max_steps_eval, pause=render_pause
        )

### Reusable pipeline for SARSA Experiments

In [30]:
def run_sarsa_experiment(
    env_class,
    n,
    holes,
    env_seed,
    sarsa_control_fn,
    greedy_action_fn,
    print_policy_fn,
    eval_fn,
    sarsa_kwargs=None,
    # Common experiment settings:
    train_episodes=30000,
    gamma=0.99,
    alpha=0.10,
    max_steps_train=100,
    eval_episodes=2000,
    max_steps_eval=100,
    verbose_every=3000,
    render_one_episode=True,
    render_pause=0.30,
    env_kwargs=None,
):
    """
    Runs SARSA control on either deterministic or probabilistic environment,
    depending on env_class passed in.
    """
    env_kwargs = env_kwargs or {}
    sarsa_kwargs = sarsa_kwargs or {}

    print("\n" + "=" * 80)
    print(f"Running SARSA experiment: grid={n}x{n} | holes={len(holes)} | hole_ratio={len(holes)/(n*n):.2%}")
    print(f"Environment: {env_class.__name__} | extra_args={env_kwargs}")
    print(f"SARSA implementation: {sarsa_control_fn.__module__}.{sarsa_control_fn.__name__} | sarsa_kwargs={sarsa_kwargs}")
    print("=" * 80)

    env = env_class(n=n, holes=holes, seed=env_seed, **env_kwargs)

    # --- Train ---
    Q, pi = sarsa_control_fn(
        env,
        num_episodes=train_episodes,
        gamma=gamma,
        alpha=alpha,
        max_steps_per_episode=max_steps_train,
        seed=0,
        verbose_every=verbose_every,
        **sarsa_kwargs
    )

    # --- Print policy ---
    print("\nFinal greedy policy (grid):")
    print_policy_fn(env, Q)

    # --- Evaluate ---
    sr = eval_fn(env, Q, episodes=eval_episodes, max_steps=max_steps_eval, seed=999)
    print(f"\nFinal greedy success rate over {eval_episodes} episodes: {sr:.3f}")

    # --- Render one episode (optional) ---
    if render_one_episode:
        renderer = FrozenLakeMatplotlibRenderer(
            env,
            bg_image_path=None,
            pause=render_pause,
            title=f"FrozenLake {n}x{n} (Greedy after SARSA)"
        )
        run_episode_with_render(env, Q, renderer, greedy_action_fn, max_steps=max_steps_eval, pause=render_pause)

### Environment Holes setup 

In [31]:
# --- Common setup: holes ---
# Task 1: 4x4 fixed holes
holes_4x4 = {(1, 1), (1, 3), (2, 3), (3, 0)}

# Task 2: 10x10 random solvable holes
holes_10x10 = generate_random_solvable_holes(
    n=10,
    hole_ratio=0.25,
    seed=123,
    start=(0, 0),
    goal=(9, 9),
    max_tries=5000
)

print("holes_4x4 =", holes_4x4)
print("holes_10x10 count =", len(holes_10x10))
print("holes_10x10 =", holes_10x10)

holes_4x4 = {(2, 3), (1, 1), (1, 3), (3, 0)}
holes_10x10 count = 25
holes_10x10 = {(4, 0), (3, 4), (8, 6), (0, 5), (1, 9), (7, 4), (6, 2), (6, 8), (5, 6), (9, 4), (0, 1), (1, 2), (2, 7), (6, 1), (7, 3), (6, 7), (4, 1), (4, 4), (3, 8), (8, 4), (5, 8), (0, 3), (1, 4), (0, 6), (1, 7)}


###  Experiment 1: Deterministic Transition, Monte Carlo (task 1), 4×4 - (detMC4)

- When multiple actions had identical Q-values during the argmax operation, ties were resolved deterministically by always selecting the leftmost action; no randomness was introduced in tie-breaking.

- Epsilon Value is fixed
	
- The action-value updates were performed using the first-visit Monte Carlo method, where only the return from the first occurrence of each state–action pair in an episode was used - Many-visit updates were not applied; 

- Training episodes set upto 30,000 and its evaluation is for every 3,000



In [32]:
run_mc_experiment(
    env_class=FrozenLakeEnvDet,
    n=4,
    holes=holes_4x4,
    env_seed=40,
    mc_control_fn=mc_control_base,
    greedy_action_fn=greedy_action_base,
    print_policy_fn=print_policy_base,
    eval_fn=eval_greedy_base,
    mc_kwargs={"epsilon": 0.10},
    train_episodes=30000,
    gamma=0.99,
    max_steps_train=100,
    eval_episodes=2000,
    max_steps_eval=100,
    verbose_every=3000,
    render_one_episode=True,
    render_pause=0.30,
    env_kwargs={}
)


Running MC experiment: grid=4x4 | holes=4 | hole_ratio=25.00%
Environment: FrozenLakeEnv | extra_args={}
MC implementation: monte_carlo.mc_control_first_visit_no_exploring_starts | mc_kwargs={'epsilon': 0.1}
[MC FV] episode=3000 | greedy success_rate=0.000
[MC FV] episode=6000 | greedy success_rate=0.000
[MC FV] episode=9000 | greedy success_rate=0.000
[MC FV] episode=12000 | greedy success_rate=1.000
[MC FV] episode=15000 | greedy success_rate=1.000
[MC FV] episode=18000 | greedy success_rate=1.000
[MC FV] episode=21000 | greedy success_rate=1.000
[MC FV] episode=24000 | greedy success_rate=1.000
[MC FV] episode=27000 | greedy success_rate=1.000
[MC FV] episode=30000 | greedy success_rate=1.000

Final greedy policy (grid):
S L L L
D H U H
R D L H
H R R G

Final greedy success rate over 2000 episodes: 1.000
[Renderer] Image not found: bg_images/robot.png (will fallback to markers if needed)
Initial state_id: 0 pos: (0, 0)
Agent at: (row=0, col=0) | state_id=0
---------
|S F F F|
|F H 

Explanation for Experiment 1:

Based on the success rate - (Training is happened for 30,000 episodes and success rate is evaluated after every 3000 epsisodes):
- For the first ~9k episodes, learned Q-table produces a greedy policy that never reaches the goal
- Between 9k–12k episodes, the agent finally discovers at least one successful trajectory to the goal. 
- Sharp performance jump (success rate 100%) occurs because I considered the deterministic transitions
- Learning is delayed due to sparse terminal rewards and the lack of exploring starts, which limits early state–action exploration.

Based on the final Greedy policy printed:
- 	At each state, the indicated action represents the action with the highest learned Q-value
- Some states indicate LEFT as the greedy action, which may appear unintuitive. This occurs because these states were rarely or never visited during successful episodes, resulting in their Q-values remaining at zero. When all Q-values are equal, the greedy_action() function resolves ties by selecting the first maximum action, which is LEFT (Assumption 1).

###  Experiment 2: Deterministic Transition, Monte Carlo (task 1), 10×10 - (detMC10)

- Extended version of the experiment 1 for 10×10 grid
- Training was done upto 80,000 and Epidoes were evaluated after 10,000 

In [33]:
run_mc_experiment(
    env_class=FrozenLakeEnvDet,
    n=10,
    holes=holes_10x10,
    env_seed=123,
    mc_control_fn=mc_control_base,
    greedy_action_fn=greedy_action_base,
    print_policy_fn=print_policy_base,
    eval_fn=eval_greedy_base,
    mc_kwargs={"epsilon": 0.10},
    train_episodes=80000,
    gamma=0.99,
    max_steps_train=300,
    eval_episodes=2000,
    max_steps_eval=300,
    verbose_every=10000,
    render_one_episode=True,
    render_pause=0.10,
    env_kwargs={}
)


Running MC experiment: grid=10x10 | holes=25 | hole_ratio=25.00%
Environment: FrozenLakeEnv | extra_args={}
MC implementation: monte_carlo.mc_control_first_visit_no_exploring_starts | mc_kwargs={'epsilon': 0.1}
[MC FV] episode=10000 | greedy success_rate=0.000
[MC FV] episode=20000 | greedy success_rate=0.000
[MC FV] episode=30000 | greedy success_rate=0.000
[MC FV] episode=40000 | greedy success_rate=0.000
[MC FV] episode=50000 | greedy success_rate=0.000
[MC FV] episode=60000 | greedy success_rate=0.000
[MC FV] episode=70000 | greedy success_rate=0.000
[MC FV] episode=80000 | greedy success_rate=0.000

Final greedy policy (grid):
S H L H L H H L L L
D D H D H L L H L H
R L L L L R L H L L
U U U U H D L L H L
H H U U H L L L L L
D D R D L L H L H L
L H H L L L L H H L
L L L H H L L L L L
L L L L H L H L L L
L L L L H L L L L G

Final greedy success rate over 2000 episodes: 0.000
[Renderer] Image not found: bg_images/robot.png (will fallback to markers if needed)
Initial state_id: 0 p

Explanations of Experiment 2:

- Greedy success rate remains 0.00 throughout training, even after 80,000 episodes.
- MC fails to scale to large grids (10×10) under sparse rewards and fixed starts
- It might be due to extreme sparsity of positive reward and also MC control requires full episodes to update Q-values

From the final Greedy policy:
- There is no single continuous directed path from Start to Goal. 

During episode execution, the robot was observed to oscillate between two grid cells as a strategy to avoid falling into a hole, thereby preventing negative rewards.

###  Experiment 3: Probablistic Transition - Slippery Environment, Monte Carlo (task 1), 4×4 - (probMC4)

- Same assumptions as Experiment 1: Epsilon is fixed and Tie Breaking is biased towards the lower indexed actions
- Training episodes set upto 30,000 and its evaluation is for every 3,000

In [34]:
run_mc_experiment(
    env_class=FrozenLakeEnvProb,
    n=4,
    holes=holes_4x4,
    env_seed=40,
    mc_control_fn=mc_control_base,
    greedy_action_fn=greedy_action_base,
    print_policy_fn=print_policy_base,
    eval_fn=eval_greedy_base,
    mc_kwargs={"epsilon": 0.10},
    train_episodes=30000,
    gamma=0.99,
    max_steps_train=100,
    eval_episodes=2000,
    max_steps_eval=100,
    verbose_every=3000,
    render_one_episode=True,
    render_pause=0.30,
    env_kwargs={"slip_prob": 0.2}
)


Running MC experiment: grid=4x4 | holes=4 | hole_ratio=25.00%
Environment: FrozenLakeEnv | extra_args={'slip_prob': 0.2}
MC implementation: monte_carlo.mc_control_first_visit_no_exploring_starts | mc_kwargs={'epsilon': 0.1}
[MC FV] episode=3000 | greedy success_rate=0.755
[MC FV] episode=6000 | greedy success_rate=0.735
[MC FV] episode=9000 | greedy success_rate=0.750
[MC FV] episode=12000 | greedy success_rate=0.685
[MC FV] episode=15000 | greedy success_rate=0.770
[MC FV] episode=18000 | greedy success_rate=0.730
[MC FV] episode=21000 | greedy success_rate=0.685
[MC FV] episode=24000 | greedy success_rate=0.715
[MC FV] episode=27000 | greedy success_rate=0.765
[MC FV] episode=30000 | greedy success_rate=0.800

Final greedy policy (grid):
S L U U
D H U H
R D D H
H R R G

Final greedy success rate over 2000 episodes: 0.760
[Renderer] Image not found: bg_images/robot.png (will fallback to markers if needed)
Initial state_id: 0 pos: (0, 0)
Agent at: (row=0, col=0) | state_id=0
---------

Explanations for Experiment 3:

- For the probabilistic Frozen Lake, the greedy success rate during Monte Carlo training does not increase monotonically but exhibits oscillations.
- This behavior is expected due to the high variance of Monte Carlo return estimates under stochastic transitions and the continued exploration induced by the ε-greedy behavior policy.
- Despite these fluctuations, the final success rate converges to approximately 0.8, which is close to the optimal achievable performance for the slippery environment.

###  Experiment 4: Probablistic Transition - Slippery Environment, Monte Carlo (task 1), 4×4 - 80k Epsiodes ----> (probMC4-moreTrain)

- This experiment extends Experiment 3 by increasing the number of training episodes to 80,000, with success evaluated every 10,000 episodes. 
- The objective is to examine whether a larger training budget improves the success rate of Monte Carlo methods in a slippery environment.

In [35]:
run_mc_experiment(
    env_class=FrozenLakeEnvProb,
    n=4,
    holes=holes_4x4,
    env_seed=40,
    mc_control_fn=mc_control_base,
    greedy_action_fn=greedy_action_base,
    print_policy_fn=print_policy_base,
    eval_fn=eval_greedy_base,
    mc_kwargs={"epsilon": 0.10},
    train_episodes=80000,
    gamma=0.99,
    max_steps_train=100,
    eval_episodes=2000,
    max_steps_eval=100,
    verbose_every=10000,
    render_one_episode=True,
    render_pause=0.30,
    env_kwargs={"slip_prob": 0.2}
)


Running MC experiment: grid=4x4 | holes=4 | hole_ratio=25.00%
Environment: FrozenLakeEnv | extra_args={'slip_prob': 0.2}
MC implementation: monte_carlo.mc_control_first_visit_no_exploring_starts | mc_kwargs={'epsilon': 0.1}
[MC FV] episode=10000 | greedy success_rate=0.755
[MC FV] episode=20000 | greedy success_rate=0.755
[MC FV] episode=30000 | greedy success_rate=0.735
[MC FV] episode=40000 | greedy success_rate=0.730
[MC FV] episode=50000 | greedy success_rate=0.770
[MC FV] episode=60000 | greedy success_rate=0.755
[MC FV] episode=70000 | greedy success_rate=0.700
[MC FV] episode=80000 | greedy success_rate=0.795

Final greedy policy (grid):
S L D U
D H D H
R D D H
H R R G

Final greedy success rate over 2000 episodes: 0.742
[Renderer] Image not found: bg_images/robot.png (will fallback to markers if needed)
Initial state_id: 0 pos: (0, 0)
Agent at: (row=0, col=0) | state_id=0
---------
|S F F F|
|F H F H|
|F F F H|
|H F F G|
---------
[t=0] a=D (1)  ns=4  r=0.0  done=False  pos=(1

Explanation for Experiment 4:

- Increasing the number of Monte Carlo training episodes from 30,000 to 80,000 in the probabilistic environment did not result in monotonic performance improvement.
- Instead, the greedy success rate continued to oscillate around a stable value (~0.75–0.80)
- This indicates early saturation of policy quality and highlights a key limitation of Monte Carlo methods in stochastic environments.

When comparing the final Greedy policy of Experiment 1 (deterministic Transition) and Experiment 4 (Slippery Environment):
- Once the agent is close enough to the goal and away from holes, both environments agree on the optimal behavior.
- The difference is only in risky regions (near holes) : Stochastic policy optimizes survival under action noise.
- In the deterministic environment, the learned greedy policy follows a direct shortest path to the goal. In contrast, under stochastic transitions, the policy becomes noticeably more conservative, favoring actions that reduce the probability of catastrophic slips into holes.

###  Experiment 5: Probablistic Transition - Slippery Environment, Monte Carlo (task 1), 10×10 - 80k Epsiodes ----> (probMC10)

- This experiment is an extended version of experiment 4 to a 10×10 grid 

In [37]:
run_mc_experiment(
    env_class=FrozenLakeEnvProb,
    n=10,
    holes=holes_10x10,
    env_seed=123,
    mc_control_fn=mc_control_base,
    greedy_action_fn=greedy_action_base,
    print_policy_fn=print_policy_base,
    eval_fn=eval_greedy_base,
    mc_kwargs={"epsilon": 0.10},
    train_episodes=80000,
    gamma=0.99,
    max_steps_train=300,
    eval_episodes=2000,
    max_steps_eval=300,
    verbose_every=10000,
    render_one_episode=True,
    render_pause=0.10,
    env_kwargs={"slip_prob": 0.2}
)


Running MC experiment: grid=10x10 | holes=25 | hole_ratio=25.00%
Environment: FrozenLakeEnv | extra_args={'slip_prob': 0.2}
MC implementation: monte_carlo.mc_control_first_visit_no_exploring_starts | mc_kwargs={'epsilon': 0.1}
[MC FV] episode=10000 | greedy success_rate=0.000
[MC FV] episode=20000 | greedy success_rate=0.000
[MC FV] episode=30000 | greedy success_rate=0.000
[MC FV] episode=40000 | greedy success_rate=0.000
[MC FV] episode=50000 | greedy success_rate=0.000
[MC FV] episode=60000 | greedy success_rate=0.000
[MC FV] episode=70000 | greedy success_rate=0.000
[MC FV] episode=80000 | greedy success_rate=0.000

Final greedy policy (grid):
S H L H L H H L L L
D D H D H R R H L H
L L D D L L D H L L
U U L L H R L D H L
H H U L H D L D L L
D L L U L L H D H L
L H H R L U L H H L
R D L H H R L L L L
D L L L H D H L L L
D L D L H L L L L G

Final greedy success rate over 2000 episodes: 0.000
[Renderer] Image not found: bg_images/robot.png (will fallback to markers if needed)
Initi

Explanation for Experiment 5:
- Similar to the deterministic transition environment, this experiment fails to converge to a valid start-to-goal trajectory for the 10×10 grid. 
- The observations are consistent with those from Experiment 2 (10×10 deterministic case)
- The greedy success rate remains 0.00 throughout training, even after 80,000 episodes.
- Monte Carlo control fails to scale to larger grids (10×10) under sparse rewards and fixed start states.
- Monte Carlo control fails to scale to larger grids (10×10) under sparse rewards and fixed start states.

###  Experiment 6: Deterministic Transition, Monte Carlo (task 1), 4×4 - (detMCup4)

- Random tie-breaking was used for greedy action selection to avoid systematic bias toward lower-indexed actions (Left side) during early learning when multiple actions have identical value estimates.

- An exponentially decaying ε-greedy strategy was adopted, starting with a high exploration rate and gradually reducing it to a fixed minimum:
"epsilon_start": 0.40 ----- "epsilon_min": 0.15 ------ "epsilon_decay_rate": 1.5



In [None]:
run_mc_experiment(
    env_class=FrozenLakeEnvDet,
    n=4,
    holes=holes_4x4,
    env_seed=40,
    mc_control_fn=mc_control_up,
    greedy_action_fn=greedy_action_up,
    print_policy_fn=print_policy_up,
    eval_fn=eval_greedy_up,
    mc_kwargs={
        "epsilon_start": 0.40,
        "epsilon_min": 0.15,
        "epsilon_decay_type": "exp",   # "exp" or "linear"
        "epsilon_decay_rate": 1.5,     # higher = faster decay
        "epsilon_decay_fraction": 0.8  # only used if decay_type == "linear"
    },
    train_episodes=30000,
    gamma=0.99,
    max_steps_train=100,
    eval_episodes=2000,
    max_steps_eval=100,
    verbose_every=3000,
    render_one_episode=True,
    render_pause=0.30,
    env_kwargs={}
)


Running MC experiment: grid=4x4 | holes=4 | hole_ratio=25.00%
Environment: FrozenLakeEnv | extra_args={}
MC implementation: monte_carlo_update.mc_control_first_visit_no_exploring_starts | mc_kwargs={'epsilon_start': 0.4, 'epsilon_min': 0.15, 'epsilon_decay_type': 'exp', 'epsilon_decay_rate': 1.5, 'epsilon_decay_fraction': 0.8}
[MC FV] episode=3000 | eps=0.3652 | greedy success_rate=1.000
[MC FV] episode=6000 | eps=0.3352 | greedy success_rate=1.000
[MC FV] episode=9000 | eps=0.3094 | greedy success_rate=1.000
[MC FV] episode=12000 | eps=0.2872 | greedy success_rate=1.000
[MC FV] episode=15000 | eps=0.2681 | greedy success_rate=1.000
[MC FV] episode=18000 | eps=0.2516 | greedy success_rate=1.000
[MC FV] episode=21000 | eps=0.2375 | greedy success_rate=1.000
[MC FV] episode=24000 | eps=0.2253 | greedy success_rate=1.000
[MC FV] episode=27000 | eps=0.2148 | greedy success_rate=1.000
[MC FV] episode=30000 | eps=0.2058 | greedy success_rate=1.000

Final greedy policy (grid):
S L L L
D H D 

Explanation for Experiment 6:

- ε is decaying smoothly
- Greedy success rate is exactly 1.000 from very early on (its possible, because of the deterministic environment)
- While the optimal start-to-goal trajectory is identical to that of Experiment 1, the resulting greedy policy is more realistic in sparsely visited regions, reflecting improved action selection beyond the main optimal path, due to the early exploration.

###  Experiment 7: Deterministic Transition, Monte Carlo (task 1), 10×10 - (detMCup10)

- This experiment is an extended version of Experiment 6 on a 10×10 grid with deterministic transitions, random tie-breaking, and a decaying ε-greedy strategy. 

In [38]:
run_mc_experiment(
    env_class=FrozenLakeEnvDet,
    n=10,
    holes=holes_10x10,
    env_seed=123,
    mc_control_fn=mc_control_up,
    greedy_action_fn=greedy_action_up,
    print_policy_fn=print_policy_up,
    eval_fn=eval_greedy_up,
    mc_kwargs={
        "epsilon_start": 0.6,
        "epsilon_min": 0.2,
        "epsilon_decay_type": "exp",
        "epsilon_decay_rate": 1.0,
        "epsilon_decay_fraction": 0.8
    },
    train_episodes=80000,
    gamma=0.99,
    max_steps_train=300,
    eval_episodes=2000,
    max_steps_eval=300,
    verbose_every=10000,
    render_one_episode=True,
    render_pause=0.10,
    env_kwargs={}
)


Running MC experiment: grid=10x10 | holes=25 | hole_ratio=25.00%
Environment: FrozenLakeEnv | extra_args={}
MC implementation: monte_carlo_update.mc_control_first_visit_no_exploring_starts | mc_kwargs={'epsilon_start': 0.6, 'epsilon_min': 0.2, 'epsilon_decay_type': 'exp', 'epsilon_decay_rate': 1.0, 'epsilon_decay_fraction': 0.8}
[MC FV] episode=10000 | eps=0.5530 | greedy success_rate=0.000
[MC FV] episode=20000 | eps=0.5115 | greedy success_rate=0.000
[MC FV] episode=30000 | eps=0.4749 | greedy success_rate=0.000
[MC FV] episode=40000 | eps=0.4426 | greedy success_rate=0.000
[MC FV] episode=50000 | eps=0.4141 | greedy success_rate=0.000
[MC FV] episode=60000 | eps=0.3889 | greedy success_rate=0.000
[MC FV] episode=70000 | eps=0.3667 | greedy success_rate=0.000
[MC FV] episode=80000 | eps=0.3472 | greedy success_rate=0.000

Final greedy policy (grid):
S H U H U H H L R U
D L H D H D D H D H
L L L L L L L H L R
U U U L H R U U H U
H H U L H D U L U U
R R U U L L H D H D
L H H U U L L H

Explanations for Experiment 7:

- Despite these algorithmic improvements, the method failed to converge to a 10×10 grid, as the greedy policy success rate remained zero even after 80,000 training episodes.

## probMCup4 (Slippery, MC updated, 4×4)

In [None]:
run_mc_experiment(
    env_class=FrozenLakeEnvProb,
    n=4,
    holes=holes_4x4,
    env_seed=40,
    mc_control_fn=mc_control_up,
    greedy_action_fn=greedy_action_up,
    print_policy_fn=print_policy_up,
    eval_fn=eval_greedy_up,
    mc_kwargs={
        "epsilon_start": 0.40,
        "epsilon_min": 0.15,
        "epsilon_decay_type": "exp",
        "epsilon_decay_rate": 1.5,
        "epsilon_decay_fraction": 0.8
    },
    train_episodes=30000,
    gamma=0.99,
    max_steps_train=100,
    eval_episodes=2000,
    max_steps_eval=100,
    verbose_every=3000,
    render_one_episode=True,
    render_pause=0.30,
    env_kwargs={"slip_prob": 0.2}
)


Running MC experiment: grid=4x4 | holes=4 | hole_ratio=25.00%
Environment: FrozenLakeEnv | extra_args={'slip_prob': 0.2}
MC implementation: monte_carlo_update.mc_control_first_visit_no_exploring_starts | mc_kwargs={'epsilon_start': 0.4, 'epsilon_min': 0.15, 'epsilon_decay_type': 'exp', 'epsilon_decay_rate': 1.5, 'epsilon_decay_fraction': 0.8}
[MC FV] episode=3000 | eps=0.3652 | greedy success_rate=0.780
[MC FV] episode=6000 | eps=0.3352 | greedy success_rate=0.755
[MC FV] episode=9000 | eps=0.3094 | greedy success_rate=0.735
[MC FV] episode=12000 | eps=0.2872 | greedy success_rate=0.705
[MC FV] episode=15000 | eps=0.2681 | greedy success_rate=0.690
[MC FV] episode=18000 | eps=0.2516 | greedy success_rate=0.730
[MC FV] episode=21000 | eps=0.2375 | greedy success_rate=0.730
[MC FV] episode=24000 | eps=0.2253 | greedy success_rate=0.710
[MC FV] episode=27000 | eps=0.2148 | greedy success_rate=0.715
[MC FV] episode=30000 | eps=0.2058 | greedy success_rate=0.770

Final greedy policy (grid)

## probMCup10 (Slippery, MC updated, 10×10)

In [None]:
run_mc_experiment(
    env_class=FrozenLakeEnvProb,
    n=10,
    holes=holes_10x10,
    env_seed=123,
    mc_control_fn=mc_control_up,
    greedy_action_fn=greedy_action_up,
    print_policy_fn=print_policy_up,
    eval_fn=eval_greedy_up,
    mc_kwargs={
        "epsilon_start": 0.6,
        "epsilon_min": 0.2,
        "epsilon_decay_type": "exp",
        "epsilon_decay_rate": 1.0,
        "epsilon_decay_fraction": 0.8
    },
    train_episodes=80000,
    gamma=0.99,
    max_steps_train=300,
    eval_episodes=2000,
    max_steps_eval=300,
    verbose_every=10000,
    render_one_episode=True,
    render_pause=0.10,
    env_kwargs={"slip_prob": 0.2}
)


Running MC experiment: grid=10x10 | holes=25 | hole_ratio=25.00%
Environment: FrozenLakeEnv | extra_args={'slip_prob': 0.2}
MC implementation: monte_carlo_update.mc_control_first_visit_no_exploring_starts | mc_kwargs={'epsilon_start': 0.6, 'epsilon_min': 0.2, 'epsilon_decay_type': 'exp', 'epsilon_decay_rate': 1.0, 'epsilon_decay_fraction': 0.8}
[MC FV] episode=10000 | eps=0.5530 | greedy success_rate=0.000
[MC FV] episode=20000 | eps=0.5115 | greedy success_rate=0.000
[MC FV] episode=30000 | eps=0.4749 | greedy success_rate=0.000
[MC FV] episode=40000 | eps=0.4426 | greedy success_rate=0.000
[MC FV] episode=50000 | eps=0.4141 | greedy success_rate=0.000
[MC FV] episode=60000 | eps=0.3889 | greedy success_rate=0.000
[MC FV] episode=70000 | eps=0.3667 | greedy success_rate=0.000
[MC FV] episode=80000 | eps=0.3472 | greedy success_rate=0.000

Final greedy policy (grid):
S H U H U H H L R U
D L H D H R L H D H
L L L L L U L H L R
U U L L H R D R H U
H H U L H U R U U U
D L U U R D H L H D

## detSARSA4 (Deterministic, SARSA, 4×4)

In [None]:
run_sarsa_experiment(
    env_class=FrozenLakeEnvDet,
    n=4,
    holes=holes_4x4,
    env_seed=40,
    sarsa_control_fn=sarsa_control_up,
    greedy_action_fn=greedy_action_sarsa,
    print_policy_fn=print_policy_sarsa,
    eval_fn=eval_greedy_sarsa,
    sarsa_kwargs={
        "epsilon": 0.10,
        "use_epsilon_decay": False,
    },
    train_episodes=30000,
    gamma=0.99,
    alpha=0.10,
    max_steps_train=100,
    eval_episodes=2000,
    max_steps_eval=100,
    verbose_every=3000,
    render_one_episode=True,
    render_pause=0.30,
    env_kwargs={}
)


Running SARSA experiment: grid=4x4 | holes=4 | hole_ratio=25.00%
Environment: FrozenLakeEnv | extra_args={}
SARSA implementation: sarsa.sarsa_control_epsilon_greedy | sarsa_kwargs={'epsilon': 0.1, 'use_epsilon_decay': False}
[SARSA] episode=3000 | eps=0.1000 | greedy success_rate=1.000
[SARSA] episode=6000 | eps=0.1000 | greedy success_rate=1.000
[SARSA] episode=9000 | eps=0.1000 | greedy success_rate=1.000
[SARSA] episode=12000 | eps=0.1000 | greedy success_rate=1.000
[SARSA] episode=15000 | eps=0.1000 | greedy success_rate=1.000
[SARSA] episode=18000 | eps=0.1000 | greedy success_rate=1.000
[SARSA] episode=21000 | eps=0.1000 | greedy success_rate=1.000
[SARSA] episode=24000 | eps=0.1000 | greedy success_rate=0.000
[SARSA] episode=27000 | eps=0.1000 | greedy success_rate=1.000
[SARSA] episode=30000 | eps=0.1000 | greedy success_rate=1.000

Final greedy policy (grid):
S L R L
D H D H
R R D H
H R R G

Final greedy success rate over 2000 episodes: 1.000
[Renderer] Image not found: bg_im

## detSARSA10 (Deterministic, SARSA, 10×10)

In [None]:
run_sarsa_experiment(
    env_class=FrozenLakeEnvDet,
    n=10,
    holes=holes_10x10,
    env_seed=123,
    sarsa_control_fn=sarsa_control_up,
    greedy_action_fn=greedy_action_sarsa,
    print_policy_fn=print_policy_sarsa,
    eval_fn=eval_greedy_sarsa,
    sarsa_kwargs={
        "use_epsilon_decay": True,
        "epsilon_start": 0.6,
        "epsilon_min": 0.2,
        "epsilon_decay_type": "exp",
        "epsilon_decay_rate": 1.0,
        "epsilon_decay_fraction": 0.8
    },
    train_episodes=80000,
    gamma=0.99,
    alpha=0.10,
    max_steps_train=300,
    eval_episodes=2000,
    max_steps_eval=300,
    verbose_every=10000,
    render_one_episode=True,
    render_pause=0.10,
    env_kwargs={}
)


Running SARSA experiment: grid=10x10 | holes=25 | hole_ratio=25.00%
Environment: FrozenLakeEnv | extra_args={}
SARSA implementation: sarsa.sarsa_control_epsilon_greedy | sarsa_kwargs={'use_epsilon_decay': True, 'epsilon_start': 0.6, 'epsilon_min': 0.2, 'epsilon_decay_type': 'exp', 'epsilon_decay_rate': 1.0, 'epsilon_decay_fraction': 0.8}
[SARSA] episode=10000 | eps=0.5530 | greedy success_rate=0.000
[SARSA] episode=20000 | eps=0.5115 | greedy success_rate=0.000
[SARSA] episode=30000 | eps=0.4749 | greedy success_rate=0.000
[SARSA] episode=40000 | eps=0.4426 | greedy success_rate=0.000
[SARSA] episode=50000 | eps=0.4141 | greedy success_rate=0.000
[SARSA] episode=60000 | eps=0.3889 | greedy success_rate=0.000
[SARSA] episode=70000 | eps=0.3667 | greedy success_rate=0.000
[SARSA] episode=80000 | eps=0.3472 | greedy success_rate=1.000

Final greedy policy (grid):
S H U H U H H R U R
D D H D H D D H U H
R R D L R D D H U L
U R D L H R R D H D
H H D D H D R R R D
D R R D D D H U H D
D H H 