# FIT5226 Assignment 2 - Deep Q Learning

Team Name: Simple <br>

---


Team Members: Satoshi Kashima, Shosuke Asano, Tanul Gupta, Felix Tay Shi Hong

**<p>Action:</p>**

We have 5 actions: going left, right, down, up, and collecting an item. The action other than collecting items is generally available to all states (except where it’s inappropriate e.g. at the right-top corner, only available actions should be left and down actions). The action to collect the item is only available at the first time visiting the item location.

In [1]:
from enum import Enum


class Action(Enum):
    # NOTE: QValue matrix used these action values as their indices
    LEFT = 0
    RIGHT = 1
    DOWN = 2
    UP = 3

    # actions when agent just got the item and is moving to item_reached state
    COLLECT = 4  # goes to item reached state

**<p>Environment Setting</p>**


Assignment2State is an extension of the State class, adding attributes like goal_location, goal_direction, and item_direction. These additional attributes provide more context about the agent's environment, such as the direction to the goal and item. This increases our statespace size to 11 inputs to our DQN.

In [2]:
from dataclasses import dataclass

@dataclass(kw_only=True)
class State:
    # it doesn not hold AgentObject / ItemObject because I want State to be immutable
    # but in the future, we might want to add more attributes to State
    # in that case we need to make a copy of the AgentObject / ItemObject
    agent_location: tuple[int, int]
    item_location: tuple[int, int]
    has_item: bool = False


@dataclass(kw_only=True)
class Assignment2State(State):
    goal_location: tuple[int, int]

    # https://edstem.org/au/courses/17085/discussion/2192014
    # these two attributes should be vectors (does not need to be unit vectors as we use cos distance)
    # but these should be in terms of coordinates, not indices so be careful
    goal_direction: tuple[float, float]
    item_direction: tuple[float, float]

**<P>Environment Setup</p>**

<p>The Assignment2Environment class is a wrapper for handling multiple sub-environments. Each sub-environment has different goal and item locations, providing more varied scenarios for training the agent. The class manages the initialization of these environments and the state transitions between them, utilizing the extended state information provided by Assignment2State.
This class introduces direction-based rewards using cosine similarity, enhancing the agent's ability to learn by incorporating directional cues towards goals and items.</p>

<p>The Environment class itself has been enhanced to support more flexible initializations and interactions. It includes additional parameters for penalties and rewards and improved handling of animations for visualizing agent movements and decisions.</p>

**<p>Visualization</p>**

<p>The visualization in the environment is set up to represent the agent's behavior within an n x n grid world. The agent (A) moves around the grid to pick up an item (I) and reach a goal (G), while the entire process is animated using Matplotlib. The visualizations are designed to provide intuitive feedback about the agent's decision-making process and its learning progress.</p>

**<p>Initialization:</p>**

<ol>
<li>The environment is initialized with parameters n (representing the size of the grid n x n) and with_animation (a boolean indicating whether to show animations or not).</li>
<li>During the initialization, the initialize_for_new_episode function is called to create the plot using Matplotlib, and the animate function is called to set up the elements inside the plot.</li>
</ol>


animate() Function:
Setup: Sets up a grid of size n x n and displays it using Matplotlib.
Icons:
<ol>
<li>A (Agent): Represents the "person" or the agent navigating through the grid.</li>
<li>I (Item): Represents the "item" that the agent needs to collect.</li>
<li>G (Goal): Represents the "goal" or the final destination that the agent needs to reach after collecting the item.</li>
<li>Legend: A legend is displayed to help identify the icons (A, I, G) on the grid.</li>
</ol>

Positions:
*   The goal (G) is now positioned randomly in the grid.


Agent State Change:
*   When the agent picks up the item (I), its color changes, indicating that it is now carrying the item. The legend is updated to reflect this change.

Visual Updates:
*   A small delay (0.7 seconds) is added to allow visualization of each movement or state change on the grid, providing a smooth animation experience.


step() Function:


*   This function is responsible for causing the movement of the agent within the grid.
*   The agent selects an action (such as moving left, right, up, down, or collecting the item)
*   The action is performed if it is valid (e.g., staying within grid boundaries), and the agent's state is updated accordingly.
*   After the action is performed, the animate function is called again to display the new state of the environment.


**<p>Reward Structure</p>**

For the reward structure, please see the `get_reward` function of the `Environment` class. We give an item collection reward of `self.item_state_reward` when the agent collects the item for the first time by visiting item location correctly. Self.item_revisit_penalty gives a penalty for revisiting the item position. We assign a large goal reward of `self.goal_state_reward` for reaching the goal state with the item, and we apply a penalty of `self.goal_no_item_penalty` if the agent reaches the goal without the item. The goal state is only valid when the item has been collected; reaching the goal before collecting the item results in a large penalty to discourage this. Apart from these cases, a time penalty of `self.time_penalty` is consistently applied to encourage efficiency in completing the task.


In [3]:
from __future__ import annotations
from abc import ABC
from random import randint, choice
from typing import cast

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cosine


DEFAULT_TIME_PENALTY = -1
GOAL_STATE_REWARD = 200
DEFAULT_ITEM_REWARD = 300
DEFAULT_ITEM_REVISIT_PENALTY = -200
DEFAULT_GOAL_NO_ITEM_PENALTY = -300


class Environment:
    def __init__(
        self,
        n: int = 5,
        item: ItemObject | None = None,
        goal_location: tuple[int, int] = (4, 0),
        time_penalty: int | float = DEFAULT_TIME_PENALTY,
        item_state_reward: int | float = DEFAULT_ITEM_REWARD,
        goal_state_reward: int | float = GOAL_STATE_REWARD,
        item_revisit_penalty: int | float = DEFAULT_ITEM_REVISIT_PENALTY,
        goal_no_item_penalty: int | float = DEFAULT_GOAL_NO_ITEM_PENALTY,
        with_animation: bool = True,
    ) -> None:
        self.n = n
        self.goal_location = goal_location
        self.time_penalty = time_penalty
        self.item_state_reward = item_state_reward
        self.goal_state_reward = goal_state_reward
        self.item_revisit_penalty = item_revisit_penalty
        self.goal_no_item_penalty = goal_no_item_penalty

        self.item = ItemObject() if item is None else item
        self.agent = AgentObject()

        if self.item.location is None:
            self.item.set_location_randomly(self.n, self.n)

        self.state: State
        # TODO: possibly implmeent this if there are multiple GridObjects to check for
        # initialize grid and put grid objects on the grid
        # x_agent, y_agent = self.agent.location
        # x_item, y_item = self.item.location
        # self.grid = np.zeros((self.n, self.n))
        # self.grid[x_agent, y_agent] = self.agent
        # self.grid[x_item, y_item] = self.item

        # Setup for animation
        self.with_animation = with_animation

    def initialize_for_new_episode(self, agent_location: tuple[int, int] | None = None) -> None:
        if agent_location is None:
            self.agent.set_location_randomly(self.n, self.n,)
        else:
            self.agent.location = agent_location
        self.agent.has_item = False if randint(0, 1) == 0 else True
        self.state = State(
            agent_location=self.agent.get_location(),
            item_location=self.item.get_location(),
            has_item=self.agent.has_item,
        )

        # ensure that no multiple matplotlib windows open
        if hasattr(self, "fig"):
            plt.close(self.fig)  # type: ignore
        self.fig, self.ax = plt.subplots(figsize=(8, 8)) if self.with_animation else (None, None)
        self.animate()  # Initial drawing of the grid

        # Reset the last action and reward
        self.last_action = None
        self.last_reward = None


    def get_state(self) -> State:
        return self.state

    def set_with_animation(self, with_animation: bool) -> None:
        self.with_animation = with_animation

    def get_available_actions(self, state: State | None = None) -> list[Action]:
        """
        Assumes that the current state is not the goal state
        """
        # logic to determine available actions
        actions = []
        current_state = state if state is not None else self.state
        x, y = current_state.agent_location

        if current_state.agent_location == current_state.item_location and not current_state.has_item:
            actions.append(Action.COLLECT)

        # note: technically speaking we know that whenever agent is at the item location, the only available (or, the most optimal) action is to collect the item
        # however, according to the CE, we must ensure that
        # "the agent is supposed to learn (rather than being told) that
        # once it has picked up the load it needs to move to the delivery point to complete its mission. ",
        # implyging that agent must be able to learn to "collect" instead of being told to collect (so add all possible actions)
        if x > 0:
            actions.append(Action.LEFT)  # left
        if x < self.n - 1:
            actions.append(Action.RIGHT)  # right
        if y > 0:
            actions.append(Action.DOWN)  # down
        if y < self.n - 1:
            actions.append(Action.UP)  # up

        return actions

    def get_reward(self, prev_state: State, current_state: State, action: Action) -> float:
        """
        Calculate the reward based on the agent's actions and state transitions.
        """
        reward = self.time_penalty

        # Large penalty for reaching the goal without the item
        if current_state.agent_location == self.goal_location and not current_state.has_item:
            reward += self.goal_no_item_penalty
            return reward

        # Large reward for reaching the goal with the item
        if self.is_goal_state(current_state):
            reward += self.goal_state_reward

        # Reward for collecting the item
        if not prev_state.has_item and current_state.agent_location == current_state.item_location:
            reward += self.item_state_reward
        if action == Action.COLLECT and prev_state.agent_location == current_state.item_location and not prev_state.has_item:
            reward += self.item_state_reward

        # Penalty for revisiting item location
        if action == Action.COLLECT and (prev_state.has_item or prev_state.agent_location != current_state.item_location):
            reward += self.item_revisit_penalty
        if prev_state.has_item and current_state.agent_location == current_state.item_location:
            reward += self.item_revisit_penalty

        return reward

    def update_state(self, action: Action) -> None:
        """
        Be careful: this method updates the state of the environment
        """
        self.agent.move(action,self.n)
        self.state = State(
            agent_location=self.agent.get_location(),
            item_location=self.item.get_location(),
            has_item=self.agent.has_item,
        )

    def is_goal_state(self, state: State) -> bool:
        return self.state.has_item and self.goal_location == state.agent_location

    def animate(self, state: Assignment2State | None = None, prev_state: Assignment2State | None = None, is_greedy: bool | None = None, all_qvals: np.ndarray | None = None, chosen_action: Action | None = None) -> None:
        """
        Animates the action
        (basically just prints out the new state, but because it seems like the agent is "moving" because it's updated in the same figure)
        """
        if not self.with_animation:
            return
        self.ax.clear()
        self.ax.set_xlim(0, self.n)
        self.ax.set_ylim(0, self.n)
        self.ax.set_xticks(np.arange(0, self.n + 1, 1))
        self.ax.set_yticks(np.arange(0, self.n + 1, 1))
        self.ax.grid(True)

        # Plotting the agent, item, and goal
        self.ax.text(
            self.agent.location[0] + 0.5,
            self.agent.location[1] + 0.5,
            "A",
            ha="center",
            va="center",
            fontsize=16,
            color="blue" if not self.agent.has_item else "purple",
        )
        self.ax.text(
            self.item.location[0] + 0.5,
            self.item.location[1] + 0.5,
            "I",
            ha="center",
            va="center",
            fontsize=16,
            color="green",
        )
        self.ax.text(
            self.goal_location[0] + 0.5,
            self.goal_location[1] + 0.5,
            "G",
            ha="center",
            va="center",
            fontsize=16,
            color="red",
        )

        # FIXME: this doesn't work for State (only works for Assignment2State)
        state_str = str(self.state) if state is None else str(state)
        state_text = "".join(state_str.split(',')[:5]) + '\n' + "".join(state_str.split(',')[5:])

        # show state info
        self.ax.text(
            2,
            2,
            state_text,
            ha="center",
            va="center",
            fontsize=10,
            color="orange",
        )

        # prints: if the action selected was greedy or random
        if is_greedy is not None:
            self.ax.text(
            self.n,
            self.n,
            "Action is greedy" if is_greedy else "Action is random",
            ha="center",
            va="center",
            fontsize=10,
            color="black",
        )

        # prints the q values for all possible actions in the previous state
        # note: this is only printed if the action was greedy (because if random, the q values did not matter for action selection)
        # note2: only "possible" actions are printed i.e. (if agent is not at the item position, it does not print the collect q value)
        if all_qvals is not None and prev_state is not None and is_greedy:
            left_q, right_q, down_q, up_q, collect_q = all_qvals
            possible_actions = self.get_available_actions(prev_state)
            # show left q value
            prev_agent_location_on_plot_x = prev_state.agent_location[0] + 0.5
            prev_agent_location_on_plot_y = prev_state.agent_location[1] + 0.5
            box_center_to_val_location = 0.3
            if Action.LEFT in possible_actions:
                self.ax.text(
                    prev_agent_location_on_plot_x - box_center_to_val_location,
                    prev_agent_location_on_plot_y,
                    f'{left_q:.2f}',
                    ha="center",
                    va="center",
                    fontsize=13,
                    color="red" if chosen_action == Action.LEFT else "black",
                )
            if Action.RIGHT in possible_actions:
                self.ax.text(
                    prev_agent_location_on_plot_x + box_center_to_val_location,
                    prev_agent_location_on_plot_y,
                    f'{right_q:.2f}',
                    ha="center",
                    va="center",
                    fontsize=13,
                    color="red" if chosen_action == Action.RIGHT else "black",
                )
            if Action.DOWN in possible_actions:
                self.ax.text(
                    prev_agent_location_on_plot_x,
                    prev_agent_location_on_plot_y - box_center_to_val_location,
                    f'{down_q:.2f}',
                    ha="center",
                    va="center",
                    fontsize=13,
                    color="red" if chosen_action == Action.DOWN else "black",
                )
            if Action.UP in possible_actions:
                self.ax.text(
                    prev_agent_location_on_plot_x,
                    prev_agent_location_on_plot_y + box_center_to_val_location,
                    f'{up_q:.2f}',
                    ha="center",
                    va="center",
                    fontsize=13,
                    color="red" if chosen_action == Action.UP else "black",
                )
            if Action.COLLECT in possible_actions:
                self.ax.text(
                    prev_agent_location_on_plot_x,
                    prev_agent_location_on_plot_y,
                    f'{collect_q:.2f}',
                    ha="center",
                    va="center",
                    fontsize=13,
                    color="red" if chosen_action == Action.COLLECT else "black",
                )



        # TODO: add a message saying "item collected" if the agent has collected the item
        # or else there is a single frame where the agent is at the same location twice,
        # so it looks like the agent is not moving
        handles = [
            plt.Line2D([0], [0], marker="o", color="w", markerfacecolor="blue", markersize=8, label="Agent (A)")
            if not self.agent.has_item
            else plt.Line2D(
                [0], [0], marker="o", color="w", markerfacecolor="purple", markersize=8, label="Agent (A) with item"
            ),
            plt.Line2D([0], [0], marker="o", color="w", markerfacecolor="green", markersize=8, label="Item (I)"),
            plt.Line2D([0], [0], marker="o", color="w", markerfacecolor="red", markersize=8, label="Goal (G)"),
        ]
        self.ax.legend(handles=handles, loc="center left", bbox_to_anchor=(1, 0.5))

        plt.subplots_adjust(right=0.75, left=0.1)
        self.fig.canvas.draw_idle()
        plt.pause(0.7)  # Pause to allow visualization of the movement

    def step(self, action: Action) -> tuple[float, State]:
        prev_state = self.get_state()
        self.update_state(action)
        next_state = self.get_state()
        self.animate()
        reward = self.get_reward(prev_state, next_state,action)
        return reward, next_state


class Assignment2Environment:
    """
    A wrapper class for multiple environments for Assignment 2
    This environment consits of multiple "sub-environments" where each sub-environment has a different goal and item location
    """
    def __init__(
        self,
        n: int = 5,
        time_penalty: int | float = DEFAULT_TIME_PENALTY,
        item_state_reward: int | float = DEFAULT_ITEM_REWARD,
        goal_state_reward: int | float = GOAL_STATE_REWARD,
        direction_reward_multiplier: int | float = 10,
        with_animation: bool = True,
    ) -> None:
        self.n = n
        # initialize a list of environments for all possible goal and item positions
        self.environments = []

        for goal_x in range(self.n):
            for goal_y in range(self.n):
                for item_x in range(self.n):
                    for item_y in range(self.n):
                        if (goal_x, goal_y) == (item_x, item_y):
                            continue
                        environment = Environment(
                            n=self.n,
                            goal_location=(goal_x, goal_y),
                            item=ItemObject(location=(item_x, item_y)),
                            with_animation=with_animation,
                            time_penalty=time_penalty,
                            item_state_reward=item_state_reward,
                            goal_state_reward=goal_state_reward,
                        )
                        self.environments.append(environment)

        self.environments = [self.environments[10]]


        self.direction_reward_multiplier = direction_reward_multiplier

        self.current_sub_environment: Environment
        self.state: Assignment2State

    def get_random_sub_environment(self) -> Environment:
        return choice(self.environments)

    def initialize_for_new_episode(self, agent_location: tuple[int, int] | None = None, index: int | None = None) -> None:
        self.current_sub_environment = self.get_random_sub_environment() if index is None else self.environments[index]
        self.current_sub_environment.initialize_for_new_episode(agent_location)

        self.state = Assignment2State(
            agent_location=self.current_sub_environment.agent.get_location(),
            item_location=self.current_sub_environment.item.get_location(),
            has_item=self.current_sub_environment.agent.has_item,
            goal_location=self.current_sub_environment.goal_location,
            goal_direction=self.get_goal_direction(),
            item_direction=self.get_item_direction(),
        )
        # NOTE: animation should be handled by individual sub-environments

    def get_available_actions(self, state:Assignment2State) -> list[Action]:
        return self.current_sub_environment.get_available_actions(state)

    def set_with_animation(self, with_animation: bool) -> None:
        for environment in self.environments:
            environment.set_with_animation(with_animation)

    def get_direction_reward(self, action: Action) -> float:
        """
        Use cosine similarity to calculate the reward based on the direction of the action.
        """
        has_collected_item = self.state.has_item

        # Define action direction vectors
        if action == Action.LEFT:
            action_direction = (-1, 0)
        elif action == Action.RIGHT:
            action_direction = (1, 0)
        elif action == Action.DOWN:
            action_direction = (0, -1)
        elif action == Action.UP:
            action_direction = (0, 1)
        else:
            action_direction = (0, 0)  # Invalid action, handle accordingly

        # Calculate the direction reward based on the goal or item direction
        if has_collected_item:
            target_direction = self.state.goal_direction
        else:
            target_direction = self.state.item_direction

        # Check if either vector is zero to avoid division by zero
        if np.linalg.norm(action_direction) == 0 or np.linalg.norm(target_direction) == 0:
            return 0.0  # No direction reward if either vector is zero

        # Calculate the cosine similarity (1 - cosine distance)
        try:
            reward = 1 - cosine(action_direction, target_direction)
        except ValueError:
            # Handle any errors from invalid vectors
            reward = 0.0

        return reward * self.direction_reward_multiplier


    # def get_reward(self, prev_state: Assignment2State, current_state: Assignment2State, action: Action) -> float:
    #     state_raward = self.current_sub_environment.get_reward(prev_state, current_state,action)
    #     action_reward = self.get_direction_reward(action)
    #     return state_raward + action_reward
    def get_reward(self, prev_state: Assignment2State, current_state: Assignment2State, action: Action) -> float:
        state_raward = self.current_sub_environment.get_reward(prev_state, current_state, action)
        return state_raward


    def get_state(self) -> Assignment2State:
        return self.state

    def get_goal_direction(self) -> tuple[float, float]:
        return (
            self.current_sub_environment.goal_location[0] - self.current_sub_environment.agent.get_location()[0],
            self.current_sub_environment.goal_location[1] - self.current_sub_environment.agent.get_location()[1],
        )

    def get_item_direction(self) -> tuple[float, float]:
        return (
            self.current_sub_environment.item.get_location()[0] - self.current_sub_environment.agent.get_location()[0],
            self.current_sub_environment.item.get_location()[1] - self.current_sub_environment.agent.get_location()[1],
        )

    def is_goal_state(self, state: State) -> bool:
        return self.current_sub_environment.state.has_item and self.current_sub_environment.goal_location == state.agent_location

    def update_state(self, action: Action) -> None:
        """
        Be careful: this method updates the state of the environment
        """
        self.current_sub_environment.update_state(action)
        self.state = Assignment2State(
            agent_location=self.current_sub_environment.agent.get_location(),
            item_location=self.current_sub_environment.item.get_location(),
            has_item=self.current_sub_environment.agent.has_item,
            goal_location=self.current_sub_environment.goal_location,
            goal_direction=self.get_goal_direction(),
            item_direction=self.get_item_direction(),
        )

    def step(self, action: Action, is_greedy: bool, all_qvals: np.ndarray) -> tuple[float, Assignment2State]:
        prev_state = self.get_state()
        self.update_state(action)
        next_state = self.get_state()
        self.current_sub_environment.animate(self.get_state(), prev_state, is_greedy, all_qvals, action)
        reward = self.get_reward(prev_state, next_state, action)
        return reward, next_state


class GridObject(ABC):
    def __init__(self, location: tuple[int, int] | None = None) -> None:
        self.icon: str
        self.location = (
            location  # NOTE: location is a tuple of (x, y) where x and y are coordinates on the grid (not indices)
        )

    def set_location_randomly(
        self, max_x: int, max_y: int, disallowed_locations: list[tuple[int, int]] = []
    ) -> tuple[int, int]:
        """
        Note: max_x and max_y are exclusive

        disallowed_locations: list of locations that are not allowed to be placed
        (e.g. agent and item location should not be initialized to the same place)
        """
        # The start, item, goal location must be different position
        location = None
        while location is None or location in disallowed_locations:
            location = (randint(0, max_x - 1), randint(0, max_y - 1))

        self.location = location
        return location

    def get_location(self) -> tuple[int, int]:
        if self.location is None:
            raise ValueError("Location is not set")
        return self.location


class AgentObject(GridObject):
    def __init__(self, location: tuple[int, int] | None = None) -> None:
        super().__init__(location)
        self.icon = "A"
        self.has_item = False  # TODO: has_item of AgentObject and State must be synched somehow

    def move(self, action: Action, grid_size: int) -> None:
        """
        Move the agent based on the given action while ensuring it doesn't leave the bounds of the grid.
        """
        if self.location is None:
            raise ValueError("Agent location is not set")

        x, y = self.location

        # Check each action and ensure it stays within the bounds
        if action == Action.LEFT:
            if x > 0:  # Ensure not moving out of bounds on the left
                self.location = (x - 1, y)
        elif action == Action.RIGHT:
            if x < grid_size - 1:  # Ensure not moving out of bounds on the right
                self.location = (x + 1, y)
        elif action == Action.DOWN:
            if y > 0:  # Ensure not moving out of bounds downwards
                self.location = (x, y - 1)
        elif action == Action.UP:
            if y < grid_size - 1:  # Ensure not moving out of bounds upwards
                self.location = (x, y + 1)
        elif action == Action.COLLECT:
            self.has_item = True  # Action to collect the item (no bounds check needed)



class ItemObject(GridObject):
    def __init__(self, location: tuple[int, int] | None = None):
        super().__init__(location)
        self.icon = "I"

**<p>Replay Buffer</p>**

<p>BaseReplayBuffer, a base replay buffer defined to provide a common interface and basic functionality such as a deque that stores experiences and the maximum number of experiences that a buffer can hold. The remember function will add new experiences to the buffer and remove the oldest experience once the buffer reached its maximum size. The abstract method is intended to be implemented by subclasses to sample a batch of experiences from the buffer. It provides a way to retrieve multiple experiences at once for training the DQN agent.</p>

<p> The object function ReplayBuffer aims to sample a random subset of experiences from the buffer without replacement and unpacks the sampled experiences into separate lists for states, actions, rewards, next states and done flags using zip(*batch).</p>

<p>The object function PrioritizedExperienceBuffer extends BaseReplayBuffer to implement prioritized experience replay. This technique assigns different priorities to experiences based on their importance for the agent to focus on more significant experiences more frequently. We use an alpha value hyperparameter to control the prioritzation of experiences. It affects the degree to which priorities influence sampling probabilities.</p>

In [4]:
from abc import ABC, abstractmethod
from collections import deque
import random

import numpy as np

Experience = tuple[np.ndarray, int, float, np.ndarray, bool]


class BaseReplayBuffer(ABC):
    def __init__(self, max_size: int):
        self.max_size = max_size
        self.buffer: deque[Experience] = deque([])

    def remember(self, experience: Experience) -> None:
        if len(self.buffer) >= self.max_size:
            self.buffer.popleft()  # Remove the oldest experience
        self.buffer.append(experience)

    @abstractmethod
    def sample_batch(self, batch_size: int) -> tuple[list[np.ndarray], list[int], list[float], list[np.ndarray], list[bool]]:
        raise NotImplementedError


class ReplayBuffer(BaseReplayBuffer):
    def __init__(self, max_size: int):
        self.max_size = max_size

    def sample_batch(self, batch_size: int) -> tuple[list[np.ndarray], list[int], list[float], list[np.ndarray], list[bool]]:
        """
        Sample a batch of experiences from the replay memory.
        """
        batch = random.sample(self.buffer, batch_size)
        state_batch, action_batch, reward_batch, next_state_batch, done_batch = zip(*batch)
        return (
            state_batch,
            action_batch,
            reward_batch,
            next_state_batch,
            done_batch,
        )


class PrioritizedExperienceBuffer(BaseReplayBuffer):
    def __init__(self, max_size: int, alpha: float = 0.6):
        """
        alpha for prioritization
        """
        super().__init__(max_size)
        self.alpha = alpha
        self.priorities = np.ones(max_size, dtype=np.float32)

        self.sampled_indices: np.ndarray[int] | None = None

    def remember(self, experience: Experience) -> None:
        if len(self.buffer) >= self.max_size:
            self.buffer.popleft()
            self.priorities[:-1] = self.priorities[1:]

        self.buffer.append(experience)
        max_priority = self.priorities.max() if self.buffer else 1.0  # max to ensure newly added experience has the highest priority
        self.priorities[len(self.buffer) - 1] = max_priority

    def sample_batch(self, batch_size: int) -> tuple[list[np.ndarray], list[int], list[float], list[np.ndarray], list[bool]]:
        """
        Sample a batch of experiences from the replay memory.
        """
        priorities = self.priorities[: len(self.buffer)]
        probabilities = priorities ** self.alpha
        probabilities /= probabilities.sum()

        self.sampled_indices = np.random.choice(len(self.buffer), batch_size, p=probabilities)
        samples = [self.buffer[idx] for idx in self.sampled_indices]
        state_batch, action_batch, reward_batch, next_state_batch, done_batch = zip(*samples)
        return (
            state_batch,
            action_batch,
            reward_batch,
            next_state_batch,
            done_batch,
        )

    def update_priorities(self, priorities: np.ndarray) -> None:
        if self.sampled_indices is None:
            raise ValueError("You need to sample a batch before updating priorities")
        for idx, priority in zip(self.sampled_indices, priorities):
            self.priorities[idx] = priority
        self.sampled_indices = None

**<p>DQN Agent that utilise Deep Q Network</p>**

In [5]:
import random
import numpy as np
import torch
import copy
from typing import List, Tuple
import sys
from copy import deepcopy

class DQNAgent:
    def __init__(
        self,
        statespace_size: int = 11,
        action_space_size: int = len(Action),
        alpha: float = 0.0005,
        discount_rate: float = 0.999,
        epsilon: float = 1,
        epsilon_decay: float = 0.999,
        epsilon_min: float = 0.007,
        replay_memory_size: int = 10000,
        batch_size: int = 128,
        min_replay_memory_size: int = 1000,
        tau: float = 0.05,
        with_log: bool = False,
        loss_log_interval: int = 100,
    ) -> None:
        """
        Initialize the DQN Agent
        """
        self.alpha = alpha  # learning rate for optimizer
        self.discount_rate = discount_rate
        self.epsilon = epsilon  # exploration rate
        self.epsilon_decay = epsilon_decay  # rate at which exploration rate decays
        self.epsilon_min = epsilon_min

        self.batch_size = batch_size
        self.action_space_size = action_space_size
        self.min_replay_memory_size = min_replay_memory_size

        self.replay_buffer = PrioritizedExperienceBuffer(max_size=replay_memory_size)

        # Initialize DQN models
        self.model = self.prepare_torch(statespace_size)  # prediction model
        self.target_model = deepcopy(self.model)  # target model

        self.optimizer = torch.optim.AdamW(self.model.parameters(), lr=self.alpha, amsgrad=True)
        self.scheduler = torch.optim.lr_scheduler.StepLR(self.optimizer, step_size=1000, gamma=0.9)
        self.loss_fn = torch.nn.MSELoss(reduction='none')
        self.steps = 0

        self.tau = tau  # for soft update of target parameters

        self.with_log = with_log
        self.loss_log_interval = loss_log_interval

    def prepare_torch(self, statespace_size: int):
        """
        Prepare the PyTorch model for DQL.
        """
        l1 = statespace_size
        l2 = 150
        l3 = 100
        l4 = self.action_space_size
        model = torch.nn.Sequential(
            torch.nn.Linear(l1, l2),
            torch.nn.ReLU(),
            torch.nn.Linear(l2, l3),
            torch.nn.ReLU(),
            torch.nn.Linear(l3, l4)
        )
        return model

    def update_target_network(self) -> None:
        """
        Copy weights from the prediction network to the target network.
        """
        # self.target_model = deepcopy(self.model)
        target_net_state_dict = self.target_model.state_dict()
        policy_net_state_dict = self.model.state_dict()
        for key in policy_net_state_dict:
            target_net_state_dict[key] = policy_net_state_dict[key]*self.tau + target_net_state_dict[key]*(1-self.tau)
        self.target_model.load_state_dict(target_net_state_dict)

    def select_action(self, state: np.ndarray, available_actions: List[Action], is_test: bool = False) -> tuple[Action, bool, np.ndarray]:
        """
        Select an action using an ε-greedy policy.

        second return val for is_greedy
        chosen_action, is_greedy, q values for all actions
        """
        qvals = self.get_qvals(state)
        if not is_test and random.random() < self.epsilon:
            return random.choice(available_actions), False, qvals
        else:
            # Filter Q-values to only consider available actions
            valid_qvals = [qvals[action.value] for action in available_actions]
            return available_actions[np.argmax(valid_qvals)], True, qvals

    def get_qvals(self, state: np.ndarray) -> np.ndarray:
        """
        Get Q-values for a given state from the prediction network.
        """
        state_tensor = torch.from_numpy(state).float().unsqueeze(0)  # Convert to tensor
        with torch.no_grad():
            qvals_tensor = self.model(state_tensor)
        return qvals_tensor.detach().numpy()[0]

    def get_maxQ(self, state: np.ndarray) -> float:
        """
        Get the maximum Q-value for a given state from the target network.
        """
        state_tensor = torch.from_numpy(state).float().unsqueeze(0)  # Convert to tensor
        with torch.no_grad():
            max_qval_tensor = torch.max(self.target_model(state_tensor))
        return max_qval_tensor.item()

    def get_double_q(self, state: np.ndarray) -> float:
        """
        Calculate the Double DQN target value for a given state.
        """
        state_tensor = torch.from_numpy(state).float().unsqueeze(0)
        with torch.no_grad():
            best_action_index = torch.argmax(self.model(state_tensor)[0]).item()
            max_qval = self.target_model(state_tensor)[0][best_action_index].item()

        return max_qval

    def train_one_step(self, states: List[np.ndarray], actions: List[int], targets: List[float]) -> float:
        """
        Perform a single training step on the prediction network.
        """
        # Convert states, actions, and targets to tensors
        state_tensors = torch.cat([torch.from_numpy(s).float().unsqueeze(0) for s in states])
        action_tensors = torch.tensor(actions, dtype=torch.long).unsqueeze(1)
        target_tensors = torch.tensor(targets, dtype=torch.float)

        self.optimizer.zero_grad()
        qvals = self.model(state_tensors).gather(1, action_tensors).squeeze()
        losses = self.loss_fn(qvals, target_tensors)
        loss = losses.mean()
        loss.backward()
        self.optimizer.step()
        self.scheduler.step()

        if self.with_log and self.steps % self.loss_log_interval == 0:
            with torch.no_grad():
                mlflow_manager.log_avg_predicted_qval(qvals.mean().item(), step=self.steps)
                mlflow_manager.log_avg_target_qval(target_tensors.mean().item(), step=self.steps)
                mlflow_manager.log_max_predicted_qval(qvals.max().item(), step=self.steps)
                mlflow_manager.log_max_target_qval(target_tensors.max().item(), step=self.steps)

        with torch.no_grad():
            self.replay_buffer.update_priorities(losses.detach().numpy())  # TODO: maybe using l1 better (at least original paper uses l1)

        return loss.item()

    def replay(self) -> None:
        """
        Train the model using experience replay.
        """
        if len(self.replay_buffer.buffer) < self.min_replay_memory_size:
            return

        states, actions, rewards, next_states, dones = self.replay_buffer.sample_batch(self.batch_size)

        # Compute targets
        targets = []
        for i in range(self.batch_size):
            if dones[i]:
                targets.append(rewards[i])
            else:
                # max_future_q = self.get_maxQ(next_states[i])
                max_future_q = self.get_double_q(next_states[i])
                targets.append(rewards[i] + self.discount_rate * max_future_q)

        self.steps += 1
        # Train the model
        loss = self.train_one_step(states, actions, targets)

        if self.with_log and self.steps % self.loss_log_interval == 0:
            mlflow_manager.log_loss(loss, step=self.steps)

        # TODO: plot loss

    def save_state(self, filepath):
        """Save the entire agent state, including model weights and hyperparameters."""
        torch.save({
            'model_state_dict': self.model.state_dict(),  # Model weights
            'target_model_state_dict': self.target_model.state_dict(),  # Target model weights
            'optimizer_state_dict': self.optimizer.state_dict(),  # Optimizer state
            'epsilon': self.epsilon,  # Epsilon value
            'epsilon_decay': self.epsilon_decay,  # Epsilon decay rate
            'epsilon_min': self.epsilon_min,  # Minimum epsilon
            'discount_rate': self.discount_rate,  # Discount factor
            'replay_buffer': self.replay_buffer,  # replay_buffer
            'steps': self.steps,  # Steps to update target network
            'random_state': random.getstate(),  # Python random state
            'numpy_random_state': np.random.get_state(),  # Numpy random state
        }, filepath)

    def load_state(self, filepath):
        """Load the entire agent state, including model weights and hyperparameters."""
        checkpoint = torch.load(filepath)
        self.model.load_state_dict(checkpoint['model_state_dict'])  # Load model weights
        self.target_model.load_state_dict(checkpoint['target_model_state_dict'])  # Load target model weights
        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])  # Restore optimizer state
        self.epsilon = checkpoint['epsilon']  # Restore epsilon value
        self.epsilon_decay = checkpoint['epsilon_decay']  # Restore epsilon decay rate
        self.epsilon_min = checkpoint['epsilon_min']  # Restore minimum epsilon
        self.discount_rate = checkpoint['discount_rate']  # Restore discount factor
        self.replay_buffer = checkpoint['replay_buffer']  # Restore replay_buffer
        self.steps = checkpoint['steps']  # Restore steps
        random.setstate(checkpoint['random_state'])  # Restore Python random state
        np.random.set_state(checkpoint['numpy_random_state'])  # Restore Numpy random state
        # If using a learning rate scheduler:
        # scheduler.load_state_dict(checkpoint['scheduler_state_dict'])  # Restore scheduler state

**<p>Trainer that will be training the model</p>**

Learning Visualization:

*   Early Stages: In the early stages of training, the agent performs random actions to explore the environment, which can result in the agent moving further away from the goal positions.
*   Later Stages: After training for many episodes, the agent learns to make optimal decisions. The visualization shows the agent making quick and direct movements towards the item, collecting it, and then moving efficiently to the final goal position (G).

<p>plot_rewards(): Visualizes total rewards per episode to show learning progress.</p>
<p>plot_epsilon_decay(): Shows how the exploration rate (epsilon) decreases over episodes, indicating the shift from exploration to exploitation.<p>






In [6]:
import matplotlib.pyplot as plt
import numpy as np
import mlflow
import time

class Trainer:
    def __init__(self, agent: DQNAgent, environment: Assignment2Environment, with_log: bool = False, log_step: int = 100, update_target_episodes: int = 20, num_validation_episodes: int = 30) -> None:
        """
        Initialize the Trainer with the DQN agent and environment.
        """
        self.agent = agent
        self.environment = environment

        self.update_target_episodes = update_target_episodes

        self.episode_rewards: list[float] = []

        self.with_log = with_log
        self.global_step = 0
        self.log_step = log_step
        self.num_validation_episodes = num_validation_episodes


    def train_one_episode(self, epoch_idx: int) -> None:
        """
        Conducts training for a single episode.
        """
        self.environment.initialize_for_new_episode()

        current_state = self.environment.get_state()
        done = False
        total_reward = 0.0
        step_count = 0
        current_log_cycle_reward_list = []

        while not done:
            state_array = self.state_to_array(current_state)
            available_actions = self.environment.get_available_actions(current_state)
            action, is_greedy, all_qvals = self.agent.select_action(state_array, available_actions)
            reward, next_state = self.environment.step(action=action, is_greedy=is_greedy, all_qvals=all_qvals)
            current_log_cycle_reward_list.append(reward)
            # print(f"S_t={current_state}, A={action.name}, R={reward}, S_t+1={next_state}")
            if self.with_log and self.global_step % self.log_step == 0:
                # print(f"R={reward}")
                # print("========================")
                running_reward = sum(current_log_cycle_reward_list) / len(current_log_cycle_reward_list)
                # mlflow.log_metric("reward", running_reward, step=self.global_step)
                mlflow_manager.log_reward(running_reward, step=self.global_step)
                current_log_cycle_reward_list.clear()
            next_state_array = self.state_to_array(next_state)
            done = self.environment.is_goal_state(next_state)
            total_reward += reward

            self.agent.replay_buffer.remember((state_array, action.value, reward, next_state_array, done))
            self.agent.replay()  # maybe train inside

            current_state = next_state
            step_count += 1
            self.global_step += 1

        # decrease exploration over time
        self.agent.epsilon = max(self.agent.epsilon_min, self.agent.epsilon * self.agent.epsilon_decay)
        self.episode_rewards.append(total_reward)
        mlflow_manager.log_episode_wise_reward(total_reward/step_count, episode_idx=epoch_idx)

    def state_to_array(self, state: Assignment2State) -> np.ndarray:
        """
        Converts a State object into a numpy array suitable for input to the DQN.
        """
        # Convert Assignment2State to array
        return np.array([
            *state.agent_location,  # Agent's (x, y) location
            *state.item_location,   # Item's (x, y) location
            float(state.has_item),  # 1 if agent has item, 0 otherwise
            *state.goal_location,   # Goal's (x, y) location
            *state.goal_direction,  # Direction to goal (dx, dy)
            *state.item_direction   # Direction to item (dx, dy)
        ])

    def train(self, num_episodes: int) -> None:
        """
        Train the agent across multiple episodes.
        """
        num_nn_passes = 0
        for episode in range(1, num_episodes+1):
            print(f"Starting Episode {episode + 1}")
            self.train_one_episode(episode)
            if episode % self.update_target_episodes == 0:
                self.agent.update_target_network()
                if self.with_log:
                    print("Target network updated")
            print(f"Episode {episode + 1} completed. Epsilon: {self.agent.epsilon:.4f}")
            if self.agent.steps != num_nn_passes:
                self.validate(episode)
                self.visualize_sample_episode()
                num_nn_passes = self.agent.steps
        # mlflow.end_run()
        # Plot and save the rewards and epsilon decay after training is complete
        self.plot_rewards(save=True, filename='reward_plot.png')
        self.plot_epsilon_decay(num_episodes, save=True, filename='epsilon_decay_plot.png')

    def visualize_sample_episode(self) -> None:
        sample_env = Assignment2Environment(n=4, with_animation=True)
        sample_env.initialize_for_new_episode()
        current_state = sample_env.get_state()
        start_time = time.time()
        done = False

        prev_state = None

        while not done and time.time() - start_time < 1*20:
            state_array = self.state_to_array(current_state)
            available_actions = sample_env.get_available_actions(current_state)
            action, is_greedy, all_qvals = self.agent.select_action(state_array, available_actions, is_test=True)
            reward, next_state = sample_env.step(action=action, is_greedy=is_greedy, all_qvals=all_qvals)
            done = sample_env.is_goal_state(next_state)

            # check for three-step cycle and stop early
            if next_state == prev_state:
                print("cycle detected... breaking")
                break
            prev_state = current_state
            current_state = next_state

    def validate(self, current_episode_index: int):
        calulated_scores = []
        for _ in range(self.num_validation_episodes):
            sample_env = Assignment2Environment(n=4, with_animation=False)
            sample_env.initialize_for_new_episode()
            sample_env.current_sub_environment.agent.has_item = False # metric assumes that agent starts without item
            current_state = sample_env.get_state()
            start_time = time.time()
            done = False
            start_location = sample_env.current_sub_environment.agent.get_location()
            item_location = sample_env.current_sub_environment.item.get_location()
            goal_location = sample_env.current_sub_environment.goal_location

            prev_state = None
            predicted_steps = 0
            while not done:
                if time.time() - start_time > 20:
                    predicted_steps = 0
                    break
                state_array = self.state_to_array(current_state)
                available_actions = sample_env.get_available_actions(current_state)
                action, is_greedy, all_qvals = self.agent.select_action(state_array, available_actions, is_test=True)
                reward, next_state = sample_env.step(action=action, is_greedy=is_greedy, all_qvals=all_qvals)
                done = sample_env.is_goal_state(next_state)

                # check for three-step cycle and stop early
                if next_state == prev_state:
                    predicted_steps = 0
                    break
                prev_state = current_state
                current_state = next_state
                predicted_steps += 1
            calulated_scores.append(calculate_metrics_score(predicted_steps, start_location, item_location, goal_location))

        result = sum(calulated_scores) / self.num_validation_episodes
        if self.with_log:
            mlflow_manager.log_validation_score(result, step=current_episode_index)
        return result


    def plot_rewards(self, save: bool = False, filename: str = None) -> None:
        """
        Plot the total reward earned per episode.
        """
        plt.figure(figsize=(10, 5))
        plt.plot(self.episode_rewards, label='Total Reward per Episode')
        plt.xlabel('Episode')
        plt.ylabel('Total Reward')
        plt.title('Reward Earned per Episode')
        plt.legend()
        if save and filename:
            plt.savefig(filename)
            print(f"Reward plot saved to {filename}")
        else:
            plt.show()

    def plot_epsilon_decay(self, num_episodes: int, save: bool = False, filename: str = None) -> None:
        """
        Plot the epsilon decay over episodes.
        """
        epsilons = [max(self.agent.epsilon_min, self.agent.epsilon * (self.agent.epsilon_decay ** i)) for i in range(num_episodes)]

        plt.figure(figsize=(10, 5))
        plt.plot(range(num_episodes), epsilons, label='Epsilon Decay')
        plt.xlabel('Episodes')
        plt.ylabel('Epsilon')
        plt.title('Epsilon Decay over Episodes')
        plt.legend()
        if save and filename:
            plt.savefig(filename)
            print(f"Epsilon decay plot saved to {filename}")
        else:
            plt.show()

ModuleNotFoundError: No module named 'mlflow'

### Metric:
$$ \frac{1}{number\ of\ episodes}\sum_{item\ location}\sum_{agent\ location}\sum_{goal\ location}{\frac{M(agent\ location, item\ location) + 1 + M(item\ location, goal\ location)}{number\ of\ steps\ taken}} $$
where M represents Manhattan distance

  We began by training our DQN for all possible item locations and goal locations. After training, we evaluated the agent's performance by calculating the ratio of the shortest possible distance (using Manhattan distance) to the actual number of steps taken by the agent from the start position, through the item, and finally reaching the goal (shortest distance / actual distance taken by the agent). This calculation was done for all possible agent, item and goal locations within the environment. A ratio of 1 indicates that the actual path matches the shortest path, while a ratio less than 1 suggests a longer path was taken. By averaging these ratios across all possible scenarios, we can obtain the metric.
  
  We obtained an average value greater than xxx, ndicating that our DQN effectively learns the optimal paths in the environment. Note that 1 in the metric represents additional step to pick up item.

In [None]:
def calculate_manhattan_distance(start_location: tuple[int, int], goal_location: tuple[int, int]) -> int:
    """
    Calculates the Manhattan distance between two points.
    """
    start_x, start_y = start_location
    goal_x, goal_y = goal_location
    return abs(start_x - goal_x) + abs(start_y - goal_y)

def calculate_metrics_score(predicted_distance: int, start_location: tuple[int, int], item_location: tuple[int, int], goal_location: tuple[int, int]) -> float:
    """
    Calculates the proportion of the distance to the shortest distance.
    """
    # Calculate shortest distance from start to item to goal
    shortest_distance = (
        calculate_manhattan_distance(start_location, item_location)
        + 1
        + calculate_manhattan_distance(item_location, goal_location)
    )
    return (shortest_distance / predicted_distance) if predicted_distance != 0 else 0

In [None]:
import random
import numpy as np
from tqdm import tqdm
import time
import os

class Evaluation:
    def __init__(self, n=4) -> None:
        self.n = n
        self.dqn_envs = Assignment2Environment(n=4, with_animation=False)
        self.dqn_agent = DQNAgent(with_log=True)

    def run_dqn_train(self):
        """
        Trains DQN agent in the environment and save the states.
        """
        trainer = Trainer(self.dqn_agent, self.dqn_envs, with_log=True)
        trainer.train(num_episodes=110)
        self.dqn_agent.save_state("trained_dqn.pth")

    def load_trained_dqn(self, path: str):
        """
        Load the saved DQN
        """
        self.dqn_agent.load_state(path)

    @staticmethod
    def generate_grid_location_list(max_x: int, max_y) -> list[tuple[int, int]]:
        """
        Generate the grid location list for all possible cases
        """
        return [(i, j) for i in range(max_x) for j in range(max_y)]

    def state_to_array(self, state: State) -> np.ndarray:
        """
        Converts a State object into a numpy array suitable for input to the DQN.
        """
        # Check if the state is an instance of Assignment2State
        if isinstance(state, Assignment2State):
            # Convert Assignment2State to array
            state_array = np.array([
                *state.agent_location,  # Agent's (x, y) location
                *state.item_location,   # Item's (x, y) location
                float(state.has_item),  # 1 if agent has item, 0 otherwise
                *state.goal_location,   # Goal's (x, y) location
                *state.goal_direction,  # Direction to goal (dx, dy)
                *state.item_direction   # Direction to item (dx, dy)
            ])
        else:
            # Convert basic State to array (without goal-related information)
            state_array = np.array([
                *state.agent_location,  # Agent's (x, y) location
                *state.item_location,   # Item's (x, y) location
                float(state.has_item)   # 1 if agent has item, 0 otherwise
            ])

        # Ensure the state array matches the input size of the neural network
        if len(state_array) != 11:
            print(f"Warning: State array length mismatch. Expected 11, got {len(state_array)}. Padding with zeros.")
            state_array = np.pad(state_array, (0, 11 - len(state_array)), 'constant')
        return state_array

    def dqn_performance_test(self):
        """
        Conducts a performance test for DQN. The maximum of the score is 1.
        """
        num_episodes = 0
        total_score = 0

        # Loop over all environments in DQN environment
        for i, _ in tqdm(enumerate(self.dqn_envs.environments)):
            for agent_location in tqdm(self.generate_grid_location_list(self.n, self.n)):
                # Initialize episode with a given agent location
                self.dqn_envs.initialize_for_new_episode(agent_location=agent_location, env_index=i)

                # Ensure agent location is not same place with item and goal
                if agent_location == self.dqn_envs.current_sub_environment.item.get_location() or agent_location == self.dqn_envs.current_sub_environment.goal_location:
                    continue

                # Metric assumes that agent starts without item
                self.dqn_envs.current_sub_environment.agent.has_item = False

                # Get start, item, and goal location to calcurate distance
                start_location = self.dqn_envs.current_sub_environment.agent.get_location()
                item_location = self.dqn_envs.current_sub_environment.item.get_location()
                goal_location = self.dqn_envs.current_sub_environment.goal_location

                current_state = self.dqn_envs.get_state() # get current state
                start_time = time.time() # to keeps track time
                predicted_steps = 0 # count the number of actual steps taken
                done = False # for one environment
                is_break = False # to keep track the break

                while not done:
                    # Break if it takes more than 20 seconds
                    if time.time() - start_time > 20:
                        is_break = True
                        break
                    state_array = self.state_to_array(current_state) # get the states in array format
                    available_actions = self.dqn_envs.get_available_actions(current_state) # get available actions
                    action, is_greedy, all_qvals = self.dqn_agent.select_action(state_array, available_actions, is_test=True)
                    reward, next_state = self.dqn_envs.step(action=action, is_greedy=is_greedy, all_qvals=all_qvals) # get next state
                    done = self.dqn_envs.is_goal_state(next_state) # Check if it is goal position
                    current_state = next_state # update current state
                    predicted_steps += 1

                if not is_break:
                    # calculate the metrics score
                    total_score += calculate_metrics_score(predicted_steps, start_location, item_location, goal_location)
                    num_episodes += 1 # increase the episode

            # Return the average score across all possible tests
            return (total_score / num_episodes) if num_episodes != 0 else 0

    def visualize_dqn(self, num_of_vis: int = 5) -> None:
        """
        Visualize the path after trained for given times
        """
        for _ in (0, num_of_vis):
            self.dqn_envs.set_with_animation(True) # Set the animation True
            self.dqn_envs.initialize_for_new_episode()
            self.dqn_envs.current_sub_environment.agent.has_item = False # Assumes that agent starts without item

            current_state = self.dqn_envs.get_state()
            start_time = time.time() # Keep track time
            done = False

            while not done:
                # If it takes more than 20 seconds to reach the goal, break the loop
                if time.time() - start_time > 20:
                    break
                state_array = self.state_to_array(current_state) # get the states in array format
                available_actions = self.dqn_envs.get_available_actions(current_state) # get available actions
                action, is_greedy, all_qvals = self.dqn_agent.select_action(state_array, available_actions, is_test=True)
                reward, next_state = self.dqn_envs.step(action=action, is_greedy=is_greedy, all_qvals=all_qvals) # get next state
                done = self.dqn_envs.is_goal_state(next_state) # Check if it is goal position
                current_state = next_state # update current state

if __name__ == "__main__":
    # DQN
    evl = Evaluation()

    # Training DQN
    # evl.run_dqn_train()

    # Load DQN model
    current_path = os.getcwd() # get current path
    saved_path = current_path+'/trained_dqn_agent_2.pth'
    evl.load_trained_dqn(saved_path)

    # Conduct the performance test
    average_score = evl.dqn_performance_test()
    print(f"Average performance score (1 is the best): {average_score:.4f}")

    # Visualize randomly the environments and show the steps of the agent
    evl.visualize_dqn()

**<p>Hyperparameter Tuning</p>**

<p> We performed hyperparameter tuning for the DQN agent using library Optuna. An objective function defined to evaluate different sets of hyperparameters and optimize them to maximize the DQN agent's total reward.</p>

<p>We set a maximum amount of time allowed of 10 seconds for each trial while running optimisation.</p>

<p>Below are the sets of hyperparameters used in tuning:</p>

<ol>
<li>Optimizer learning rate <i>(alpha)</i> from range of 0.995 to 0.999</li>
<li>Discount factor for future rewards <i>(discount_rate)</i> from range of 0.95 to 0.975</li>
<li>Exploration rate <i>(epsilon)</i> with a set of 0.2, 0.5, 0.4 and 0.6</li>
<li>Maximum size of replay memory <i>(replay_memory_size)</i> from range of 1000 to 5000 </li>
<li>Batch size for experience replay <i>(batch_size)</i> with a set of 16, 32, 64, 128, 256 </li>
<li>Time penalty <i>(time_penalty)</i>  from range of -10 to -1</li>
<li>Penalty for reach goal without item <i>(goal_no_item_penalty)</i> from range of -500 to -100</li>
<li>Penalty for revisiting the item location <i>(item_revisit_penalty)</i> from range of -200 to -50</li>
<li>Reward for collecting the item <i>(item_state_reward)</i> from range of 100 to 200</li>
<li>Reward for reaching the goal with item collected <i>(goal_state_reward)</i> from range of 300 to 600</li>
<li>Number of epsiodes to train the model <i>(num_episodes)</i> from range of 100 to 600</li>
</ol>


<p>We then initialised the environment and agent with the sets of hyperparameters and start training loop.</p>

<p>We call Optuna to start the optimsation process with 50 number of trials. The best hyperparameters and total rewards are printed, and saved into YAML file. The Optimisation history is plotted to visualize how the total reward improved over trials</p>


In [None]:
import numpy as np
import optuna
import time
import yaml
from tqdm import tqdm

TIME_LIMIT = 10


class Tuning:
    def __init__(self, time_limit: int = TIME_LIMIT) -> None:
        self.study = optuna.create_study(direction='maximize') # Create an Optuna study
        self.time_limit = time_limit

    def objective(self, trial: optuna.Trial) -> float:
        # Hyperparameters to optimize
        alpha = trial.suggest_float('alpha', 0.995, 0.999)
        discount_rate = trial.suggest_float('discount_rate', 0.95, 0.975)
        epsilon = trial.suggest_categorical('epsilon', [0.2, 0.35, 0.45, 0.6])
        replay_memory_size = trial.suggest_int('replay_memory_size', 1000, 5000)
        batch_size = trial.suggest_categorical('batch_size', [16, 32, 64, 128, 256])
        time_penalty = trial.suggest_int('time_penalty', -10, -1)
        goal_no_item_penalty = trial.suggest_int('goal_no_item_penalty', -500, -100)
        item_revisit_penalty = trial.suggest_int('item_revisit_penalty', -200, -50)
        item_state_reward = trial.suggest_int('item_state_reward', 100, 200)
        goal_state_reward = trial.suggest_int('goal_state_reward', 300, 600)
        num_episodes = trial.suggest_int('num_episodes', 100, 600)

        # Initialize Assignment2Environment
        tune_env = Assignment2Environment(
        n=4,  # Grid size
        time_penalty=time_penalty,
        goal_no_item_penalty=goal_no_item_penalty,
        item_revisit_penalty=item_revisit_penalty,
        item_state_reward=item_state_reward,
        goal_state_reward=goal_state_reward,
        direction_reward_multiplier=1,
        with_animation=False
        )

        # Initialize DQNAgent
        tune_agent = DQNAgent(
        alpha=alpha,
        discount_rate=discount_rate,
        epsilon=epsilon,
        replay_memory_size=replay_memory_size,
        batch_size=batch_size,
        )

        # num_episodes = 200 # define episodes
        tune_trainer = Trainer(tune_agent, tune_env)

        for _ in range(num_episodes):
            tune_env.initialize_for_new_episode() # Initialize the environment for a new episode
            current_state = tune_env.get_state()  # Get state from current sub-environment
            total_reward = 0  # Track total reward for the episode
            start_time = time.time()  # Record the start time of the episode

            while not tune_env.is_goal_state(current_state):
                # Check if time limit is exceeded
                elapsed_time = time.time() - start_time
                if elapsed_time > self.time_limit:
                    break

                # Convert the current state to a numpy array for input to the neural network
                state_array = tune_trainer.state_to_array(current_state)

                # Retrieve available actions from the current sub-environment
                available_actions = tune_env.get_available_actions(current_state)

                # Select an action using the agent's ε-greedy policy
                action, is_greedy, all_qvals = tune_agent.select_action(state_array, available_actions)

                # Execute the action in the current sub-environment, receive reward and next state
                reward, next_state = tune_env.step(action=action, is_greedy=is_greedy, all_qvals=all_qvals)


                # Add the reward to the total reward for this episode
                total_reward += reward

                # Convert the next state to a numpy array
                next_state_array = tune_trainer.state_to_array(next_state)

                # Store experience in the agent's replay memory
                tune_agent.replay_buffer.remember((state_array, action.value, reward, next_state_array, tune_env.is_goal_state(next_state)))

                # Learn from experiences using experience replay
                tune_agent.replay()

                # Move to the next state
                current_state = next_state

            # decrease exploration over time
            tune_agent.epsilon = max(tune_agent.epsilon_min, tune_agent.epsilon * tune_agent.epsilon_decay)
            # Store total reward of the episode
            tune_trainer.episode_rewards.append(total_reward)

            # Prune the trial early if it's performing poorly
            if trial.should_prune():
                raise optuna.exceptions.TrialPruned()

        return total_reward


    def run_hyperparameter_tuning(self):
        '''
        Run hyperparameter tunig and save the best parameters
        '''
        # Define the number of trials
        num_trials = 50

        # Optimize the objective function
        self.study.optimize(self.objective, n_trials=num_trials, show_progress_bar=True)

        # Print the best hyperparameters and the best value
        print("Best Hyperparameters: ", self.study.best_params)
        print("Best Value: ", self.study.best_value)

        # Save the best hyperparameters to a YAML file
        with open("config.yml", "w") as file:
            yaml.dump(self.study.best_params, file, default_flow_style=False)

    def hyperparameter_tuning_visualization(self):
        '''
        Visualize the optimization history
        '''
        optuna.visualization.plot_optimization_history(self.study)

if __name__ == "__main__":
    tuning = Tuning()

    # Conduct hyperparameter tuning
    tuning.run_hyperparameter_tuning()

    # Visualize the hyperparameter tuning
    tuning.hyperparameter_tuning_visualization()