\begin{center}

{\huge 1. Front Matter}

\end{center}

---

<p style="text-align: center;">
    This project explores the performance of agents implemented with dynamic programming versus statistical machine learning. These agents will autonomously perform intelligent tasks in a dynamic video game environment.
</p>

Contents:
    
* Abstract
* Introduction
* Background
* Requirements Capture
* Analysis and Design
* Implementation
* Testing
* Results
* Evaluation
* Conclusion and Further Work
* User Guide
* Bibliograph
* Appendices

\pagebreak

<div style="page-break-after: always;"></div>

#### _Plagiarism Statement_

This project was written by me and in my own words, except for quotations from published and unpublished sources which are clearly indicated and acknowledged as such. I am conscious that the incorporation of material from other works or a paraphrase of such material without acknowledgement will be treated as plagiarism, subject to the custom and usage of the subject, according to the University Regulations. The source of any picture, map or other illustration is also indicated, as is the source, published or unpublished, of any material not resulting from my own experimentation, observation or specimen-collecting.

\pagebreak

<div style="page-break-after: always;"> </div>

\begin{center}
{\huge 2. Acknowledgements}
\end{center}

---

(section under construction)

\pagebreak

\begin{center}
{\huge 3. Abstract}
\end{center}

---

Prediction and classification are semi-autonomous tasks commonly performed with deep learning algorithms. These tasks, such as recognizing an object, could be easily automated due to its limited decision space[1] . However, what would happen if we included the dimension of time and required the algorithm to act, as well as predict or classify?

An algorithm which is able to make decisions on control tasks would be valuable in automation. A popular approach to autonomous, intelligent decision-making is reinforcement learning.

In artificial intelligence, anything which perceives its environment, takes actions [autonomously](https://en.wikipedia.org/wiki/Autonomous "Autonomous") in order to achieve goals, and may improve its performance with [learning](https://en.wikipedia.org/wiki/Machine_learning "Machine learning") or may use [knowledge](https://en.wikipedia.org/wiki/Knowledge_representation "Knowledge representation") is known as an Intelligent Agent (IA) [2].

Russell & Norvig [3] classifies IA's into 5 classes:

* simple reflex agents

* model-based reflex agents

* goal-based agents

* utility-based agents

* learning agents


\pagebreak

\begin{center}
{\huge 4. Introduction}
\end{center}

---


\underline {\large 4.1 From automation to autonomy.}

Automation preceded artificial intelligence in replacing human labour at repetitive tasks. An example of automation is automated storage and retrieval systems common across large warehouses of logistic companies.

Advancements in systems such as Warehouse Management Systems can be classified into different trends. On the one hand, some advancements aim to enhance specific processes of the warehouses, improving the effectiveness and efficiency of the whole activity. On the other hand, other advances focus on improving more general processes, using Machine Learning methods such as **Reinforcement Learning** (Cestero et. al (2022)).

Large robot arms retrieve packages before placing each on long conveyor belts with precise mechanical motion. A package one size too large or incorrectly oriented will cause failure in the system. The package will drop or never even reach its destination.

**Automated systems are inflexible.**

Automated systems make decision by calculation, searching or lookup. Each robot arm and conveyor belt was programmed to do a task, in a specific way.

The automated system needs to be able to tolerate inconsistencies. This is where Autonomous A.I. is able to make an impact, in an environment streaming unstructured, dynamic data.  Unlike conventional data stream methods built upon a shallow network structure Deep Neural Networks potentially offers significant improvement in accuracy and aptitude to handle unstructured data streams (Ashfahani et. al. (2020)).

\underline {\large 4.2 Adapting automation to a changing environment.}

Globally, companies are facing difficult situations with ever changing business environments (markets, customers, workers, equipment). Often, automated systems built around repeatable and predictable processes are unable to change programmed behaviour in response to the changing environment.

The [AI Index Report](https://aiindex.stanford.edu/report/) cites that over 120,000 AI related peer-reviewed academic papers were published in 2019. A lot of the advances are heavily publicized and make it seem like the Holy Grail of A.I. The challenge is to adopt algorithms which were successful in the lab to real-life problems.

\underline {\large 4.3 Developing autonomous algorithms in a controlled lab environment.}

Lab based experiments conducted with simulated environments are invaluable for developing autonomous A.Is, which will be referred to as Intelligent Agents (IA). 

Simulations in the lab do not replicate real-world situation perfectly, but there are advantages:

1. low costs
2. absence of safety issues
3. unlimited iterations of simulations

Video-game playing IAs have gained enormous publicity in recent years, with IAs beating humans at complex games such as Dota 2 and Starcraft 2.

Video games are perfect environment to develop and test IAs. 

1. video game simulations can be built from scratch,
2. researchers can define the rules in video games, such as physics and rewards,


\pagebreak

\begin{center}
{\huge 5. Literature Review}
\end{center}


---

\underline {\large Case Study 1: Design and Implementation of an Intelligent Gaming Agent with FSM}

\normalsize Study URL: [link](https://www.researchgate.net/publication/346319155_Design_and_Implementation_of_an_Intelligent_Gaming_Agent_Using_A_Algorithm_and_Finite_State_Machines)

\normalsize From the definition of Intelligent Agents in Artificial Intelligence, we can say an intelligent agent is anything that can perceive/observing its immediate environment and take action with respect to its observation, hence we can say an intelligent gaming agent is capable of learning/observing what goes on in the gaming environment and also act on its observation (Adegun et. al., 2020).

An agent has three key steps through which is cycled through while the FSM is active. The steps are commonly known as the sense-think-act cycle. In addition to these three steps, there is an optional learning or remembering step that may also take place during this loop.

The game agent _must_ have access to the environment's state to generate a decision and an action _following_ the decision. The agent pursues a pre-determined goal or destination, by navigating the environment with a localization and navigation algorithm, which for this paper, was A* search.

The agent navigates a 3D environment generated with the Unity3D engine. The agent encounters NPCs which attack the agent while it navigates the 3D environment towards its goal.

\underline {Critical evaluation}

\normalsize Several aspects of this research paper was unclear.

1. Navigating a 3D environment with A* Search without generating a navigation mesh is computationally expensive. 3D environments have an almost _infinite_ number of points which the agent can move to, and a navigation mesh of the map is generated before A* Search is used.

\begin{figure}
        \centering
        \includegraphics[scale=0.7]{images/waypoints_amitp.jpg}
        \par
        Figure 5.5: Navigation Mesh, retrieved from https://web.archive.org/web/20110716210236/http://www.ai-blog.net/archives/000152.html, accessed in August 2022\par
\end{figure}

2. The paths found by A* on eight-neighbor square grids can be approximately 8% longer than the shortest paths in the continuous environment (Nash et. al., 2010)
3. Properties of the environment were absent, such as player weapon ammo, enemy health, etc
4. How does the agent perceive state? How does the agent 'know' when an enemy NPC is dead, or how much ammunition it has remaining?

\bigskip

\underline {\large Case Study 2: OpenAI Five}

\normalsize Study URL: [https://arxiv.org/pdf/1912.06680.pdf](https://arxiv.org/pdf/1912.06680.pdf)

In 2019, an A.I system by OpenAI, dubbed ‘Five’, defeated a team of professional players at the game Dota 2. Pros; victory require teamwork and collaboration, making it a milestone for A.I. 

Link: [https://openai.com/five/](https://openai.com/five/)

The agents were trained on millions of possible scenarios with each of the 110+ ‘heroes’ available to players. Training ran from June 30th, 2018 to April 22nd, 2019, which was 10 months of training on 770 PFlops/s of processing power.

The agents defeated the human Dota 2 world champions in a best of three match. The agents were able to solve the following:

* Long time horizons
* Partially observed state
* High dimensional action and observation spaces
  Drawbacks:
* The complex environment of a game such as Dota 2 requires processing power beyond the reach of unaffiliated researchers
* The agents were specialised and are unable to play another game, for example, Starcraft II.

\underline {Critical evaluation}

\normalsize The authors of this study made it clear the A.I was highly specialised; it could only perform the tasks in the environment where it's training took place. Moving the A.I to similar game with slightly different mechanics and environment, such as League of Legends, would give poor results.

These limitations underline the bottomline of employing A.I. at the time of writing; writing algorithms and implementing models for generalized tasks are still _a way off_.

\bigskip

\underline {\large Case Study 3: Ubisoft LaForge - Deep Reinforcement Learning for Navigation.}

\normalsize Study URL: [https://arxiv.org/pdf/2011.04764.pdf](https://arxiv.org/pdf/2011.04764.pdf)

Ubisoft Montreal has an A.I research unit conducting experiments on A.I systems for games. This particular study highlights real-world challenges. Pros; navigation with uncommon techniques such as grappling hooks and jumping. Cons; Poor scaling.

Link: https://montreal.ubisoft.com/en/deep-reinforcement-learning-for-navigation-in-aaa-video-games/

The agent was able to navigate the 3D environment built with UnityML with ray-casting, which is done by bouncing ‘light rays’ at the environment and sensing the feedback. This is known as ‘local perception’.

The agent navigating a 3D environment had local perception in the form of:

1. 3D occupancy map
2. 2D depth map
3. Scalar information about its own physical attributes (e.g. weight, height)

Running computations in a game engine is costly, so the agents are trained with an off-policy RL algorithm named Soft Actor-Critic(SAC). Local perception drastically improved the sampling efficiency of the Rl algorithm.

\underline {Critical evaluation}

\normalsize Off-policy algorithms such as SAC are difficult to optimize. 

**Off-policy algorithms** reuse data from old policies to update the current policy. On-policy reinforcement learning (RL) algorithms have high sample complexity while off-policy algorithms are difficult to tune. Merging the two holds the promise to develop efficient algorithms that generalize across diverse environments (Fakoor et. al. 2019).

\bigskip

\underline {\large Case Study 4: Proximal Policy Optimization (PPO)}

\normalsize Study URL: [https://arxiv.org/pdf/1707.06347.pdf](https://arxiv.org/pdf/1707.06347.pdf)

Proximal Policy Optimization is a reinforcement learning framework developed at OpenAI for easy implementation coupled with good baseline performance. It is a policy gradient method and can be used for environments with either discrete or continuous action spaces.

Proximal Policy Optimization strikes a balance between ease of implementation, sample complexity, and ease of tuning, trying to compute an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small [5].

\underline {\large How it works}

\normalsize PPO utilizes the [Actor-Critic](https://keras.io/examples/rl/actor_critic_cartpole/) method and trains a stochastic policy in an on-policy way.

As an agent takes actions and moves through an environment, it learns to map the observed state of the environment to two possible outputs (Zhao, et al., 2017).

* Recommended action: A probability value for each action in the action space. The part of the agent responsible for this output is called the actor.
* Estimated rewards in the future: Sum of all rewards it expects to receive in the future. The part of the agent responsible for this output is the critic.

The policy is updated via a stochastic gradient ascent optimizer, while the value function is fitted via some gradient descent algorithm. This procedure is applied for many epochs until the environment is solved.

\centering Image credit: https://keras.io/examples/rl/ppo_cartpole/
\begin{figure}
        \centering
        \includegraphics{images/ppo.png}
        \par
        Figure 5.6: Pseudocode for the PPO algorithm. 
\end{figure}

\raggedright 
\normalsize PPO attains the data efficiency and reliable performance of [Trust Region Policy Optimization](https://arxiv.org/abs/1502.05477) (TRPO), while using only first-order optimization. 

TRPO is:

* complicated
* incompatible with architectures which includes noise (dropout) or parameter sharing

PPO is:

* simpler to implement
* applicable in general settings
* a better performer overall

\underline {Critical evaluation}

\normalsize PPO strikes a balance between ease of implementation, sample complexity, and ease of tuning, trying to compute an update at each step that minimizes the cost function, while ensuring the deviation from the previous policy is relatively small (Schulman et. al., 2017).

This algorithm is tried and tested, and should be amongst the algorithms evaluated for implementing the agent Abel.

\pagebreak

\begin{center}
{\huge 3. Design}
\end{center}


---

\underline {\large 3.1 Kane and Abel: AIs that play games}

\normalsize Amongst the earliest work on visual-based reinforcement learning for autonomous navigation and actions is by Asada et. al. (1996), who trained robots to learn soccer playing skills with reinforcement learning.

Based on the IA classes defined by Russell & Norvig, Kane is classified as a **simple reflex agent**, while Abel is a **intelligent agent**. An intelligent agent perceives its environment via sensors and acts rationally upon that environment with its effectors. 

This project will implement IA's interacting with the **VizDoom** environment, which is a first person shooter game based on the classic video game from the 90s, Doom. The VizDoom research tool was chosen due to its well-documented API, and more importantly, an environment which only provides *visual feedback* which rules out handicaps for implementers of learning algorithms.

Two types of agent interact with the environment; a simple reflex agent operating on the Finite State Machine Model, and an intelligenct agent built with reinforcement learning algorithms Proximal Policy Optimization (PPO) and Advantage Actor-Critic (A2C).

\underline {\large 3.2 Finite State Machines}

Finite state machines are an abstract machine that can be in exactly one of a finite number of states at any given time (source: Wikipedia). A Finite State Machine(FSM) is also a control system design methodology that describes the behavior or the working principle of the system by using the following three points: 

1. state, 
2. event, 
3. action

The FSM system transitions from one state to another based on events, by taking action. Actions can be simple, or could involve a complex set of processes.

\underline {\large 3.3 The Building Blocks of Deep Reinforcement Learning}

\underline {Markov Decision Process}

\normalsize Markov Decision Process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDP is used in many disciplines, including robotics, automatic control, economics and manufacturing.

A Markov Decision Process is a 4-tuple of $S$ state space, $A$ action space, $P_a (s, s')$ is the probability that action $a$ in state $s$ at time $t$ will lead to state $s'$ at time $t+1$, **and**, $R_a (s, s')$ reward is the reward from transition from $s$ to $s'$.

In reinforcement learning, MDP is used wherever probabilities or rewards are unknown (Shoham et. al, 2003). Reinforcement learning can solve Markov-Decision processes without explicit specification of the transition probabilities; the values of the transition probabilities are needed in value and policy iteration. 


\underline {Deep Reinforcement Learning}

In reinforcement learning, instead of explicit specification of the transition probabilities, the transition probabilities are accessed through a simulator that is typically restarted many times from a uniformly random initial state. Reinforcement learning can also be combined with function approximation to address problems with a very large number of states.

**Reinforcement learning** is a **generic framework** for representing and solving control tasks. There are a number of algorithms within the framework, and below are some of them:

* Associative reinforcement learning
* Deep reinforcement learning
* Fuzzy reinforcement learning
* Inverse reinforcement learning
* Safe reinforcement learning
* Partially supervised reinforcement learning

Implementers are free to choose which algorithms to apply to a particular control task.  Deep learning algorithms are a natural choice as they are able to process complex data efficiently. An RL agent learns by trial and error
a good policy (or controller) based on observations and numeric reward feedback on the previously performed action (Buffet et. al., (2020)). In RL, the agent needs to learn how to choose good actions based on its observations and the reward feedback, without necessarily knowing the dynamics of the environment (Buffet et. al., 2020).

\underline {\large 3.4 Reinforcement learning and video games.}

\normalsize Artificial intelligence in video games are a longstanding research area. Video games are excellent mediums for testing A.I techniques, and its environment introduces new ways of understanding how an A.I can be constructed. Various games provide interesting and complex problems for agents to solve, making video games perfect environments for AI research. These virtual environments are safe and controllable. In addition, these game environments provide infinite supply of useful data for machine learning algorithms, and they are much faster than real-time (Zhao et. al., 2019).

An agent learning in a dynamic environment without supervision would be useful in a number of fields, such as:

* image labeling
* autonomous navigation (e.g. self-driving cars)
* autonomous control tasks (e.g. traffic control)
* automation in healthcare (e.g. diagnosis and sugery)
* robotics (e.g. inventory management tasks such as labeling and warehousing)

\pagebreak


\underline {\large Algorithms and Tooling}

\underline {\large Video game engine}

\normalsize [VizDoom](https://github.com/mwydmuch/ViZDoom) is a software platform for research into autonomous control based on raw visual information. The environment is based on the classic first-person shooter from the 90s, Doom. The environment allows the development of game-playing bots which interpret information based only on the screen buffer.  VizDoom also provides:

* a realistic physics model,
* customization of environmental parameters, maps, non-player characters, rewards, goals and actions
* visual and audio feedback for perception and interpretation by the game-playing agent.

First-person shooter (FPS) games such as Unreal Tournament, Quake III Arena and Counter-Strike have been used for A.I research. The drawback of using these games is the availability of high-level information such as the position of walls and enemies. VizDoom's feedback can be access only as _raw visual information_, requiring the agent to respond from purely visual feedback.

\underline {\large Building the simulator}

\normalsize With a game engine, a 2D environment can be created and simulated. This environment can then be used to synthesize data for the agent to 'consume'.

The game engine **environmental properties** would be represented as numerical data, which is used to produce rewards, navigate and shoot.

**OpenAI's Gym** wrappers are a convenient way to modify an existing environment without having to alter the underlying code directly. With OpenAI's Gym wrapper, the code can be modular and the same wrapper and environment can be utilised by both Kane and Abel.

Gym implements the classic “agent-environment loop”, action spaces and rewards are generated from the game engine's API.

_Please refer to Appendix B: Code, Block 3.1._

\underline {\large Image Processing}

**Raw image data is noisy**. Image processing should be included in the pipeline to remove unnecessary data which should speed up the models and algorithms.

Removal of unnecessary data also frees up computing resources from tasks which do not contribute to the IA's feedback and decision making.

\underline {\large Object Recognition With Deep Learning}

Object recognition is a common task and at the time of writing, machine learning practitioners are spoilt for choice when it comes to choice of frameworks. The following machine learning architectures and models are a great way to accelerate development of machine learning algorithms for both Kane _and_ Abel.

1. [Detectron2](https://github.com/facebookresearch/detectron2)
2. [YOLOv5](https://github.com/ultralytics/yolov5)
3. [Cascade Classifier](https://docs.opencv.org/3.4/db/d28/tutorial_cascade_classifier.html)

\underline {\large Reinforcement Learning with Machine Learning frameworks}

The machine learning ecosystem is dominated by Tensorflow and Pytorch. Both frameworks  work well for most machine learning tasks.

PyTorch will be used to implement the reinforcement learning agent due to the following reasons:

1. tooling support,
2. available code examples,
3. extensive documentation.

\underline {\large Dynamic Programming vs Reinforcement Learning}

\underline {Kane}
\normalsize With dynamic programming, the agent will offer predictable results. However, *consistency* of results will be difficult to maintain across variations of the same environment. For example, when the agent is transferred from a small map to a larger map with slight variations in enemies the algorithms will require extensive tweaking and testing.

\underline {Abel}
The route to success in reinforcement learning isn’t as obvious — hyperparameter tuning within a large search space with the PPO and A2C algorithm is required, followed by **reward shaping** and **curriculum learning**.

\underline {\large Plan of Work}

_Please refer to Appendix A: Images, Figure 3.8._

\underline {\large Testing and Evaluation}

The performance of both agents will be tested on the ViZDoom scenario 'Deadly Corridor'. Deadly Corridor is a representative of the challenges players face in a first-person shooter; collectin power-ups and health items, killing enemies and moving towards a goal. Both agents must navigate a narrow corridor with enemies to reach the end without being killed.

\underline {Testing Kane}

\normalsize Testing Kane in the ViZDoom environment is a straightforward approach. **Will the FSM's algorithm enable the bot to kill all enemies and reach the end of the corridor to complete the level?**

\underline {Testing Abel - Welch-T Test}
The performances of Reinforcement Learning algorithms have specific characteristics. They are independent
of each other and they are not paired between algorithms. Statistical difference testing offers a principled way to compare the central performances of two algorithms (Colas et. al., 2019).

Testing the effectiveness of the PPO vs. A2C will be done with the **Welch-T test**. This test calculates the T-test for the means of two independent samples of scores. This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

Testing will be done with the SciPy library's `ttest_ind` function.

\pagebreak


\underline{\huge 4. Implementation}

\par

\underline{\large Environment creation with OpenAI Gym and ViZDoom.}

The ViZDoom game engine is 'wrapped' with OpenAI's Gym library, which will produce feedback in the form of rewards, action spaces and episodic levels. The environment's visual buffer will be used to train Abel and provide Kane with visual feedback. Each frame of the buffer will be 320 x 240 pixels, and each frame is converted to grayscale. The scenario used for both Kane and Abel is 'Deadly Corridor', a long hallway with enemies and an armour power-up at the end of the level. The goal of both Kane and Abel is to reach the end without being killed by the enemies.

\begin{figure}
    \centering
    \includegraphics[scale=0.4]{images/deadly_corridor.png}
    \par
    \centering {Figure 4.1: Deadly Corridor level.\par}
\end{figure}

The image illustrating the level above shows the following; a **green dot** representing the **agent**, **red dots** representing **enemies** and a single **orange dot** on the far right which is an **armour power-up** and the end of the level.

\underline{\large Implementing Kane}

\normalsize Implementing an A.I. for Kane required algorithms to solve the following:

1. Localization
2. Object Detection
3. Navigation

Kane is created with dynamic programming techniques implemented within the paradigm of a Finite State Machine.

\underline{\large Navigation}

The dynamic programming technique **A\* Search** is an improvement of Dijkstra's algorithm, which is a greedy algorithm (Rachmawati et. al., 2019).

Dijkstra and A* perform equally well for solving small scale pathfinding (.e.g a town or regional scale maps), but A* performs better with large scale pathfinding.

A* search would be run *exactly once* during each episode to solve the map, and the calculated path saved. The agent would move along the calculated path and fire on any enemies it detects.


\underline{\large Object Detection}

\normalsize Object detection is crucial for implementing Kane's algorithm. The object detection model had to be able to predict the presence of enemies from the game engine's screen buffer.

\underline{\large Image labeling and annotation.}

\normalsize All training images for training a custom object detection model was done with LabelImg, which is a GUI image annotation tool. The code for this tool, can be found [here](https://github.com/heartexlabs/labelImg).

The object detection model was attempted with 3 different machine learning architectures, which are listed below in order of attempts:

1. Haar Cascades
2. Detectron 2
3. YOLOv5

\normalsize YOLOv5 is a family of object detection architectures and models pretrained on the COCO dataset[15]. YOLO is one of the most famous object detection algorithms due to its speed and accuracy. YOLOv5 is a collection of 5 models - nano, small, medium, large and Xlarge, alluding to the side of the model's layers.

Training a custom model for YOLOv5 was similar to training a model for Detectron 2:

1. generate and annotate a collection of training and validation images of the objects
2. train the custom model
3. run real time detection with the custom model.

_Please refer to Appendix B: Code, Block 4.1: Training a YOLOv5 model for object detection._

\underline{\huge Implementing Abel}

\underline {\large Deep Reinforcement Learning for VizDoom}

On-policy reinforcement learning is useful when you want to optimize the value of an agent that is exploring. For offline learning, where the agent does not explore much, off-policy RL may be more appropriate. For instance, off-policy classification is good at predicting movement in robotics. 

The idea behind Reinforcement Learning is that an agent (an AI) will learn from the environment by interacting with it (through trial and error) and receiving rewards (negative or positive) as feedback for performing actions (Simonini et. al, 2019).

A breakdown of possible Rewards for our Agent in the VizDoom Environment is as follows:

* Our Agent receives state $S_0$ from the Environment — we receive the first frame of our game (Environment).
* Based on that state $S_0$, the Agent takes action $A_0$ — our Agent will shoot its weapon
* Environment goes to a new state $S_1$ — new frame.
* The environment gives some reward $R_1$ to the Agent — if the Agent hits the demon (Positive Reward +1). Otherwise, the agent takes a new action, $A_1$, which might be: 
 
Abel is implemented with a class of Reinforcement Learning algorithm named PPO. The main idea behind PPO is that after an update, the new policy should be not too far from the old policy. For that, PPO uses clipping to avoid too large an update.

\underline {Reward Shaping}
Reward shaping is a technique inspired by animal training where supplemental rewards are provided to make a problem easier to learn. There is usually an obvious natural reward for any problem. For games, this is usually a win or loss. Reward shaping is a method for engineering a reward function in order to provide more frequent feedback on appropriate behavior (Wiewora 2017).

\underline {Curriculum Learning}
Tasks or data samples can be sequenced into a curriculum for the purpose of learning a problem that may otherwise be too difficult to learn from scratch (Narvekar et. al. 2020). The idea of using such curricula to train artificial agents dates back to the early 1990s, where the first known applications were to grammar learning, robotics control problems and classification problems. As the agent 'learns', it is trained on increasing difficult curricula.

\underline {\large Hyperparameters and Automated Hyperparameter Tuning}

Hyperparameters are an essential components of a machine learning model. The values of hyperparameters determine it's behaviour, and it's vital that optimized hyperparameter values are used for each task. Due to the large dimensional search space of optimized hyperparameter values, this process is best automated.

\underline {Hyperparameter tuning and Optuna}

Optuna is a library for implementing automated hyperparameter tuning for machine learning models. Optuna boasts an efficient sampling and pruning mechanism, easy scalability and ease of setup (Takuya et. al. 2019). 

Hyperparameters

* Gamma - quantifies how much importance is given to future rewards.
* NStep - the number of steps to run for each environment per update.
* Learning Rate - this value controls the rate or speed at which the model learns.
* Entropy Coefficient - inverse of the reward scale used for regularization to improve the model's policy optimization
* Clip Range - clipping is used to stay in a valid interval when action range is not bounded.
* GAE Lambda - a smooting parameter for reducing the variance in training, which produces stabler results.

\underline {\large Training models with the PPO and A2C algorithms implemented with the Stable Baselines3 library.}

PPO and A2C is imported from the `stable_baselines3` library with one line:

```
from stable_baselines3 import PPO, A2C
```

Implementation of both algorithms will be done in two stages:

* Stage One - both algorithms are run for 100,000 timesteps for 100 trials. The optimal hyperparameters are recorded in a SQLite database. At the end of 100 trials, the hyperparameters which produced the best score is used for Stage Two.
* Stage Two - PPO and A2C are trained for 500,000 time-steps with optimized hyperparameters. The training pipeline will include reward shaping and curriculum learning techniques. At the end of Stage 2, the both models are evaluated with the Welch-T test.


_Please refer to Appendix B: Code, Block 4.2: Training the PPO algorithm with the ViZDoom environment._

\pagebreak

\begin{center}
\huge Evaluation
\end{center}

---


\underline{\huge Testing Kane's Algorithms}


\underline{\large Localization and Navigation - A* Search}
A* is a [graph traversal](https://en.wikipedia.org/wiki/Graph_traversal "Graph traversal") and [path search](https://en.wikipedia.org/wiki/Pathfinding "Pathfinding") [algorithm](https://en.wikipedia.org/wiki/Algorithm "Algorithm"), which is often used in many fields of computer science due to its completeness, optimality, and optimal efficiency.

[Peter Hart](https://en.wikipedia.org/wiki/Peter_E._Hart "Peter E. Hart"), [Nils Nilsson](https://en.wikipedia.org/wiki/Nils_Nilsson_(researcher) "Nils Nilsson (researcher)") and [Bertram Raphael](https://en.wikipedia.org/wiki/Bertram_Raphael "Bertram Raphael") of Stanford Research Institute (now [SRI International](https://en.wikipedia.org/wiki/SRI_International "SRI International")) first published the algorithm in 1968.[[4]](https://en.wikipedia.org/wiki/A*_search_algorithm#cite_note-nilsson-4) It can be seen as an extension of [Dijkstra's algorithm](https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm "Dijkstra's algorithm"). A* achieves better performance by using [heuristics](https://en.wikipedia.org/wiki/Heuristic_(computer_science) "Heuristic (computer science)") to guide its search.

A* search is an iterative, ordered search which maintains a set of open board states to explore in an attempt to reach its final goal.

\underline{\large Object Detection - Haar Cascades}

\normalsize Creating the samples for training took **16 hours** for 50 images which were mostly 640x480 in resolution.

```sh
./opencv_createsamples -vec positives.vec -info positives.txt -bg ./negative/negative.txt

# OUTPUT
Info file name: positives.txt
Img file name: (NULL)
Vec file name: positives.vec
BG  file name: ./negative/negative.txt
Num: 1000
BG color: 0
BG threshold: 80
Invert: FALSE
Max intensity deviation: 40
Max x angle: 1.1
Max y angle: 1.1
Max z angle: 0.5
Show samples: FALSE
Width: 64
Height: 64
Max Scale: -1
RNG Seed: 12345
Create training samples from images collection...
```

Attempts to train the model afterwards threw errors which were difficult to debug.

```sh
./opencv_traincascade -data data -vec positives.vec -bg ./negative/negative.txt

# OUTPUT
OpenCV Error: Parsing error (wrong file format for positives.vec
) in create, file /projects/opencv-3.4.0/apps/traincascade/imagestorage.cpp, line 141
terminate called after throwing an instance of 'cv::Exception'
  what():  /projects/opencv-3.4.0/apps/traincascade/imagestorage.cpp:141: error: (-212) wrong file format for positives.vec
 in function create

Aborted (core dumped)
```

After an abortive attempt to debug the issue with training a Haar Cascade algorithm, a new attempt with Detectron 2 was started.

\pagebreak

\underline{\large Object Detection - Detectron 2}

Training a model for Detectron 2 took significantly less time compared to training Haar Cascades on the same training dataset.

```sh
# OUTPUT
[07/23 16:24:46 d2.data.common]: Serializing 1 elements to byte tensors and concatenating them all ...
[07/23 16:24:46 d2.data.common]: Serialized dataset takes 0.00 MiB
WARNING [07/23 16:24:46 d2.solver.build]: SOLVER.STEPS contains values larger than SOLVER.MAX_ITER. These values will be ignored.
The checkpoint state_dict contains keys that are not used by the model:
  pixel_mean
  pixel_std
[07/23 16:24:46 d2.engine.train_loop]: Starting training from iteration 0
[07/23 16:24:58 d2.utils.events]:  eta: 0:10:52  iter: 19  total_loss: 0.864  loss_cls:
0.6261  loss_box_reg: 0.2209  time: 0.5857  data_time: 0.0056  lr: 4.9953e-06  max_mem:
2163M
[07/23 16:25:10 d2.utils.events]:  eta: 0:10:54  iter: 39  total_loss: 1.466  loss_cls:
0.9281  loss_box_reg: 0.5527  time: 0.5958  data_time: 0.0013  lr: 9.9902e-06  max_mem:
2163M
[07/23 16:25:22 d2.utils.events]:  eta: 0:10:52  iter: 59  total_loss: 1.084  loss_cls:
0.5975  loss_box_reg: 0.5109  time: 0.5997  data_time: 0.0013  lr: 1.4985e-05  max_mem:
2163M
```

Training for 10000 iterations takes about 6 hours. Testing results were **negative**.

```python
from detectron2.utils.visualizer import ColorMode
from detectron2.engine import DefaultPredictor
from detectron2.utils.visualizer import Visualizer
import cv2
import os

from detectron2.config import get_cfg
cfg = get_cfg()

cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")
predictor = DefaultPredictor(cfg)

img = cv2.imread('./test.png')
outputs = predictor(img)
visualizer = Visualizer(img[:, :, ::-1], scale=1)
visualizer = visualizer.draw_instance_predictions(outputs["instances"].to("cpu"))
cv2.imshow('', visualizer.get_image()[:, :, ::-1])
cv2.waitKey(0)
```

The trained model failed to detect the enemies on screen.

\begin{figure}
        \centering
        \includegraphics{images/detectron2_test.png}
        \caption{Detectron 2 test.}
\end{figure}

\pagebreak

\underline{\large Object Detection - YOLOv5}

In contrast with Haar Cascades and Dectectron 2, YOLOv5's small model was trained over 550 epochs. Training took only **23 minutes** to complete.

Testing code:
```python
import torch
import cv2

# load trained model
model = torch.hub.load('ultralytics/yolov5', 'custom', path = './best.pt') 
img = './test.png'

results = model(img)
results.show()
```

Running inference with YOLOv5

```sh
python3 test.py

# OUTPUT
Using cache found in /.cache/torch/hub/ultralytics_yolov5_master
YOLOv5 🚀 2022-8-6 Python-3.8.10 torch-1.12.1+cu102 CUDA:0 (NVIDIA GeForce GTX 970, 4035MiB)

Fusing layers... 
Model summary: 213 layers, 7058671 parameters, 0 gradients, 15.9 GFLOPs
Adding AutoShape... 
image 1/1: 405x640 1 demon, 1 demon_02, 1 demon_03
Speed: 18.5ms pre-process, 10.4ms inference, 21.3ms NMS per image at shape (1, 3, 416, 640)
```

\begin{figure}
        \centering
        \includegraphics{images/yolov5_test.png}
        \caption{Detectron 2 test.}
\end{figure}

\pagebreak

\underline{\huge Testing Abel's algorithms}

\begin{figure}
        \centering
        \includegraphics{images/PPO_firing.png}
        \caption{Agent PPO algorithm.}
\end{figure}

The Agent was able to find the enemy and kill it after training for over 1000000 time-steps.

\large Drawbacks of PPO

PPO training has a couple of notable drawbacks:

1. missing support for GPU training.
2. lack of pre-trained models for transfer learning

\large Reward Shaping

* HITCOUNT
* SELECTED_WEAPON_AMMO
* DAMAGE_TAKEN

\large Curriculum Learning


\pagebreak

\begin{center}
\huge 10. Results
\end{center}


---

(section under construction)

\pagebreak


\begin{center}
\huge Evaluation
\end{center}

---

\underline {\large Multi-player battle between bots}

Kane will not be playing.

Battle is between Abel and Abel. Deathmatch



\pagebreak


\begin{center}
\huge 12. Conclusion and Further Work
\end{center}

---

(section under construction)

\pagebreak

\begin{center}
\huge 13. User Guide
\end{center}

---

(section under construction)

\pagebreak


\begin{center}
\huge 14. Bibliography
\end{center}

---

1. Iversen, A. & Taylor, Nick & Brown, Keith. (2005). Classification and verification through the combination of the multi-layer perceptron and auto-association neural networks. Proceedings of the International Joint Conference on Neural Networks. 2. 1166 - 1171 vol. 2. 10.1109/IJCNN.2005.1556018. 
2. [Intelligent agent - Wikipedia](https://en.wikipedia.org/wiki/Intelligent_agent)
3. [Russell, Stuart J.](https://en.wikipedia.org/wiki/Stuart_J._Russell "Stuart J. Russell"); [Norvig, Peter](https://en.wikipedia.org/wiki/Peter_Norvig "Peter Norvig") (2003). [*Artificial Intelligence: A Modern Approach*](http://aima.cs.berkeley.edu/) (2nd ed.). Upper Saddle River, New Jersey: Prentice Hall. Chapter 2. [ISBN](https://en.wikipedia.org/wiki/ISBN_(identifier) "ISBN (identifier)") [0-13-790395-2](https://en.wikipedia.org/wiki/Special:BookSources/0-13-790395-2 "Special:BookSources/0-13-790395-2").
4. Minoru Asada, Shoichi Noda, Sukoya Tawaratsumida, and Koh Hosoda.
   Purposive behavior acquisition for a real robot by vision-based reinforcement learning. In Recent Advances in Robot Learning, pages 163–187.
   Springer, 1996.
5. Schulman et al, Proximal Policy Optimization Algorithms, arxiv 2017.
6. OpenCV: Cascade Classifier
7.  Hart, P. E.; Nilsson, N. J.; Raphael, B. (1968). "A Formal Basis for the Heuristic Determination of Minimum Cost Paths". *IEEE Transactions on Systems Science and Cybernetics*. **4** (2): 100–107. [doi](https://en.wikipedia.org/wiki/Doi_(identifier) "Doi (identifier)"):[10.1109/TSSC.1968.300136](https://doi.org/10.1109%2FTSSC.1968.300136).
8. Hans-Georg Beyer, (2007), Scholarpedia. [Evolution strategies - Scholarpedia](http://www.scholarpedia.org/article/Evolution_strategies)
9. She, Jennifer, 2018, Combining PPO and Evolutionary Strategies
   for Better Policy Search. https://cs229.stanford.edu/proj2018/report/251.pdf
10. [Proximal Policy Optimization](https://keras.io/examples/rl/ppo_cartpole/)
11. Rabin, S. 2015. JPS+ now with Goal Bounding: Over 1000 × Faster than A*, GDC 2015.
12. Hamalainen, Perttu, Babadi, Amin, Ma, Xiaoxiao, and Lehtinen, Jaakko. PPO-CMA: proximal policy optimization with covariance matrix adaptation.
13. https://docs.opencv.org/3.4/db/d28/tutorial_cascade_classifier.html
14. https://github.com/facebookresearch/detectron2
15. https://github.com/ultralytics/yolov5
16. https://huggingface.co/blog/deep-rl-intro
17. https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html
18. https://keras.io/examples/rl/actor_critic_cartpole/
19. Zhong-Qiu, Zhao, et al. 2017 https://arxiv.org/pdf/1807.05511.pdf
20. Dian Rachmawati, Lysander Gustin, IOP Science, 2019, https://iopscience.iop.org/article/10.1088/1742-6596/1566/1/012061/

(section under construction)

\pagebreak

\begin{center}
\huge 15. Appendices
\end{center}

---

\underline {\huge APPENDIX A: Images}


\underline {\large 3. Design}

\begin{figure}
        \centering
        \includegraphics{images/simple_reflex_agent.png}
        \par
        Figure 3.1: Simple Reflex Agent.\par
\end{figure}


\begin{figure}
        \centering
        \includegraphics{images/intelligent_agent.png}
        \par
        Figure 3.2: Intelligent Agent.\par
\end{figure}


\begin{figure}
    \centering
    \includegraphics[scale=0.7]{images/vizdoom.png}
    \par
    \centering {Figure 3.3: ViZDoom game engine.\par}
\end{figure}

\begin{figure}
    \centering
    \includegraphics[scale=0.45]{images/AE_loop_dark.png}
    \par
    \centering {Figure 3.4 OpenAI Gym Loop.\par}
\end{figure}

\begin{figure}
        \centering
        \includegraphics{images/cnn_rl.png}
        \par
        Figure 3.5: A simplified representation of CNNs in Reinforcement Learning.\par
\end{figure}

\begin{figure}
    \centering
    \includegraphics[scale=0.8]{images/FSM_kane.png}
    \par
    \centering {Figure 3.6: FSM state diagram for Kane.\par}
\end{figure}


\begin{figure}
        \centering
        \includegraphics{images/RL_process.jpg}
        \par
        Figure 3.7: Reinforcement learning process for Abel. \par
\end{figure}

\begin{figure}
        \centering
        \includegraphics{images/plan_of_work.png}
        \par
        Figure 3.8: Plan of work. \par
\end{figure}

\underline {\large 4. Implementation}

\begin{figure}
        \centering
        \includegraphics{images/labelImg.png}
        \par
        Figure 4.1: LabelImg in action.\par
\end{figure}

\pagebreak


\underline {\huge APPENDIX B: Code}


\underline {\large 3. Design}


```python
# source: https://www.gymlibrary.ml/content/wrappers/
import gym
from gym.wrappers import RescaleAction

base_env = gym.make("BipedalWalker-v3")
base_env.action_space
# OUTPUT: Box([-1. -1. -1. -1.], [1. 1. 1. 1.], (4,), float32)

wrapped_env = RescaleAction(base_env, min_action=0, max_action=1)
wrapped_env.action_space
# OUTPUT: Box([0. 0. 0. 0.], [1. 1. 1. 1.], (4,), float32)
```
Block 3.1: OpenAI Gym Wrapper


\underline {\large 4. Implementation}


```bash
# First step - clone the Github repository of YOLOv5
git clone https://github.com/ultralytics/yolov5
```

```yaml
#Generate YAML file
path: ./demon
train: train
val: val

nc: 18
names: [
  'dog','person','cat','tv','car','meatballs','marinara sauce',
  'tomato soup','chicken noodle soup',
  'french onion soup','chicken breast','ribs','pulled pork',
  'hamburger','cavity','demon_01', 'demon_02', 'demon_03'
]
```

```bash
# Training with a small model on **50 training images** and **10 validation images** from the command line.
python3 train.py --img 320 --batch 16 --epochs 500 --data doom.yaml\
--weights yolov5s.pt
```
Block 4.1: Training a YOLOv5 model for object detection.


```python
# 1. import necessary libraries
import torch
import cv2
from vizdoom import *
from gym import Env
from gym.spaces import Discrete, Box
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3 import PPO

# 2. class for creating instance of training environment
class VizDoomGym(Env):
    def __init__(self, render = False):
        super().__init__()

        self.game = DoomGame()
        self.action_space_dimensions = 3
        self.game.load_config("./config/basic.cfg") # load basic map

        if render == False:
            self.game.set_window_visible(False)
        else:
            self.game.set_window_visible(True)

        self.game.init()
        self.observation_space = Box(
            low = 0, high = 255, shape = (3, 240, 320), dtype = np.uint8
        )
        self.action_space = Discrete(self.action_space_dimensions)
        
    def step(self, action):
        actions = np.identity(self.action_space_dimensions, dtype = np.uint8)
        reward = self.game.make_action(actions[action], 4)

        if self.game.get_state(): 
            state = self.game.get_state().screen_buffer
            state = self.grayscale(state)
        else: 
            state = np.zeros(self.observation_space.shape)
            info = 0 

        info = { "info": self.game.get_state().game_variables[0] }
        done = self.game.is_episode_finished()
        
        return state, reward, done, info 

    def close(self):
        self.game.close()
    
    def grayscale(self, observation):
        gray = cv2.cvtColor(np.moveaxis(observation, 0, -1), cv2.COLOR_BGR2GRAY)
        resize = cv2.resize(gray, (160, 100), interpolation = cv2.INTER_CUBIC)
        state = np.reshape(resize, (100, 160, 1))
        return state
    
    def reset(self):
        self.game.new_episode()
        state = self.game.get_state().screen_buffer
        return self.grayscale(state)
    
# create instance of environment
env = VizDoomGym()

model = PPO(
    'CnnPolicy', 
    env,
    tensorboard_log = './train/train_basic',
    verbose = 1,
    learning_rate = 0.0001,
    n_steps = 128
)
# train agent
model.learn(total_timesteps = 100000)
```
Block 4.2: Training the PPO algorithm with the ViZDoom environment.

\underline {\huge APPENDIX C: Unsuccessful Attempts}


\underline {\large 4. Implementation}

\underline{\large Haar Cascades}

\normalsize Object Detection using Haar feature-based cascade classifiers is an effective object detection method proposed by Paul Viola and Michael Jones in their paper, "Rapid Object Detection using a Boosted Cascade of Simple Features" in 2001 [13]. 

A Haar Cascades classifier is an algorithm with two stages:

1. training stage
2. detection stage

\underline{Training Stage}

\normalsize A boosted cascader of weak classifiers (i.e. Adaboost) is trained with a set of _positive samples_ and a set of _negative samples_. Positive samples contain objects which the classifiers are supposed to detect, and negative samples contain everything else which should not be detected.

_note: Training and detection was done with OpenCV 3.4 instead of OpenCV 4.x, as `opencv_createsamples` and `opencv_traincascade` was disabled with OpenCV 4.x._

Generate a list of positive and negative files:

```python
import os

directory = "positive"
t = open("./positive.txt", "w")
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    
    if os.path.isfile(f):
        t.write(f + "\n")
t.close()
```

Positive samples which were annotated are created with the `opencv_createsamples` binary included with the OpenCV package.

```
opencv_createsamples \
-info positives/positives.txt \
-bg negatives/negatives.txt \
-vec cropped.vec \
-num 1000 -w 64 -h 64
```

Negative samples are created manually from any arbitrary collection of images.

Training is done with the `opencv_traincascade` binary, by running 

```
opencv_traincascade -data ./output -vec positive.vec -bg\
./negatives/negatives.txt -numPos 900 -numNeg 500 -numStages 20\
-featureType HAAR -w 64 -h 64
```

The command above trains a Haar Cascade classifier with the positive _and_ negative images and produces a `.xml` file for use with object detection.

For results, please see Testing\\Haar Cascades.

\pagebreak

\underline{\large Detectron 2}

\normalsize Detectron2 is Facebook AI Research's next generation library that provides state-of-the-art detection and segmentation algorithms. It is the successor of Detectron and maskrcnn-benchmark. It supports a number of computer vision research projects and production applications in Facebook [14].

Detectron 2 is built on PyTorch, while the original, Detectron, was built with Caffe2. Detectron 2 allows users to utilize pre-trained models for inference or train custom models.

Training a custom model can be done with the following steps:
1. generate and annotate a collection of training and validation images of the objects
2. train the custom model
3. run real time detection with the custom model.

50 training and 10 validation images were annotated with [LabelImg](https://github.com/heartexlabs/labelImg) to generate a `.json` file with metadata. 


```python
# Generating metadata
def get_enemy_dicts(img_dir):
    image_ans = None
    json_file = os.path.join(img_dir, "instances.json")
    with open(json_file) as f:
        imgs_anns = json.load(f)

    dataset_dicts = []
    record = {}
    for i in range(len(imgs_anns['images'])):
        
        filename = os.path.join(img_dir, imgs_anns['images'][i]["file_name"])
        height, width = cv2.imread(filename).shape[:2]
        
        record["file_name"] = filename
        record["height"] = height
        record["width"] = width
    for i in range(len(imgs_anns['annotations'])):

        objs = []

        px = [
            imgs_anns['annotations'][i]['bbox'][0],
            imgs_anns['annotations'][i]['bbox'][2]
        ], 
        py = [
            imgs_anns['annotations'][i]['bbox'][1],
            imgs_anns['annotations'][i]['bbox'][3]
        ]
        obj = {
            "bbox": [np.min(px), np.min(py), np.max(px), np.max(py)],
            "bbox_mode": BoxMode.XYWH_ABS,
            "category_id": 0,
            "iscrowd": 0
        }
        objs.append(obj)
    record["annotations"] = objs
    dataset_dicts.append(record)
    return dataset_dicts

from detectron2.data import DatasetCatalog, MetadataCatalog
for d in ["train", "val"]:
    print(("enemy/" + d))
    DatasetCatalog.register("enemy/" + d, lambda d=d: get_enemy_dicts("enemy/" + d))
    MetadataCatalog.get("enemy/" + d).set(thing_classes=["enemy"])
enemy_metadata = MetadataCatalog.get("enemy/train")
```

The images and metadata were then used to train the custom model.

```python
# Train a custom Detectron 2 model
cfg = get_cfg()
cfg.merge_from_file(
    "./detectron2/configs/COCO-Detection/retinanet_R_50_FPN_3x.yaml"
)
cfg.DATASETS.TRAIN = ("enemy/train",)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = "detectron2://COCO-Detection/retinanet_R_50_FPN_3x/190397829/model_final_5bd44e.pkl"
cfg.SOLVER.IMS_PER_BATCH = 1
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 1000
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 64
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg) 
trainer.resume_or_load(resume=False)
trainer.train()
```

For results, please see Testing\\Detectron 2.

\pagebreak

## Discard


Intelligent object recognition is a required component of intelligent agents within an environment with visual data. Object detection methods will be used for the following tasks:

1. detecting enemies,
2. detection of power-ups and health items,

The following methods will be evaluated and tested:

1. Haar Cascades - a machine learning based approach where a cascade function is trained from a lot of positive and negative images.
2. Deep Learning -  a type of machine learning and artificial intelligence (AI) that imitates the way humans gain certain types of knowledge[19]

\underline {Convolutional Neural Networks}

Convolutional Neural Networks (CNNs) comprise of a high number of interconnected computational nodes (referred to as neurons), of which work entwine in a distributed fashion to collectively learn from the input in order to optimise its final output. CNNs work within **Reinforcement Learning** by combining CNNs with a framework of reinforcement learning which helps intelligent agent learn a task. 