GitHub Repository for this project can be found at https://github.com/seel6470/CSPB-3202-Final-Project

# Super Mario Bros Reflexive Agent
<ol type="I">
<li><a href='#short-overview'>Short Overview</a></li>
<li><a href='#approach'>Approach</a>
  <ol type="A">
    <li><a href='#random-agent'>Random Agent</a></li>
    <li><a href='#heuristic-agent'>Heuristic Agent</a></li>
    <li><a href='#basic-ppo-agent'>Basic PPO Agent</a></li>
    <li><a href='#q-learning-agent'>Q-Learning Agent</a></li>
  </ol>
</li>
<li><a href='#results'>Results</a></li>
<li><a href='#conclusion'>Conclusion</a></li>
</ol>

## Short Overview

*short overview of what your project is about (e.g. you're building /testing certain RL models in certain environments; yes you can test your algorithm in more than 1 environment if your goal is to test an algorithm(s) performances in different settings)*

1. Does it include the clear overview on what the project is about? (4)

2. Does it explain how the environment works and what the game rules are? (4)

For my project, I chose to teach a learning model to play the original Super Mario Bros. game for the NES. I utilized a library created by Christian Kauten called gym-super-mario-bros, which provides an OpenAI Gym environment using the nes-py emulator (Kauten, 2018). The challenge is to beat as many levels as possible in the original Mario game for NES with the following rules of the game.

The goal of the game is to avoid enemies and pits to reach the end of each level. One hit and Mario loses a life, starting over from the nearest checkpoint. Power-ups provide Mario an additional hit. The following page from the original game manual outlines the inputs Mario receives for the game:

![image](images/controls.jpg)

Nintendo. (1985). Super Mario Bros. Instruction Manual. Nintendo of America Inc. Retrieved from [https://www.nintendo.co.jp/clv/manuals/en/pdf/CLV-P-NAAAE.pdf]

The game environment takes these controls and creates the following action lists that can be used within the environment wrapper:

```python
# actions for the simple run right environment
RIGHT_ONLY = [
    ['NOOP'],
    ['right'],
    ['right', 'A'],
    ['right', 'B'],
    ['right', 'A', 'B'],
]


# actions for very simple movement
SIMPLE_MOVEMENT = [
    ['NOOP'],
    ['right'],
    ['right', 'A'],
    ['right', 'B'],
    ['right', 'A', 'B'],
    ['A'],
    ['left'],
]


# actions for more complex movement
COMPLEX_MOVEMENT = [
    ['NOOP'],
    ['right'],
    ['right', 'A'],
    ['right', 'B'],
    ['right', 'A', 'B'],
    ['A'],
    ['left'],
    ['left', 'A'],
    ['left', 'B'],
    ['left', 'A', 'B'],
    ['down'],
    ['up'],
]
```

The environment can also determine the following keys for the gamestate:

| Key       | Type | Description                                |
|-----------|------|--------------------------------------------|
| coins     | int  | The number of collected coins              |
| flag_get  | bool | True if Mario reached a flag or ax         |
| life      | int  | The number of lives left, i.e., {3, 2, 1}  |
| score     | int  | The cumulative in-game score               |
| stage     | int  | The current stage, i.e., {1, ..., 4}       |
| status    | str  | Mario's status, i.e., {'small', 'tall', 'fireball'} |
| time      | int  | The time left on the clock                 |
| world     | int  | The current world, i.e., {1, ..., 8}       |
| x_pos     | int  | Mario's x position in the stage (from the left) |
| y_pos     | int  | Mario's y position in the stage (from the bottom) |

Additionally, the environment utilizes the following parameters for the reward function:

v: the difference in agent x values between states

c: the difference in the game clock between frames

d: a death penalty that penalizes the agent for dying in a state

<a id='approach'></a>

## Approach

*explain your environment, your choice of model(s), the methods and purpose of testing and experiments, explain any trouble shooting required.*

3. Does it explain clearly the model(s) of choices, the methods and purpose of tests and experiments? (7)

4. Does it show problem solving procedure- e.g. how the author solved and improved when an algorithm doesn't work well. Note that it's not about debugging or programming/implementation, but about when a correctly implemented algorithm wasn't enough for the problem and the author had to modify/add some features or techniques, or compare with another model, etc. (7)

The initial setup for the environment was a bit tricky due to some incompatibilities between the chosen gym library gym-super-mario-bros JoypadSpace wrapper and the current version of OpenAi's gym framework, specifically with the `reset` method. Huge thanks to NathanGavinski who supplied [a workaround](https://github.com/Kautenja/gym-super-mario-bros/issues/128#issuecomment-1954019091) in the issues forum for gym-super-mario-bros Git. (NathanGavenski, 2023).

The following code utilizes this fix along with the suggested boilerplate setup from the gym-super-mario-bros documentation:

In [97]:
import gym
import gym_super_mario_bros
import time
from nes_py.wrappers import JoypadSpace
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
from gymnasium.wrappers import StepAPICompatibility, TimeLimit

# Create the Super Mario Bros. environment
env = gym.make('SuperMarioBros-v0')
steps = env._max_episode_steps  # get the original max_episode_steps count

# Set the Joypad wrapper
env = JoypadSpace(env.env, SIMPLE_MOVEMENT)

# Define a new reset function to accept `seeds` and `options` args for compatibility
def gymnasium_reset(self, **kwargs):
    return self.env.reset(), {}

# Overwrite the old reset to accept `seeds` and `options` args
env.reset = gymnasium_reset.__get__(env, JoypadSpace)

# Set TimeLimit back
env = TimeLimit(StepAPICompatibility(env, output_truncation_bool=True), max_episode_steps=steps)

  logger.warn(
You can set `disable_env_checker=True` to disable this check.[0m
  logger.warn(


## Random Agent

Let's create an agent that makes random movements just to make sure our environment is working:

In [98]:
done = True
max_x = 0
max_world = 0
max_stage = 0
# Run the environment for 5000 steps
for step in range(1000):
    if done:
        state = env.reset()
    state, reward, done, truncated, info = env.step(env.action_space.sample())
    if info['world'] > max_world:
        max_world = info['world']
        max_x = 0
        max_stage = 1
    elif info['stage'] > max_stage:
        max_stage = info['stage']
        max_x = 0
    elif info['x_pos'] > max_x:
        max_x = info['x_pos']
    done = done or truncated
    env.render()
    time.sleep(0.01)  # Add a delay of 0.05 seconds between frames

# Close the environment
env.close()
print(f"Max X: {max_x}")
print(f"Max World: {max_world}")
print(f"Max Stage: {max_stage}")

Max X: 596
Max World: 1
Max Stage: 1


<video width="320" height="240" controls>
  <source src="./images/random_agent.mp4" type="video/mp4">
</video>

You can see that the random agent never gets past the second pipe. This is because it is not probabilistically reasonable for random inputs to know to sustain a jump by pressing A to get high enough to clear the pipe and keep going. This pipe exists at an X value of 595. Let's see if there are any other agents that can get farther.

## Heuristic Agent

To get a baseling, I decided to implement a basic heuristic model that uses a simple algorithm to try to beat a level of Super Mario Bros.

In [99]:
# Create the Super Mario Bros. environment
env = gym.make('SuperMarioBros-v0')

# Set the Joypad wrapper
env = JoypadSpace(env.env, SIMPLE_MOVEMENT)

# Define a new reset function to accept `seeds` and `options` args for compatibility
def gymnasium_reset(self, **kwargs):
    return self.env.reset(), {}

# Overwrite the old reset to accept `seeds` and `options` args
env.reset = gymnasium_reset.__get__(env, JoypadSpace)

# Set TimeLimit back
env = TimeLimit(StepAPICompatibility(env, output_truncation_bool=True), max_episode_steps=steps)

# create global variables for inputs
done = True
going_up = False
prev_y = None

max_x = 0
max_world = 0
max_stage = 0

for step in range(1700):
    if done:
        state = env.reset()
        prev_y = None
        hold_jump = False
    
    # if Mario is on flat groun
    # or in the process of rising from previous jump
    # will continue to hold A to perform the maximum jump
    action = SIMPLE_MOVEMENT.index(['right', 'A', 'B']) if going_up else SIMPLE_MOVEMENT.index(['right', 'B'])
    state, reward, done, truncated, info = env.step(action)

    # set going_up to true if Mario is not descending
    if prev_y is not None:
        if info['y_pos'] >= prev_y:
            going_up = True
        else:
            going_up = False

    # capture current y position to compare for next state
    prev_y = info['y_pos']
        
    if info['world'] > max_world:
        max_world = info['world']
        max_x = 0
        max_stage = 1
    elif info['stage'] > max_stage:
        max_stage = info['stage']
        max_x = 0
    elif info['x_pos'] > max_x:
        max_x = info['x_pos']

    if done or truncated:
        done = True
    env.render()
    time.sleep(0.01)  # Add a delay of 0.01 seconds between frames

# Close the environment
env.close()
print(f"Max X: {max_x}")
print(f"Max World: {max_world}")
print(f"Max Stage: {max_stage}")

Max X: 320
Max World: 1
Max Stage: 2


<video width="320" height="240" controls>
  <source src="./images/heuristic_agent.mp4" type="video/mp4">
</video>

This heuristic, while not an actual learning model, is effective by sheer luck. This strategy is similar to what an actual player might try when attempting to beat the first level. It is intuitive to try and jump as high and as often as possible to clear most obstacles and enemies. It is by sheer luck (and some deaths) that the heuristic is able to avoid enemies and pits. Despite this, however, this agent is able to clear the first level, although it does not get very far in the second level.

Let's try to create a learning model and see if the agent can beat the level without relying on luck.

## PPO Agent


Let's actually have an agent that learns, rather than blindly jumping. We'll start by creating a basic Proximal Policy Optimization (PPO) that makes stable updates to policies by limiting drastic changes, ensuring efficient learning.

In [72]:
import numpy as np
import gym_super_mario_bros
import time
import gym
from nes_py.wrappers import JoypadSpace
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
from gym.wrappers import GrayScaleObservation
from gym.wrappers import StepAPICompatibility, TimeLimit
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv, VecFrameStack, VecMonitor
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback

# Create the Super Mario Bros. environment
env = gym.make('SuperMarioBros-v0')
steps = env._max_episode_steps

# Set the Joypad wrapper
env = JoypadSpace(env.env, SIMPLE_MOVEMENT)
# Overwrite the old reset to accept seeds and options args
env.reset = gymnasium_reset.__get__(env, JoypadSpace)
env = StepAPICompatibility(env, output_truncation_bool=True)

model = PPO(
    'CnnPolicy',      # Use a convolutional neural network
    env,              # environment
    verbose=1,        # print diagnostics
    learning_rate=1e-4,  # controls how much to adjust the model with each step
    n_steps=128,      # affects the frequency of updates
    batch_size=64,    # number of samples per gradient update
    n_epochs=4,       # Number of epochs
    clip_range=0.2,   # helps in limiting updates for stable training
    ent_coef=0.03     # use entropy to encourage exploration
)

# Define evaluation and checkpoint callbacks
eval_callback = EvalCallback(env, best_model_save_path='./logs/', log_path='./logs/', eval_freq=500, deterministic=True, render=False)
checkpoint_callback = CheckpointCallback(save_freq=1000, save_path='./models/', name_prefix='ppo_mario')

model.learn(total_timesteps=1500)

# model = PPO.load("ppo_mario.zip")

model.save("./trained_agents/ppo_mario")

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
print(f"Mean reward: {mean_reward} ± {std_reward}")

  raise ModuleNotFoundError(


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.


  logger.warn(
  logger.deprecation(
  if not isinstance(done, (bool, np.bool8)):


----------------------------
| time/              |     |
|    fps             | 103 |
|    iterations      | 1   |
|    time_elapsed    | 1   |
|    total_timesteps | 128 |
----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 27          |
|    iterations           | 2           |
|    time_elapsed         | 9           |
|    total_timesteps      | 256         |
| train/                  |             |
|    approx_kl            | 0.033081587 |
|    clip_fraction        | 0.223       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.93       |
|    explained_variance   | -0.00124    |
|    learning_rate        | 0.0001      |
|    loss                 | 192         |
|    n_updates            | 4           |
|    policy_gradient_loss | -0.00565    |
|    value_loss           | 425         |
-----------------------------------------
---------------------------------------
|

In [74]:
import numpy as np
import gym
import gym_super_mario_bros
import time
from nes_py.wrappers import JoypadSpace
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
from gym.wrappers import StepAPICompatibility, TimeLimit
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback

# Create the Super Mario Bros. environment
env = gym.make('SuperMarioBros-v0')
steps = env._max_episode_steps  # get the original max_episode_steps count

# Set the Joypad wrapper
env = JoypadSpace(env.env, SIMPLE_MOVEMENT)
# Overwrite the old reset to accept seeds and options args
env.reset = gymnasium_reset.__get__(env, JoypadSpace)

# Set TimeLimit back
env = TimeLimit(StepAPICompatibility(env, output_truncation_bool=True), max_episode_steps=steps)

# model = PPO.load("ppo_mario.zip")

max_x = 0
max_world = 0
max_stage = 0


obs, info = env.reset()
for step in range(1500):
    action, _states = model.predict(obs.copy())
    action = action.item()
    obs, reward, done, truncated, info = env.step(action)
    if info['world'] > max_world:
        max_world = info['world']
        max_x = 0
        max_stage = 1
    elif info['stage'] > max_stage:
        max_stage = info['stage']
        max_x = 0
    elif info['x_pos'] > max_x:
        max_x = info['x_pos']
    env.render()
    if done:
        obs, info = env.reset()

# Close the environment
env.close()
print(f"Max X: {max_x}")
print(f"Max World: {max_world}")
print(f"Max Stage: {max_stage}")

Max X: 595
Max World: 1
Max Stage: 1


<video width="320" height="240" controls>
  <source src="./images/PPO_agent.mp4" type="video/mp4">
</video>

This PPO agent is, unfortunately, unable to clear the pipe at the 595 X value. This fairs about as well as the random agent, and not as good as the heuristic agent. Given more timesteps, the agent could learn how to get farther, however let us try to create better reflexive learning agent using the Q learning algorithm.

## Q-Learning Agent

Now, let us create our own Q-Learning agent. This is the meat of this project, so get ready to strap in. A huge thank you to Sourish Kundu, who made a video on how to do exactly this with some great explanations (Kundo, 2023). If you have a chance, I highly recommend giving it a watch, even if you are not working with this gym environment, as it gives some great explanations of general concepts we have been discussing in class.

https://www.youtube.com/watch?v=_gmQZToTMac

We can begin by creating our own environment wrapper to skip four frames. Additionally, we will reduce the frame resolution, convert the frames to grayscale, and stack four frames. This will assist in reducing the complexity of the environment.

In [82]:
import numpy as np
from gym import Wrapper
from gym.wrappers import GrayScaleObservation, ResizeObservation, FrameStack

class SkipFrame(Wrapper):
    def __init__(self, env, skip):
        super().__init__(env)
        self.skip = skip
    
    def step(self, action):
        # create a reward accumulator
        reward_accum = 0.0
        done = False
        for _ in range(self.skip):
            next_state, reward, done, info = self.env.step(action)
            # add reward to accumulator
            reward_accum += reward
            if done:
                break
        return next_state, reward_accum, done, info
    
    '''
    def reset(self):
        state = self.env.reset()
        print(state)
        # Assuming we want to stack 'skip' frames
        state = [state] * self.skip
        state = np.stack(state, axis=0)
        return state
    '''

def apply_wrappers(env):
    env = SkipFrame(env, skip=4) # skip every four frames
    env = ResizeObservation(env, shape=84) # reduce size of frame image
    env = GrayScaleObservation(env) # create grayscale images
    env = FrameStack(env, num_stack=4, lz4_compress=True) # stack frames (4 skipped)
    return env

Next, we will create our own neural network agent using convolution layers and linear layers. We are using Conv2d layers with Relu activation functions, as these are typically appropriate for images (as I learned from HW5). Using MaxPool2d will reduce spatial dimensions and lower computational cost, which will hopefully lower the processing time for our algorithm.

Additional methods are used to dynamically evaluate the input shape for our initial linear layer (`_get_conv_out`), prevent pytorch from updating the gradients if frozen (`_freeze`), and tell pytorch how to handle the forward pass for each tensor (`forward`).

In [83]:
import torch
from torch import nn
import numpy as np

class AgentNN(nn.Module):
    def __init__(self, input_shape, n_actions, freeze=False):
        super().__init__()
        # Conolutional layers
        self.conv_layers = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=5, stride=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            nn.Conv2d(32, 64, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )

        # use built-in method to get the dimensional input size for initial linear layer
        conv_out_size = self._get_conv_out(input_shape)

        # Fully connected linear layers
        self.network = nn.Sequential(
            self.conv_layers,
            nn.Flatten(),
            nn.Linear(conv_out_size, 256),
            nn.ReLU(),
            nn.Linear(256, n_actions) # determine best action to predict
        )

        # call the freeze method if frozen
        # to make sure no parameters are updated if frozen
        if freeze:
            self._freeze()
        
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu' # try to use the GPU if possible
        self.to(self.device)

    # method to handle forward pass 
    def forward(self, x):
        # pass the input tensor through the neural network layers
        return self.network(x)

    # get the number of neurons for our linear layers
    def _get_conv_out(self, shape):
        o = self.conv_layers(torch.zeros(1, *shape))
        return int(np.prod(o.size()))
    
    # method to make sure gradients are not calculated if frozen
    def _freeze(self):        
        for p in self.network.parameters():
            p.requires_grad = False

Next, we will create our agent class to use the neural network created previously.

In [94]:
import torch
import numpy as np
from tensordict import TensorDict # use tensors in python lists to speed up processing
from torchrl.data import TensorDictReplayBuffer, LazyMemmapStorage
import psutil
import os

class Agent:
    def __init__(self, 
                 input_dims, 
                 num_actions, 
                 lr=0.00025, 
                 gamma=0.9, 
                 epsilon=1.0, 
                 eps_decay=0.99999975, 
                 eps_min=0.1, 
                 replay_buffer_capacity=75000, 
                 batch_size=32, 
                 sync_network_rate=10000
                 ):
        
        self.num_actions = num_actions # use the appropriate number of actions (SIMPLE_MOVEMENT dict has 7)
        self.learn_step_counter = 0

        # Hyperparameters
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon
        self.eps_decay = eps_decay
        self.eps_min = eps_min
        self.batch_size = batch_size
        self.sync_network_rate = sync_network_rate

        # Networks
        self.online_network = AgentNN(input_dims, num_actions)
        self.target_network = AgentNN(input_dims, num_actions, freeze=True)

        # Optimizer and loss
        self.optimizer = torch.optim.Adam(self.online_network.parameters(), lr=self.lr)
        self.loss = torch.nn.MSELoss() # loss function

        # Replay buffer
        storage = LazyMemmapStorage(replay_buffer_capacity)
        self.replay_buffer = TensorDictReplayBuffer(storage=storage)
        self.log_memory_usage()

    def log_memory_usage(self):
        process = psutil.Process(os.getpid())
        mem_info = process.memory_info()
        print(f"Memory Usage: {mem_info.rss / 1024 ** 2:.2f} MB")

    def choose_action(self, observation):
        # create the potential to choose a random action
        # this will include some value of randomness to increase exploration
        if np.random.random() < self.epsilon:
            return np.random.randint(self.num_actions)
        
        observation = (
            torch.tensor(np.array(observation), dtype=torch.float32) # speed up processing by using tensors instead of numpy arrays
            .unsqueeze(0) # add dimension of batch size to first index of tensor
            .to(self.online_network.device) # move to the correct device (GPU or CPU)
        )
        # return the action with the highest Q-value
        return self.online_network(observation).argmax().item()
    
    # compute the value of epsilon to diminish rewards for later actions
    def decay_epsilon(self):
        self.epsilon = max(self.epsilon * self.eps_decay, self.eps_min)

    # put tensors in a dict and add to buffer
    def store_in_memory(self, state, action, reward, next_state, done):
        # Create TensorDict with correct shapes and types
        data = TensorDict({
            "state": torch.tensor(np.array(state), dtype=torch.float32),
            "action": torch.tensor(action),
            "reward": torch.tensor(reward),
            "next_state": torch.tensor(np.array(next_state), dtype=torch.float32),
            "done": torch.tensor(done)
        }, batch_size=[])
        self.replay_buffer.add(data)
    
    # copy weights of online network to target network if enough steps have passed
    def sync_networks(self):
        if self.learn_step_counter % self.sync_network_rate == 0 and self.learn_step_counter > 0:
            self.target_network.load_state_dict(self.online_network.state_dict())

    # save current model (in case something goes wrong)
    def save_model(self, path):
        torch.save(self.online_network.state_dict(), path)

    # load model
    def load_model(self, path):
        self.online_network.load_state_dict(torch.load(path))
        self.target_network.load_state_dict(torch.load(path))


    def learn(self):
        # if not enough experiences, return and keep going
        if len(self.replay_buffer) < self.batch_size:
            return
        
        # copy weights to target network
        self.sync_networks()
        
        # clear gradients
        self.optimizer.zero_grad()

        # sample the replay buffer and store the results
        samples = self.replay_buffer.sample(self.batch_size).to(self.online_network.device)
        states = samples['state']
        actions = samples['action']
        rewards = samples['reward']
        next_states = samples['next_state']
        dones = samples['done']

        # get the predicted values from our neural network with the appropriate batch size
        predicted_q_values = self.online_network(states)
        predicted_q_values = predicted_q_values[np.arange(self.batch_size), actions.squeeze()]

        # Max returns two tensors, the first one is the maximum value, the second one is the index of the maximum value
        target_q_values = self.target_network(next_states).max(dim=1)[0]
        # The rewards of any future states don't matter if the current state is a terminal state
        # If done is true, then 1 - done is 0, so the part after the plus sign (representing the future rewards) is 0
        target_q_values = rewards + self.gamma * target_q_values * (1 - dones.float())

        loss = self.loss(predicted_q_values, target_q_values)
        loss.backward()
        self.optimizer.step()

        self.learn_step_counter += 1
        self.decay_epsilon()

We now have everything we need to create the environment and run our reinforcement learning architecture.

In [101]:
import gym_super_mario_bros
from gym_super_mario_bros.actions import RIGHT_ONLY
from nes_py.wrappers import JoypadSpace
import numpy as np
import gym
import gym_super_mario_bros
import time
from nes_py.wrappers import JoypadSpace
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
import gym_super_mario_bros
from gym_super_mario_bros.actions import RIGHT_ONLY
from nes_py.wrappers import JoypadSpace

ENV_NAME = 'SuperMarioBros-1-1-v0'
SHOULD_TRAIN = True
DISPLAY = True
NUM_OF_EPISODES = 5

env = gym_super_mario_bros.make(ENV_NAME)
env = JoypadSpace(env, RIGHT_ONLY)

env = apply_wrappers(env)

agent = Agent(input_dims=env.observation_space.shape, num_actions=env.action_space.n)

def print_progress(cur,end, bar_length=40):
    progress = cur / end
    block = int(round(bar_length * progress))
    text = f"\rCurrently processing episode {cur}/{end} [{'#' * block + '-' * (bar_length - block)}] {progress * 100:.2f}%"
    print(text, end='', flush=True)

for i in range(NUM_OF_EPISODES):
    print_progress(i+1,NUM_OF_EPISODES)
    done = False
    state = env.reset()
    while not done:
        a = agent.choose_action(state)
        new_state, reward, done, info  = env.step(a)
        
        agent.store_in_memory(state, a, reward, new_state, done)
        agent.learn()

        state = new_state

env.close()

Memory Usage: 4956.05 MB
Currently processing episode 5/5 [########################################] 100.00%

In [105]:
import gym_super_mario_bros
from gym_super_mario_bros.actions import RIGHT_ONLY
from nes_py.wrappers import JoypadSpace
import numpy as np
import gym
import gym_super_mario_bros
import time
from nes_py.wrappers import JoypadSpace
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
import gym_super_mario_bros
from gym_super_mario_bros.actions import RIGHT_ONLY
from nes_py.wrappers import JoypadSpace

ENV_NAME = 'SuperMarioBros-1-1-v0'
SHOULD_TRAIN = True
DISPLAY = True
NUM_OF_EPISODES = 5

env = gym_super_mario_bros.make(ENV_NAME)
env = JoypadSpace(env, RIGHT_ONLY)

env = apply_wrappers(env)

agent = Agent(input_dims=env.observation_space.shape, num_actions=env.action_space.n)

state = env.reset()
for step in range(1500):
    action = agent.choose_action(state)
    state, reward, done, info = env.step(action)
    if info['world'] > max_world:
        max_world = info['world']
        max_x = 0
        max_stage = 1
    elif info['stage'] > max_stage:
        max_stage = info['stage']
        max_x = 0
    elif info['x_pos'] > max_x:
        max_x = info['x_pos']
    env.render()
    time.sleep(0.02)
    if done:
        state = env.reset()

print(f"Max X: {max_x}")
print(f"Max World: {max_world}")
print(f"Max Stage: {max_stage}")

env.close()

Memory Usage: 5874.73 MB
Max X: 898
Max World: 1
Max Stage: 2


## Result


*show the result and interpretation of your experiment. Any iterative improvements summary.*

5. Does it include the results summary, interpretation of experiments and visualization (e.g. performance comparison table, graphs etc)? (7)


| Name of Agent    | Max Level | Max World | Max X Position |
|------------------|----------------|-----------|-----------|
| Random Agent     | 1            | 1         | 595         |
| Heuristic Agent  | 1            | 2         | 320         |
| Basic PPO Agent  | 1            | 1         | 595         |
| Q Learning Agent | ???            | ???         | ???         |


## Conclusion


*Conclusion, discussion, reflection, or suggestions for future improvements or future ideas.*

6. Does it include discussion (what went well or not and why), and suggestions for improvements or future work? (5)

## References

*Reference: Please include all relevant links (git, video, etc)*

7. Does it include all deliverables (3)
	- git with codes or notebooks
	- writeup (you can consider notebook as a writeup if the notebook contains all needed contents and explanation)
	- demo clips
	- proper quote or reference
    
Kauten, C. (2018). Super Mario Bros for OpenAI Gym. GitHub. Retrieved from https://github.com/Kautenja/gym-super-mario-bros

Nintendo. (1985). Super Mario Bros. Instruction Manual. Nintendo of America Inc. Retrieved from [https://www.nintendo.co.jp/clv/manuals/en/pdf/CLV-P-NAAAE.pdf]

NathanGavenski. (2023). Comment on issue #128 in Kautenja/gym-super-mario-bros repository. GitHub. Retrieved from https://github.com/Kautenja/gym-super-mario-bros/issues/128#issuecomment-1954019091

Kundu, S. (2023, October 2). Train AI to beat Super Mario Bros! || Reinforcement learning completely from scratch [Video]. YouTube. https://www.youtube.com/watch?v=_gmQZToTMac

