GitHub Repository for this project can be found at https://github.com/seel6470/CSPB-3202-Final-Project

## Video Clip of Finished Project

### Short Overview

*short overview of what your project is about (e.g. you're building /testing certain RL models in certain environments; yes you can test your algorithm in more than 1 environment if your goal is to test an algorithm(s) performances in different settings)*

1. Does it include the clear overview on what the project is about? (4)

2. Does it explain how the environment works and what the game rules are? (4)

For my project, I chose to teach a learning model to play the original Super Mario Bros. game for the NES. I utilized a library created by Christian Kauten called gym-super-mario-bros, which provides an OpenAI Gym environment using the nes-py emulator (Kauten, 2018). The challenge is to beat as many levels as possible in the original Mario game for NES with the following rules of the game.

The goal of the game is to avoid enemies and pits to reach the end of each level. One hit and Mario loses a life, starting over from the nearest checkpoint. Power-ups provide Mario an additional hit. The following page from the original game manual outlines the inputs Mario receives for the game:

![image](images/controls.jpg)

Nintendo. (1985). Super Mario Bros. Instruction Manual. Nintendo of America Inc. Retrieved from [https://www.nintendo.co.jp/clv/manuals/en/pdf/CLV-P-NAAAE.pdf]

The game environment takes these controls and creates the following action lists that can be used within the environment wrapper:

```python
# actions for the simple run right environment
RIGHT_ONLY = [
    ['NOOP'],
    ['right'],
    ['right', 'A'],
    ['right', 'B'],
    ['right', 'A', 'B'],
]


# actions for very simple movement
SIMPLE_MOVEMENT = [
    ['NOOP'],
    ['right'],
    ['right', 'A'],
    ['right', 'B'],
    ['right', 'A', 'B'],
    ['A'],
    ['left'],
]


# actions for more complex movement
COMPLEX_MOVEMENT = [
    ['NOOP'],
    ['right'],
    ['right', 'A'],
    ['right', 'B'],
    ['right', 'A', 'B'],
    ['A'],
    ['left'],
    ['left', 'A'],
    ['left', 'B'],
    ['left', 'A', 'B'],
    ['down'],
    ['up'],
]
```

The environment can also determine the following keys for the gamestate:

| Key       | Type | Description                                |
|-----------|------|--------------------------------------------|
| coins     | int  | The number of collected coins              |
| flag_get  | bool | True if Mario reached a flag or ax         |
| life      | int  | The number of lives left, i.e., {3, 2, 1}  |
| score     | int  | The cumulative in-game score               |
| stage     | int  | The current stage, i.e., {1, ..., 4}       |
| status    | str  | Mario's status, i.e., {'small', 'tall', 'fireball'} |
| time      | int  | The time left on the clock                 |
| world     | int  | The current world, i.e., {1, ..., 8}       |
| x_pos     | int  | Mario's x position in the stage (from the left) |
| y_pos     | int  | Mario's y position in the stage (from the bottom) |

Additionally, the environment utilizes the following parameters for the reward function:

v: the difference in agent x values between states

c: the difference in the game clock between frames

d: a death penalty that penalizes the agent for dying in a state

### Approach

*explain your environment, your choice of model(s), the methods and purpose of testing and experiments, explain any trouble shooting required.*

3. Does it explain clearly the model(s) of choices, the methods and purpose of tests and experiments? (7)

4. Does it show problem solving procedure- e.g. how the author solved and improved when an algorithm doesn't work well. Note that it's not about debugging or programming/implementation, but about when a correctly implemented algorithm wasn't enough for the problem and the author had to modify/add some features or techniques, or compare with another model, etc. (7)

The initial setup for the environment was a bit tricky due to some incompatibilities between the chosen gym library gym-super-mario-bros JoypadSpace wrapper and the current version of OpenAi's gym framework, specifically with the `reset` method. Huge thanks to NathanGavinski who supplied [a workaround](https://github.com/Kautenja/gym-super-mario-bros/issues/128#issuecomment-1954019091) in the issues forum for gym-super-mario-bros Git. (NathanGavenski, 2023).

The following code utilizes this fix along with the suggested boilerplate setup from the gym-super-mario-bros documentation:

In [69]:
import gym
import gym_super_mario_bros
import time
from nes_py.wrappers import JoypadSpace
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
from gymnasium.wrappers import StepAPICompatibility, TimeLimit

# Create the Super Mario Bros. environment
env = gym.make('SuperMarioBros-v0')
steps = env._max_episode_steps  # get the original max_episode_steps count

# Set the Joypad wrapper
env = JoypadSpace(env.env, SIMPLE_MOVEMENT)

# Define a new reset function to accept `seeds` and `options` args for compatibility
def gymnasium_reset(self, **kwargs):
    return self.env.reset(), {}

# Overwrite the old reset to accept `seeds` and `options` args
env.reset = gymnasium_reset.__get__(env, JoypadSpace)

# Set TimeLimit back
env = TimeLimit(StepAPICompatibility(env, output_truncation_bool=True), max_episode_steps=steps)

To get a baseling, I decided to implement a basic heuristic model that uses a simple algorithm to try to beat a level of Super Mario Bros.

In [70]:
# create global variables for inputs
done = True
going_up = False
prev_y = None

for step in range(1700):
    if done:
        state = env.reset()
        prev_y = None
        hold_jump = False
    
    # if Mario is on flat groun
    # or in the process of rising from previous jump
    # will continue to hold A to perform the maximum jump
    action = SIMPLE_MOVEMENT.index(['right', 'A', 'B']) if going_up else SIMPLE_MOVEMENT.index(['right', 'B'])
    state, reward, done, truncated, info = env.step(action)

    # set going_up to true if Mario is not descending
    if prev_y is not None:
        if info['y_pos'] >= prev_y:
            going_up = True
        else:
            going_up = False

    # capture current y position to compare for next state
    prev_y = info['y_pos']
        
    if done or truncated:
        done = True
    env.render()
    time.sleep(0.01)  # Add a delay of 0.01 seconds between frames

# Close the environment
env.close()

  logger.deprecation(
  if not isinstance(done, (bool, np.bool8)):
  logger.warn(


KeyboardInterrupt: 

In [83]:
class CustomRewardWrapper(gym.RewardWrapper):
    def __init__(self, env):
        super(CustomRewardWrapper, self).__init__(env)
        self.previous_y_pos = 0  # Initialize the previous y position

    def reward(self, reward):
        # Access the private attribute _y_position
        current_y_pos = self.unwrapped._y_position

        # Calculate the change in y position
        delta_y = current_y_pos - self.previous_y_pos

        # Update the previous y position
        self.previous_y_pos = current_y_pos

        reward += 0.2 * delta_y

        return reward

In [90]:
import numpy as np
import gym
import gym_super_mario_bros
import time
from nes_py.wrappers import JoypadSpace
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
from gym.wrappers import StepAPICompatibility, TimeLimit
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

class CustomRewardWrapper(gym.RewardWrapper):
    def __init__(self, env):
        super(CustomRewardWrapper, self).__init__(env)
        self.previous_y_pos = 0  # Initialize the previous y position

    def reward(self, reward):
        # Access the private attribute _y_position
        current_y_pos = self.unwrapped._y_position

        # Calculate the change in y position
        delta_y = current_y_pos - self.previous_y_pos

        # Update the previous y position
        self.previous_y_pos = current_y_pos

        reward += 0.2 * delta_y

        return reward

# Create the Super Mario Bros. environment
env = gym.make('SuperMarioBros-v0')
steps = env._max_episode_steps  # get the original max_episode_steps count

CUSTOM_ACTIONS = [
    ['right', 'B'],
    ['right', 'A', 'B']
]

# Set the Joypad wrapper
env = JoypadSpace(env.env, CUSTOM_ACTIONS)
# Overwrite the old reset to accept seeds and options args
env.reset = gymnasium_reset.__get__(env, JoypadSpace)

# Set TimeLimit back
env = TimeLimit(StepAPICompatibility(env, output_truncation_bool=True), max_episode_steps=steps)
env = CustomRewardWrapper(env)
# Create and train the PPO agent
model = PPO('CnnPolicy', env, verbose=1, learning_rate=1e-4, n_steps=128, batch_size=64, n_epochs=4, clip_range=0.2)

model.learn(total_timesteps=1000)

#model.save("ppo_mario")

# Evaluate the agent
#mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)
#print(f"Mean reward: {mean_reward} ± {std_reward}")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.
----------------------------
| time/              |     |
|    fps             | 120 |
|    iterations      | 1   |
|    time_elapsed    | 1   |
|    total_timesteps | 128 |
----------------------------
----------------------------------------
| time/                   |            |
|    fps                  | 30         |
|    iterations           | 2          |
|    time_elapsed         | 8          |
|    total_timesteps      | 256        |
| train/                  |            |
|    approx_kl            | 0.03974242 |
|    clip_fraction        | 0.375      |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.677     |
|    explained_variance   | 0.00109    |
|    learning_rate        | 0.0001     |
|    loss                 | 120        |
|    n_updates            | 4          |
|    policy_gradient_loss | -0.00242   |
|    v

<stable_baselines3.ppo.ppo.PPO at 0x217998a8ef0>

In [92]:
obs, info = env.reset()
for step in range(1500):
    action, _states = model.predict(obs.copy())
    action = action.item()
    obs, reward, done, truncated, info = env.step(action)
    env.render()
    if done:
        obs, info = env.reset()

env.close()

### Result

*show the result and interpretation of your experiment. Any iterative improvements summary.*

5. Does it include the results summary, interpretation of experiments and visualization (e.g. performance comparison table, graphs etc)? (7)

### Conclusion

*Conclusion, discussion, reflection, or suggestions for future improvements or future ideas.*

6. Does it include discussion (what went well or not and why), and suggestions for improvements or future work? (5)

### References

*Reference: Please include all relevant links (git, video, etc)*

7. Does it include all deliverables (3)
	- git with codes or notebooks
	- writeup (you can consider notebook as a writeup if the notebook contains all needed contents and explanation)
	- demo clips
	- proper quote or reference
    
Kauten, C. (2018). Super Mario Bros for OpenAI Gym. GitHub. Retrieved from https://github.com/Kautenja/gym-super-mario-bros

Nintendo. (1985). Super Mario Bros. Instruction Manual. Nintendo of America Inc. Retrieved from [https://www.nintendo.co.jp/clv/manuals/en/pdf/CLV-P-NAAAE.pdf]

NathanGavenski. (2023). Comment on issue #128 in Kautenja/gym-super-mario-bros repository. GitHub. Retrieved from https://github.com/Kautenja/gym-super-mario-bros/issues/128#issuecomment-1954019091

