# AI-Training to play Mario

In [14]:
%pip install gym-super-mario-bros==7.4.0
%pip install tensordict==0.2.0
%pip install torchrl==0.2.0


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
import torch
from torch import nn
from torchvision import transforms as T
from PIL import Image
import numpy as np
from pathlib import Path
from collections import deque
import random, datetime, os

# Gym is an OpenAI toolkit for RL
import gym
from gym.spaces import Box
from gym.wrappers import FrameStack

# NES Emulator for OpenAI Gym
from nes_py.wrappers import JoypadSpace

# Super Mario environment for OpenAI Gym
import gym_super_mario_bros


## Definitions of RL

**Environment**: The world that an agent interacts and learns from.

**Action** $a$: How our agent responds to the environment. A set of possible actions can be called action-space.

**State** $s$: The current characteristic of the environment. This is a set of all possible states the environment can be in is called state-space.

**Reward** $r$: Reward is the key feedback from Environment to agent. It is what drives our agent to learn and change its future action. An aggreation of rewards over multiple time steps is called **Return**.

**Action value function** $Q^*(s,a)$: This models the expected return starting from a given state and the agent making a particular action. Given a starting state $s$ takes an arbitrary function $a$, and then for each future time step take the action that maximises returns. The $Q$ stands for the quality of an action in a state.


## Environment

### Initialisation of the environment

In Mario, the environment consists of tubes, mushrooms, enemies and other components.

When Mario makes an action, the environment responds with the changed (next) state, reward and other info.

Here is the Super Mario environment initialised (in v0.26 we change the render_mode to 'rgb' to see the results on the screen):

In [2]:
if gym.__version__ < '0.26':
    env = gym_super_mario_bros.make('SuperMarioBros-1-1-v0', new_step_api=True)
else:
    env = gym_super_mario_bros.make('SuperMarioBros-1-1-v0', render_mode='rgb', apply_api_compatibility=True)

  logger.warn(
  logger.warn(


Limits the action space to:

0. walk right
1. jump right


In [3]:
env = JoypadSpace(env, [['right'], ['right', 'A']])

env.reset()
next_state, reward, done, trunc, info = env.step(action=0)
print(f"{next_state.shape}, {reward}, {done}, {trunc}, {info}")

(240, 256, 3), 0.0, False, False, {'coins': 0, 'flag_get': False, 'life': 2, 'score': 0, 'stage': 1, 'status': 'small', 'time': 400, 'world': 1, 'x_pos': 40, 'y_pos': 79}


  if not isinstance(terminated, (bool, np.bool8)):


### Preprocessing of the environment

Environment data is returned to the agent in `next_state`. As above, each state is represented by a `[3, 240, 256]` size array. Often that is more information than our agent needs, for instance Mario doesn't response on the color of the pipes or the sky!

We use **Wrappers** to preprocess environment data before sending to an another agent.

`SkipFrame` is used to modify the behavior of an OpenAI Gym environment. The `gym.Wrapper` class is a base class provided by the OpenAI Gym library that allows me to add functionality to an environment by wrapping it with additional behavior

In [4]:
class SkipFrame(gym.Wrapper):
    def __init__(self, env, skip):
        super().__init__(env)
        self._skip = skip
    
    def step(self, action):
        total_reward = 0.0
        for i in range(self._skip):
            obs, reward, done, info = self.env.step(action)
            total_reward += reward
            if done:
                break
            return obs, total_reward, done, info

`GrayScaleObservation` is a common wrapper to transform an RGB image to grayscale; doing so reduces the size of the state representation without losing useful information. Now the size of each state: `[1, 240, 256]`

In [5]:
class GrayScaleObservation(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
        obs_shape = self.observation_space.shape[:2]
        self.observation_space = Box(low=0, high=255, shape=obs_shape, dtype=np.uint8)

    def observation(self, observation):
        observation = np.transpose(observation, (2, 0, 1))
        observation = torch.tensor(observation.copy(), dtype=torch.float)
        return observation
    
    def permute_orientation(self, observation):
        observation = np.transpose(observation, (2, 0, 1))
        observation = torch.tensor(observation.copy(), dtype=torch.float)
        return observation

`ResizeObservation` downstamps each observation into a square image. The new size is `[1, 84, 84]`

In [6]:
class ResizeObservation(gym.ObservationWrapper):
    def __init__(self, env, shape):
        super().__init__(env)
        if isinstance(shape, int):
            self.shape = (shape, shape)
        else:
            self.shape = tuple(shape)
        
        obs_shape = self.shape + self.observation_space.shape[2:]
        self.observation_space = Box(low=0, high=255, shape=obs_shape, dtype=np.uint8)

        def observation(self, observation):
            transforms = T.Compose([T.Resize(self.shape), T.Normalize(0, 255)])
            observation = transforms(observation).squeeze(0)
            return observation

#### Applying Wrappers to the environment
`SkipFrame`  is a custom wrapper that inherits from `gym.Wrapper` and implements the `step` function. Because consecutive frames don’t vary much, we can skip n-intermediate frames without losing much information. The n-th frame aggregates rewards accumulated over each skipped frame.

`FrameStack` is a wrapper that allows us to squash consecutive frames of the environment into a single observation point to feed to our learning model. This way, we can identify if Mario was landing or jumping based on the direction of his movement in the previous several frames.

In [7]:
env = SkipFrame(env, skip=4)
env = GrayScaleObservation(env)
env = ResizeObservation(env, shape=84)
if gym.__version__ < '0.26':
    env = FrameStack(env, num_stack=4, new_step_api=True)
else:
    env = FrameStack(env, num_stack=4)

After applying the above wrappers to the environment, the final wrapped state consists of 4 grey-scaled consecutive frames. Each time Mario makes an action, the environment responds with a state of this structure. The structure is represented by a 3D array of the size `[4, 84, 84]`