This notebook follows the official pytorch tutorial:
[Mario-Playing RL Agent](https://pytorch.org/tutorials/intermediate/mario_rl_tutorial.html#train-a-mario-playing-rl-agent)

Useful resources:
- Model paper: [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/pdf/1509.06461.pdf)
- Official repo of the code: [MadMario](https://github.com/yfeng997/MadMario)
- Reinforcement Learning cheatsheet: [RL Cheatsheet](https://colab.research.google.com/drive/1eN33dPVtdPViiS1njTW_-r-IYCDTFU7N)

In [7]:
import torch
from torch import nn
from torchvision import transforms as T

import numpy as np
import random, datetime, os, copy
from collections import deque
from pathlib import Path
from PIL import Image

import gym
from gym.spaces import Box
from gym.wrappers import FrameStack
import gym_super_mario_bros

# NES Emulator for Gym
from nes_py.wrappers import JoypadSpace

# suppress warning for now
import warnings 
warnings.filterwarnings(action="once")

**Optimal Action-Value function** $Q^\star(s, a)$: Gives the expected return if you start in state $s$, take arbitrary action $a$, and then for each future time step take the action that maximizes returns. $Q$ can be said to stand for the "qaulity" of the action in a state. We try to approximate this function.

In [5]:
env = gym_super_mario_bros.make("SuperMarioBros-1-1-v0")
# limit the action-spcae to
#    0. walk right
#    1. jump right
env = JoypadSpace(env, [["right"], ["right", 'A']])

env.reset()
next_state, reward, done, info = env.step(action=0)

print(f"{next_state.shape},\n {reward},\n {done},\n {info}")

  logger.warn(


(240, 256, 3),
 0,
 False,
 {'coins': 0, 'flag_get': False, 'life': 2, 'score': 0, 'stage': 1, 'status': 'small', 'time': 400, 'world': 1, 'x_pos': 40, 'y_pos': 79}


preprocess environment with different **wrappers**: `GrayScaleObseravtion`, `ResieObservation`, `SkipFrame`, and `FrameStack`

In [8]:
class SkipFrame(gym.Wrapper):
    def __init__(self, env, skip):
        """Return only every `skip`-th frame"""
        super().__init__(env)
        self._skip = skip
        
    def step(self, action):
        """Repeat action, and sum reward"""
        total_reward = 0.0
        done = False
        for i in range(self._skip):
            # accumulate reward and repeat the same action
            obs, reward, done, info = self.env.step(action)
            total_reward += reward
            if done:
                break
        return obs, total_reward, done, info

In [9]:
class GrayScaleObservation(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
        obs_shape = self.observation_spcae.shape[:2]
        self.observation_space = Box(low=0, high=255, shape=obs_shape, dtype=np.uint8)
        
    def permute_orientation(self, observation):
        # permute [H, W, C] array to [C, H, W] tensor
        observation = np.transpose(observation, (2, 0, 1))
        observation = torch.tensor(observation.copy(), dtype=torch.float)
        return observation
    
    def observation(self, observation):
        observation = self.permute_orientation(observation)
        transform = T.GrayScale()
        observation = transform(observation)
        return observation

In [None]:
class ResizeObservation(gym.ObservationWrapper):
    def __init__(self, env, shape):
        super().__init__(env)
        if isinstance(shape, int):
            self.shape = (shape, shape)
        else:
             self.shape = tuple(shape)   