<a href="https://colab.research.google.com/github/YuansongFeng/MadMario/blob/master/tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Pre-MVP tutorial for walking users through building a learning Mario. Guidelines for creating this notebook (feel free to add/edit):
1. Extensive explanation (link to AI cheatsheet where necessary) 
2. Only ask for core logics
3. Extensive error parsing 


In [0]:
# !git clone https://user:pwd@github.com/YuansongFeng/MadMario
# %cd MadMario/
# !pip install gym-super-mario-bros

# Section 0.0
In the below section, you will pre-process the environment by turning the perceived RGB images into gray-scale images. The advantage of doing this is that now the model can be significantly smaller because the input channels turn from 3 to 1. Due to a reduced number of model parameters to learn, the training will be faster. 

To visualize what your pre-processing logic will do, here are the environment feedback to Mario before and after the pre-processing:

**before pre-processing**

![picture](https://drive.google.com/uc?id=1c9-tUWFyk4u_vNNrkZo1Rg0e2FUcbF3N)
![picture](https://drive.google.com/uc?id=1s7UewXkmF4g_gZfD7vloH7n1Cr-D3YYX)
![picture](https://drive.google.com/uc?id=1mXDt8rFLKT9a-YvhGOgGZT4bq0T2y7iw)

**after pre-processing**

![picture](https://drive.google.com/uc?id=1ED9brgnbPmUZL43Bl_x2FDmXd-hsHBQt)
![picture](https://drive.google.com/uc?id=1PB1hHSPk6jIhSxVok2u2ntHjvE3zrk7W)
![picture](https://drive.google.com/uc?id=1CYm5q71f_OlY_mqvZADuMMjPmcMgbjVW)

To pre-process the environment, we use the idea of a *wrapper*. By wrapping the environment, we can specify a desired pre-processing step to the environment output, specifically, the observation.  

Example of applying an environment wrapper:
```
env = ResizeObservation(env, shape=84)
```
In this case, the environment observation output is resized to a dimension of 84 x 84. 

# Instruction

Apply `GrayScaleObservation` to the given `env`. 

In [0]:
from gym.wrappers import FrameStack, GrayScaleObservation, ResizeObservation

def wrapper(env):
    # TODO wrap the given env with GrayScaleObservation and return result
    return None

In [0]:
# This should be imported from a standalone python file specifically for error 
# checking and feedback. For now, define it here for example purpose. 
import gym_super_mario_bros

def feedback_section_0_0(wrapper):
  if not wrapper:
    return "Do you forget to define the wrapper() function?"
  env = gym_super_mario_bros.make('SuperMarioBros-1-1-v0')
  env = wrapper(env)
  if not env:
    return "Do you remember to return the wrapped env?"
  if not env.observation_space.shape == (240, 256):
    return "Do you remember to call GrayScaleObservation on env?"
  # More detailed tests here... 
  return None

In [0]:
error = feedback_section_0_0(wrapper)
if error:
  print(error)

Do you remember to return the wrapped env?


In [0]:
def sample_batch(memory, batch_size):
  """
  Input
    memory (FIFO queue)
      a queue where each entry has five elements as below
      state: LazyFrame of dimension (state_dim)
      next_state: LazyFrame of dimension (state_dim)
      action: integer, representing the action taken
      reward: float, the reward from state to next_state with action
      done: boolean, whether state is a terminal state
    batch_size (int)
      size of the batch to return 

  Output
    state, next_state, action, reward, done (tuple)
      a tuple of five elements: state, next_state, action, reward, done
      state: numpy array of dimension (batch_size x state_dim)
      next_state: numpy array of dimension (batch_size x state_dim)
      action: numpy array of dimension (batch_size)
      reward: numpy array of dimension (batch_size)
      done: numpy array of dimension (batch_size)
  """
  return (None, None, None, None, None)

In [0]:
def calculate_prediction_q(state, action):
  """
  Input
    state (np.array)
      dimension is (batch_size x state_dim), each item is an observation 
      for the current state 
    action (np.array)
      dimension is (batch_size), each item is an integer representing the 
      action taken for current state 

  Output
    pred_q (torch.tensor)
      dimension of (batch_size), each item is a predicted Q value of the 
      current state-action pair 
  """
  return None

In [0]:
def calculate_target_q(next_state, reward):
  """
  Input
    next_state (np.array)
      dimension is (batch_size x state_dim), each item is an observation 
      for the next state 
    reward (np.array)
      dimension is (batch_size), each item is a float representing the 
      reward collected from (state -> next state) transition 

  Output
    target_q (torch.tensor)
      dimension of (batch_size), each item is a target Q value of the current
      state-action pair, calculated based on reward collected and 
      estimated Q value for next state
  """
  return None

In [0]:
def calculate_huber_loss(pred_q, target_q):
  """
  Input
    pred_q (torch.tensor)
      dimension is (batch_size), each item is an observation 
      for the next state 
    target_q (torch.tensor)
      dimension is (batch_size), each item is a float representing the 
      reward collected from (state -> next state) transition 

  Output
    loss (torch.tensor)
      a single value representing the Huber loss of pred_q and target_q
  """
  return None

In [0]:
def back_propagate_loss(optimizer, loss):
  """
  Input
    optimizer (torch optimizer)
      optimizer with already defined learning rate and learning parameters
    loss (torch.tensor)
      a single value that the optimizer should optimize on  

  """
  return