<a href="https://colab.research.google.com/github/YuansongFeng/MadMario/blob/master/tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Pre-MVP tutorial for walking users through building a learning Mario. Guidelines for creating this notebook (feel free to add/edit):
1. Extensive explanation (link to AI cheatsheet where necessary) 
2. Only ask for core logics
3. Extensive error parsing 


In [0]:
# !git clone https://user:pwd@github.com/YuansongFeng/MadMario
# %cd MadMario/
# !pip install gym-super-mario-bros

# Section 0.0
In the below section, you will pre-process the environment by turning the perceived RGB images into gray-scale images. The advantage of doing this is that now the model can be significantly smaller because the input channels turn from 3 to 1. Due to a reduced number of model parameters to learn, the training will be faster. 

To visualize what your pre-processing logic will do, here are the environment feedback to Mario before and after the pre-processing:

**before pre-processing**

![picture](https://drive.google.com/uc?id=1c9-tUWFyk4u_vNNrkZo1Rg0e2FUcbF3N)
![picture](https://drive.google.com/uc?id=1s7UewXkmF4g_gZfD7vloH7n1Cr-D3YYX)
![picture](https://drive.google.com/uc?id=1mXDt8rFLKT9a-YvhGOgGZT4bq0T2y7iw)

**after pre-processing**

![picture](https://drive.google.com/uc?id=1ED9brgnbPmUZL43Bl_x2FDmXd-hsHBQt)
![picture](https://drive.google.com/uc?id=1PB1hHSPk6jIhSxVok2u2ntHjvE3zrk7W)
![picture](https://drive.google.com/uc?id=1CYm5q71f_OlY_mqvZADuMMjPmcMgbjVW)

To pre-process the environment, we use the idea of a *wrapper*. By wrapping the environment, we can specify a desired pre-processing step to the environment output, specifically, the observation.  

Example of applying an environment wrapper:
```
env = ResizeObservation(env, shape=84)
```
In this case, the environment observation output is resized to a dimension of 84 x 84. 

# Instruction

We want to apply 3 built-in wrappers to the given `env`, `GrayScaleObservation`, `ResizeObservation`, and `FrameStack`.  

`FrameStack` is a wrapper that will allow us to squash consecutive frames of the environment into a single observation point to feed to our learning model. This way, we can differentiate between when Mario was landing or jumping based on his direction of movement in the previous several frames. 

We can start with the following arguments:
`GrayScaleObservation`: keep_dim=False 
`ResizeObservation`: shape=84 
`FrameStack`: num_stack=4 



In [0]:
from gym.wrappers import FrameStack, GrayScaleObservation, ResizeObservation

def wrapper(env):
    # TODO wrap the given env with GrayScaleObservation, ResizeObservation and FrameStack and return result
    return None

We also would like you to get a taste of implementing an environment wrapper on your own, instead of calling off-the-shelf packages. Here is an idea:
As an effort of downsizing our model to make training faster, we can choose to skip every n-th frame. In other words, our wrapped environment will only output every n-th frame. Below is a skeleton of the class `SkipFrame`, inherited from `gym.Wrapper`.  Notice in the `__init__` function, the `_skip` field is overriden by the input parameter, default set at 4.
However, it is important to accumulate the reward during these skipped steps, because the reward is the most important factor in determining the success of the learning model, so while we can skip frame for dimension reduction purpose, it is crucial we keep adding those rewards to our total reward. Implement the reward accumulation function, using your favorite for loop.

In [0]:
class SkipFrame(gym.Wrapper):
    def __init__(self, env, skip=4):
        """Return only every `skip`-th frame"""
        super().__init__(env)
        self._skip = skip

    def step(self, action):
        """Repeat action, and sum reward"""
        total_reward = 0.0
        done = None
        for i in range(self._skip):
            obs, reward, done, info = None
            if done:
                break

        return obs, total_reward, done, info
       


After you finished the `SkipFrame` class, you can call it on your preprocessed `env`

In [0]:
# This should be imported from a standalone python file specifically for error 
# checking and feedback. For now, define it here for example purpose. 
import gym_super_mario_bros

def feedback_section_0_0(wrapper):
  if not wrapper:
    return "Do you forget to define the wrapper() function?"
  env = gym_super_mario_bros.make('SuperMarioBros-1-1-v0')
  env = wrapper(env)
  if not env:
    return "Do you remember to return the wrapped env?"
  if not env.observation_space.shape == (240, 256):
    return "Do you remember to call GrayScaleObservation on env?"
  # More detailed tests here... 
  return None

In [0]:
error = feedback_section_0_0(wrapper)
if error:
  print(error)

Do you remember to return the wrapped env?


# Experience Sampling 
Mario learns by drawing past experiences from its memory. The memory is a queue data structure that stores each individual experience in the format of 

> state, next_state, action, reward, done

Examples of some experiences in Mario's memory:


- state: ![pic](https://drive.google.com/uc?id=1D34QpsmJSwHrdzszROt405ZwNY9LkTej)  next_state: ![pic](https://drive.google.com/uc?id=13j2TzRd1SGmFru9KJImZsY9DMCdqcr_J) action: jump reward: 0.0 done: False


- state: ![pic](https://drive.google.com/uc?id=1ByUKXf967Z6C9FBVtsn_QRnJTr9w-18v) next_state: ![pic](https://drive.google.com/uc?id=1hmGGVO1cS7N7kdcM99-K3Y2sxrAFd0Oh) action: right reward: -10.0 done: True


- state: ![pic](https://drive.google.com/uc?id=10MHERSI6lap79VcZfHtIzCS9qT45ksk-) next_state: ![pic](https://drive.google.com/uc?id=1VFNOwQHGAf9pH_56_w0uRO4WUJTIXG90) action: right reward: -10.0 done: True


- state: ![pic](https://drive.google.com/uc?id=1T6CAIMzNxeZlBTUdz3sB8t_GhDFbNdUO) next_state: ![pic](https://drive.google.com/uc?id=1aZlA0EnspQdcSQcVxuVmaqPW_7jT3lfW) action: jump_right reward: 0.0 done: False


- state: ![pic](https://drive.google.com/uc?id=1bPRnGRx2c1HJ_0y_EEOFL5GOG8sUBdIo) next_state: ![pic](https://drive.google.com/uc?id=1qtR4qCURBq57UCrmObM6A5-CH26NYaHv) action: right reward: 10.0 done: False

State and next_state are observations at timestep *t* and *t+1* respectively. They are both of type `LazyFrame`, which allows us to optimize memory usage. To convert a `LazyFrame` to numpy array, do

```
state_np_array = np.array(state_lazy_frame)
```

Action represents what Mario takes when the state transition happens. 

Reward is the feedback from environment after transition happens. 

Done indicates if next_state is a terminal state, which means Mario is dead. Terminal state by definition has a return value of 0.

## Instruction

Return a batch of experiences grouped by (state, next_state, action, reward, done) individually. Standardize all formats to numpy array. 

 

In [0]:
def sample_batch(memory, batch_size):
  """
  Input
    memory (FIFO queue)
      a queue where each entry has five elements as below
      state: LazyFrame of dimension (state_dim)
      next_state: LazyFrame of dimension (state_dim)
      action: integer, representing the action taken
      reward: float, the reward from state to next_state with action
      done: boolean, whether state is a terminal state
    batch_size (int)
      size of the batch to return 

  Output
    state, next_state, action, reward, done (tuple)
      a tuple of five elements: state, next_state, action, reward, done
      state: numpy array of dimension (batch_size x state_dim)
      next_state: numpy array of dimension (batch_size x state_dim)
      action: numpy array of dimension (batch_size)
      reward: numpy array of dimension (batch_size)
      done: numpy array of dimension (batch_size)
  """
  return (None, None, None, None, None)

## Predicted Q Value

The learning process relies on the Q-learning algorithm (refer to Q-learning in cheatsheet):

> Q_p(s, a) <- Q_p(s, a) + α(r + γ max Q_t(s', a') - Q_p(s,a))

where Q_p is the prediction value function, Q_t is the target value function, s and a are the current state and action, s' is the next state, a' is the best next action decided by Q_p and s' collectively. We use two separate neural networks to represent Q_p and Q_t. The neural networks learn to estimate state-action value (Q value) better over the learning process. All s, a and s' are retrieved from memory. 

The reason to have 2 value functions is to prevent divergence during the optimization. Q_p is used to make actual prediction of the current state-action value, while Q_t is used in conjunction with r to determine the target state-action value (refer to Temporal Difference Learning in Cheatsheet). In this section we make value prediction using Q_p. 

Ideally we pass both s and a to the Q_p function, which outputs the predicted value for the state-action pair. Imagine having 5 possible actions, this means passing the state-action pair to the Q_p neural network 5 times, which is very costly. To improve efficiency, we pass only the state to Q_p, which outputs predicted Q values for all possible actions at once. For example:

Input

state (s): ![pic](https://drive.google.com/uc?id=1ByUKXf967Z6C9FBVtsn_QRnJTr9w-18v)

Output
- moving right (a_1): -10
- jumping up (a_2): 10
- jumping right (a_3): 0

This gives us 

```
Q_p(s, a_1) = -10
Q_p(s, a_2) = 10
Q_p(s, a_3) = 0
```

In our scenario, since the action is given (e.g. a_2), we can directly return the associated Q value, i.e. Q_p(s, a_2). 

## Instruction

For a batch of experiences consisting of states (s) and actions (a), calculate the estimated value for each state-action pair Q(s, a). Return the results in `torch.tensor` format. 


In [0]:
def calculate_prediction_q(state, action):
  """
  Input
    state (np.array)
      dimension is (batch_size x state_dim), each item is an observation 
      for the current state 
    action (np.array)
      dimension is (batch_size), each item is an integer representing the 
      action taken for current state 

  Output
    pred_q (torch.tensor)
      dimension of (batch_size), each item is a predicted Q value of the 
      current state-action pair 
  """
  return None

## Target Q Value

In this section we calculate the target Q value, in the form of

> r + γ max Q_t(s', a')

where r is the reward at transition from s to s',  γ  is the discounting factor, and s' is the next state. Because a' is not part of the actual experience (it is the predicted best action to take at next state), we will estimate it using our prediction value function Q_p by taking the argmax of Q_p(s', a') with respect to a'. 

> a' = argmax_a Q_p(s', a)

Target Q value, in comparison to prediction Q value Q_p(s, a), gives a better estimate of the current state-action value. We want to update the predicted Q value Q_p(s, a) towards target Q value, r + γ max Q_t(s', a'). 

In this section we calculate the target Q value of current state-action. 

## Instruction

For a batch of experiences consisting of next_states (s') and rewards (r), calculate the target Q value using above mentioned equation. Note that a' is not explicitly given, so we will need to first obtain that using prediction value function Q_p. 

Return the results in `torch.tensor` format. 

In [0]:
def calculate_target_q(next_state, reward):
  """
  Input
    next_state (np.array)
      dimension is (batch_size x state_dim), each item is an observation 
      for the next state 
    reward (np.array)
      dimension is (batch_size), each item is a float representing the 
      reward collected from (state -> next state) transition 

  Output
    target_q (torch.tensor)
      dimension of (batch_size), each item is a target Q value of the current
      state-action pair, calculated based on reward collected and 
      estimated Q value for next state
  """
  return None

## Loss between Prediction and Target Q Value

To improve our value estimation, we would like our predicted Q value to be as close to the target Q value as possible. In other words, we want to minimize the distance between Q_p(s, a) and r + γ max Q_t(s', a'). To do this, we calculate the *huber loss* between the two values, and use this loss to update Q_p, the prediction value function. 


![pic](https://drive.google.com/uc?id=1FZM7sBnMgY5GQNTx-o3LtLRLQQM0mwat)


Huber loss is a smoothed version of L1 loss. Graph above gives some intuition behind L1 vs. L2 vs. Huber loss. L1 is intolerant around the origin and gives a high loss when there is only small difference between predicted and target value. On the other hand, L2 loss explodes quickly when there is a big difference between predicted and target value. Huber loss conveniently avoids both issues. 

Hint: the huber loss can be called in this way
```
loss = nn.functional.smooth_l1_loss(input, target)
```

## Instruction

Given predicted and target Q values for the batch of experiences, return the sum of huber loss. 


In [0]:
def calculate_huber_loss(pred_q, target_q):
  """
  Input
    pred_q (torch.tensor)
      dimension is (batch_size), each item is an observation 
      for the next state 
    target_q (torch.tensor)
      dimension is (batch_size), each item is a float representing the 
      reward collected from (state -> next state) transition 

  Output
    loss (torch.tensor)
      a single value representing the Huber loss of pred_q and target_q
  """
  return None

## Learning

In [0]:
def learn(self):
    """Update prediction action value (Q) function with a batch of experiences
    """
    # sync target network
    if self.step % self.copy_every == 0:
        self.sync_target_q()
    # checkpoint model
    if self.step % self.save_every == 0:
        self.save_model()
    # break if burn-in
    if self.step < self.burnin:
        return
    # break if no training
    if self.step % self.learn_every != 0:
        return
    # sample a batch of experiences from self.memory
    state, next_state, action, reward, done = sample_batch(self.memory, self.batch_size)

    # calculate prediction Q values for the batch
    pred_q = calculate_prediction_q(state, action)

    # calculate target Q values for the batch
    target_q = calculate_target_q(next_state, reward)

    # calculate huber loss of target and prediction values
    loss = calculate_huber_loss(pred_q, target_q)
    
    # update target network
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()
    return loss