<a href="https://colab.research.google.com/github/YuansongFeng/MadMario/blob/master/tutorial_agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Agent [Yuansong]

Agent and environment are two core concepts in reinforcement learning. The agent continuously interacts with the environment, collects reward and learn to maximize its overall return in the long term. In our scenario, Mario is the agent and other game components (blocks, tubes, mushrooms, etc.) are the environment. 

The agent class, `DQNAgent`, captures Mario's behavior in the game environment. The agent should be able to 

- Make its decision about next action to take. This requires Mario to process the environment state and find the optimal action that yields the highest return value. Refer to Optimal Action in Cheatsheet. 

- Remember past experiences. Mario should be able to add the current experience to its memory. Later, it uses all the previous experiences to learn to act smarter. 

- Learn to improve action over time. The decision made by Mario should yield higher and higher return as the training proceeds. This requires Mario to update its decision process based on previous experiences. Refer to Q-learnin in the RL Cheatsheet. 


In [0]:
class DQNAgent:
    def __init__(self, state_dim, action_dim, max_memory, double_q):
        pass

    def predict(self, state, model):
        """Given a state, predict Q values of all actions
        model is either 'online' or 'target'
        """
        pass

    def act(self, state):
        """Given a state, choose an epsilon-greedy action
        """
        pass

    def remember(self, experience):
        """Add the observation to memory
        """
        pass

    def learn(self):
        """Update online action value (Q) function with a batch of experiences
        """
        pass


Along the way we will create some helper methods that make the code more modular. Lets look at these function individually.  

## Initialize Instance Variables [Steven]

Before implementing the core functions, we need to first declare some attributes(variables) the agent will need to regulate its behaviors, some examples include epsilon(random exploration rate), epsilon decay rate, future reward discount rate, etc. Please refer to cheat sheet for more details.

In [0]:
class DQNAgent:
  """
  eps (real number)
    Random Exploration Prabability. Under some probability, agent will not follow the policy(perform the best action), 
    instead, it will randomly choose an action to explore the state space. This is very important at the early stage of 
    learning, because agent does not have a good policy in the begining, it needs to try different actions to see 
    which actions leads to better rewards. Random exploration also helps agent to fall into the local optima. eps will
    gradually decrease as agent's policy becomes better and better.
    Please initialize it to 1.0.
  eps_decay (real number)
    Decay rate of eps. Agent rigorously explores space at the early stage, but gradually reduces its exploration rate 
    to maintain action quality. In the later stage, agent already learns a fairly good policy, so we want it to follow
    its policy more frequently. Decrease eps by the factor of eps_decay each time the agent acts.
    Please initialize it to 0.99999975
  gamma (real number)
    Future reward discount rate. gamma serves to make agent give higher weight on the short-term rewards over future reward.
    Please initialize it to 0.9
  batch_size (integer)
    # of experiences used to update neu each time. 
    Please initialize it to 32
  state_dim (tuple)
    state is the observation of the current environment which includes locations of obstacles, opponents, etc. The agent
    chooses the best action based entirely on the state. state_dim is the dimension of the state, in mario example, it is
    4 consecutive snapshots of the enviroment stacked together, and each snapshot is a 84*84 gray-scale picture, so
    state_dim = (4, 84,84)
  Note: You can always try other combinations of parameters and test how agent would behave.


  """
    def __init__(self, state_dim, action_dim, max_memory):
       # state space dimension
      self.state_dim = state_dim
      # action space dimension
      self.action_dim = action_dim
      # replay buffer
      self.memory = deque(maxlen=max_memory)
      # current step, updated everytime the agent acts
      self.step = 0

      #TODO: Please declare other variables as described above


      pass
    

## Predict Q value [Yuansong]

The key function we are trying to learn here is the Q function, which is parameterized by a neural network. 

Instruction:
implement prediction function for both online and target Q function. 

Syntax: To call the forward function of an pytorch model, we use this syntax: model(input), eg. pred_q_values = self.online_q(state_float)

In [0]:
class DQNAgent:
    def __init__(self, ...):
      self.online_q = ConvNet(input_dim=state_dim, output_dim=action_dim).to(self.device)

    def predict(self, state, model):
        """Given a state, predict Q values of all possible actions using specified model (either online or target)
        Input:
          state
           dimension of (batch_size * state_dim)
          model
           either 'online' or 'target'
        Output
          pred_q_values (torch.tensor)
            dimension of (batch_size * action_dim), predicted Q values for all possible actions given the 
            state
        """
        # LazyFrame -> np array -> torch tensor
        state_float = torch.FloatTensor(np.array(state)).to(self.device)
        # normalize
        state_float = state_float / 255.
        
        # TODO return the predicted Q values for online/target function

## Act [Steven]

The *act* function defines how Mario reacts to current state(observation of environment). Given a state, the agent chooses the optimal action to perform based on the policy(Q function), or sometimes it would act randomly regardless of the policy to explore the state space. Please refer to the optimal action section in the cheatsheet. To choose an action, we need to predict the Q values for all possible actions in action dimension, and choose the one that gives the highest Q-value.

Small examples on NumPy indexing:
 

In [0]:
class DQNAgent:
    # def predict()
    
    def act(self, state):
        """Given a state, choose an epsilon-greedy action and update value of step
        Input
          state(np.array) 
            A single observation of the current state, dimension is (state_dim)
        Output
          action
            An integer representing which action agent will perform
        """
        # TODO choose action with epsilon-greedy policy
        if np.random.rand() < self.eps:
          # random action
          pass
        else:
          # policy action
          pass
          
        # decrease eps
        self.eps *= self.eps_decay
        self.eps = max(self.eps_min, self.eps)
        # increment step
        self.step += 1
        return action

## Remember [Steven]

Mario has a memory that stores lots of prior experiences. Mario uses them to learn how to update its value prediction in the future. Store up experience. 

In [0]:
class DQNAgent:
    def remember(self, experience):
        """Add the observation to memory (deque)
        Input
          experience =  (state, next_state, action, reward, done) tuple
        Output
            None
        """
        # TODO Add the observation to memory

## Learn [Yuansong]

The learning process relies on the Q-learning algorithm in RL Cheatsheet. Specifically, we make the observation that the prediction based on reward and next action-state value is more accurate than predicting directly the current action-state pair. 

There are some key steps to perform:
- experience sampling
- predicting online q values
- predicting target q values
- calculate loss
- update online q function

In [0]:
class DQNAgent:  
    def learn(self):
        """Update prediction action value (Q) function with a batch of experiences
        """
        # set up and check learning criterion 

        # sample a batch of experiences from self.memory
        state, next_state, action, reward, done = sample_batch(self.memory, self.batch_size)

        # calculate prediction Q values for the batch
        pred_q = calculate_prediction_q(state, action)

        # calculate target Q values for the batch
        target_q = calculate_target_q(next_state, reward)

        # calculate huber loss of target and prediction values
        loss = calculate_huber_loss(pred_q, target_q)
        
        # update target network
        update_prediction_q(loss, optimizer)


# Experience Sampling [Howard]
Mario learns by drawing past experiences from its memory. The memory is a queue data structure that stores each individual experience in the format of 

> state, next_state, action, reward, done

Examples of some experiences in Mario's memory:


- state: ![pic](https://drive.google.com/uc?id=1D34QpsmJSwHrdzszROt405ZwNY9LkTej)  next_state: ![pic](https://drive.google.com/uc?id=13j2TzRd1SGmFru9KJImZsY9DMCdqcr_J) action: jump reward: 0.0 done: False


- state: ![pic](https://drive.google.com/uc?id=1ByUKXf967Z6C9FBVtsn_QRnJTr9w-18v) next_state: ![pic](https://drive.google.com/uc?id=1hmGGVO1cS7N7kdcM99-K3Y2sxrAFd0Oh) action: right reward: -10.0 done: True


- state: ![pic](https://drive.google.com/uc?id=10MHERSI6lap79VcZfHtIzCS9qT45ksk-) next_state: ![pic](https://drive.google.com/uc?id=1VFNOwQHGAf9pH_56_w0uRO4WUJTIXG90) action: right reward: -10.0 done: True


- state: ![pic](https://drive.google.com/uc?id=1T6CAIMzNxeZlBTUdz3sB8t_GhDFbNdUO) next_state: ![pic](https://drive.google.com/uc?id=1aZlA0EnspQdcSQcVxuVmaqPW_7jT3lfW) action: jump_right reward: 0.0 done: False


- state: ![pic](https://drive.google.com/uc?id=1bPRnGRx2c1HJ_0y_EEOFL5GOG8sUBdIo) next_state: ![pic](https://drive.google.com/uc?id=1qtR4qCURBq57UCrmObM6A5-CH26NYaHv) action: right reward: 10.0 done: False

State and next_state are observations at timestep *t* and *t+1* respectively. They are both of type `LazyFrame`, which allows us to optimize memory usage. To convert a `LazyFrame` to numpy array, do

```
state_np_array = np.array(state_lazy_frame)
```

Action represents what Mario takes when the state transition happens. 

Reward is the feedback from environment after transition happens. 

Done indicates if next_state is a terminal state, which means Mario is dead. Terminal state by definition has a return value of 0.


One question one might ask why do we want to sample data points from all past experiences rather than the most recent ones(for example, from the previous episode), which are newly trained with higher accuracy. 

The intuition is behind the tradeoff between these two approaches:

Do we want to train on data that are generated from a small-size dataset with relatively high quality or a huge-size dataset with relatively lower quality? 

The answer is the latter, because the more data we have, the more of a wholistic, comprehensive point of view we have on the overall behavior of the system we have, in our case, the Mario game. Limited size dataset has the danger of overfitting and overlooking bigger pictures of the entire action/state space. 


Remember, Reinforcement Learning is all about exploring different scenarios(state) and keeping improving based on trial and errors, generated from the interactions between the agent(action) and the environmental feedback(reward). 

## Instruction

Return a batch of experiences grouped by (state, next_state, action, reward, done) individually. Standardize all formats to numpy array. 

 

In [0]:
class DQNAgent:
  def sample_batch(memory, batch_size):
    """
    Input
      memory (FIFO queue)
        a queue where each entry has five elements as below
        state: LazyFrame of dimension (state_dim)
        next_state: LazyFrame of dimension (state_dim)
        action: integer, representing the action taken
        reward: float, the reward from state to next_state with action
        done: boolean, whether state is a terminal state
      batch_size (int)
        size of the batch to return 

    Output
      state, next_state, action, reward, done (tuple)
        a tuple of five elements: state, next_state, action, reward, done
        state: numpy array of dimension (batch_size x state_dim)
        next_state: numpy array of dimension (batch_size x state_dim)
        action: numpy array of dimension (batch_size)
        reward: numpy array of dimension (batch_size)
        done: numpy array of dimension (batch_size)
    """
    return (None, None, None, None, None)

## Predicted Q Value

The learning process relies on the Q-learning algorithm (refer to Q-learning in cheatsheet):

> Q_p(s, a) <- Q_p(s, a) + α(r + γ max Q_t(s', a') - Q_p(s,a))

where Q_p is the prediction value function, Q_t is the target value function, s and a are the current state and action, s' is the next state, a' is the best next action decided by Q_p and s' collectively. We use two separate neural networks to represent Q_p and Q_t. The neural networks learn to estimate state-action value (Q value) better over the learning process. All s, a and s' are retrieved from memory. 

The reason to have 2 value functions is to prevent divergence during the optimization. Q_p is used to make actual prediction of the current state-action value, while Q_t is used in conjunction with r to determine the target state-action value (refer to Temporal Difference Learning in Cheatsheet). In this section we make value prediction using Q_p. 

Ideally we pass both s and a to the Q_p function, which outputs the predicted value for the state-action pair. Imagine having 5 possible actions, this means passing the state-action pair to the Q_p neural network 5 times, which is very costly. To improve efficiency, we pass only the state to Q_p, which outputs predicted Q values for all possible actions at once. For example:

Input

state (s): ![pic](https://drive.google.com/uc?id=1ByUKXf967Z6C9FBVtsn_QRnJTr9w-18v)

Output
- moving right (a_1): -10
- jumping up (a_2): 10
- jumping right (a_3): 0

This gives us 

```
Q_p(s, a_1) = -10
Q_p(s, a_2) = 10
Q_p(s, a_3) = 0
```

In our scenario, since the action is given (e.g. a_2), we can directly return the associated Q value, i.e. Q_p(s, a_2). 

## Instruction

For a batch of experiences consisting of states (s) and actions (a), calculate the estimated value for each state-action pair Q(s, a). Return the results in `torch.tensor` format. 


In [0]:
class DQNAgent:
  # def predict()
  
  def calculate_prediction_q(state, action):
    """
    Input
      state (np.array)
        dimension is (batch_size x state_dim), each item is an observation 
        for the current state 
      action (np.array)
        dimension is (batch_size), each item is an integer representing the 
        action taken for current state 

    Output
      pred_q (torch.tensor)
        dimension of (batch_size), each item is a predicted Q value of the 
        current state-action pair 
    """
    return None

## Target Q Value

In this section we calculate the target Q value, in the form of

> r + γ max Q_t(s', a')

where r is the reward at transition from s to s',  γ  is the discounting factor, and s' is the next state. Because a' is not part of the actual experience (it is the predicted best action to take at next state), we will estimate it using our prediction value function Q_p by taking the argmax of Q_p(s', a') with respect to a'. 

> a' = argmax_a Q_p(s', a)

Target Q value, in comparison to prediction Q value Q_p(s, a), gives a better estimate of the current state-action value. We want to update the predicted Q value Q_p(s, a) towards target Q value, r + γ max Q_t(s', a'). 

In this section we calculate the target Q value of current state-action. 

## Instruction

For a batch of experiences consisting of next_states (s') and rewards (r), calculate the target Q value using above mentioned equation. Note that a' is not explicitly given, so we will need to first obtain that using prediction value function Q_p. 

Return the results in `torch.tensor` format. 

In [0]:
class DQNAgent:
  # def predict()
  
  def calculate_target_q(next_state, reward):
    """
    Input
      next_state (np.array)
        dimension is (batch_size x state_dim), each item is an observation 
        for the next state 
      reward (np.array)
        dimension is (batch_size), each item is a float representing the 
        reward collected from (state -> next state) transition 

    Output
      target_q (torch.tensor)
        dimension of (batch_size), each item is a target Q value of the current
        state-action pair, calculated based on reward collected and 
        estimated Q value for next state
    """
    return None

## Loss between Prediction and Target Q Value

To improve our value estimation, we would like our predicted Q value to be as close to the target Q value as possible. In other words, we want to minimize the distance between Q_p(s, a) and r + γ max Q_t(s', a'). To do this, we calculate the *huber loss* between the two values, and use this loss to update Q_p, the prediction value function. 


![pic](https://drive.google.com/uc?id=1FZM7sBnMgY5GQNTx-o3LtLRLQQM0mwat)


Huber loss is a smoothed version of L1 loss. Graph above gives some intuition behind L1 vs. L2 vs. Huber loss. L1 is intolerant around the origin and gives a high loss when there is only small difference between predicted and target value. On the other hand, L2 loss explodes quickly when there is a big difference between predicted and target value. Huber loss conveniently avoids both issues. 

Hint: the huber loss can be called in this way
```
loss = nn.functional.smooth_l1_loss(input, target)
```

## Instruction

Given predicted and target Q values for the batch of experiences, return the sum of huber loss. 


In [0]:
class DQNAgent:
  def calculate_huber_loss(pred_q, target_q):
    """
    Input
      pred_q (torch.tensor)
        dimension is (batch_size), each item is an observation 
        for the next state 
      target_q (torch.tensor)
        dimension is (batch_size), each item is a float representing the 
        reward collected from (state -> next state) transition 

    Output
      loss (torch.tensor)
        a single value representing the Huber loss of pred_q and target_q
    """
    return None

## Learn

With all the helper methods implemented, let's revisit our learn function. 

We need to set up the learning process and check some criterion. Logic is added for you. 

In [0]:
class DQNAgent:
    def learn(self):
        """Update prediction action value (Q) function with a batch of experiences
        """
        # sync target network
        if self.step % self.copy_every == 0:
            self.sync_target_q()
        # checkpoint model
        if self.step % self.save_every == 0:
            self.save_model()
        # break if burn-in
        if self.step < self.burnin:
            return
        # break if no training
        if self.step % self.learn_every != 0:
            return

        # sample a batch of experiences from self.memory
        state, next_state, action, reward, done = sample_batch(self.memory, self.batch_size)

        # calculate prediction Q values for the batch
        pred_q = calculate_prediction_q(state, action)

        # calculate target Q values for the batch
        target_q = calculate_target_q(next_state, reward)

        # calculate huber loss of target and prediction values
        loss = calculate_huber_loss(pred_q, target_q)
        
        # update target network
        update_prediction_q(loss, optimizer)


## Completed agent

We know have all the key functionlities realized. There are some additional helper methods that would be useful. For example:
- to be able to save the agent

We implemented these methods for you, and here is your completed agent file. 

In [0]:
from collections import deque
import torch
import torch.nn as nn
import numpy as np
import random
from neural import ConvNet
import pdb

class DQNAgent:
    def __init__(self, state_dim, action_dim, max_memory, double_q):
        # state space dimension
        self.state_dim = state_dim
        # action space dimension
        self.action_dim = action_dim
        # replay buffer
        self.memory = deque(maxlen=max_memory)
        # if double_q, use best action from online_q for next state q value
        self.double_q = double_q
        # future reward discount rate
        self.gamma = 0.9
        # initial epsilon(random exploration rate)
        self.eps = 1
        # final epsilon
        self.eps_min = 0.1
        # epsilon decay rate
        self.eps_decay = 0.99999975
        # current step, updated everytime the agent acts
        self.step = 0
        # number of experiences between updating online q
        self.learn_every = 3
        # number of experiences to collect before training
        # self.burnin = 1e5
        self.burnin = 1e2
        # number of experiences between updating target q with online q
        self.copy_every = 1e4
        # number of experiences between saving the current agent
        self.save_every = 5e5

        # batch size used to update online q
        self.batch_size = 32
        # online action value function, Q(s, a)
        self.online_q = ConvNet(input_dim=state_dim, output_dim=action_dim)
        # target action value function, Q'(s, a)
        self.target_q = ConvNet(input_dim=state_dim, output_dim=action_dim)
        # optimizer
        self.optimizer = torch.optim.Adam(self.online_q.parameters(), lr=0.00025)

    def predict(self, state, model):
        """Given a state, predict Q values of all actions
        model is either 'online' or 'target'
        """
        state_float = torch.tensor(np.array(state)).float() / 255.
        if model == 'online':
            return self.online_q(state_float)
        if model == 'target':
            return self.target_q(state_float)

    def act(self, state):
        """Given a state, choose an epsilon-greedy action
        """
        if np.random.rand() < self.eps:
            # random action
            action = np.random.randint(low=0, high=self.action_dim)
        else:
            # policy action
            q = self.predict(np.expand_dims(state, 0), model='online')
            action = torch.max(q, axis=1)[1].item()
        # decrease eps
        self.eps *= self.eps_decay
        self.eps = max(self.eps_min, self.eps)
        # increment step
        self.step += 1
        return action

    def remember(self, experience):
        """Add the observation to memory
        """
        self.memory.append(experience)

    def learn(self):
        """Update online action value (Q) function with a batch of experiences
        """
        # sync target network
        if self.step % self.copy_every == 0:
            self.sync_target_q()
        # checkpoint model
        if self.step % self.save_every == 0:
            self.save_model()
        # break if burn-in
        if self.step < self.burnin:
            return
        # break if no training
        if self.step % self.learn_every != 0:
            return
        # sample batch
        batch = random.sample(self.memory, self.batch_size)
        state, next_state, action, reward, done = map(np.array, zip(*batch))
        # get next q values from target_q
        next_q = self.predict(next_state, 'target')
        # calculate discounted future reward
        if self.double_q:
            q = self.predict(next_state, 'online')
            q_idx = torch.max(q, axis=1)[1]
            target_q = torch.tensor(reward) + torch.tensor(1. - done) * self.gamma * next_q[np.arange(0, self.batch_size), q_idx]
        else:
            target_q = torch.tensor(reward) + torch.tensor(1. - done) * self.gamma * torch.max(next_q, axis=1)[0]
        # get predicted q values from online_q and actions taken
        curr_q = self.predict(state, 'online')
        pred_q = curr_q[np.arange(0, self.batch_size), action]
        # huber loss
        loss = nn.functional.mse_loss(pred_q, target_q)
        # update online_q
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        # TODO Log shit


    def save_model(self):
        """Save the current agent
        """
        return

    def sync_target_q(self):
        """Update target action value (Q) function with online action value (Q) function
        """
        self.target_q.load_state_dict(self.online_q.state_dict())