<!--
authors: Matthew Wilson, Daniele Reda
created: 2020/01/14
last_updated: 2023/02/08
-->


## CPSC 533V: Assignment 3 - Tabular Q Learning and DQN (Due Thu Feb 22)

---

#  Part 1 [54 pts] Tabular Q-Learning 

Tabular Q-learning is an RL algorithm for problems with discrete states and discrete actions. The algorithm is described in the class notes, which borrows the summary description from [Section 6.5](http://incompleteideas.net/book/RLbook2018.pdf#page=153) of Richard Sutton's RL book. In the tabular approach, the Q-value is represented as a lookup table. As discussed in class, Q-learning can further be extended to continuous states and discrete actions, leading to the [Atari DQN](https://arxiv.org/abs/1312.5602) / Deep Q-learning algorithm.  However, it is important and informative to first fully understand tabular Q-learning.

Informally, Q-learning works as follows: The goal is to learn the optimal Q-function: 
`Q(s,a)`, which is the *value* of being at state `s` and taking action `a`.  Q tells you how well you expect to do, on average, from here on out, given that you act optimally.  Once the Q function is learned, choosing an optimal action is as simple as looping over all possible actions and choosing the one with the highest Q (optimal action $a^* = \text{argmax}_a Q(s,a)$).  To learn Q, we initialize it arbitrarily and then iteratively refine it using the Bellman backup equation for Q functions, namely: 
$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \text{max}_a Q(s', a) - Q(s,a)]$.
Here, $r$ is the reward associated with with the transition from state s to s', and $\alpha$ is a learning rate.

In the first part of assignment you will implement tabular Q-learning and apply it to CartPole -- an environment with a **continuous** state space.  To apply the tabular method, you will need to discretize the CartPole state space by dividing the state-space into bins.


**Goals:**
- to become familiar with python/numpy, as well as using an OpenAI Gym environment
- to understand tabular Q-learning, by implementing tabular Q-Learning for 
  a discretized version of a continuous-state environment, and experimenting with the implementation
- (optional) to develop further intuition regarding possible variations of the algorithm

## Introduction
Deep reinforcement learning has generated impressive results for board games ([Go][go], [Chess/Shogi][chess]), video games ([Atari][atari], [DOTA2][dota], [StarCraft II][scii]), [and][baoding] [robotic][rubix] [control][anymal] ([of][cassie] [course][mimic] ;)).  RL is beginning to work for an increasing range of tasks and capabilities.  At the same time, there are many [gaping holes][irpan] and [difficulties][amid] in applying these methods. Understanding deep RL is important if you wish to have a good grasp of the modern landscape of control methods.

These next several assignments are designed to get you started with deep reinforcement learning, to give you a more close and personal understanding of the methods, and to provide you with a good starting point from which you can branch out into topics of interest. You will implement basic versions of some of the important fundamental algorithms in this space, including Q-learning and policy gradient/search methods.

We will only have time to cover a subset of methods and ideas in this space.
If you want to dig deeper, we suggest following the links given on the course webpage.  Additionally we draw special attention to the [Sutton book](http://incompleteideas.net/book/RLbook2018.pdf) for RL fundamentals and in depth coverage, and OpenAI's [Spinning Up resources](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html) for a concise intro to RL and deep RL concepts, as well as good comparisons and implementations of modern deep RL algorithms.


[atari]: https://arxiv.org/abs/1312.5602
[go]: https://deepmind.com/research/case-studies/alphago-the-story-so-far
[chess]:https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go 
[dota]: https://openai.com/blog/openai-five/
[scii]: https://deepmind.com/blog/article/AlphaStar-Grandmaster-level-in-StarCraft-II-using-multi-agent-reinforcement-learning
[baoding]: https://bair.berkeley.edu/blog/2019/09/30/deep-dynamics/
[rubix]: https://openai.com/blog/solving-rubiks-cube/
[cassie]: https://www.cs.ubc.ca/~van/papers/2019-CORL-cassie/index.html
[mimic]: https://www.cs.ubc.ca/~van/papers/2018-TOG-deepMimic/index.html
[anymal]: https://arxiv.org/abs/1901.08652


[irpan]: https://www.alexirpan.com/2018/02/14/rl-hard.html
[amid]: http://amid.fish/reproducing-deep-rl



In [1]:
# # uncomment if necesary
# !pip install numpy
# !pip install gym
# # OR:
# !pip install gymnasium
import time
import itertools
import numpy as np
# import gym
import gymnasium as gym
import tqdm

---

## [12 pts] Explore the CartPole environment 

Your first task is to familiarize yourself with the OpenAI gym interface and the [CartPole environment](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/cartpole.py)
by writing a simple hand-coded policy to try to solve it.  
To begin understanding OpenAI Gym environments, [read this first](https://gymnasium.farama.org/api/env/).) 
The gym interface is very popular and you will see many algorithm implementations and 
custom environments that support it.  You may even want to use the API in your course projects, 
to define a custom environment for a task you want to solve.

Note that there were several breaking changes introduced in the past few years to the gym API. Some reference algorithm implementations online might be using the old version:
- `obs = env.reset()` ->  `obs, info = env.reset()`
- `obs, reward, done, info = env.step(action)` to `obs, reward, terminated, truncated, info = env.step(action)`
- `env.render()` no longer accepts the `render_mode` parameter (e.g. human mode where the environment is rendered in a pop-up window, or rgb_array which allows headless conversion to images or videos)


Below is some example code that runs a simple random policy.  You are to:
- **run the code to see what it does**
- **write code that chooses an action based on the observation**.  You will need to learn about the gym API and to read the CartPole documentation to figure out what the `action` and `obs` vectors mean for this environment. 
Your hand-coded policy can be arbitrary, and it should ideally do better than the random policy.  There is no single correct answer. The goal is to become familiar with `env`s.
- **write code to print out the total reward gained by your policy in a single episode run**
- **answer the short-response questions below** (see the TODOs for all of this)

In [2]:
env = gym.make('CartPole-v1', render_mode="human")  # you can also try LunarLander-v2, but make sure to change it back
print('observation space:', env.observation_space)
print('action space:', env.action_space)

# To find out what the observations mean, read the CartPole documentation.
# Uncomment the lines below, or visit the source file: 
# https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/cartpole.py

#cartpole = env.unwrapped
#cartpole?

observation space: Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
action space: Discrete(2)


In [3]:
# 1.1 [10pts]

# runs a single episode and render it.  try running this before editing anything
obs, info = env.reset()  # get first obs/state
rewards = 0
while True:
    # TODO: replace this `action` with something that depends on `obs` 
    if obs[2] > 0:
        action = 1
    else:
        action = 0
    # action = env.action_space.sample()  # random action
    
    obs, reward, terminated, truncated, info = env.step(action)
    rewards += reward
    env.render()
    time.sleep(0.1)  # so it doesn't render too quickly
    if terminated | truncated: break
env.close()

# TODO: print out your total sum of rewards here
print(rewards)

51.0


To answer the questions below, look at the full [source code here](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/cartpole.py) if you haven't already.

**1.2. [2pts] Briefly describe your policy.  What observation information does it use?  What score did you achieve (rough maximum and average)?  And how does it compare to the random policy?**

I used the pole angle. If pole angle > 0, means pole leaning right, so push right to let it maintain balance. Vice versa. Maximum reward is 53, average 44. It's higher than the random policy (reward 9).

---

##  [12 pts] Discretize the env

Next, we need to discretize CartPole's continuous state space to work for tabular Q-learning.  While this is in part  a contrived usage of tabular methods, given the existence of other approaches that are designed to cope with continuous state-spaces, it is also interesting to consider whether tabular methods can be adapted more directly via discretization of the state into bins. Furthermore, tabular methods are simple, interpretabile, and can be proved to converge, and thus they still remain relevant.

Your task is to discretize the state/observation space so that it is compatible with tabular Q-learning.  To do this:
- **implement `obs_normalizer` to pass its test**
- **implement `get_bins` to pass its test**
- **then answer question 2.3**

[map]: https://arxiv.org/abs/1504.04909
[qd]: https://quality-diversity.github.io/

In [4]:
env = gym.make('CartPole-v1')

In [5]:
# 2.1 [5 pts for passing test_normed]
def obs_normalizer(obs):
    """Normalize the observations between 0 and 1
    
    If the observation has extremely large bounds, then clip to a reasonable range before normalizing; 
    (-2,2) should work.  (It is ok if the solution is specific to CartPole)
    
    Args:
        obs (np.ndarray): shape (4,) containing an observation from CartPole using the bound of the env
    Returns:
        normed (np.ndarray): shape (4,) where all elements are roughly uniformly mapped to the range [0, 1]
    
    """
    # HINT: check out env.observation_space.high, env.observation_space.low
    
    # TODO: implement this function
    result = np.zeros_like(obs)
    result[0] = (obs[0] / (2*4.8000002)) + 0.5
    if obs[1] > 2:
        result[1] = 2
    elif obs[1] < -2:
        result[1] = -2
    else:
        result[1] = obs[1]
    result[1] = result[1] / 4 + 0.5
    result[2] = (obs[1] / (2*4.1887903)) + 0.5
    if obs[3] > 2:
        result[3] = 2
    elif obs[3] < -2:
        result[3] = -2
    else:
        result[3] = obs[3]
    result[3] = result[3] / 4 + 0.5
    return result

In [6]:
### TEST 2.1
def test_normed():
    obs, info = env.reset()
    while True:
        obs, _, terminated, truncated, _ =  env.step(env.action_space.sample())
        normed = obs_normalizer(obs) 
        assert np.all(normed >= 0.0) and np.all(normed <= 1.0), '{} are outside of (0,1)'.format(normed)
        if terminated | truncated: break
    env.close()
    print('Passed!')
test_normed()

Passed!


In [7]:
# 2.2 [5 pts for passing test_binned]
def get_bins(normed, num_bins):
    """Map normalized observations (0,1) to bin index values (0,num_bins-1)
    
    Args:
        normed (np.ndarray): shape (4,) output from obs_normalizer
        num_bins (int): how many bins to use
    Returns:
        binned (np.ndarray of type np.int32): shape (4,) where all elements are values in range [0,num_bins-1]
    
    """
    bin_size = 1 / num_bins

    # handle corner case
    normed[normed == 1] -= (0.1 * bin_size)
    return (normed // bin_size).astype(np.int32)

In [8]:
### TEST 2.2
obs, info = env.reset()

def test_binned(num_bins):
    normed = np.array([0.0, 0.2, 0.8, 1.0])
    binned = get_bins(normed, num_bins)
    assert np.all(binned >= 0) and np.all(binned < num_bins), '{} supposed to be between (0, {})'.format(binned, num_bins-1)
    assert binned.dtype == np.int32, "You should also make sure to cast your answer to int using arr.astype(np.int32)" 
    
test_binned(5)
test_binned(10)
test_binned(50)
print('Passed!')

Passed!


**2.3. [2 pts] If your state has 4 values and each is binned into N possible bins, how many bins are needed to represent all unique possible states?**



N<sup>4</sup>

---

## [20 pts] Solve the env 

Using the pseudocode below and the functions you implemented above, implement tabular Q-learning and use it to solve CartPole.

We provide setup code to initialize the Q-table and give examples of interfacing with it. Write the inner and outer loops to train your algorithm.  These training loops will be similar to those deep RL approaches, so get used to writing them!

The algorithm (excerpted from Section 6.5 of [Sutton's book](http://incompleteideas.net/book/RLbook2018.pdf)) is given below:

![Sutton RL](https://i.imgur.com/mdcWVRL.png)

in summary:
- **implement Q-learning using this pseudocode and the helper code**
- **answer the questions below**
- **run the suggested experiments and otherwise experiment with whatever interests you**

In [11]:
env = gym.make('CartPole-v1')
# setup (see last few lines for how to use the Q-table)

# hyper parameters. feel free to change these as desired and experiment with different values
num_bins = 30
alpha = 0.5
gamma = 0.99
log_n = 1000
# epsilon greedy
eps = 0.05  #usage: action = optimal if np.random.rand() > eps else random

obs, info = env.reset()

# Q-table initialized to zeros.  first 4 dims are state, last dim is for action (0,1) for left,right.
Q = np.zeros([num_bins]*len(obs)+[env.action_space.n])

# helper function to convert observation into a binned state so we can index into our Q-table
obs2bin = lambda obs: tuple(get_bins(obs_normalizer(obs), num_bins=num_bins))

s = obs2bin(obs)

print('Shape of Q Table: ', Q.shape) # you can imagine why tabular learning does not scale very well
print('Original obs {} --> binned {}'.format(obs, s))
print('Value of Q Table at that obs/state value', Q[s])

Shape of Q Table:  (30, 30, 30, 30, 2)
Original obs [-0.025777    0.00829503  0.02939578 -0.02648045] --> binned (14, 15, 15, 14)
Value of Q Table at that obs/state value [0. 0.]


In [12]:
# 3.1 [20 pts]

# TODO: implement Q learning, following the pseudo-code above. 
#     - you can follow it almost exactly, but translating things for the gym api and our code used above
#     - make sure to use e-greedy, where e = random about 0.05 percent of the time
#     - make sure to do the S <-- S' step because it can be easy to forget
#     - every log_n steps, you should render your environment and
#       print out the average total episode rewards of the past log_n runs to monitor how your agent trains
#      (your implementation should be able to break at least +150 average reward value, and you can use that 
#       as a breaking condition.  It make take several minutes to run depending on your computer.)

def get_action(state: tuple[int, int, int, int]):
    """Get action based on epsilon greedy"""
    if np.random.random() < eps:
        return env.action_space.sample()
    else:
        return int(np.argmax(Q[state]))
    
def update(
        state: tuple[int, int, int, int],
        action: int,
        reward: float,
        terminated: bool,
        next_state: tuple[int, int, int, int],
    ):
        """Updates the Q-value of an action."""
        future_q_value = (not terminated) * np.max(Q[next_state])
        temporal_difference = (
            reward + gamma * future_q_value - Q[state][action]
        )

        Q[state][action] = (
            Q[state][action] + alpha * temporal_difference
        )


rewards = 0
num_episodes = 0
while True:
    obs, info = env.reset()
    done = False
    num_episodes += 1

    while not done:
        s = obs2bin(obs)
        action = get_action(s)
        next_obs, reward, terminated, truncated, info = env.step(action)
        next_s = obs2bin(next_obs)

        update(s, action, reward, terminated, next_s)
        rewards += reward
        done = terminated or truncated
        obs = next_obs

    if num_episodes % log_n == 0:
        print("The current reward is {}.".format(rewards/num_episodes))
        # env.render()

    if rewards / num_episodes > 150.5:
        break

env.close()

The current reward is 41.445.
The current reward is 57.4995.
The current reward is 68.727.
The current reward is 72.3615.
The current reward is 76.0974.
The current reward is 78.33683333333333.
The current reward is 81.145.
The current reward is 84.83275.
The current reward is 86.13066666666667.
The current reward is 87.7766.
The current reward is 90.81.
The current reward is 91.914.
The current reward is 93.97061538461539.
The current reward is 96.88914285714286.
The current reward is 99.11893333333333.
The current reward is 100.1260625.
The current reward is 101.66235294117647.
The current reward is 103.5195.
The current reward is 104.56005263157894.
The current reward is 105.53015.
The current reward is 106.46419047619048.
The current reward is 107.71972727272727.
The current reward is 108.79086956521739.
The current reward is 109.67795833333334.
The current reward is 110.04208.
The current reward is 111.09026923076924.
The current reward is 111.77588888888889.
The current reward is

## [10 pts] Experiments

Given a working algorithm, you will run a few experiments.  Either make a copy of your code above to modify, or make the modifications in a way that they can be commented out or switched between (with boolean flag if statements).

**4.2. [5 pts] $\epsilon$-greedy.**  How sensitive are the results to the value of $\epsilon$?   First, write down your prediction of what would happen if $\epsilon$ is set to various values, including for example [0, 0.05, 0.25, 0.5].

For $\epsilon=0$, there will be no increasing of reward as there is no exploration and Q is set to all 0s. The policy will choose a fixed action every time, which is suboptimal.

For $\epsilon=0.05$, there will be some exploration. The training will proceed. Reward will increase.

As $\epsilon$ increases, the exploration increases during training. The training speed at first might be slower because of the exploration, but later when it is closer to converge, the policy will provide more accurate actions. However, too high of $\epsilon$ can result in instability of training.

Now run the experiment and observe the impact on the algorithm.  Report the results below.

As expected, no exploration will result in a suboptimal policy. The average reward will stuck at 9.4.

For $\epsilon=0.05$ the rewards accumulate a lot faster at the beginning compared to higher $\epsilon$. This is expected as well since higher $\epsilon$ is doing more exploration than exploitation.

On a 30mins training, 0.05 session reaches avg reward of 150. 0.25 session only reaches 102.

When $\epsilon$ is too high, i.e. 0.25 or 0.5, it's really hard for the policy to converge.

0.01 - 0.1 should be a reasonable range.

**4.3. [5 pts] Design your own experiment.** Design a modification that you think would either increase or reduce performance.  A simple example (which you can use) is initializing the Q-table differently, and thinking about how this might alter performance. Write down your idea, what you think might happen, and why.

My modification is to increase the number of bins. The discretization could impact the result quite a lot. If the number of bin is too low, there might not be enough discretized states to represent the continuous states, i.e. the accuracy of the discretization is low. If the number of bins are too high, the training cost (memory and speed) will increase tremendously due to the exponential increasing of the Q table size. This will also harm since many of the states will result in similar actions so too many states will not increase the performance.

Personally, I think 10 bins didn't perform well on my machine so I decided to try out 20, 30, 40.

Run the experiment and report the results.

It turns out #bins = 30 gives the best performance (speed and avg total reward).

10 somehow stuck at around 135 and increases very slowly after that.

40 trains too slow for the entire time.

---

## A. Extensions (fully optional, will not be graded, if you have time after Part 2)

- plots your learning curve, using e.g., matplotlib 
- visualize the Q-table to see which values are being updated and not
- design a better binning strategy that uses fewer bins for a better-performing policy
- extend this approach to work on different environments (e.g., LunarLander-v2)
- extend this approach to work on environments with continuous actions, by using a fixed set of discrete samples of the action space.  e.g., for Pendulum-v0
- implement a simple deep learning version of this.  we will see next part that DQN uses some tricks to make the neural network training more stable.  Experiment directly with simply replacing the Q-table with a Q-Network and train the Q-Network using gradient descent with `loss = (targets - Q(s,a))**2`, where `targets = stop_grad(R + gamma * maxa(Q(s,a))`).

# Part 2 [60 pts] Behavioral Cloning and Deep Q Learning

---
The second part of assignment will help you transition from tabular approaches to deep neural network approaches. You will implement the [Atari DQN / Deep Q-Learning](https://arxiv.org/abs/1312.5602) algorithm, which arguably kicked off the modern Deep Reinforcement Learning craze.

In this part we will use PyTorch as our deep learning framework.  To familiarize yourself with PyTorch, your first task is to use a behavior cloning (BC) approach to learn a policy.  Behavior cloning is a supervised learning method in which there exists a dataset of expert demonstrations (state-action pairs) and the goal is to learn a policy $\pi$ that mimics this expert.  At any given state, your policy should choose the same action the export would.

Since BC avoids the need to collect data from the policy you are trying to learn, it is relatively simple. 
This makes it a nice stepping stone for implementing DQN. Furthermore, BC is relevant to modern approaches---for example its use as an initialization for systems like [AlphaGo][go] and [AlphaStar][star], which then use RL to further adapte the BC result.  

<!--

I feel like this might be better suited to going lower in the document:

Unfortunately, in many tasks it is impossible to collect good expert demonstrations, making

it's not always possible to have good expert demonstrations for a task in an environemnt and this is where reinforcement learning comes handy. Through the reward signal retrieved by interacting with the environment, the agent learns by itself what is a good policy and can learn to outperform the experts.

-->

Goals:
- Famliarize yourself with PyTorch and its API including models, datasets, dataloaders
- Implement a supervised learning approach (behavioral cloning) to learn a policy.
- Implement the DQN objective and learn a policy through environment interaction.

[go]:  https://deepmind.com/research/case-studies/alphago-the-story-so-far
[star]: https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii

## Submission information

- Complete by editing and executing the associated Python files.
- Copy and paste the code and the terminal output requested in the predefined cells on this Jupyter notebook.
- When done, upload the completed Jupyter notebook (ipynb file) on canvas.

## Preliminaries

### PyTorch

If you have never used PyTorch before, we recommend you follow this [60 Minutes Blitz][blitz] tutorial from the official website. It should give you enough context to be able to complete the assignment.


**If you have issues, post questions to Piazza**

### Installation

To install all required python packages:

```
python3 -m pip install -r requirements.txt
```

### Debugging


You can include:  `import ipdb; ipdb.set_trace()` in your code and it will drop you to that point in the code, where you can interact with variables and test out expressions.  We recommend this as an effective method to debug the algorithms.


[blitz]: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

## 1. [36 pts] Behavioral Cloning

Behavioral Cloning is a type of supervised learning in which you are given a dataset of expert demonstrations tuple $(s, a)$ and the goal is to learn a policy function $\hat a = \pi(s)$, such that $\hat a = a$.

The optimization objective is $\min_\theta D(\pi(s), a)$ where $\theta$ are the parameters the policy $\pi$, in our case the weights of a neural network, and where $D$ represents some difference between the actions.

---

Before starting, we suggest reading through the provided files.

For Behavioral Cloning, the important files to understand are: `model.py`, `dataset.py` and `bc.py`.

- The file `model.py` has the skeleton for the model (which you will have to complete in the following questions),

- The file `dataset.py` has the skeleton for the dataset the model is being trained with,

- and, `bc.py` will have all the structure for training the model with the dataset.


### [10 pts] 1.1 Dataset

We provide a pickle file with pre-collected expert demonstrations on CartPole from which to learn the policy $\pi$. The data has been collected from an expert policy on the environment, with the addition of a small amount of gaussian noise to the actions.

The pickle file contains a list of tuples of states and actions in `numpy` in the following way:

```
[(state s, action a), (state s, action a), (state s, action a), ...]
```

In the `dataset.py` file, we provide skeleton code for creating a custom dataset. The provided code shows how to load the file.

Your goal is to overwrite the `__getitem__` function in order to return a dictionary of tensors of the correct type.

Hint: Look in the `bc.py` file to understand how the dataset is used.

Answer the following questions:

**[6 pts]** Insert your code in the placeholder below.

In [None]:
# PLACEHOLDER TO INSERT YOUR __getitem__ method here
def __getitem__(self, index):
    item = self.data[index]
    return {"state": item[0], "action": item[1]}

In [1]:
from dataset import Dataset

myDataset = Dataset("./CartPole-v1_dataset.pkl")


# def getDimensions(self):
#         states = []
#         actions = []
#         for d in self.data:
#             states.append(d[0])
#             actions.append(d[1])
        
#         states = np.array(states)
#         actions = np.array(actions)
#         return len(states), states.shape, actions.shape, np.max(states, axis=0, keepdims=True), np.min(states, axis=0, keepdims=True), np.max(actions,axis=0,keepdims=True), np.min(actions,axis=0,keepdims=True)


print(myDataset.getDimensions())

(99660, (99660, 4), (99660,), array([[2.39948596, 1.84697975, 0.14641718, 0.47143314]]), array([[-0.72267057, -0.43303689, -0.05007198, -0.38122098]]), array([1], dtype=int64), array([0], dtype=int64))


**[2 pt]** How big is the dataset provided?

99660

**[2 pts]** What is the dimensionality of $s$ and what range does each dimension of $s$ span?  I.e., how much of the state space does the expert data cover? What are the dimensionalities and ranges of the action $a$ in the dataset (how much of the action space does the expert data cover)?

From the previous cell, we can see for the states

- 4 dimension
- each value in the state spans [-0.72267057, -0.43303689, -0.05007198, -0.38122098] to [2.39948596, 1.84697975, 0.14641718, 0.47143314]

for the actions

- 1 dimension
- spans 0 to 1 (both actions)




### [5 pts] 1.2 Environment

Recall the state and action space of CartPole, from the previous assignment.

Considering the full state and action spaces, do you think the provided expert dataset has good coverage?  Why or why not? How might this impact the performance of our cloned policy?

The full state space from the definition is -4.8 to 4.9 for positions, -inf to inf for velocities and -0.418 to 0.418 for angles. However, the termination condition is for those are different. For positions, they are (-2.4, 2.4). For angles, they are (-0.2095,0.2095). In the expert data, the positions are from -0.7 to 2.4, which spans most of the state space (before termination) but not very much the left side to the limit. Similar for the angles. Both actions are coverd. I think the expert policy can handle the situations when the pole are leaning right well but not leaning left. If the expert data lack data for part of the state space, the cloned policy will do even worse because the error accumulates during time.

### [14 pts] 1.3 Model

The file `model.py` provides skeleton code for the model. Your goal is to create the architecture of the network by adding layers that map the input to output.

You will need to update the `__init__` method and the `forward` method.

The `select_action` method has already been written for you.  This should be used when running the policy in the environment, while the `forward` function should be used at training time.

- **[10 pts]** Insert your code in the placeholder below.

In [None]:
# PLACEHOLDER TO INSERT YOUR MyModel class here

class MyModel(nn.Module):
    def __init__(self, state_size, action_size):
        super().__init__()
        self.state_size = state_size
        # TODO YOUR CODE HERE FOR INITIALIZING THE MODEL
        # Guidelines for network size: start with 2 hidden layers and maximum 32 neurons per layer
        # feel free to explore different sizes
        self.layers = nn.Sequential(
            nn.Linear(state_size, 32, True),
            nn.ReLU(),
            nn.Linear(32, 32, True),
            nn.ReLU(),
            nn.Linear(32, action_size, True),
            )

    def forward(self, x):
        return self.layers(x)
        
    def select_action(self, state):
        self.eval()
        x = self.forward(state)
        self.train()
        return x.max(1)[1].view(1, 1).to(torch.long)


Answer the following questions:

- **[2 pts]** What is the dimension and meaning of the input of the network

Input size is 4, this is the state size.

- **[2 pts]** Similarly, describe the output.

There are 2 hidden layers with 32 neurons each and the final layer just maps the output to the logits (probability value) of the two actions.


### [7 pts] 1.4 Training

The file `bc.py` is the entry point for training your behavioral cloning model. The skeleton and the main components are already there.

The missing parts for you to do are:

- Initializing the model
- Choosing a loss function
- Choosing an optimizer
- Playing with hyperparameters to train your model.

- **[5 pts]** Insert your code in the placeholder below.

In [None]:
# PLACEHOLDER FOR YOUR CODE HER
# HOW DID YOU INITIALIZE YOUR MODEL, OPTIMIZER AND LOSS FUNCTIONS? PASTE HERE YOUR FINAL CODE
# NOTE: YOU CAN KEEP THE FOLLOWING LINES COMMENTED OUT, AS RUNNING THIS CELL WILL PROBABLY RESULT IN ERRORS

model = MyModel(state_size=4, action_size=2).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
loss_function = torch.nn.CrossEntropyLoss()

You can run your code by doing:

```
python3 bc.py
```

**During all of this assignment, the code in `eval_policy.py` will be your best friend.** At any time, you can test your model by giving as argument the path to the model weights and the environment name using the following command:

```
python3 eval_policy.py --model-path /path/to/model/weights --env ENV_NAME
````

In [None]:
## PASTE YOUR TERMINAL OUTPUT HERE
# NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10
[epoch    1/100] [iter       0] [loss 0.69105]
[epoch    1/100] [iter     500] [loss 0.18857]
[epoch    1/100] [iter    1000] [loss 0.16961]
[epoch    1/100] [iter    1500] [loss 0.09920]
[epoch    2/100] [iter    2000] [loss 0.03406]
[epoch    2/100] [iter    2500] [loss 0.02636]
[epoch    2/100] [iter    3000] [loss 0.01991]
[Test on environment] [epoch 2/100] [score 259.10]
[epoch    3/100] [iter    3500] [loss 0.01336]
[epoch    3/100] [iter    4000] [loss 0.02488]
[epoch    3/100] [iter    4500] [loss 0.01080]
[epoch    4/100] [iter    5000] [loss 0.00624]
[epoch    4/100] [iter    5500] [loss 0.01536]
[epoch    4/100] [iter    6000] [loss 0.03493]
[Test on environment] [epoch 4/100] [score 229.70]
[epoch    5/100] [iter    6500] [loss 0.01465]
[epoch    5/100] [iter    7000] [loss 0.00266]
[epoch    5/100] [iter    7500] [loss 0.02348]
[epoch    6/100] [iter    8000] [loss 0.01031]
[epoch    6/100] [iter    8500] [loss 0.00957]
[epoch    6/100] [iter    9000] [loss 0.02683]
[Test on environment] [epoch 6/100] [score 277.50]
[epoch    7/100] [iter    9500] [loss 0.02431]
[epoch    7/100] [iter   10000] [loss 0.01535]
[epoch    7/100] [iter   10500] [loss 0.00287]
[epoch    8/100] [iter   11000] [loss 0.00734]
[epoch    8/100] [iter   11500] [loss 0.01632]
[epoch    8/100] [iter   12000] [loss 0.00213]
[Test on environment] [epoch 8/100] [score 263.10]
[epoch    9/100] [iter   12500] [loss 0.02587]
[epoch    9/100] [iter   13000] [loss 0.00176]
[epoch    9/100] [iter   13500] [loss 0.01909]
[epoch    9/100] [iter   14000] [loss 0.00887]
[epoch   10/100] [iter   14500] [loss 0.02162]
[epoch   10/100] [iter   15000] [loss 0.01047]
[epoch   10/100] [iter   15500] [loss 0.01503]
[Test on environment] [epoch 10/100] [score 250.00]
[epoch   11/100] [iter   16000] [loss 0.01479]
[epoch   11/100] [iter   16500] [loss 0.01943]
[epoch   11/100] [iter   17000] [loss 0.00436]
[epoch   12/100] [iter   17500] [loss 0.01987]
[epoch   12/100] [iter   18000] [loss 0.00986]
[epoch   12/100] [iter   18500] [loss 0.00328]
[Test on environment] [epoch 12/100] [score 242.80]
[epoch   13/100] [iter   19000] [loss 0.00734]
[epoch   13/100] [iter   19500] [loss 0.01310]
[epoch   13/100] [iter   20000] [loss 0.00357]
[epoch   14/100] [iter   20500] [loss 0.00119]
[epoch   14/100] [iter   21000] [loss 0.01521]
[epoch   14/100] [iter   21500] [loss 0.03005]
[Test on environment] [epoch 14/100] [score 245.80]
[epoch   15/100] [iter   22000] [loss 0.00909]
[epoch   15/100] [iter   22500] [loss 0.00366]
[epoch   15/100] [iter   23000] [loss 0.00156]
[epoch   16/100] [iter   23500] [loss 0.02178]
[epoch   16/100] [iter   24000] [loss 0.00838]
[epoch   16/100] [iter   24500] [loss 0.00649]
[Test on environment] [epoch 16/100] [score 271.10]
[epoch   17/100] [iter   25000] [loss 0.00111]
[epoch   17/100] [iter   25500] [loss 0.00373]
[epoch   17/100] [iter   26000] [loss 0.05091]
[epoch   18/100] [iter   26500] [loss 0.00956]
[epoch   18/100] [iter   27000] [loss 0.01292]
[epoch   18/100] [iter   27500] [loss 0.00161]
[epoch   18/100] [iter   28000] [loss 0.02314]
[Test on environment] [epoch 18/100] [score 233.90]
[epoch   19/100] [iter   28500] [loss 0.00003]
[epoch   19/100] [iter   29000] [loss 0.00177]
[epoch   19/100] [iter   29500] [loss 0.00043]
[epoch   20/100] [iter   30000] [loss 0.00263]
[epoch   20/100] [iter   30500] [loss 0.00129]
[epoch   20/100] [iter   31000] [loss 0.04000]
[Test on environment] [epoch 20/100] [score 278.70]
[epoch   21/100] [iter   31500] [loss 0.01452]
[epoch   21/100] [iter   32000] [loss 0.02344]
[epoch   21/100] [iter   32500] [loss 0.02737]
[epoch   22/100] [iter   33000] [loss 0.00004]
[epoch   22/100] [iter   33500] [loss 0.00185]
[epoch   22/100] [iter   34000] [loss 0.01779]
[Test on environment] [epoch 22/100] [score 328.20]
[epoch   23/100] [iter   34500] [loss 0.00324]
[epoch   23/100] [iter   35000] [loss 0.00838]
[epoch   23/100] [iter   35500] [loss 0.03275]
[epoch   24/100] [iter   36000] [loss 0.00425]
[epoch   24/100] [iter   36500] [loss 0.00366]
[epoch   24/100] [iter   37000] [loss 0.00510]
[Test on environment] [epoch 24/100] [score 283.20]
[epoch   25/100] [iter   37500] [loss 0.06638]
[epoch   25/100] [iter   38000] [loss 0.09780]
[epoch   25/100] [iter   38500] [loss 0.00005]
[epoch   26/100] [iter   39000] [loss 0.00058]
[epoch   26/100] [iter   39500] [loss 0.03468]
[epoch   26/100] [iter   40000] [loss 0.00328]
[epoch   26/100] [iter   40500] [loss 0.02745]
[Test on environment] [epoch 26/100] [score 268.60]
[epoch   27/100] [iter   41000] [loss 0.00519]
[epoch   27/100] [iter   41500] [loss 0.01604]
[epoch   27/100] [iter   42000] [loss 0.03140]
[epoch   28/100] [iter   42500] [loss 0.00212]
[epoch   28/100] [iter   43000] [loss 0.00597]
[epoch   28/100] [iter   43500] [loss 0.02455]
[Test on environment] [epoch 28/100] [score 297.80]
[epoch   29/100] [iter   44000] [loss 0.00371]
[epoch   29/100] [iter   44500] [loss 0.00321]
[epoch   29/100] [iter   45000] [loss 0.00053]
[epoch   30/100] [iter   45500] [loss 0.00200]
[epoch   30/100] [iter   46000] [loss 0.00791]
[epoch   30/100] [iter   46500] [loss 0.00832]
[Test on environment] [epoch 30/100] [score 254.60]
[epoch   31/100] [iter   47000] [loss 0.00413]
[epoch   31/100] [iter   47500] [loss 0.00304]
[epoch   31/100] [iter   48000] [loss 0.00016]
[epoch   32/100] [iter   48500] [loss 0.00182]
[epoch   32/100] [iter   49000] [loss 0.00153]
[epoch   32/100] [iter   49500] [loss 0.00012]
[Test on environment] [epoch 32/100] [score 265.90]
[epoch   33/100] [iter   50000] [loss 0.03355]
[epoch   33/100] [iter   50500] [loss 0.00664]
[epoch   33/100] [iter   51000] [loss 0.05109]
[epoch   34/100] [iter   51500] [loss 0.03792]
[epoch   34/100] [iter   52000] [loss 0.00024]
[epoch   34/100] [iter   52500] [loss 0.00022]
[Test on environment] [epoch 34/100] [score 238.10]
[epoch   35/100] [iter   53000] [loss 0.00096]
[epoch   35/100] [iter   53500] [loss 0.00598]
[epoch   35/100] [iter   54000] [loss 0.00143]
[epoch   35/100] [iter   54500] [loss 0.02797]
[epoch   36/100] [iter   55000] [loss 0.02483]
[epoch   36/100] [iter   55500] [loss 0.00307]
[epoch   36/100] [iter   56000] [loss 0.00001]
[Test on environment] [epoch 36/100] [score 267.20]
[epoch   37/100] [iter   56500] [loss 0.00221]
[epoch   37/100] [iter   57000] [loss 0.00037]
[epoch   37/100] [iter   57500] [loss 0.00153]
[epoch   38/100] [iter   58000] [loss 0.00011]
[epoch   38/100] [iter   58500] [loss 0.01667]
[epoch   38/100] [iter   59000] [loss 0.00138]
[Test on environment] [epoch 38/100] [score 252.40]
[epoch   39/100] [iter   59500] [loss 0.00502]
[epoch   39/100] [iter   60000] [loss 0.00561]
[epoch   39/100] [iter   60500] [loss 0.01452]
[epoch   40/100] [iter   61000] [loss 0.00083]
[epoch   40/100] [iter   61500] [loss 0.00311]
[epoch   40/100] [iter   62000] [loss 0.02843]
[Test on environment] [epoch 40/100] [score 244.60]
[epoch   41/100] [iter   62500] [loss 0.03630]
[epoch   41/100] [iter   63000] [loss 0.02196]
[epoch   41/100] [iter   63500] [loss 0.00000]
[epoch   42/100] [iter   64000] [loss 0.00191]
[epoch   42/100] [iter   64500] [loss 0.00601]
[epoch   42/100] [iter   65000] [loss 0.00127]
[Test on environment] [epoch 42/100] [score 284.30]
[epoch   43/100] [iter   65500] [loss 0.00049]
[epoch   43/100] [iter   66000] [loss 0.00935]
[epoch   43/100] [iter   66500] [loss 0.00112]
[epoch   44/100] [iter   67000] [loss 0.00059]
[epoch   44/100] [iter   67500] [loss 0.01081]
[epoch   44/100] [iter   68000] [loss 0.00000]
[epoch   44/100] [iter   68500] [loss 0.00009]
[Test on environment] [epoch 44/100] [score 260.30]
[epoch   45/100] [iter   69000] [loss 0.00123]
[epoch   45/100] [iter   69500] [loss 0.00200]
[epoch   45/100] [iter   70000] [loss 0.00713]
[epoch   46/100] [iter   70500] [loss 0.00148]
[epoch   46/100] [iter   71000] [loss 0.00299]
[epoch   46/100] [iter   71500] [loss 0.00004]
[Test on environment] [epoch 46/100] [score 253.50]
[epoch   47/100] [iter   72000] [loss 0.00475]
[epoch   47/100] [iter   72500] [loss 0.01660]
[epoch   47/100] [iter   73000] [loss 0.00070]
[epoch   48/100] [iter   73500] [loss 0.01289]
[epoch   48/100] [iter   74000] [loss 0.00734]
[epoch   48/100] [iter   74500] [loss 0.00172]
[Test on environment] [epoch 48/100] [score 247.80]
[epoch   49/100] [iter   75000] [loss 0.00036]
[epoch   49/100] [iter   75500] [loss 0.00200]
[epoch   49/100] [iter   76000] [loss 0.00869]
[epoch   50/100] [iter   76500] [loss 0.00101]
[epoch   50/100] [iter   77000] [loss 0.00489]
[epoch   50/100] [iter   77500] [loss 0.05687]
[Test on environment] [epoch 50/100] [score 284.30]
[epoch   51/100] [iter   78000] [loss 0.00033]
[epoch   51/100] [iter   78500] [loss 0.00041]
[epoch   51/100] [iter   79000] [loss 0.01757]
[epoch   52/100] [iter   79500] [loss 0.00010]
[epoch   52/100] [iter   80000] [loss 0.00001]
[epoch   52/100] [iter   80500] [loss 0.00000]
[epoch   52/100] [iter   81000] [loss 0.00009]
[Test on environment] [epoch 52/100] [score 268.50]
[epoch   53/100] [iter   81500] [loss 0.00110]
[epoch   53/100] [iter   82000] [loss 0.00282]
[epoch   53/100] [iter   82500] [loss 0.00602]
[epoch   54/100] [iter   83000] [loss 0.01967]
[epoch   54/100] [iter   83500] [loss 0.00005]
[epoch   54/100] [iter   84000] [loss 0.00615]
[Test on environment] [epoch 54/100] [score 315.20]
[epoch   55/100] [iter   84500] [loss 0.00095]
[epoch   55/100] [iter   85000] [loss 0.00006]
[epoch   55/100] [iter   85500] [loss 0.00098]
[epoch   56/100] [iter   86000] [loss 0.00949]
[epoch   56/100] [iter   86500] [loss 0.00239]
[epoch   56/100] [iter   87000] [loss 0.00217]
[Test on environment] [epoch 56/100] [score 233.70]
[epoch   57/100] [iter   87500] [loss 0.00525]
[epoch   57/100] [iter   88000] [loss 0.00006]
[epoch   57/100] [iter   88500] [loss 0.00016]
[epoch   58/100] [iter   89000] [loss 0.00858]
[epoch   58/100] [iter   89500] [loss 0.00043]
[epoch   58/100] [iter   90000] [loss 0.00386]
[Test on environment] [epoch 58/100] [score 287.60]
[epoch   59/100] [iter   90500] [loss 0.00585]
[epoch   59/100] [iter   91000] [loss 0.00047]
[epoch   59/100] [iter   91500] [loss 0.00132]
[epoch   60/100] [iter   92000] [loss 0.00002]
[epoch   60/100] [iter   92500] [loss 0.00786]
[epoch   60/100] [iter   93000] [loss 0.00009]
[Test on environment] [epoch 60/100] [score 255.00]
[epoch   61/100] [iter   93500] [loss 0.00008]
[epoch   61/100] [iter   94000] [loss 0.01252]
[epoch   61/100] [iter   94500] [loss 0.00009]
[epoch   61/100] [iter   95000] [loss 0.00063]
[epoch   62/100] [iter   95500] [loss 0.00037]
[epoch   62/100] [iter   96000] [loss 0.00472]
[epoch   62/100] [iter   96500] [loss 0.00194]
[Test on environment] [epoch 62/100] [score 275.10]
[epoch   63/100] [iter   97000] [loss 0.00310]
[epoch   63/100] [iter   97500] [loss 0.00206]
[epoch   63/100] [iter   98000] [loss 0.00000]
[epoch   64/100] [iter   98500] [loss 0.00001]
[epoch   64/100] [iter   99000] [loss 0.00308]
[epoch   64/100] [iter   99500] [loss 0.00378]
[Test on environment] [epoch 64/100] [score 270.40]
[epoch   65/100] [iter  100000] [loss 0.00001]
[epoch   65/100] [iter  100500] [loss 0.01823]
[epoch   65/100] [iter  101000] [loss 0.00714]
[epoch   66/100] [iter  101500] [loss 0.00958]
[epoch   66/100] [iter  102000] [loss 0.01940]
[epoch   66/100] [iter  102500] [loss 0.00020]
[Test on environment] [epoch 66/100] [score 267.70]
[epoch   67/100] [iter  103000] [loss 0.00230]
[epoch   67/100] [iter  103500] [loss 0.00061]
[epoch   67/100] [iter  104000] [loss 0.00393]
[epoch   68/100] [iter  104500] [loss 0.00000]
[epoch   68/100] [iter  105000] [loss 0.00541]
[epoch   68/100] [iter  105500] [loss 0.04640]
[Test on environment] [epoch 68/100] [score 254.50]
[epoch   69/100] [iter  106000] [loss 0.00146]
[epoch   69/100] [iter  106500] [loss 0.00012]
[epoch   69/100] [iter  107000] [loss 0.01226]
[epoch   69/100] [iter  107500] [loss 0.00028]
[epoch   70/100] [iter  108000] [loss 0.02177]
[epoch   70/100] [iter  108500] [loss 0.00255]
[epoch   70/100] [iter  109000] [loss 0.01709]
[Test on environment] [epoch 70/100] [score 254.60]
[epoch   71/100] [iter  109500] [loss 0.02117]
[epoch   71/100] [iter  110000] [loss 0.00191]
[epoch   71/100] [iter  110500] [loss 0.03279]
[epoch   72/100] [iter  111000] [loss 0.00015]
[epoch   72/100] [iter  111500] [loss 0.00199]
[epoch   72/100] [iter  112000] [loss 0.01155]
[Test on environment] [epoch 72/100] [score 264.30]
[epoch   73/100] [iter  112500] [loss 0.00624]
[epoch   73/100] [iter  113000] [loss 0.01151]
[epoch   73/100] [iter  113500] [loss 0.06255]
[epoch   74/100] [iter  114000] [loss 0.00001]
[epoch   74/100] [iter  114500] [loss 0.00197]
[epoch   74/100] [iter  115000] [loss 0.00629]
[Test on environment] [epoch 74/100] [score 313.60]
[epoch   75/100] [iter  115500] [loss 0.00000]
[epoch   75/100] [iter  116000] [loss 0.02030]
[epoch   75/100] [iter  116500] [loss 0.00013]
[epoch   76/100] [iter  117000] [loss 0.00000]
[epoch   76/100] [iter  117500] [loss 0.00676]
[epoch   76/100] [iter  118000] [loss 0.00009]
[Test on environment] [epoch 76/100] [score 250.10]
[epoch   77/100] [iter  118500] [loss 0.00356]
[epoch   77/100] [iter  119000] [loss 0.00898]
[epoch   77/100] [iter  119500] [loss 0.00033]
[epoch   78/100] [iter  120000] [loss 0.00016]
[epoch   78/100] [iter  120500] [loss 0.01634]
[epoch   78/100] [iter  121000] [loss 0.00051]
[epoch   78/100] [iter  121500] [loss 0.00555]
[Test on environment] [epoch 78/100] [score 258.80]
[epoch   79/100] [iter  122000] [loss 0.00396]
[epoch   79/100] [iter  122500] [loss 0.02056]
[epoch   79/100] [iter  123000] [loss 0.01746]
[epoch   80/100] [iter  123500] [loss 0.00611]
[epoch   80/100] [iter  124000] [loss 0.00000]
[epoch   80/100] [iter  124500] [loss 0.00023]
[Test on environment] [epoch 80/100] [score 256.60]
[epoch   81/100] [iter  125000] [loss 0.01374]
[epoch   81/100] [iter  125500] [loss 0.00000]
[epoch   81/100] [iter  126000] [loss 0.01504]
[epoch   82/100] [iter  126500] [loss 0.00056]
[epoch   82/100] [iter  127000] [loss 0.01825]
[epoch   82/100] [iter  127500] [loss 0.00551]
[Test on environment] [epoch 82/100] [score 300.80]
[epoch   83/100] [iter  128000] [loss 0.00011]
[epoch   83/100] [iter  128500] [loss 0.00005]
[epoch   83/100] [iter  129000] [loss 0.00025]
[epoch   84/100] [iter  129500] [loss 0.00106]
[epoch   84/100] [iter  130000] [loss 0.03765]
[epoch   84/100] [iter  130500] [loss 0.00036]
[Test on environment] [epoch 84/100] [score 226.00]
[epoch   85/100] [iter  131000] [loss 0.00004]
[epoch   85/100] [iter  131500] [loss 0.00166]
[epoch   85/100] [iter  132000] [loss 0.02647]
[epoch   86/100] [iter  132500] [loss 0.00005]
[epoch   86/100] [iter  133000] [loss 0.00054]
[epoch   86/100] [iter  133500] [loss 0.00081]
[Test on environment] [epoch 86/100] [score 268.00]
[epoch   87/100] [iter  134000] [loss 0.00087]
[epoch   87/100] [iter  134500] [loss 0.03919]
[epoch   87/100] [iter  135000] [loss 0.00292]
[epoch   87/100] [iter  135500] [loss 0.00129]
[epoch   88/100] [iter  136000] [loss 0.00281]
[epoch   88/100] [iter  136500] [loss 0.00001]
[epoch   88/100] [iter  137000] [loss 0.00001]
[Test on environment] [epoch 88/100] [score 267.20]
[epoch   89/100] [iter  137500] [loss 0.00001]
[epoch   89/100] [iter  138000] [loss 0.00202]
[epoch   89/100] [iter  138500] [loss 0.00039]
[epoch   90/100] [iter  139000] [loss 0.01026]
[epoch   90/100] [iter  139500] [loss 0.00936]
[epoch   90/100] [iter  140000] [loss 0.08809]
[Test on environment] [epoch 90/100] [score 258.60]
[epoch   91/100] [iter  140500] [loss 0.00234]
[epoch   91/100] [iter  141000] [loss 0.00144]
[epoch   91/100] [iter  141500] [loss 0.01822]
[epoch   92/100] [iter  142000] [loss 0.00014]
[epoch   92/100] [iter  142500] [loss 0.00052]
[epoch   92/100] [iter  143000] [loss 0.00150]
[Test on environment] [epoch 92/100] [score 253.90]
[epoch   93/100] [iter  143500] [loss 0.00124]
[epoch   93/100] [iter  144000] [loss 0.00232]
[epoch   93/100] [iter  144500] [loss 0.02956]
[epoch   94/100] [iter  145000] [loss 0.00007]
[epoch   94/100] [iter  145500] [loss 0.00000]
[epoch   94/100] [iter  146000] [loss 0.00027]
[Test on environment] [epoch 94/100] [score 286.40]
[epoch   95/100] [iter  146500] [loss 0.00023]
[epoch   95/100] [iter  147000] [loss 0.00001]
[epoch   95/100] [iter  147500] [loss 0.00002]
[epoch   95/100] [iter  148000] [loss 0.00032]
[epoch   96/100] [iter  148500] [loss 0.00001]
[epoch   96/100] [iter  149000] [loss 0.01595]
[epoch   96/100] [iter  149500] [loss 0.03903]
[Test on environment] [epoch 96/100] [score 259.90]
[epoch   97/100] [iter  150000] [loss 0.00001]
[epoch   97/100] [iter  150500] [loss 0.00001]
[epoch   97/100] [iter  151000] [loss 0.00093]
[epoch   98/100] [iter  151500] [loss 0.00678]
[epoch   98/100] [iter  152000] [loss 0.00094]
[epoch   98/100] [iter  152500] [loss 0.00006]
[Test on environment] [epoch 98/100] [score 240.90]
[epoch   99/100] [iter  153000] [loss 0.02876]
[epoch   99/100] [iter  153500] [loss 0.00045]
[epoch   99/100] [iter  154000] [loss 0.04840]
[epoch  100/100] [iter  154500] [loss 0.00000]
[epoch  100/100] [iter  155000] [loss 0.00308]
[epoch  100/100] [iter  155500] [loss 0.00566]
[Test on environment] [epoch 100/100] [score 279.30]
Saving model as behavioral_cloning_CartPole-v1.pt

**[2 pts]** Did you manage to learn a good policy? How consistent is the reward you are getting?

The learned policy is better than the tabular Q learning and faster as well. For consistency, the average reward is around 260. It has a 50ish variance. 

## 2. [24 pts] Deep Q Learning

There are two main issues with the behavior cloning approach.

- First, we are not always lucky enough to have access to a dataset of expert demonstrations.
- Second, replicating an expert policy suffers from compounding error. The policy $\pi$ only sees these "perfect" examples and has no knowledge on how to recover from states not visited by the expert. For this reason, as soon as it is presented with a state that is off the expert trajectory, it will perform poorly and will continue to deviate from a good trajectory without the possibility of recovering from errors.

---
The second task consists in solving the environment from scratch, using RL, and most specifically the DQN algorithm, to learn a policy $\pi$.

For this task, familiarize yourself with the file `dqn.py`. We are going to re-use the file `model.py` for the model you created in the previous task.

Your task is very similar to the one in the previous assignment, to implement the Q-learning algorithm, but in this version, our Q-function is approximated with a neural network.

The algorithm (excerpted from [Atari DQN paper](https://arxiv.org/abs/1312.5602)) is given below:

![DQN algorithm](https://i.imgur.com/Mh4Uxta.png)

### 2.0 [2 pts] Think about your model...



In DQN, we are using the same model as in task 1 for behavioral cloning. In both tasks the model receives as input the state and in both tasks the model outputs something that has the same dimensionality as the number of actions. These two outputs, though, represent very different things. What is each one representing?

In BC, the output is logits/probabilities of each action but in DQN, the ouput is the Q value.

### 2.1 [10 pts] Update your Q-function

Complete the `optimize_model` function. This function receives as input a `state`, an `action`, the `next_state`, the `reward` and `done` representing the tuple $(s_t, a_t, s_{t+1}, r_t, done_t)$. Your task is to update your Q-function as shown in the [Atari DQN paper](https://arxiv.org/abs/1312.5602) environment. For now don't be concerned with the experience replay buffer. We'll get to that later.

![Loss function](https://i.imgur.com/tpTsV8m.png)

Insert your code in the placeholder below.

In [None]:
## PLACEHOLDER TO INSERT YOUR optimize_model function here:

# Without replay buffer
def optimize_model(state, action, next_state, reward, done):
    # TODO given a tuple (s_t, a_t, s_{t+1}, r_t, done_t) update your model weights
    state_torch = torch.tensor(state, device=device, dtype=torch.float32).unsqueeze(0)
    targetQ = torch.tensor(reward, device=device, dtype=torch.float32)
    next_state_torch = torch.tensor(next_state, device=device, dtype=torch.float32).unsqueeze(0)
    action = action.squeeze()
    if not done:
        targetQ += GAMMA * torch.max(target(next_state_torch))
    
    loss = nn.functional.mse_loss(targetQ, model(state_torch).squeeze()[action])

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


# With replay buffer
def optimize_model(state, action, next_state, reward, done):
    # TODO given a tuple (s_t, a_t, s_{t+1}, r_t, done_t) update your model weights
    not_done = torch.logical_not(done).to(torch.float32)
    
    targetQ = reward + GAMMA * not_done * torch.max(target(next_state), dim=1)[0]

    forward_pass = model(state)
    m, _ = forward_pass.shape
    loss = nn.functional.mse_loss(targetQ, forward_pass[torch.arange(m), action.squeeze()])

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

### 2.2 [5 pts] $\epsilon$-greedy strategy

You will need a strategy to explore your environment. The standard strategy is to use $\epsilon$-greedy. Implement it in the `choose_action` function template.

Insert your code in the placeholder below.

In [None]:
## PLACEHOLDER TO INSERT YOUR choose_action function here:

def choose_action(state, test_mode=False):
    # TODO implement an epsilon-greedy strategy
    if np.random.random() < EPS_EXPLORATION:
        return torch.from_numpy(np.array(env.action_space.sample()).reshape((1,1))).to(device)
    else:
        return model.select_action(torch.tensor(state, device=device, dtype=torch.float32).unsqueeze(0))


### 2.3 [2 pts] Train your model

Try to train a model in this way.

You can run your code by doing:

```
python3 dqn.py
```

How many episodes does it take to learn (ie. reach a good reward)?

It starts to produce useful actions from around 2000th episodes.

In [None]:
## PASTE YOUR TERMINAL OUTPUT HERE
# NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10

[Episode   50/4000] [Steps   12] [reward 13.0]
----------
saving model.
[TEST Episode 50] [Average Reward 9.4]
----------
[Episode  100/4000] [Steps    9] [reward 10.0]
----------
[TEST Episode 100] [Average Reward 9.3]
----------
[Episode  150/4000] [Steps   12] [reward 13.0]
----------
saving model.
[TEST Episode 150] [Average Reward 10.5]
----------
[Episode  200/4000] [Steps    8] [reward 9.0]
----------
[TEST Episode 200] [Average Reward 10.4]
----------
[Episode  250/4000] [Steps   11] [reward 12.0]
----------
[TEST Episode 250] [Average Reward 9.5]
----------
[Episode  300/4000] [Steps   27] [reward 28.0]
----------
saving model.
[TEST Episode 300] [Average Reward 10.6]
----------
[Episode  350/4000] [Steps   10] [reward 11.0]
----------
[TEST Episode 350] [Average Reward 9.4]
----------
[Episode  400/4000] [Steps   13] [reward 14.0]
----------
[TEST Episode 400] [Average Reward 9.1]
----------
[Episode  450/4000] [Steps   10] [reward 11.0]
----------
[TEST Episode 450] [Average Reward 9.5]
----------
[Episode  500/4000] [Steps   27] [reward 28.0]
----------
saving model.
[TEST Episode 500] [Average Reward 57.9]
----------
[Episode  550/4000] [Steps    9] [reward 10.0]
----------
[TEST Episode 550] [Average Reward 9.0]
----------
[Episode  600/4000] [Steps    9] [reward 10.0]
----------
[TEST Episode 600] [Average Reward 15.3]
----------
[Episode  650/4000] [Steps    9] [reward 10.0]
----------
saving model.
[TEST Episode 650] [Average Reward 401.0]
----------
[Episode  700/4000] [Steps    9] [reward 10.0]
----------
[TEST Episode 700] [Average Reward 9.5]
----------
[Episode  750/4000] [Steps   19] [reward 20.0]
----------
[TEST Episode 750] [Average Reward 16.9]
----------
[Episode  800/4000] [Steps    7] [reward 8.0]
----------
[TEST Episode 800] [Average Reward 9.3]
----------
[Episode  850/4000] [Steps   13] [reward 14.0]
----------
[TEST Episode 850] [Average Reward 115.2]
----------
[Episode  900/4000] [Steps    8] [reward 9.0]
----------
[TEST Episode 900] [Average Reward 9.3]
----------
[Episode  950/4000] [Steps   14] [reward 15.0]
----------
[TEST Episode 950] [Average Reward 9.1]
----------
[Episode 1000/4000] [Steps   57] [reward 58.0]
----------
[TEST Episode 1000] [Average Reward 16.8]
----------
[Episode 1050/4000] [Steps   22] [reward 23.0]
----------
[TEST Episode 1050] [Average Reward 46.0]
----------
[Episode 1100/4000] [Steps   10] [reward 11.0]
----------
[TEST Episode 1100] [Average Reward 20.1]
----------
[Episode 1150/4000] [Steps    8] [reward 9.0]
----------
[TEST Episode 1150] [Average Reward 18.1]
----------
[Episode 1200/4000] [Steps   28] [reward 29.0]
----------
[TEST Episode 1200] [Average Reward 100.6]
----------
[Episode 1250/4000] [Steps   66] [reward 67.0]
----------
[TEST Episode 1250] [Average Reward 11.4]
----------
[Episode 1300/4000] [Steps   24] [reward 25.0]
----------
[TEST Episode 1300] [Average Reward 50.9]
----------
[Episode 1350/4000] [Steps   12] [reward 13.0]
----------
[TEST Episode 1350] [Average Reward 21.9]
----------
[Episode 1400/4000] [Steps   10] [reward 11.0]
----------
[TEST Episode 1400] [Average Reward 9.3]
----------
[Episode 1450/4000] [Steps  300] [reward 301.0]
----------
[TEST Episode 1450] [Average Reward 166.1]
----------
[Episode 1500/4000] [Steps   11] [reward 12.0]
----------
[TEST Episode 1500] [Average Reward 11.3]
----------
[Episode 1550/4000] [Steps  139] [reward 140.0]
----------
[TEST Episode 1550] [Average Reward 124.6]
----------
[Episode 1600/4000] [Steps    8] [reward 9.0]
----------
[TEST Episode 1600] [Average Reward 10.4]
----------
[Episode 1650/4000] [Steps  354] [reward 355.0]
----------
[TEST Episode 1650] [Average Reward 162.0]
----------
[Episode 1700/4000] [Steps   16] [reward 17.0]
----------
[TEST Episode 1700] [Average Reward 15.4]
----------
[Episode 1750/4000] [Steps  123] [reward 124.0]
----------
[TEST Episode 1750] [Average Reward 108.0]
----------
[Episode 1800/4000] [Steps  110] [reward 111.0]
----------
[TEST Episode 1800] [Average Reward 128.3]
----------
[Episode 1850/4000] [Steps   29] [reward 30.0]
----------
[TEST Episode 1850] [Average Reward 151.7]
----------
[Episode 1900/4000] [Steps   18] [reward 19.0]
----------
[TEST Episode 1900] [Average Reward 151.0]
----------
[Episode 1950/4000] [Steps  195] [reward 196.0]
----------
[TEST Episode 1950] [Average Reward 55.9]
----------
[Episode 2000/4000] [Steps   79] [reward 80.0]
----------
[TEST Episode 2000] [Average Reward 69.2]
----------
[Episode 2050/4000] [Steps   87] [reward 88.0]
----------
[TEST Episode 2050] [Average Reward 120.1]
----------
[Episode 2100/4000] [Steps  196] [reward 197.0]
----------
[TEST Episode 2100] [Average Reward 135.8]
----------
[Episode 2150/4000] [Steps  150] [reward 151.0]
----------
[TEST Episode 2150] [Average Reward 259.9]
----------
[Episode 2200/4000] [Steps   16] [reward 17.0]
----------
saving model.
[TEST Episode 2200] [Average Reward 500.0]
----------
[Episode 2250/4000] [Steps  121] [reward 122.0]
----------
[TEST Episode 2250] [Average Reward 116.6]
----------
[Episode 2300/4000] [Steps  124] [reward 125.0]
----------
[TEST Episode 2300] [Average Reward 128.4]
----------
[Episode 2350/4000] [Steps  156] [reward 157.0]
----------
[TEST Episode 2350] [Average Reward 108.2]
----------
[Episode 2400/4000] [Steps   79] [reward 80.0]
----------
[TEST Episode 2400] [Average Reward 80.7]
----------
[Episode 2450/4000] [Steps  198] [reward 199.0]
----------
[TEST Episode 2450] [Average Reward 154.3]
----------
[Episode 2500/4000] [Steps   22] [reward 23.0]
----------
[TEST Episode 2500] [Average Reward 109.3]
----------
[Episode 2550/4000] [Steps  134] [reward 135.0]
----------
[TEST Episode 2550] [Average Reward 416.8]
----------
[Episode 2600/4000] [Steps   98] [reward 99.0]
----------
[TEST Episode 2600] [Average Reward 96.3]
----------
[Episode 2650/4000] [Steps   48] [reward 49.0]
----------
[TEST Episode 2650] [Average Reward 104.4]
----------
[Episode 2700/4000] [Steps  104] [reward 105.0]
----------
[TEST Episode 2700] [Average Reward 170.9]
----------
[Episode 2750/4000] [Steps  110] [reward 111.0]
----------
[TEST Episode 2750] [Average Reward 113.8]
----------
[Episode 2800/4000] [Steps  105] [reward 106.0]
----------
[TEST Episode 2800] [Average Reward 47.9]
----------
[Episode 2850/4000] [Steps   74] [reward 75.0]
----------
[TEST Episode 2850] [Average Reward 47.9]
----------
[Episode 2900/4000] [Steps  146] [reward 147.0]
----------
[TEST Episode 2900] [Average Reward 63.2]
----------
[Episode 2950/4000] [Steps   10] [reward 11.0]
----------
[TEST Episode 2950] [Average Reward 59.4]
----------
[Episode 3000/4000] [Steps   42] [reward 43.0]
----------
[TEST Episode 3000] [Average Reward 114.0]
----------
[Episode 3050/4000] [Steps  100] [reward 101.0]
----------
[TEST Episode 3050] [Average Reward 98.0]
----------
[Episode 3100/4000] [Steps  320] [reward 321.0]
----------
[TEST Episode 3100] [Average Reward 53.5]
----------
[Episode 3150/4000] [Steps   91] [reward 92.0]
----------
[TEST Episode 3150] [Average Reward 107.2]
----------
[Episode 3200/4000] [Steps   91] [reward 92.0]
----------
[TEST Episode 3200] [Average Reward 97.8]
----------
[Episode 3250/4000] [Steps  211] [reward 212.0]
----------
[TEST Episode 3250] [Average Reward 114.4]
----------
[Episode 3300/4000] [Steps  102] [reward 103.0]
----------
[TEST Episode 3300] [Average Reward 500.0]
----------
[Episode 3350/4000] [Steps  134] [reward 135.0]
----------
[TEST Episode 3350] [Average Reward 118.7]
----------
[Episode 3400/4000] [Steps  176] [reward 177.0]
----------
[TEST Episode 3400] [Average Reward 137.4]
----------
[Episode 3450/4000] [Steps  107] [reward 108.0]
----------
[TEST Episode 3450] [Average Reward 198.5]
----------
[Episode 3500/4000] [Steps  199] [reward 200.0]
----------
[TEST Episode 3500] [Average Reward 134.7]
----------
[Episode 3550/4000] [Steps  205] [reward 206.0]
----------
[TEST Episode 3550] [Average Reward 112.1]
----------
[Episode 3600/4000] [Steps  140] [reward 141.0]
----------
[TEST Episode 3600] [Average Reward 107.3]
----------
[Episode 3650/4000] [Steps   55] [reward 56.0]
----------
[TEST Episode 3650] [Average Reward 122.7]
----------
[Episode 3700/4000] [Steps   12] [reward 13.0]
----------
[TEST Episode 3700] [Average Reward 125.4]
----------
[Episode 3750/4000] [Steps   64] [reward 65.0]
----------
[TEST Episode 3750] [Average Reward 106.5]
----------
[Episode 3800/4000] [Steps   70] [reward 71.0]
----------
[TEST Episode 3800] [Average Reward 79.3]
----------
[Episode 3850/4000] [Steps  117] [reward 118.0]
----------
[TEST Episode 3850] [Average Reward 500.0]
----------
[Episode 3900/4000] [Steps  164] [reward 165.0]
----------
[TEST Episode 3900] [Average Reward 68.2]
----------
[Episode 3950/4000] [Steps  112] [reward 113.0]
----------
[TEST Episode 3950] [Average Reward 202.7]
----------
[Episode 4000/4000] [Steps   45] [reward 46.0]
----------
[TEST Episode 4000] [Average Reward 43.3]
----------

### 2.4 [5 pts] Add the Experience Replay Buffer

If you read the DQN paper (and as you can see from the algorithm picture above), the authors make use of an experience replay buffer to learn faster. We provide an implementation in the file `replay_buffer.py`. Update the `train_reinforcement_learning` code to push a tuple to the replay buffer and to sample a batch for the `optimize_model` function.

In [None]:
## PASTE YOUR TERMINAL OUTPUT HERE
# NOTE: TO HAVE LESS LINES PRINTED, YOU CAN SET THE VARIABLE PRINT_INTERVAL TO 5 or 10

[Episode   50/4000] [Steps    8] [reward 9.0]
----------
saving model.
[TEST Episode 50] [Average Reward 9.2]
----------
[Episode  100/4000] [Steps    9] [reward 10.0]
----------
saving model.
[TEST Episode 100] [Average Reward 9.4]
----------
[Episode  150/4000] [Steps   15] [reward 16.0]
----------
saving model.
[TEST Episode 150] [Average Reward 14.2]
----------
[Episode  200/4000] [Steps   15] [reward 16.0]
----------
saving model.
[TEST Episode 200] [Average Reward 18.5]
----------
[Episode  250/4000] [Steps   44] [reward 45.0]
----------
saving model.
[TEST Episode 250] [Average Reward 56.9]
----------
[Episode  300/4000] [Steps   29] [reward 30.0]
----------
[TEST Episode 300] [Average Reward 36.8]
----------
[Episode  350/4000] [Steps   60] [reward 61.0]
----------
[TEST Episode 350] [Average Reward 52.2]
----------
[Episode  400/4000] [Steps  365] [reward 366.0]
----------
saving model.
[TEST Episode 400] [Average Reward 500.0]
----------
[Episode  450/4000] [Steps  170] [reward 171.0]
----------
[TEST Episode 450] [Average Reward 175.7]
----------
[Episode  500/4000] [Steps  130] [reward 131.0]
----------
[TEST Episode 500] [Average Reward 117.9]
----------
[Episode  550/4000] [Steps  120] [reward 121.0]
----------
[TEST Episode 550] [Average Reward 124.6]
----------
[Episode  600/4000] [Steps  125] [reward 126.0]
----------
[TEST Episode 600] [Average Reward 146.7]
----------
[Episode  650/4000] [Steps  329] [reward 330.0]
----------
[TEST Episode 650] [Average Reward 427.4]
----------
[Episode  700/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 700] [Average Reward 497.8]
----------
[Episode  750/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 750] [Average Reward 500.0]
----------
[Episode  800/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 800] [Average Reward 500.0]
----------
[Episode  850/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 850] [Average Reward 500.0]
----------
[Episode  900/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 900] [Average Reward 500.0]
----------
[Episode  950/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 950] [Average Reward 491.5]
----------
[Episode 1000/4000] [Steps  486] [reward 487.0]
----------
[TEST Episode 1000] [Average Reward 490.0]
----------
[Episode 1050/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 1050] [Average Reward 500.0]
----------
[Episode 1100/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 1100] [Average Reward 500.0]
----------
[Episode 1150/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 1150] [Average Reward 500.0]
----------
[Episode 1200/4000] [Steps  490] [reward 491.0]
----------
[TEST Episode 1200] [Average Reward 423.1]
----------
[Episode 1250/4000] [Steps  124] [reward 125.0]
----------
[TEST Episode 1250] [Average Reward 258.5]
----------
[Episode 1300/4000] [Steps  146] [reward 147.0]
----------
[TEST Episode 1300] [Average Reward 166.9]
----------
[Episode 1350/4000] [Steps  251] [reward 252.0]
----------
[TEST Episode 1350] [Average Reward 279.5]
----------
[Episode 1400/4000] [Steps  286] [reward 287.0]
----------
[TEST Episode 1400] [Average Reward 191.9]
----------
[Episode 1450/4000] [Steps  152] [reward 153.0]
----------
[TEST Episode 1450] [Average Reward 192.9]
----------
[Episode 1500/4000] [Steps  269] [reward 270.0]
----------
[TEST Episode 1500] [Average Reward 235.3]
----------
[Episode 1550/4000] [Steps  354] [reward 355.0]
----------
[TEST Episode 1550] [Average Reward 293.2]
----------
[Episode 1600/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 1600] [Average Reward 500.0]
----------
[Episode 1650/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 1650] [Average Reward 500.0]
----------
[Episode 1700/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 1700] [Average Reward 500.0]
----------
[Episode 1750/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 1750] [Average Reward 500.0]
----------
[Episode 1800/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 1800] [Average Reward 500.0]
----------
[Episode 1850/4000] [Steps  198] [reward 199.0]
----------
[TEST Episode 1850] [Average Reward 158.2]
----------
[Episode 1900/4000] [Steps  443] [reward 444.0]
----------
[TEST Episode 1900] [Average Reward 451.0]
----------
[Episode 1950/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 1950] [Average Reward 500.0]
----------
[Episode 2000/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 2000] [Average Reward 500.0]
----------
[Episode 2050/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 2050] [Average Reward 500.0]
----------
[Episode 2100/4000] [Steps  240] [reward 241.0]
----------
[TEST Episode 2100] [Average Reward 212.6]
----------
[Episode 2150/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 2150] [Average Reward 500.0]
----------
[Episode 2200/4000] [Steps  270] [reward 271.0]
----------
[TEST Episode 2200] [Average Reward 258.0]
----------
[Episode 2250/4000] [Steps  259] [reward 260.0]
----------
[TEST Episode 2250] [Average Reward 265.2]
----------
[Episode 2300/4000] [Steps  275] [reward 276.0]
----------
[TEST Episode 2300] [Average Reward 261.9]
----------
[Episode 2350/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 2350] [Average Reward 500.0]
----------
[Episode 2400/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 2400] [Average Reward 500.0]
----------
[Episode 2450/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 2450] [Average Reward 500.0]
----------
[Episode 2500/4000] [Steps  219] [reward 220.0]
----------
[TEST Episode 2500] [Average Reward 500.0]
----------
[Episode 2550/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 2550] [Average Reward 500.0]
----------
[Episode 2600/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 2600] [Average Reward 500.0]
----------
[Episode 2650/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 2650] [Average Reward 500.0]
----------
[Episode 2700/4000] [Steps  119] [reward 120.0]
----------
[TEST Episode 2700] [Average Reward 500.0]
----------
[Episode 2750/4000] [Steps  266] [reward 267.0]
----------
[TEST Episode 2750] [Average Reward 252.2]
----------
[Episode 2800/4000] [Steps  237] [reward 238.0]
----------
[TEST Episode 2800] [Average Reward 241.3]
----------
[Episode 2850/4000] [Steps  362] [reward 363.0]
----------
[TEST Episode 2850] [Average Reward 500.0]
----------
[Episode 2900/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 2900] [Average Reward 500.0]
----------
[Episode 2950/4000] [Steps  184] [reward 185.0]
----------
[TEST Episode 2950] [Average Reward 437.3]
----------
[Episode 3000/4000] [Steps   21] [reward 22.0]
----------
[TEST Episode 3000] [Average Reward 500.0]
----------
[Episode 3050/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 3050] [Average Reward 500.0]
----------
[Episode 3100/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 3100] [Average Reward 500.0]
----------
[Episode 3150/4000] [Steps   25] [reward 26.0]
----------
[TEST Episode 3150] [Average Reward 500.0]
----------
[Episode 3200/4000] [Steps  232] [reward 233.0]
----------
[TEST Episode 3200] [Average Reward 500.0]
----------
[Episode 3250/4000] [Steps  487] [reward 488.0]
----------
[TEST Episode 3250] [Average Reward 500.0]
----------
[Episode 3300/4000] [Steps  307] [reward 308.0]
----------
[TEST Episode 3300] [Average Reward 500.0]
----------
[Episode 3350/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 3350] [Average Reward 500.0]
----------
[Episode 3400/4000] [Steps  153] [reward 154.0]
----------
[TEST Episode 3400] [Average Reward 103.6]
----------
[Episode 3450/4000] [Steps   75] [reward 76.0]
----------
[TEST Episode 3450] [Average Reward 500.0]
----------
[Episode 3500/4000] [Steps   53] [reward 54.0]
----------
[TEST Episode 3500] [Average Reward 500.0]
----------
[Episode 3550/4000] [Steps   12] [reward 13.0]
----------
[TEST Episode 3550] [Average Reward 500.0]
----------
[Episode 3600/4000] [Steps  126] [reward 127.0]
----------
[TEST Episode 3600] [Average Reward 500.0]
----------
[Episode 3650/4000] [Steps  465] [reward 466.0]
----------
[TEST Episode 3650] [Average Reward 220.6]
----------
[Episode 3700/4000] [Steps  362] [reward 363.0]
----------
[TEST Episode 3700] [Average Reward 500.0]
----------
[Episode 3750/4000] [Steps  123] [reward 124.0]
----------
[TEST Episode 3750] [Average Reward 145.6]
----------
[Episode 3800/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 3800] [Average Reward 500.0]
----------
[Episode 3850/4000] [Steps   20] [reward 21.0]
----------
[TEST Episode 3850] [Average Reward 500.0]
----------
[Episode 3900/4000] [Steps   95] [reward 96.0]
----------
[TEST Episode 3900] [Average Reward 419.3]
----------
[Episode 3950/4000] [Steps   30] [reward 31.0]
----------
[TEST Episode 3950] [Average Reward 334.2]
----------
[Episode 4000/4000] [Steps  499] [reward 500.0]
----------
[TEST Episode 4000] [Average Reward 500.0]
----------

How does the replay buffer improve performance?

Firstly, the policy converges quickly (around 700 to 1000 episodes). Secondly, the replay buffer provides a more i.i.d like dataset so the policy not only explores very well, but also stablizes in the late training phase. For example, at around 3500 episodes, the policy can perform consistently with 500 average rewards compared to without replay buffer.

## 3. Extra (fully optional)

Ideas to experiment with:

- Is $\epsilon$-greedy strategy the best strategy available? Experiment with other strategies.
- Make use of the model you have trained in the behavioral cloning part and fine-tune it with RL. How does that affect performance?
- You are perhaps bored with `CartPole-v1` by now. Another environment we suggest trying is `LunarLander-v2`. It will be harder to learn but with experimentation, you will find the correct optimizations for success. Piazza is also your friend :)
- What about learning from images? This requires more work because you have to extract the image from the environment. How much more challenging might you expect the learning to be in this case?
- An improvement over DQN is DoubleDQN. Experiment with this to see how much of an impact it makes.



In [None]:
# YOU CAN USE THIS CODEBLOCK AND ADD ANY BLOCK BELOW AS YOU NEED
# TO SHOW US THE IDEAS AND EXTRA EXPERIMENTS YOU RUN.
# HAVE FUN!