# CPSC 533V: Assignment 3 - Behavioral Cloning and Deep Q Learning

## 48 points total (9% of final grade)

---
This assignment will help you transition from tabular approaches, topic of HW 2, to deep neural network approaches. You will implement the [Atari DQN / Deep Q-Learning](https://arxiv.org/abs/1312.5602) algorithm, which arguably kicked off the modern Deep Reinforcement Learning craze.

In this assignment we will use PyTorch as our deep learning framework.  To familiarize yourself with PyTorch, your first task is to use a behavior cloning (BC) approach to learn a policy.  Behavior cloning is a supervised learning method in which there exists a dataset of expert demonstrations (state-action pairs) and the goal is to learn a policy $\pi$ that mimics this expert.  At any given state, your policy should choose the same action the export would.

Since BC avoids the need to collect data from the policy you are trying to learn, it is relatively simple. 
This makes it a nice stepping stone for implementing DQN. Furthermore, BC is relevant to modern approaches---for example its use as an initialization for systems like [AlphaGo][go] and [AlphaStar][star], which then use RL to further adapte the BC result.  

<!--

I feel like this might be better suited to going lower in the document:

Unfortunately, in many tasks it is impossible to collect good expert demonstrations, making

it's not always possible to have good expert demonstrations for a task in an environemnt and this is where reinforcement learning comes handy. Through the reward signal retrieved by interacting with the environment, the agent learns by itself what is a good policy and can learn to outperform the experts.

-->

Goals:
- Famliarize yourself with PyTorch and its API including models, datasets, dataloaders
- Implement a supervised learning approach (behavioral cloning) to learn a policy.
- Implement the DQN objective and learn a policy through environment interaction.

[go]:  https://deepmind.com/research/case-studies/alphago-the-story-so-far
[star]: https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii

## Submission information

- Complete the assignment by editing and executing the associated Python files.
- Copy and paste the code and the terminal output requested in the predefined cells on this Jupyter notebook.
- When done, upload the completed Jupyter notebook (ipynb file) on canvas.

## Task 0: Preliminaries

### PyTorch

If you have never used PyTorch before, we recommend you follow this [60 Minutes Blitz][blitz] tutorial from the official website. It should give you enough context to be able to complete the assignment.


**If you have issues, post questions to Piazza**

### Installation

To install all required python packages:

```
python3 -m pip install -r requirements.txt
```

### Debugging


You can include:  `import ipdb; ipdb.set_trace()` in your code and it will drop you to that point in the code, where you can interact with variables and test out expressions.  We recommend this as an effective method to debug the algorithms.


[blitz]: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

## Task 1: Behavioral Cloning

Behavioral Cloning is a type of supervised learning in which you are given a dataset of expert demonstrations tuple $(s, a)$ and the goal is to learn a policy function $\hat a = \pi(s)$, such that $\hat a = a$.

The optimization objective is $\min_\theta D(\pi(s), a)$ where $\theta$ are the parameters the policy $\pi$, in our case the weights of a neural network, and where $D$ represents some difference between the actions.

---

Before starting, we suggest reading through the provided files.

For Behavioral Cloning, the important files to understand are: `model.py`, `dataset.py` and `bc.py`.

- The file `model.py` has the skeleton for the model (which you will have to complete in the following questions),

- The file `dataset.py` has the skeleton for the dataset the model is being trained with,

- and, `bc.py` will have all the structure for training the model with the dataset.


### 1.1 Dataset

We provide a pickle file with pre-collected expert demonstrations on CartPole from which to learn the policy $\pi$. The data has been collected from an expert policy on the environment, with the addition of a small amount of gaussian noise to the actions.

The pickle file contains a list of tuples of states and actions in `numpy` in the following way:

```
[(state s, action a), (state s, action a), (state s, action a), ...]
```

In the `dataset.py` file, we provide skeleton code for creating a custom dataset. The provided code shows how to load the file.

Your goal is to overwrite the `__getitem__` function in order to return a dictionary of tensors of the correct type.

Hint: Look in the `bc.py` file to understand how the dataset is used.

Answer the following questions:

- [**QUESTION 2 points]** Insert your code in the placeholder below.

In [5]:
# PLACEHOLDER TO INSERT YOUR __getitem__ method here

def __getitem__(self, index):
    item = self.data[index]

    return {'state': torch.from_numpy(item[0]), 'action': item[1]}
    

### 1.2 Environment

Recall the state and action space of CartPole, from the previous assignment.

- **[QUESTION 2 points]** Considering the full state and action spaces, do you think the provided expert dataset has good coverage?  Why or why not? How might this impact the performance of our cloned policy?

The dataset does not have good coverage, which might lead to significant error if the agent enters out-of-distribution states (or state-actions pairs) 

### 1.3 Model

The file `model.py` provides skeleton code for the model. Your goal is to create the architecture of the network by adding layers that map the input to output.

You will need to update the `__init__` method and the `forward` method.

The `select_action` method has already been written for you.  This should be used when running the policy in the environment, while the `forward` function should be used at training time.

- [**QUESTION 5 points]** Insert your code in the placeholder below.

In [4]:
# PLACEHOLDER TO INSERT YOUR MyModel class here

class MyModel(nn.Module):
    def __init__(self, state_size, action_size):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(state_size, 120)
        self.fc2 = nn.Linear(120, action_size)
        self.rl1 = nn.ReLU(inplace=True)

    def forward(self, x):
        x = x.to(torch.float32)
        x = self.fc1(x)
        x = self.rl1(x)
        x = self.fc2(x)
        x = self.rl1(x)
        return x

    def select_action(self, state):
        self.eval()
        x = self.forward(state)
        self.train()
        return x.max(1)[1].view(1, 1).to(torch.long)

Answer the following questions:

- **[QUESTION 2 points]** What is the input of the network?

(Set of) tensors that represent state(s)

- **[QUESTION 2 points]** What is the output?

A tensor of size action_size


### 1.4 Training

The file `bc.py` is the entry point for training your behavioral cloning model. The skeleton and the main components are already there.

The missing parts for you to do are:

- Initializing the model
- Choosing a loss function
- Choosing an optimizer
- Playing with hyperparameters to train your model.

- [**QUESTION 5 points]** Insert your code in the placeholder below.

In [5]:
# PLACEHOLDER FOR YOUR CODE HER
# HOW DID YOU INITIALIZE YOUR MODEL, OPTIMIZER AND LOSS FUNCTIONS? PASTE HERE YOUR FINAL CODE
# NOTE: YOU CAN KEEP THE FOLLOWING LINES COMMENTED OUT, AS RUNNING THIS CELL WILL PROBABLY RESULT IN ERRORS

model = MyModel(4, 2)
optimizer = optim.RMSprop(model.parameters(), lr=1e-2, momentum=0.9)
loss_function = nn.CrossEntropyLoss()

You can run your code by doing:

```
python3 bc.py
```

**During all of this assignment, the code in `eval_policy.py` will be your best friend.** At any time, you can test your model by giving as argument the path to the model weights and the environment name using the following command:

```
python3 eval_policy.py --model-path /path/to/model/weights --env ENV_NAME
````

In [4]:
## PASTE YOUR TERMINAL OUTPUT HERE
[Episode    0/10] [reward 200.0]
[Episode    1/10] [reward 196.0]
[Episode    2/10] [reward 189.0]
[Episode    3/10] [reward 200.0]
[Episode    4/10] [reward 188.0]
[Episode    5/10] [reward 200.0]
[Episode    6/10] [reward 200.0]
[Episode    7/10] [reward 200.0]
[Episode    8/10] [reward 200.0]
[Episode    9/10] [reward 200.0]

[epoch    1/100] [iter       0] [loss 0.70160]
[epoch    1/100] [iter     500] [loss 0.36842]
[epoch    1/100] [iter    1000] [loss 0.37924]
[epoch    1/100] [iter    1500] [loss 0.13513]
[epoch    2/100] [iter    2000] [loss 0.01085]
[epoch    2/100] [iter    2500] [loss 0.15105]
[epoch    2/100] [iter    3000] [loss 0.02183]
[Test on environment] [epoch 2/100] [score 198.10]
[epoch    3/100] [iter    3500] [loss 0.02203]
[epoch    3/100] [iter    4000] [loss 0.09750]
[epoch    3/100] [iter    4500] [loss 0.02166]
[epoch    4/100] [iter    5000] [loss 0.05470]
[epoch    4/100] [iter    5500] [loss 0.01086]
[epoch    4/100] [iter    6000] [loss 0.03260]
[Test on environment] [epoch 4/100] [score 196.50]
[epoch    5/100] [iter    6500] [loss 0.05430]
[epoch    5/100] [iter    7000] [loss 0.03250]
[epoch    5/100] [iter    7500] [loss 0.05836]
[epoch    6/100] [iter    8000] [loss 0.09748]
[epoch    6/100] [iter    8500] [loss 0.02263]
[epoch    6/100] [iter    9000] [loss 0.03423]
[Test on environment] [epoch 6/100] [score 197.30]
[epoch    7/100] [iter    9500] [loss 0.10399]
[epoch    7/100] [iter   10000] [loss 0.05735]
[epoch    7/100] [iter   10500] [loss 0.04610]
[epoch    8/100] [iter   11000] [loss 0.01263]
[epoch    8/100] [iter   11500] [loss 0.06111]
[epoch    8/100] [iter   12000] [loss 0.07581]
[Test on environment] [epoch 8/100] [score 195.30]
[epoch    9/100] [iter   12500] [loss 0.04340]
[epoch    9/100] [iter   13000] [loss 0.02166]
[epoch    9/100] [iter   13500] [loss 0.03370]
[epoch    9/100] [iter   14000] [loss 0.05505]
[epoch   10/100] [iter   14500] [loss 0.13011]
[epoch   10/100] [iter   15000] [loss 0.04332]
[epoch   10/100] [iter   15500] [loss 0.04332]
[Test on environment] [epoch 10/100] [score 197.40]
[epoch   11/100] [iter   16000] [loss 0.04333]
[epoch   11/100] [iter   16500] [loss 0.05415]
[epoch   11/100] [iter   17000] [loss 0.02166]
[epoch   12/100] [iter   17500] [loss 0.02509]
[epoch   12/100] [iter   18000] [loss 0.03322]
[epoch   12/100] [iter   18500] [loss 0.02362]
[Test on environment] [epoch 12/100] [score 200.00]
[epoch   13/100] [iter   19000] [loss 0.08366]
[epoch   13/100] [iter   19500] [loss 0.04332]
[epoch   13/100] [iter   20000] [loss 0.01022]
[epoch   14/100] [iter   20500] [loss 0.00006]
[epoch   14/100] [iter   21000] [loss 0.02325]
[epoch   14/100] [iter   21500] [loss 0.05425]
[Test on environment] [epoch 14/100] [score 198.10]
[epoch   15/100] [iter   22000] [loss 0.01085]
[epoch   15/100] [iter   22500] [loss 0.01083]
[epoch   15/100] [iter   23000] [loss 0.01084]
[epoch   16/100] [iter   23500] [loss 0.09022]
[epoch   16/100] [iter   24000] [loss 0.02166]
[epoch   16/100] [iter   24500] [loss 0.03408]
[Test on environment] [epoch 16/100] [score 200.00]
[epoch   17/100] [iter   25000] [loss 0.00105]
[epoch   17/100] [iter   25500] [loss 0.01595]
[epoch   17/100] [iter   26000] [loss 0.02166]
[epoch   18/100] [iter   26500] [loss 0.04375]
[epoch   18/100] [iter   27000] [loss 0.02185]
[epoch   18/100] [iter   27500] [loss 0.03249]
[epoch   18/100] [iter   28000] [loss 0.09273]
[Test on environment] [epoch 18/100] [score 198.90]
[epoch   19/100] [iter   28500] [loss 0.02929]
[epoch   19/100] [iter   29000] [loss 0.03340]
[epoch   19/100] [iter   29500] [loss 0.02236]
[epoch   20/100] [iter   30000] [loss 0.03249]
[epoch   20/100] [iter   30500] [loss 0.02166]
[epoch   20/100] [iter   31000] [loss 0.01083]
[Test on environment] [epoch 20/100] [score 200.00]
[epoch   21/100] [iter   31500] [loss 0.01089]
[epoch   21/100] [iter   32000] [loss 0.02166]
[epoch   21/100] [iter   32500] [loss 0.02166]
[epoch   22/100] [iter   33000] [loss 0.00000]
[epoch   22/100] [iter   33500] [loss 0.01083]
[epoch   22/100] [iter   34000] [loss 0.03249]
[Test on environment] [epoch 22/100] [score 197.70]
[epoch   23/100] [iter   34500] [loss 0.01167]
[epoch   23/100] [iter   35000] [loss 0.01083]
[epoch   23/100] [iter   35500] [loss 0.05202]
[epoch   24/100] [iter   36000] [loss 0.02166]
[epoch   24/100] [iter   36500] [loss 0.01083]
[epoch   24/100] [iter   37000] [loss 0.04332]
[Test on environment] [epoch 24/100] [score 198.80]
[epoch   25/100] [iter   37500] [loss 0.04332]
[epoch   25/100] [iter   38000] [loss 0.05438]
[epoch   25/100] [iter   38500] [loss 0.00000]
[epoch   26/100] [iter   39000] [loss 0.01767]
[epoch   26/100] [iter   39500] [loss 0.04333]
[epoch   26/100] [iter   40000] [loss 0.02166]
[epoch   26/100] [iter   40500] [loss 0.04332]
[Test on environment] [epoch 26/100] [score 199.40]
[epoch   27/100] [iter   41000] [loss 0.02166]
[epoch   27/100] [iter   41500] [loss 0.03260]
[epoch   27/100] [iter   42000] [loss 0.01083]
[epoch   28/100] [iter   42500] [loss 0.03250]
[epoch   28/100] [iter   43000] [loss 0.01453]
[epoch   28/100] [iter   43500] [loss 0.07608]
[Test on environment] [epoch 28/100] [score 198.80]
[epoch   29/100] [iter   44000] [loss 0.02180]
[epoch   29/100] [iter   44500] [loss 0.00039]
[epoch   29/100] [iter   45000] [loss 0.04332]
[epoch   30/100] [iter   45500] [loss 0.03249]
[epoch   30/100] [iter   46000] [loss 0.00000]
[epoch   30/100] [iter   46500] [loss 0.02178]
[Test on environment] [epoch 30/100] [score 198.70]
[epoch   31/100] [iter   47000] [loss 0.00086]
[epoch   31/100] [iter   47500] [loss 0.01166]
[epoch   31/100] [iter   48000] [loss 0.02345]
[epoch   32/100] [iter   48500] [loss 0.02167]
[epoch   32/100] [iter   49000] [loss 0.02167]
[epoch   32/100] [iter   49500] [loss 0.00000]
[Test on environment] [epoch 32/100] [score 198.90]
[epoch   33/100] [iter   50000] [loss 0.02874]
[epoch   33/100] [iter   50500] [loss 0.01083]
[epoch   33/100] [iter   51000] [loss 0.03249]
[epoch   34/100] [iter   51500] [loss 0.35175]
[epoch   34/100] [iter   52000] [loss 0.04339]
[epoch   34/100] [iter   52500] [loss 0.01094]
[Test on environment] [epoch 34/100] [score 198.60]
[epoch   35/100] [iter   53000] [loss 0.01083]
[epoch   35/100] [iter   53500] [loss 0.03901]
[epoch   35/100] [iter   54000] [loss 0.01083]
[epoch   35/100] [iter   54500] [loss 0.08763]
[epoch   36/100] [iter   55000] [loss 0.01482]
[epoch   36/100] [iter   55500] [loss 0.02740]
[epoch   36/100] [iter   56000] [loss 0.00000]
[Test on environment] [epoch 36/100] [score 197.90]
[epoch   37/100] [iter   56500] [loss 0.03303]
[epoch   37/100] [iter   57000] [loss 0.01901]
[epoch   37/100] [iter   57500] [loss 0.01083]
[epoch   38/100] [iter   58000] [loss 0.03249]
[epoch   38/100] [iter   58500] [loss 0.04332]
[epoch   38/100] [iter   59000] [loss 0.01094]
[Test on environment] [epoch 38/100] [score 195.10]
[epoch   39/100] [iter   59500] [loss 0.02166]
[epoch   39/100] [iter   60000] [loss 0.08373]
[epoch   39/100] [iter   60500] [loss 0.02166]
[epoch   40/100] [iter   61000] [loss 0.05415]
[epoch   40/100] [iter   61500] [loss 0.03249]
[epoch   40/100] [iter   62000] [loss 0.03249]
[Test on environment] [epoch 40/100] [score 198.60]
[epoch   41/100] [iter   62500] [loss 0.07582]
[epoch   41/100] [iter   63000] [loss 0.02167]
[epoch   41/100] [iter   63500] [loss 0.00000]
[epoch   42/100] [iter   64000] [loss 0.01083]
[epoch   42/100] [iter   64500] [loss 0.03249]
[epoch   42/100] [iter   65000] [loss 0.02249]
[Test on environment] [epoch 42/100] [score 200.00]
[epoch   43/100] [iter   65500] [loss 0.01083]
[epoch   43/100] [iter   66000] [loss 0.04332]
[epoch   43/100] [iter   66500] [loss 0.02329]
[epoch   44/100] [iter   67000] [loss 0.02195]
[epoch   44/100] [iter   67500] [loss 0.05416]
[epoch   44/100] [iter   68000] [loss 0.00000]
[epoch   44/100] [iter   68500] [loss 0.03249]
[Test on environment] [epoch 44/100] [score 199.70]
[epoch   45/100] [iter   69000] [loss 0.00000]
[epoch   45/100] [iter   69500] [loss 0.01286]
[epoch   45/100] [iter   70000] [loss 0.02166]
[epoch   46/100] [iter   70500] [loss 0.01084]
[epoch   46/100] [iter   71000] [loss 0.05654]
[epoch   46/100] [iter   71500] [loss 0.00044]
[Test on environment] [epoch 46/100] [score 198.90]
[epoch   47/100] [iter   72000] [loss 0.02352]
[epoch   47/100] [iter   72500] [loss 0.02166]
[epoch   47/100] [iter   73000] [loss 0.04332]
[epoch   48/100] [iter   73500] [loss 0.01083]
[epoch   48/100] [iter   74000] [loss 0.03384]
[epoch   48/100] [iter   74500] [loss 0.01528]
[Test on environment] [epoch 48/100] [score 198.40]
[epoch   49/100] [iter   75000] [loss 0.01086]
[epoch   49/100] [iter   75500] [loss 0.01083]
[epoch   49/100] [iter   76000] [loss 0.03268]
[epoch   50/100] [iter   76500] [loss 0.02166]
[epoch   50/100] [iter   77000] [loss 0.03249]
[epoch   50/100] [iter   77500] [loss 0.02166]
[Test on environment] [epoch 50/100] [score 199.10]
[epoch   51/100] [iter   78000] [loss 0.01083]
[epoch   51/100] [iter   78500] [loss 0.01083]
[epoch   51/100] [iter   79000] [loss 0.04406]
[epoch   52/100] [iter   79500] [loss 0.00000]
[epoch   52/100] [iter   80000] [loss 0.00024]
[epoch   52/100] [iter   80500] [loss 0.01083]
[epoch   52/100] [iter   81000] [loss 0.01083]
[Test on environment] [epoch 52/100] [score 200.00]
[epoch   53/100] [iter   81500] [loss 0.01083]
[epoch   53/100] [iter   82000] [loss 0.03249]
[epoch   53/100] [iter   82500] [loss 0.05814]
[epoch   54/100] [iter   83000] [loss 0.34556]
[epoch   54/100] [iter   83500] [loss 0.01083]
[epoch   54/100] [iter   84000] [loss 0.04164]
[Test on environment] [epoch 54/100] [score 198.80]
[epoch   55/100] [iter   84500] [loss 0.03249]
[epoch   55/100] [iter   85000] [loss 0.01083]
[epoch   55/100] [iter   85500] [loss 0.02166]
[epoch   56/100] [iter   86000] [loss 1.75419]
[epoch   56/100] [iter   86500] [loss 0.15672]
[epoch   56/100] [iter   87000] [loss 0.01083]
[Test on environment] [epoch 56/100] [score 199.10]
[epoch   57/100] [iter   87500] [loss 0.04333]
[epoch   57/100] [iter   88000] [loss 0.01083]
[epoch   57/100] [iter   88500] [loss 0.02167]
[epoch   58/100] [iter   89000] [loss 0.05749]
[epoch   58/100] [iter   89500] [loss 0.01083]
[epoch   58/100] [iter   90000] [loss 0.05415]
[Test on environment] [epoch 58/100] [score 199.40]
[epoch   59/100] [iter   90500] [loss 0.03250]
[epoch   59/100] [iter   91000] [loss 0.05417]
[epoch   59/100] [iter   91500] [loss 0.01083]
[epoch   60/100] [iter   92000] [loss 0.00000]
[epoch   60/100] [iter   92500] [loss 0.01530]
[epoch   60/100] [iter   93000] [loss 0.01083]
[Test on environment] [epoch 60/100] [score 198.90]
[epoch   61/100] [iter   93500] [loss 0.00013]
[epoch   61/100] [iter   94000] [loss 0.01083]
[epoch   61/100] [iter   94500] [loss 0.00000]
[epoch   61/100] [iter   95000] [loss 0.02166]
[epoch   62/100] [iter   95500] [loss 0.02166]
[epoch   62/100] [iter   96000] [loss 0.01083]
[epoch   62/100] [iter   96500] [loss 0.00045]
[Test on environment] [epoch 62/100] [score 199.20]
[epoch   63/100] [iter   97000] [loss 0.00359]
[epoch   63/100] [iter   97500] [loss 0.02166]
[epoch   63/100] [iter   98000] [loss 0.02166]
[epoch   64/100] [iter   98500] [loss 0.01086]
[epoch   64/100] [iter   99000] [loss 0.01083]
[epoch   64/100] [iter   99500] [loss 0.01102]
[Test on environment] [epoch 64/100] [score 200.00]
[epoch   65/100] [iter  100000] [loss 0.02166]
[epoch   65/100] [iter  100500] [loss 0.03287]
[epoch   65/100] [iter  101000] [loss 0.03867]
[epoch   66/100] [iter  101500] [loss 0.02166]
[epoch   66/100] [iter  102000] [loss 0.05415]
[epoch   66/100] [iter  102500] [loss 0.03102]
[Test on environment] [epoch 66/100] [score 200.00]
[epoch   67/100] [iter  103000] [loss 0.04181]
[epoch   67/100] [iter  103500] [loss 0.02182]
[epoch   67/100] [iter  104000] [loss 0.03249]
[epoch   68/100] [iter  104500] [loss 0.02166]
[epoch   68/100] [iter  105000] [loss 0.05415]
[epoch   68/100] [iter  105500] [loss 0.02166]
[Test on environment] [epoch 68/100] [score 198.50]
[epoch   69/100] [iter  106000] [loss 0.02168]
[epoch   69/100] [iter  106500] [loss 0.00009]
[epoch   69/100] [iter  107000] [loss 0.02166]
[epoch   69/100] [iter  107500] [loss 0.02166]
[epoch   70/100] [iter  108000] [loss 0.02207]
[epoch   70/100] [iter  108500] [loss 0.03250]
[epoch   70/100] [iter  109000] [loss 0.01370]
[Test on environment] [epoch 70/100] [score 200.00]
[epoch   71/100] [iter  109500] [loss 0.01083]
[epoch   71/100] [iter  110000] [loss 0.02342]
[epoch   71/100] [iter  110500] [loss 0.02166]
[epoch   72/100] [iter  111000] [loss 0.02166]
[epoch   72/100] [iter  111500] [loss 0.04422]
[epoch   72/100] [iter  112000] [loss 0.55392]
[Test on environment] [epoch 72/100] [score 200.00]
[epoch   73/100] [iter  112500] [loss 0.01792]
[epoch   73/100] [iter  113000] [loss 0.03249]
[epoch   73/100] [iter  113500] [loss 0.01083]
[epoch   74/100] [iter  114000] [loss 0.01084]
[epoch   74/100] [iter  114500] [loss 0.11848]
[epoch   74/100] [iter  115000] [loss 0.00010]
[Test on environment] [epoch 74/100] [score 200.00]
[epoch   75/100] [iter  115500] [loss 0.06498]
[epoch   75/100] [iter  116000] [loss 0.02166]
[epoch   75/100] [iter  116500] [loss 0.01083]
[epoch   76/100] [iter  117000] [loss 0.01083]
[epoch   76/100] [iter  117500] [loss 0.02166]
[epoch   76/100] [iter  118000] [loss 0.00000]
[Test on environment] [epoch 76/100] [score 200.00]
[epoch   77/100] [iter  118500] [loss 0.01085]
[epoch   77/100] [iter  119000] [loss 0.01740]
[epoch   77/100] [iter  119500] [loss 0.00000]
[epoch   78/100] [iter  120000] [loss 0.01090]
[epoch   78/100] [iter  120500] [loss 0.02166]
[epoch   78/100] [iter  121000] [loss 0.01089]
[epoch   78/100] [iter  121500] [loss 0.04468]
[Test on environment] [epoch 78/100] [score 200.00]
[epoch   79/100] [iter  122000] [loss 0.02534]
[epoch   79/100] [iter  122500] [loss 0.06498]
[epoch   79/100] [iter  123000] [loss 0.05853]
[epoch   80/100] [iter  123500] [loss 0.01083]
[epoch   80/100] [iter  124000] [loss 0.00000]
[epoch   80/100] [iter  124500] [loss 0.00003]
[Test on environment] [epoch 80/100] [score 199.90]
[epoch   81/100] [iter  125000] [loss 0.04332]
[epoch   81/100] [iter  125500] [loss 0.00000]
[epoch   81/100] [iter  126000] [loss 0.01083]
[epoch   82/100] [iter  126500] [loss 0.00635]
[epoch   82/100] [iter  127000] [loss 0.03249]
[epoch   82/100] [iter  127500] [loss 0.01090]
[Test on environment] [epoch 82/100] [score 199.40]
[epoch   83/100] [iter  128000] [loss 0.02166]
[epoch   83/100] [iter  128500] [loss 0.01086]
[epoch   83/100] [iter  129000] [loss 0.01083]
[epoch   84/100] [iter  129500] [loss 0.01979]
[epoch   84/100] [iter  130000] [loss 0.07419]
[epoch   84/100] [iter  130500] [loss 0.00029]
[Test on environment] [epoch 84/100] [score 199.20]
[epoch   85/100] [iter  131000] [loss 0.01083]
[epoch   85/100] [iter  131500] [loss 0.01106]
[epoch   85/100] [iter  132000] [loss 0.09314]
[epoch   86/100] [iter  132500] [loss 0.00001]
[epoch   86/100] [iter  133000] [loss 0.02166]
[epoch   86/100] [iter  133500] [loss 0.09368]
[Test on environment] [epoch 86/100] [score 200.00]
[epoch   87/100] [iter  134000] [loss 0.03249]
[epoch   87/100] [iter  134500] [loss 0.03272]
[epoch   87/100] [iter  135000] [loss 0.02166]
[epoch   87/100] [iter  135500] [loss 0.00069]
[epoch   88/100] [iter  136000] [loss 0.01084]
[epoch   88/100] [iter  136500] [loss 0.01127]
[epoch   88/100] [iter  137000] [loss 0.01511]
[Test on environment] [epoch 88/100] [score 199.90]
[epoch   89/100] [iter  137500] [loss 0.01088]
[epoch   89/100] [iter  138000] [loss 0.22703]
[epoch   89/100] [iter  138500] [loss 0.01090]
[epoch   90/100] [iter  139000] [loss 0.03252]
[epoch   90/100] [iter  139500] [loss 0.01089]
[epoch   90/100] [iter  140000] [loss 0.03465]
[Test on environment] [epoch 90/100] [score 199.40]
[epoch   91/100] [iter  140500] [loss 0.01083]
[epoch   91/100] [iter  141000] [loss 0.05416]
[epoch   91/100] [iter  141500] [loss 0.02166]
[epoch   92/100] [iter  142000] [loss 0.01083]
[epoch   92/100] [iter  142500] [loss 0.00000]
[epoch   92/100] [iter  143000] [loss 0.00332]
[Test on environment] [epoch 92/100] [score 200.00]
[epoch   93/100] [iter  143500] [loss 0.02209]
[epoch   93/100] [iter  144000] [loss 0.01735]
[epoch   93/100] [iter  144500] [loss 0.03249]
[epoch   94/100] [iter  145000] [loss 0.02166]
[epoch   94/100] [iter  145500] [loss 0.01354]
[epoch   94/100] [iter  146000] [loss 0.00000]
[Test on environment] [epoch 94/100] [score 196.60]
[epoch   95/100] [iter  146500] [loss 0.03249]
[epoch   95/100] [iter  147000] [loss 0.00016]
[epoch   95/100] [iter  147500] [loss 0.03271]
[epoch   95/100] [iter  148000] [loss 0.00000]
[epoch   96/100] [iter  148500] [loss 0.02166]
[epoch   96/100] [iter  149000] [loss 0.01083]
[epoch   96/100] [iter  149500] [loss 0.03167]
[Test on environment] [epoch 96/100] [score 199.60]
[epoch   97/100] [iter  150000] [loss 0.00287]
[epoch   97/100] [iter  150500] [loss 0.00000]
[epoch   97/100] [iter  151000] [loss 0.00001]
[epoch   98/100] [iter  151500] [loss 0.02185]
[epoch   98/100] [iter  152000] [loss 0.02167]
[epoch   98/100] [iter  152500] [loss 0.00000]
[Test on environment] [epoch 98/100] [score 198.10]
[epoch   99/100] [iter  153000] [loss 0.02166]
[epoch   99/100] [iter  153500] [loss 0.00017]
[epoch   99/100] [iter  154000] [loss 0.08183]
[epoch  100/100] [iter  154500] [loss 0.00000]
[epoch  100/100] [iter  155000] [loss 0.01083]
[epoch  100/100] [iter  155500] [loss 0.01083]
[Test on environment] [epoch 100/100] [score 200.00]


**[QUESTION 2 points]** Did you manage to learn a good policy? How consistent is the reward you are getting?
The learned policy is not really good because sometimes it still (almost) falls off the screen

## Task 2: Deep Q Learning

There are two main issues with the behavior cloning approach.

- First, we are not always lucky enough to have access to a dataset of expert demonstrations.
- Second, replicating an expert policy suffers from compounding error. The policy $\pi$ only sees these "perfect" examples and has no knowledge on how to recover from states not visited by the expert. For this reason, as soon as it is presented with a state that is off the expert trajectory, it will perform poorly and will continue to deviate from a good trajectory without the possibility of recovering from errors.

---
The second task consists in solving the environment from scratch, using RL, and most specifically the DQN algorithm, to learn a policy $\pi$.

For this task, familiarize yourself with the file `dqn.py`. We are going to re-use the file `model.py` for the model you created in the previous task.

Your task is very similar to the one in the previous assignment, to implement the Q-learning algorithm, but in this version, our Q-function is approximated with a neural network.

The algorithm (excerpted from Section 6.5 of [Sutton's book](http://incompleteideas.net/book/RLbook2018.pdf)) is given below:

![DQN algorithm](https://i.imgur.com/Mh4Uxta.png)

### 2.0 Think about your model...



**[QUESTION 2 points]** In DQN, we are using the same model as in task 1 for behavioral cloning. In both tasks the model receives as input the state and in both tasks the model outputs something that has the same dimensionality as the number of actions. These two outputs, though, represent very different things. What is each one representing?

For DQN, the output should represent the value for actions corresnponding to the input state, while for behavioral cloning, the output represents how likely is it to take a certain action at the given input state

### 2.1 Update your Q-function

Complete the `optimize_model` function. This function receives as input a `state`, an `action`, the `next_state`, the `reward` and `done` representing the tuple $(s_t, a_t, s_{t+1}, r_t, done_t)$. Your task is to update your Q-function as shown in the [Atari DQN paper](https://arxiv.org/abs/1312.5602) environment. For now don't be concerned with the experience replay buffer. We'll get to that later.

![Loss function](https://i.imgur.com/tpTsV8m.png)

- [**QUESTION 8 points]** Insert your code in the placeholder below.

In [7]:
## PLACEHOLDER TO INSERT YOUR optimize_model function here:

def optimize_model(state, action, next_state, reward, done):
    y = torch.zeros([1, len(done)], dtype=torch.double).to(device)
    for idx in range(len(done)):
        if done[idx]:
            y[0][idx] = reward[idx]
        else:
            max_action = torch.argmax(target(next_state)[idx])
            y[0][idx] = reward[idx]+GAMMA*target(next_state)[idx][max_action]
    q_values_training = torch.zeros([1, len(done)], dtype=torch.double).to(device)
    for i in range(len(done)):
        action_i = action[i].item()
        q_all_values_i = model(state)[i]
        q_value_i = q_all_values_i[action_i]
        q_values_training[0][i] = q_value_i
    loss = nn.MSELoss()(q_values_training, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

### 2.2 $\epsilon$-greedy strategy

You will need a strategy to explore your environment. The standard strategy is to use $\epsilon$-greedy. Implement it in the `choose_action` function template.

- [**QUESTION 5 points]** Insert your code in the placeholder below.

In [9]:
## PLACEHOLDER TO INSERT YOUR choose_action function here:

def choose_action(state, test_mode=False):
    if not torch.is_tensor(state):
        state = torch.tensor([state], device=device, dtype=torch.float32)
    greedy_action = model.select_action(state)
    if np.random.rand() > EPS_EXPLORATION:
        action = greedy_action.to(device)
    else:
        action = torch.tensor([[random.randrange(n_actions)]], device=device, dtype=torch.long)
    return action


### 2.3 Train your model

Try to train a model in this way.

You can run your code by doing:

```
python3 dqn.py
```

**[QUESTION 2 points]** How many episodes does it take to learn (ie. reach a good reward)?



[TEST Episode 25] [Average Reward 9.8]
----------
----------
[TEST Episode 50] [Average Reward 9.2]
----------
----------
[TEST Episode 75] [Average Reward 9.1]
----------
----------
[TEST Episode 100] [Average Reward 9.2]
----------
----------
saving model.
[TEST Episode 125] [Average Reward 14.0]
----------
----------
[TEST Episode 150] [Average Reward 9.9]
----------
----------
[TEST Episode 175] [Average Reward 10.7]
----------
----------
[TEST Episode 200] [Average Reward 10.2]
----------
----------
saving model.
[TEST Episode 225] [Average Reward 17.0]
----------
----------
[TEST Episode 250] [Average Reward 10.3]
----------
----------
[TEST Episode 275] [Average Reward 12.7]
----------
----------
[TEST Episode 300] [Average Reward 10.5]
----------
----------
[TEST Episode 325] [Average Reward 10.3]
----------
----------
[TEST Episode 350] [Average Reward 14.6]
----------
----------
saving model.
[TEST Episode 375] [Average Reward 144.4]
----------
----------
[TEST Episode 400] [Average Reward 28.5]
----------
----------
saving model.
[TEST Episode 425] [Average Reward 156.5]
----------
----------
[TEST Episode 450] [Average Reward 91.0]
----------
----------
[TEST Episode 475] [Average Reward 34.3]
----------
[Episode  500/4000] [Steps   79] [reward 80.0]
----------
[TEST Episode 500] [Average Reward 109.2]
----------
----------
[TEST Episode 525] [Average Reward 39.7]
----------
----------
[TEST Episode 550] [Average Reward 81.8]
----------
----------
saving model.
[TEST Episode 575] [Average Reward 172.4]
----------
----------
[TEST Episode 600] [Average Reward 87.0]
----------
----------
[TEST Episode 625] [Average Reward 21.7]
----------
----------
[TEST Episode 650] [Average Reward 38.0]
----------
----------
saving model.
[TEST Episode 675] [Average Reward 200.0]
----------
----------
[TEST Episode 700] [Average Reward 110.3]
----------
----------
[TEST Episode 725] [Average Reward 200.0]
----------
----------
[TEST Episode 750] [Average Reward 100.8]
----------
----------
[TEST Episode 775] [Average Reward 9.1]
----------
----------
[TEST Episode 800] [Average Reward 163.1]
----------
----------
[TEST Episode 825] [Average Reward 200.0]
----------
----------
[TEST Episode 850] [Average Reward 198.8]
----------
----------
[TEST Episode 875] [Average Reward 29.0]
----------
----------
[TEST Episode 900] [Average Reward 200.0]
----------
----------
[TEST Episode 925] [Average Reward 121.1]
----------
----------
[TEST Episode 950] [Average Reward 47.8]
----------
----------
[TEST Episode 975] [Average Reward 193.0]
----------
[Episode 1000/4000] [Steps  116] [reward 117.0]
----------
[TEST Episode 1000] [Average Reward 127.6]
----------
----------
[TEST Episode 1025] [Average Reward 117.0]
----------
----------
[TEST Episode 1050] [Average Reward 14.9]
----------
----------
[TEST Episode 1075] [Average Reward 18.3]
----------
----------
[TEST Episode 1100] [Average Reward 47.6]
----------
----------
[TEST Episode 1125] [Average Reward 11.0]
----------
----------
[TEST Episode 1150] [Average Reward 49.3]
----------
----------
[TEST Episode 1175] [Average Reward 189.4]
----------
----------
[TEST Episode 1200] [Average Reward 94.1]
----------
----------
[TEST Episode 1225] [Average Reward 9.6]
----------
----------
[TEST Episode 1250] [Average Reward 140.9]
----------
----------
[TEST Episode 1275] [Average Reward 149.8]
----------
----------
[TEST Episode 1300] [Average Reward 71.7]
----------
----------
[TEST Episode 1325] [Average Reward 200.0]
----------
----------
[TEST Episode 1350] [Average Reward 168.9]
----------
----------
[TEST Episode 1375] [Average Reward 109.2]
----------
----------
[TEST Episode 1400] [Average Reward 45.8]
----------
----------
[TEST Episode 1425] [Average Reward 200.0]
----------
----------
[TEST Episode 1450] [Average Reward 20.6]
----------
----------
[TEST Episode 1475] [Average Reward 59.8]
----------
[Episode 1500/4000] [Steps  125] [reward 126.0]
----------
[TEST Episode 1500] [Average Reward 128.3]
----------
----------
[TEST Episode 1525] [Average Reward 200.0]
----------
----------
[TEST Episode 1550] [Average Reward 200.0]
----------
----------
[TEST Episode 1575] [Average Reward 200.0]
----------
----------
[TEST Episode 1600] [Average Reward 200.0]
----------
----------
[TEST Episode 1625] [Average Reward 109.2]
----------
----------
[TEST Episode 1650] [Average Reward 138.2]
----------
----------
[TEST Episode 1675] [Average Reward 162.7]
----------
----------
[TEST Episode 1700] [Average Reward 97.6]
----------
----------
[TEST Episode 1725] [Average Reward 179.7]
----------
----------
[TEST Episode 1750] [Average Reward 125.6]
----------
----------
[TEST Episode 1775] [Average Reward 174.8]
----------
----------
[TEST Episode 1800] [Average Reward 127.8]
----------
----------
[TEST Episode 1825] [Average Reward 60.4]
----------
----------
[TEST Episode 1850] [Average Reward 13.5]
----------
----------
[TEST Episode 1875] [Average Reward 200.0]
----------
----------
[TEST Episode 1900] [Average Reward 47.7]
----------
----------
[TEST Episode 1925] [Average Reward 101.7]
----------
----------
[TEST Episode 1950] [Average Reward 200.0]
----------
----------
[TEST Episode 1975] [Average Reward 164.3]
----------
[Episode 2000/4000] [Steps  127] [reward 128.0]
----------
[TEST Episode 2000] [Average Reward 187.5]
----------
----------
[TEST Episode 2025] [Average Reward 127.9]
----------
----------
[TEST Episode 2050] [Average Reward 153.3]
----------
----------
[TEST Episode 2075] [Average Reward 191.0]
----------
----------
[TEST Episode 2100] [Average Reward 176.6]
----------
----------
[TEST Episode 2125] [Average Reward 165.8]
----------
----------
[TEST Episode 2150] [Average Reward 172.9]
----------
----------
[TEST Episode 2175] [Average Reward 194.2]
----------
----------
[TEST Episode 2200] [Average Reward 141.3]
----------
----------
[TEST Episode 2225] [Average Reward 183.1]
----------
----------
[TEST Episode 2250] [Average Reward 10.9]
----------
----------
[TEST Episode 2275] [Average Reward 166.9]
----------
----------
[TEST Episode 2300] [Average Reward 10.8]
----------
----------
[TEST Episode 2325] [Average Reward 134.9]
----------
----------
[TEST Episode 2350] [Average Reward 99.1]
----------
----------
[TEST Episode 2375] [Average Reward 154.6]
----------
----------
[TEST Episode 2400] [Average Reward 200.0]
----------
----------
[TEST Episode 2425] [Average Reward 119.5]
----------
----------
[TEST Episode 2450] [Average Reward 197.7]
----------
----------
[TEST Episode 2475] [Average Reward 193.5]
----------
[Episode 2500/4000] [Steps  127] [reward 128.0]
----------
[TEST Episode 2500] [Average Reward 199.2]
----------
----------
[TEST Episode 2525] [Average Reward 132.1]
----------
----------
[TEST Episode 2550] [Average Reward 157.4]
----------
----------
[TEST Episode 2575] [Average Reward 135.3]
----------
----------
[TEST Episode 2600] [Average Reward 136.3]
----------
----------
[TEST Episode 2625] [Average Reward 144.9]
----------
----------
[TEST Episode 2650] [Average Reward 185.0]
----------
----------
[TEST Episode 2675] [Average Reward 11.4]
----------
----------
[TEST Episode 2700] [Average Reward 200.0]
----------
----------
[TEST Episode 2725] [Average Reward 131.0]
----------
----------
[TEST Episode 2750] [Average Reward 200.0]
----------
----------
[TEST Episode 2775] [Average Reward 10.5]
----------
----------
[TEST Episode 2800] [Average Reward 200.0]
----------
----------
[TEST Episode 2825] [Average Reward 25.4]
----------
----------
[TEST Episode 2850] [Average Reward 39.0]
----------
----------
[TEST Episode 2875] [Average Reward 11.7]
----------
----------
[TEST Episode 2900] [Average Reward 49.8]
----------
----------
[TEST Episode 2925] [Average Reward 113.0]
----------
----------
[TEST Episode 2950] [Average Reward 111.0]
----------
----------
[TEST Episode 2975] [Average Reward 128.0]
----------
[Episode 3000/4000] [Steps  199] [reward 200.0]
----------
[TEST Episode 3000] [Average Reward 146.8]
----------
----------
[TEST Episode 3025] [Average Reward 13.5]
----------
----------
[TEST Episode 3050] [Average Reward 117.8]
----------
----------
[TEST Episode 3075] [Average Reward 199.8]
----------
----------
[TEST Episode 3100] [Average Reward 9.7]
----------
----------
[TEST Episode 3125] [Average Reward 61.4]
----------
----------
[TEST Episode 3150] [Average Reward 192.1]
----------
----------
[TEST Episode 3175] [Average Reward 9.5]
----------
----------
[TEST Episode 3200] [Average Reward 200.0]
----------
----------
[TEST Episode 3225] [Average Reward 160.5]
----------
----------
[TEST Episode 3250] [Average Reward 77.3]
----------
----------
[TEST Episode 3275] [Average Reward 101.6]
----------
----------
[TEST Episode 3300] [Average Reward 88.7]
----------
----------
[TEST Episode 3325] [Average Reward 185.7]
----------
----------
[TEST Episode 3350] [Average Reward 76.3]
----------
----------
[TEST Episode 3375] [Average Reward 200.0]
----------
----------
[TEST Episode 3400] [Average Reward 114.2]
----------
----------
[TEST Episode 3425] [Average Reward 132.0]
----------
----------
[TEST Episode 3450] [Average Reward 105.0]
----------
----------
[TEST Episode 3475] [Average Reward 10.4]
----------
[Episode 3500/4000] [Steps    8] [reward 9.0]
----------
[TEST Episode 3500] [Average Reward 86.5]
----------
----------
[TEST Episode 3525] [Average Reward 33.8]
----------
----------
[TEST Episode 3550] [Average Reward 130.7]
----------
----------
[TEST Episode 3575] [Average Reward 54.0]
----------
----------
[TEST Episode 3600] [Average Reward 15.6]
----------
----------
[TEST Episode 3625] [Average Reward 10.6]
----------
----------
[TEST Episode 3650] [Average Reward 16.9]
----------
----------
[TEST Episode 3675] [Average Reward 115.0]
----------
----------
[TEST Episode 3700] [Average Reward 200.0]
----------
----------
[TEST Episode 3725] [Average Reward 125.1]
----------
----------
[TEST Episode 3750] [Average Reward 9.3]
----------
----------
[TEST Episode 3775] [Average Reward 148.7]
----------
----------
[TEST Episode 3800] [Average Reward 141.0]
----------
----------
[TEST Episode 3825] [Average Reward 82.7]
----------
----------
[TEST Episode 3850] [Average Reward 200.0]
----------
----------
[TEST Episode 3875] [Average Reward 103.2]
----------
----------
[TEST Episode 3900] [Average Reward 92.8]
----------
----------
[TEST Episode 3925] [Average Reward 127.6]
----------
----------
[TEST Episode 3950] [Average Reward 11.8]
----------
----------
[TEST Episode 3975] [Average Reward 75.0]
----------
[Episode 4000/4000] [Steps  142] [reward 143.0]
----------
[TEST Episode 4000] [Average Reward 102.6]

Reached good performance after approximately 2000 episodes. However, the agent sometimes stuck due to on-policy bias and random initialization 

### 2.4 Add the Experience Replay Buffer

If you read the DQN paper (and as you can see from the algorithm picture above), the authors make use of an experience replay buffer to learn faster. We provide an implementation in the file `replay_buffer.py`. Update the `train_reinforcement_learning` code to push a tuple to the replay buffer and to sample a batch for the `optimize_model` function.

**[QUESTION 5 points]** How does the replay buffer improve performances?

[Episode   10/4000] [Steps   90] [reward 91.0]
[Episode   20/4000] [Steps  143] [reward 144.0]
----------
saving model.
[TEST Episode 25] [Average Reward 149.2]
----------
[Episode   30/4000] [Steps   85] [reward 86.0]
[Episode   40/4000] [Steps   71] [reward 72.0]
[Episode   50/4000] [Steps   64] [reward 65.0]
----------
[TEST Episode 50] [Average Reward 92.9]
----------
[Episode   60/4000] [Steps   41] [reward 42.0]
[Episode   70/4000] [Steps   38] [reward 39.0]
----------
[TEST Episode 75] [Average Reward 11.9]
----------
[Episode   80/4000] [Steps   23] [reward 24.0]
[Episode   90/4000] [Steps   19] [reward 20.0]
[Episode  100/4000] [Steps   21] [reward 22.0]
----------
[TEST Episode 100] [Average Reward 21.7]
----------
[Episode  110/4000] [Steps    9] [reward 10.0]
[Episode  120/4000] [Steps   12] [reward 13.0]
----------
[TEST Episode 125] [Average Reward 13.3]
----------
[Episode  130/4000] [Steps   12] [reward 13.0]
[Episode  140/4000] [Steps    8] [reward 9.0]
[Episode  150/4000] [Steps   16] [reward 17.0]
----------
[TEST Episode 150] [Average Reward 15.6]
----------
[Episode  160/4000] [Steps    8] [reward 9.0]
[Episode  170/4000] [Steps   11] [reward 12.0]
----------
[TEST Episode 175] [Average Reward 11.1]
----------
[Episode  180/4000] [Steps   22] [reward 23.0]
[Episode  190/4000] [Steps   28] [reward 29.0]
[Episode  200/4000] [Steps    8] [reward 9.0]
----------
[TEST Episode 200] [Average Reward 9.4]
----------
[Episode  210/4000] [Steps    8] [reward 9.0]
[Episode  220/4000] [Steps    9] [reward 10.0]
----------
[TEST Episode 225] [Average Reward 44.7]
----------
[Episode  230/4000] [Steps   15] [reward 16.0]
[Episode  240/4000] [Steps   12] [reward 13.0]
[Episode  250/4000] [Steps    9] [reward 10.0]
----------
[TEST Episode 250] [Average Reward 10.8]
----------
[Episode  260/4000] [Steps    9] [reward 10.0]
[Episode  270/4000] [Steps   24] [reward 25.0]
----------
[TEST Episode 275] [Average Reward 16.7]
----------
[Episode  280/4000] [Steps   12] [reward 13.0]
[Episode  290/4000] [Steps   23] [reward 24.0]
[Episode  300/4000] [Steps    8] [reward 9.0]
----------
[TEST Episode 300] [Average Reward 9.9]
----------
[Episode  310/4000] [Steps   13] [reward 14.0]
[Episode  320/4000] [Steps   11] [reward 12.0]
----------
[TEST Episode 325] [Average Reward 12.5]
----------
[Episode  330/4000] [Steps    9] [reward 10.0]
[Episode  340/4000] [Steps   10] [reward 11.0]
[Episode  350/4000] [Steps    9] [reward 10.0]
----------
[TEST Episode 350] [Average Reward 9.5]
----------
[Episode  360/4000] [Steps   20] [reward 21.0]
[Episode  370/4000] [Steps   28] [reward 29.0]
----------
[TEST Episode 375] [Average Reward 9.7]
----------
[Episode  380/4000] [Steps   35] [reward 36.0]
[Episode  390/4000] [Steps   16] [reward 17.0]
[Episode  400/4000] [Steps   12] [reward 13.0]
----------
[TEST Episode 400] [Average Reward 12.1]
----------
[Episode  410/4000] [Steps   20] [reward 21.0]
[Episode  420/4000] [Steps   18] [reward 19.0]
----------
[TEST Episode 425] [Average Reward 21.0]
----------
[Episode  430/4000] [Steps   44] [reward 45.0]
[Episode  440/4000] [Steps   17] [reward 18.0]
[Episode  450/4000] [Steps   93] [reward 94.0]
----------
[TEST Episode 450] [Average Reward 73.2]
----------
[Episode  460/4000] [Steps   33] [reward 34.0]
[Episode  470/4000] [Steps   11] [reward 12.0]
----------
[TEST Episode 475] [Average Reward 25.8]
----------
[Episode  480/4000] [Steps   27] [reward 28.0]
[Episode  490/4000] [Steps   26] [reward 27.0]
[Episode  500/4000] [Steps   54] [reward 55.0]
----------
[TEST Episode 500] [Average Reward 86.8]
----------
[Episode  510/4000] [Steps   42] [reward 43.0]
[Episode  520/4000] [Steps   84] [reward 85.0]
----------
[TEST Episode 525] [Average Reward 68.7]
----------
[Episode  530/4000] [Steps   72] [reward 73.0]
[Episode  540/4000] [Steps  166] [reward 167.0]
[Episode  550/4000] [Steps  182] [reward 183.0]
----------
saving model.
[TEST Episode 550] [Average Reward 166.5]
----------
[Episode  560/4000] [Steps  199] [reward 200.0]
[Episode  570/4000] [Steps  199] [reward 200.0]
----------
saving model.
[TEST Episode 575] [Average Reward 200.0]
----------
[Episode  580/4000] [Steps  199] [reward 200.0]
[Episode  590/4000] [Steps  199] [reward 200.0]
[Episode  600/4000] [Steps  199] [reward 200.0]
----------
[TEST Episode 600] [Average Reward 200.0]
----------
[Episode  610/4000] [Steps  199] [reward 200.0]
[Episode  620/4000] [Steps  129] [reward 130.0]
----------
[TEST Episode 625] [Average Reward 200.0]
----------
[Episode  630/4000] [Steps  199] [reward 200.0]
[Episode  640/4000] [Steps  165] [reward 166.0]
[Episode  650/4000] [Steps  164] [reward 165.0]
----------
[TEST Episode 650] [Average Reward 187.4]
----------
[Episode  660/4000] [Steps  199] [reward 200.0]
[Episode  670/4000] [Steps  199] [reward 200.0]
----------
[TEST Episode 675] [Average Reward 200.0]
----------
[Episode  680/4000] [Steps  199] [reward 200.0]
[Episode  690/4000] [Steps  199] [reward 200.0]
[Episode  700/4000] [Steps  199] [reward 200.0]
----------
[TEST Episode 700] [Average Reward 186.7]
----------
[Episode  710/4000] [Steps   59] [reward 60.0]
[Episode  720/4000] [Steps  102] [reward 103.0]
----------
[TEST Episode 725] [Average Reward 200.0]
----------
[Episode  730/4000] [Steps  199] [reward 200.0]
[Episode  740/4000] [Steps  199] [reward 200.0]
[Episode  750/4000] [Steps  199] [reward 200.0]
----------
[TEST Episode 750] [Average Reward 197.8]

Learn faster: reached good policy after 500 episodes


## Task 3: Extra

Ideas to experiment with:

- Is $\epsilon$-greedy strategy the best strategy available? Why not trying something different.
- Why not make use of the model you have trained in the behavioral cloning part and fine-tune it with RL? How does that affect performance?
- You are perhaps bored with `CartPole-v0` by now. Another environment we suggest trying is `LunarLander-v2`. It will be harder to learn but with experimentation, you will find the correct optimizations for success. Piazza is also your friend :)
- What about learning from images? This requires more work because you have to extract the image from the environment. However, would it be possible? How much more challenging might you expect the learning to be in this case?
- The ReplayBuffer implementation provided is very simple. In class we have briefly mentioned Prioritized Experience Replay; how would the learning process change?
- An improvement over DQN is DoubleDQN, which is a very simple addition to the current code.



[Episode  476/4000] [Steps   39] [reward 40.0]
[Episode  477/4000] [Steps   39] [reward 40.0]
[Episode  478/4000] [Steps   91] [reward 92.0]
[Episode  479/4000] [Steps   52] [reward 53.0]
[Episode  480/4000] [Steps   48] [reward 49.0]
[Episode  481/4000] [Steps   52] [reward 53.0]
[Episode  482/4000] [Steps   38] [reward 39.0]
[Episode  483/4000] [Steps   52] [reward 53.0]
[Episode  484/4000] [Steps   18] [reward 19.0]
[Episode  485/4000] [Steps   23] [reward 24.0]
[Episode  486/4000] [Steps   15] [reward 16.0]
[Episode  487/4000] [Steps   26] [reward 27.0]
[Episode  488/4000] [Steps   12] [reward 13.0]
[Episode  489/4000] [Steps   20] [reward 21.0]
[Episode  490/4000] [Steps   19] [reward 20.0]
[Episode  491/4000] [Steps   32] [reward 33.0]
[Episode  492/4000] [Steps   29] [reward 30.0]
[Episode  493/4000] [Steps   54] [reward 55.0]
[Episode  494/4000] [Steps   23] [reward 24.0]
[Episode  495/4000] [Steps  104] [reward 105.0]
[Episode  496/4000] [Steps   39] [reward 40.0]
[Episode  497/4000] [Steps   65] [reward 66.0]
[Episode  498/4000] [Steps   49] [reward 50.0]
[Episode  499/4000] [Steps   64] [reward 65.0]
[Episode  500/4000] [Steps   40] [reward 41.0]
----------
saving model.
[TEST Episode 500] [Average Reward 45.8]
----------
[Episode  501/4000] [Steps   46] [reward 47.0]
[Episode  502/4000] [Steps   34] [reward 35.0]
[Episode  503/4000] [Steps   32] [reward 33.0]
[Episode  504/4000] [Steps   57] [reward 58.0]
[Episode  505/4000] [Steps   76] [reward 77.0]
[Episode  506/4000] [Steps   39] [reward 40.0]
[Episode  507/4000] [Steps   42] [reward 43.0]
[Episode  508/4000] [Steps   28] [reward 29.0]
[Episode  509/4000] [Steps   88] [reward 89.0]
[Episode  510/4000] [Steps   37] [reward 38.0]
[Episode  511/4000] [Steps   32] [reward 33.0]
[Episode  512/4000] [Steps   66] [reward 67.0]
[Episode  513/4000] [Steps   30] [reward 31.0]
[Episode  514/4000] [Steps   32] [reward 33.0]
[Episode  515/4000] [Steps   84] [reward 85.0]
[Episode  516/4000] [Steps  123] [reward 124.0]
[Episode  517/4000] [Steps   61] [reward 62.0]
[Episode  518/4000] [Steps   27] [reward 28.0]
[Episode  519/4000] [Steps   66] [reward 67.0]
[Episode  520/4000] [Steps   23] [reward 24.0]
[Episode  521/4000] [Steps   44] [reward 45.0]
[Episode  522/4000] [Steps   15] [reward 16.0]
[Episode  523/4000] [Steps   62] [reward 63.0]
[Episode  524/4000] [Steps   50] [reward 51.0]
[Episode  525/4000] [Steps   23] [reward 24.0]
----------
[TEST Episode 525] [Average Reward 39.2]
----------
[Episode  526/4000] [Steps   25] [reward 26.0]
[Episode  527/4000] [Steps   17] [reward 18.0]
[Episode  528/4000] [Steps   64] [reward 65.0]
[Episode  529/4000] [Steps  164] [reward 165.0]
[Episode  530/4000] [Steps   71] [reward 72.0]
[Episode  531/4000] [Steps   44] [reward 45.0]
[Episode  532/4000] [Steps   81] [reward 82.0]
[Episode  533/4000] [Steps  161] [reward 162.0]
[Episode  534/4000] [Steps   59] [reward 60.0]
[Episode  535/4000] [Steps   70] [reward 71.0]
[Episode  536/4000] [Steps   63] [reward 64.0]
[Episode  537/4000] [Steps   83] [reward 84.0]
[Episode  538/4000] [Steps  148] [reward 149.0]
[Episode  539/4000] [Steps   34] [reward 35.0]
[Episode  540/4000] [Steps   65] [reward 66.0]
[Episode  541/4000] [Steps   41] [reward 42.0]
[Episode  542/4000] [Steps   33] [reward 34.0]
[Episode  543/4000] [Steps   47] [reward 48.0]
[Episode  544/4000] [Steps   24] [reward 25.0]
[Episode  545/4000] [Steps   77] [reward 78.0]
[Episode  546/4000] [Steps   47] [reward 48.0]
[Episode  547/4000] [Steps   71] [reward 72.0]
[Episode  548/4000] [Steps   93] [reward 94.0]
[Episode  549/4000] [Steps   47] [reward 48.0]
[Episode  550/4000] [Steps   63] [reward 64.0]
----------
saving model.
[TEST Episode 550] [Average Reward 87.0]
----------
[Episode  551/4000] [Steps  115] [reward 116.0]
[Episode  552/4000] [Steps   70] [reward 71.0]
[Episode  553/4000] [Steps   42] [reward 43.0]
[Episode  554/4000] [Steps   53] [reward 54.0]
[Episode  555/4000] [Steps   88] [reward 89.0]
[Episode  556/4000] [Steps   80] [reward 81.0]
[Episode  557/4000] [Steps   62] [reward 63.0]
[Episode  558/4000] [Steps   55] [reward 56.0]
[Episode  559/4000] [Steps   85] [reward 86.0]
[Episode  560/4000] [Steps   50] [reward 51.0]
[Episode  561/4000] [Steps  136] [reward 137.0]
[Episode  562/4000] [Steps   59] [reward 60.0]
[Episode  563/4000] [Steps   56] [reward 57.0]
[Episode  564/4000] [Steps   61] [reward 62.0]
[Episode  565/4000] [Steps   45] [reward 46.0]
[Episode  566/4000] [Steps   50] [reward 51.0]
[Episode  567/4000] [Steps   72] [reward 73.0]
[Episode  568/4000] [Steps   92] [reward 93.0]
[Episode  569/4000] [Steps  141] [reward 142.0]
[Episode  570/4000] [Steps  128] [reward 129.0]
[Episode  571/4000] [Steps   41] [reward 42.0]
[Episode  572/4000] [Steps   58] [reward 59.0]
[Episode  573/4000] [Steps   82] [reward 83.0]
[Episode  574/4000] [Steps  153] [reward 154.0]
[Episode  575/4000] [Steps   52] [reward 53.0]
----------
saving model.
[TEST Episode 575] [Average Reward 120.1]
----------
[Episode  576/4000] [Steps  176] [reward 177.0]
[Episode  577/4000] [Steps   97] [reward 98.0]
[Episode  578/4000] [Steps   93] [reward 94.0]
[Episode  579/4000] [Steps   89] [reward 90.0]
[Episode  580/4000] [Steps   79] [reward 80.0]
[Episode  581/4000] [Steps  120] [reward 121.0]
[Episode  582/4000] [Steps  135] [reward 136.0]
[Episode  583/4000] [Steps  199] [reward 200.0]
[Episode  584/4000] [Steps  161] [reward 162.0]
[Episode  585/4000] [Steps   86] [reward 87.0]
[Episode  586/4000] [Steps   74] [reward 75.0]
[Episode  587/4000] [Steps  123] [reward 124.0]
[Episode  588/4000] [Steps  134] [reward 135.0]
[Episode  589/4000] [Steps  115] [reward 116.0]
[Episode  590/4000] [Steps   41] [reward 42.0]
[Episode  591/4000] [Steps   68] [reward 69.0]
[Episode  592/4000] [Steps  199] [reward 200.0]
[Episode  593/4000] [Steps  194] [reward 195.0]
[Episode  594/4000] [Steps  199] [reward 200.0]
[Episode  595/4000] [Steps  153] [reward 154.0]
[Episode  596/4000] [Steps  193] [reward 194.0]
[Episode  597/4000] [Steps   69] [reward 70.0]
[Episode  598/4000] [Steps  137] [reward 138.0]
[Episode  599/4000] [Steps  199] [reward 200.0]
[Episode  600/4000] [Steps  199] [reward 200.0]

Results of using DDQN instead of DQN: converge more stable at further steps. In particular, even though both start to converge at episode 500-600, at further episodes DDQN maintains a high score better