<h1 style=
  "margin-top: 0px;
  margin-bottom: 10px;
  font-family: sans-serif;
  font-size: 8rem;">
<span style="color:#808080">D</span><span style="color:#808080">Q</span><span style="color:#808080">N</span>
</h1>

In this homework, Jupyter Notebook is mainly used for visualizations and reporting the results. We will start implementing a vanilla DQN agent and continue with implementing a RAINBOW agent. In general, there are 3 scripts to run a training experiment with the DQN agent on an environment. First one is the model where we implement the policy and the loss function. Second one is the Trainer class, where all of the training and evaluation is handled. This class is responsible for parameter updates, running the environment, and keeping track of necessary statistics as well as saving the model (agent and optimizer). Lastly, the third script initiates the agent, trainer, environment, and starts the training with the given arguments.

- DQN
    - model
    - trainer
    - box2d (experiment script)

We will follow a very similar structure for the Rainbow agent.

#### Running

We will train each experiment with 5 different seeds to have a good understanding of the stochasticity involved in the training process. You can run your experiments with command-line interface within the notebook.

Run the cell below to see CL arguments

In [None]:
!python dqn/dqn/box2d.py --help

An example dqn run is given below. (You need to fill the missing parts before running the command below)

In [None]:
!python dqn/dqn/box2d.py --log_dir logs/vanilla-dqn --gamma 0.9 --n-iterations 40000 --seed 5555

After you run the training script (box2d.py), the log file named "progress.csv" will be saved to the directory given by the ```log_dir``` argument. You can use the csv file to obtain a Pandas dataframe object and visualize the training.

#### Plotting

When you are done with experiments, you can plot the statistics. We are interested to see how much variation exists in the training. So, run and plot for at least 5 different seeds. Plotter will handle the multi seed plotting and comparisons.

Below is an example plot of two experiments each contains 3 different ```progress.csv``` files to demonstrate Plotter.

You can switch axes using the dropdowns.

In [3]:
from typing import Dict, List
import os
import pandas as pd

from dqn.visualize import Plotter


def collect_training_logs(log_dir: str) -> Dict[str, List[pd.DataFrame]]:
    """
        Obtain pandas frames from progress.csv files in the given directory
    """
    return [pd.read_csv(os.path.join(log_dir, folder, "progress.csv"))
                        for folder in os.listdir(log_dir)
                        if os.path.exists(os.path.join(log_dir, folder, "progress.csv"))]

In [None]:
df_dict = {"gamma-0.90": collect_training_logs(os.path.join("logs", "vanilla-dqn-gamma-0.90")),
           "gamma-0.99": collect_training_logs(os.path.join("logs", "vanilla-dqn-gamma-0.99"))}

plotter = Plotter(df_dict)
plotter()

### Implementation

We start filling the source code with ``` dqn/base_dqn.py ```. This class serves as a base class for DQN agents (Vanilla DQN and Rainbow DQN).

> Complete ``` greedy_policy ``` in ``` dqn/base_dqn.py ``` script

> Complete ``` update_target ``` in ``` dqn/base_dqn.py ``` script


> Complete ``` evaluate ``` in ``` dqn/base_dqn.py ``` script

As you can see the target network is already initialized in the constructor of the base class. But we also need a replay buffer. The next part to complete is ``` dqn/replaybuffer/uniform.py ```. When we initiate the buffer, we allocate all the memory and then gradually push transitions. Here the capacity is fixed and the size of the buffer grows as we push transitions.

> Complete ``` push ``` in ``` dqn/replaybuffer/uniform.py ``` script

Remember, our replay buffer is a queue with FIFO behavior.

> Complete ``` sample ``` in ``` dqn/replaybuffer/uniform.py ``` script

Now we can complete DQN agent.

> Complete ``` loss ``` in ``` dqn/dqn/model.py ```

When we are done with DQN and replay buffer, we can start implementing trainer class. This class takes care of all the training.

> Complete ``` update ``` in ``` dqn/dqn/train.py ```

Update function updates the parameters (value and target networks). Also, append td error to the ```td_loss``` list

Now we can complete``` __iter__ ``` method. This python special method returns a generator that yields a transition at every step for "n_iterations" steps (from ```args```). This is the method where we gather experience from the environment by following ```e_greedy_policy```. To see how we use ```__iter__``` method, please check the ```__call__``` method in the ```Trainer``` class. Additionally, append the episodic training reward if the environment is terminated. Step the epsilon here (comeback this point after next implementation)

> Complete ``` __iter__ ``` in ``` dqn/dqn/train.py ```

We bring every component necessary for training in the ``` __call__ ``` method. Which is already completed.

Before starting the experiments, we need to implement annealing functions located at ``` dqn/common.py ```. Remember epsilon is a python **generator**.

> Complete ``` linear_annealing ``` in ``` dqn/common.py ```

> Complete ``` exponential_annealing ``` in ``` dqn/common.py ```

When trainer is initialized, it selects epsilon annealing based on given ```args```. For example: If the ```epsilon-decay``` is given we use exponential decaying strategy. But if ```epsilon-range``` is given, it we use linear decay.

Finally, we need to Q value network

> Complete ``` ValueNet ``` in ``` dqn/dqn/box2d.py ```

### Experiments

We run our experiments in the "Lunar Lander" environment. Let's see if two of the innovations introduced in the DQN paper make a difference. You can render evaluation episodes using ```--render``` CL argument.

> Remember ```n-iterations``` and ```write-period``` must be fixed within each experiment for plotting purposes!
In total, there must be 15 runs (5 for each).

- Experiment **(run training 5 times)** DQN in Lunar Lander (default) environment with very small Replay Buffer and very frequent target updates (small value for ```target-update-period```)



In [None]:
!python dqn/dqn/box2d.py --log_dir logs/vanilla-dqn-exp1 --target-update-period 50 --buffer-capacity 10000

In [27]:
directory = "logs/vanilla-dqn-exp1"
if not os.path.exists(directory):
    os.mkdir(directory)

log_dir = os.path.join("logs", "vanilla-dqn-exp1")
exp_1_dataframes = collect_training_logs(log_dir)

- Experiment DQN in Lunar Lander with large Replay Buffer and target update period.



In [None]:
!python dqn/dqn/box2d.py --log_dir logs/vanilla-dqn-exp2 --target-update-period 300 --buffer-capacity 50000

In [28]:
directory = "logs/vanilla-dqn-exp2"
if not os.path.exists(directory):
    os.mkdir(directory)

log_dir = os.path.join("logs", "vanilla-dqn-exp2")
exp_2_dataframes = collect_training_logs(log_dir)

- Experiment DQN with exponential decaying epsilon strategy.



In [None]:
!python dqn/dqn/box2d.py --log_dir logs/vanilla-dqn-exp3 --target-update-period 250 --buffer-capacity 40000 --epsilon-decay 0.99

In [29]:
directory = "logs/vanilla-dqn-exp3"
if not os.path.exists(directory):
    os.mkdir(directory)

log_dir = os.path.join("logs", "vanilla-dqn-exp3")
exp_3_dataframes = collect_training_logs(log_dir)

Remaining hyperparameters must be tuned and fixed. First two experiments may use linear decaying.

Obtain Pandas dataframes (15 in total) and plot the results using the given Plotter class.

In [32]:
Plotter({"primitive_dqn": exp_1_dataframes, "stable_dqn": exp_2_dataframes, "exp_decay_dqn": exp_3_dataframes})()

VBox(children=(Dropdown(description='X axis', options=('Iteration', 'Episode'), value='Iteration'), Dropdown(d…

### Train All DQN and Rainbow Experiments at Once

In [None]:
for i in range(5):

    !python dqn/dqn/box2d.py --log_dir logs/vanilla-dqn-exp1 --target-update-period 50 --buffer-capacity 10000

    !python dqn/dqn/box2d.py --log_dir logs/vanilla-dqn-exp2 --target-update-period 300 --buffer-capacity 50000

    !python dqn/dqn/box2d.py --log_dir logs/vanilla-dqn-exp3 --target-update-period 250 --buffer-capacity 40000 --epsilon-decay 0.99

    !python dqn/rainbow/box2d.py --n-iterations 5000 --no-double --no-dueling --no-noisy --no-prioritized --n-steps 1 --no-dist

    !python dqn/rainbow/box2d.py --log_dir logs/prioritized --no-dist --no-dueling --n-step 1 --no-double --no-noisy

    !python dqn/rainbow/box2d.py --log_dir logs/distributional --no-prioritized --no-dueling --n-step 1 --no-double --no-noisy

    !python dqn/rainbow/box2d.py --log_dir logs/nsteps --no-prioritized --no-dist --no-dueling --n-step 5 --no-double --no-noisy

    !python dqn/rainbow/box2d.py --log_dir logs/double --no-prioritized --no-dist --no-dueling --n-step 1 --no-noisy

    !python dqn/rainbow/box2d.py --log_dir logs/dueling --no-prioritized --no-dist --n-step 5 --no-double --no-noisy

    !python dqn/rainbow/box2d.py --log_dir logs/noisy --no-prioritized --no-dist --no-dueling --n-step 1 --no-double

<h1 style="margin-top: 0px;
  margin-bottom: 10px;
  font-family: sans-serif;
  font-size: 8rem;">
<span style="color:#FF0000">R</span><span style="color:#FFDB00">a</span><span style="color:#49FF00">i</span><span style="color:#00FF92">n</span><span style="color:#0092FF">b</span><span style="color:#4900FF">o</span><span style="color:#FF00DB">w</span>
</h1>

We use DQN as a base class for our implementation. Rainbow introduces a few extensions over vanilla DQN. Each of these extensions can be disabled in our implementation. We will test the Rainbow agent in both Lunar Lander and Pong.

> **Read** the related paper or the book section before moving to implementation.

### Implementation

Before implementing extensions we need to have a bare minimum DQN so that you can test your extension independently. We feed ```extensions``` dictionary to Rainbow agent. The dictionary contains information related to extensions that we want to use in RAINBOW agent. You can see the definition of ```extensions``` dictionary in ```dqn/rainbow/box2d.py```.

Luckly, we already have a "vanilla" DQN to start with. We only need to complete a few parts to run vanilla DQN (one that has no extension) in rainbow agent. 

> Complete ```ValueNet``` in ```dqn/rainbow/box2d.py```. Ignore the ```extensions``` dictionary for now.

Most of the methods use inherited functions from DQN section. However, as you implement the extensions you will need to replace them with their extension based versions.


In [None]:
!python dqn/rainbow/box2d.py --n-iterations 5000 --no-double --no-dueling --no-noisy --no-prioritized --n-steps 1 --no-dist



#### Prioritized Buffer
Let's start with Prioritized Replay Buffer. To start implementing this buffer we need weighted sampling. NumPy has ```np.random.choice``` function that we can use for prioritized sampling. 


> Complete ``` PriorityBuffer ``` in ``` dqn/replaybuffer/prioritized.py ```.
> - ```push```
> - ```sample```
> - ```update_priority```

**Prioritized Buffer** causes a few changes in the code.
- ```update``` function in the ```Trainer``` class located at ```dqn/rainbow/train.py```
- ``` loss ``` functions (two of them) in the ```Rainbow``` class. Loss tensor must not be averaged over the batch axis! In the ```update``` function we will be using the weighted average loss where the weights are Importance sampling weights obtained from  Prioritized Buffer sample (see the paper for further details). Also, update the td errors of the samples. 

> Modify ```update``` method in  ```dqn/rainbow/train.py```.

> Modify ```vanilla_loss``` in ```dqn/rainbow/model.py```.

> Modify ```_next_action_network``` in ```dqn/rainbow/model.py```. (We will comeback to this one in Double & Noisy extensions)

Use ```_next_action_network``` to obtain target actions so that the loss functions become compatible with double Q-learning.

Remember this while implementing ```update``` function!

You can run Prioritized Buffer experiments bash script under the experiments section to test your implementation.

- - -

#### Distributional RL

This extension changes Q value and hence loss function and policy need modifications. Greedy policy need the expected value of Q distribution, therefore we need to implement additional method that we can use in greedy policy. Moreover, we need to have a Q network with more outputs ```(act_size * n_atoms)``` instaed of ```act_size```

> Complete ```distributional_loss ``` in ```dqn/rainbow/model.py```.

> Complete ```expected_value ``` in ```dqn/rainbow/model.py```.

> Modify ```HeadLayer ``` in ```dqn/rainbow/layers.py```.
- - -

#### N-step Learning

There are many ways of using n-step learning, so we will pick the simplest one. Ignore Importance sampling ratios. Yield a transition with a reward that equals to the sum of $n$ consecutive rewards (discounted by gamma) and the nth next state as the next_state. You can find this way of using n-step learning in Chapter 7 of the textbook (without Importance Sampling or Tree Backup, similar to n-step Sarsa). You can use ```deque```s to delay yielding transitions.

$(s_t, a_t, \sum_{j=t}^{t+n}(\gamma^{j-t} r_t), \text{done}, s_{t+n})$

> Complete ``` __iter__``` in ``` dqn/rainbow/train.py ```

We set n to 1 to deactivate this extension.

- - -
#### Double Q-learning

In double Q learning, the target value is calculated using the actions selected from the online network(```valuenet```). Since we already use ```_next_action_network``` function to find the action that yields maximum value at the next state, we only need to implement ```_next_action_network``` method in ```dqn/rainbow/model.py```.

> Modify ```_next_action_network``` in ```dqn/rainbow/model.py```.

- - -
#### Noisy Net

In this part, we need to complete ```NoisyLayer``` at ```dqn/rainbow/layers.py```. Moreover, when we use "noisy-network" we can act greedily since the stochasticity is built within the network. In ```__iter__``` method at ``` dqn/rainbow/train.py``` use ```greedy_policy``` if noisy-net is active.

> Complete ```NoisyLinear``` in ```dqn/rainbow/layers.py```.
> - __init__
> - reset_noise
> - forward

> Modify ```update``` in  ```dqn/rainbow/train.py``` to reset noise if noisy network is active. Reset both the target and the online networks separately.

> Modify ```__iter__``` in  ```dqn/rainbow/train.py``` to use greedy (but noisy) policy for exploration.

> Modify ```ValueNet``` in ```dqn/rainbow/box2d.py```.

> Modify ```ValueNet``` in ```dqn/rainbow/pong.py``` when you start working with Pong.

> Modify ```HeadLayer``` in ```dqn/rainbow/layers.py```.

In eval mode, use parameter means. Do not forget to use eval mode for target value calculations.


- - - 
#### Dueling Architecutre

You can implement Dueling architecture by filling the ```HeadLayer``` class at ```dqn/rainbow/layers.py```. Remember, the structure of this class depends on Dueling, Distributional, and Noisy Nets.

> Modify ```HeadLayer``` in ```dqn/rainbow/layers.py```.

- - - 
#### Rainbow

Once you completed all the extensions you can combine them. Complete the implementation by filling:

- In box2D, initialize a fully connected network.
> Complete ```ValueNet``` in ```dqn/rainbow/box2d.py``` use ```HeadLayer``` and ```NoisyLinear``` layers if noisy is activated
- In pong, initialize a convolutional network that reduces the spatial size into 5 by 5 (or any other value that you prefer). 
> Complete ```ValueNet``` in ```dqn/rainbow/pong.py``` use ```HeadLayer``` and ```NoisyLinear``` layers if noisy is activated



### Experiments

We will test each extension on its own. Run ```box2d.py``` by enabling one extension at a time and store the results (5 runs per experiment). An example run is given below for prioritized-only experiment.

> Remember ```n-iterations``` and ```write-period``` must be fixed within each experiment for plotting purposes!
In total, there must be 30 runs (5 for each).

#### DQN with Prioritized Buffer

In [None]:
!python dqn/rainbow/box2d.py --log_dir logs/prioritized --no-dist --no-dueling --n-step 1 --no-double --no-noisy

#### DQN with Distributional Values

In [None]:
!python dqn/rainbow/box2d.py --log_dir logs/distributional --no-prioritized --no-dueling --n-step 1 --no-double --no-noisy

#### DQN with N-Step

In [None]:
!python dqn/rainbow/box2d.py --log_dir logs/nsteps --no-prioritized --no-dist --no-dueling --n-step 5 --no-double --no-noisy

#### DQN with Double Q-learning

In [None]:
!python dqn/rainbow/box2d.py --log_dir logs/double --no-prioritized --no-dist --no-dueling --n-step 1 --no-noisy

#### DQN with Dueling Architecture

In [None]:
!python dqn/rainbow/box2d.py --log_dir logs/dueling --no-prioritized --no-dist --n-step 5 --no-double --no-noisy

#### DQN with Noisy Networks

In [None]:
!python dqn/rainbow/box2d.py --log_dir logs/noisy --no-prioritized --no-dist --no-dueling --n-step 1 --no-double

### Gather dataframes

In [None]:
dataframes_vanilla = collect_training_logs(os.path.join("logs", "vanilla-dqn-exp2"))
dataframes_prioritized = collect_training_logs(os.path.join("logs", "prioritized"))
dataframes_distributional = collect_training_logs(os.path.join("logs", "distributional"))
dataframes_nsteps = collect_training_logs(os.path.join("logs", "nsteps"))
dataframes_double = collect_training_logs(os.path.join("logs", "double"))
dataframes_dueling = collect_training_logs(os.path.join("logs", "dueling"))
dataframes_noisy = collect_training_logs(os.path.join("logs", "noisy"))

Plot the results using the provided Plotter.

In [None]:
Plotter(
    {"dqn": dataframes_vanilla,
     "dqn+prioritized_buffer": dataframes_prioritized,
     "dqn+distributional": dataframes_distributional,
     "dqn+n_step": dataframes_nsteps,
     "dqn+double_q": dataframes_double,
     "dqn+dueling": dataframes_dueling,
     "dqn+noisy_nets": dataframes_noisy,
    }
)()

> You can remove the ones that you did not implement from the plots.

> Feel free to experiment with hyperparameters. You can plot their scores and compare them.

**ATARI**

The next step is to train **Pong** with the Rainbow agent. This time, please enable model saving ```--save-model``` and upload the model parameters that returns the highest evaluation score to google drive. Put the link at the end of the notebook. You can use any combination of extensions.

> **Note**: No need to run Pong for more than 1 run!

> **Note**: You need GPU for this experiment! You can use [Colab](https://colab.research.google.com/) if you do not have access to a GPU machine.

Before starting a long training make sure that pong.py terminates successfully.

In [None]:
!python dqn/rainbow/pong.py --log_dir logs/pong --no-prioritized --no-dist --no-dueling --n-step 1 --no-noisy

In [None]:
# Plot the training
Plotter({"dqn": collect_training_logs(os.path.join("logs", "pong"))})()

Put Google drive [link](?) for the model paramterers.