# Deep Q-Learning (DQN)

This exercise requires you to implement and evaluate deep Q-learning (DQN) with convolutional neural networks for playing Atari games. The goal is to implement the DQN algorithm covered in the lecture and you will be provided with a starter code.

For a quick reminder, DQN trains a parameterized Q-network $Q(\cdot, \cdot; \phi_k)$ by minimizing the empirical Bellman error:

$\mathcal{E}(\mathcal{D}, \phi_{k,g}) = \mathbb{E}_{(s,a,r,s')\sim D} \left[\left(Q(s,a;\phi_{k,g}) - \left(\underbrace{r + \gamma \max_{a'} Q(s',a';\phi_k)}_{\text{fixed parameter}~ \phi_{k}}\right)\right)^2\right],$

where $\mathcal{D}$, $k$, and $g$ each denotes the replay buffer, the iteration number, and the gradient step. The learned Q-network is then used to select the best action on a randomly given state:

$a = \text{argmax}_a Q(s, a; \phi_K)$.

## Implementation

The default code will run the `LunarLander-v3` game with reasonable hyperparameter settings. You may want to look inside `rlkit/infrastructure/dqn_utils.py` to understand how the replay buffer works, but you do not need to modify it.

In order to implement DQN, you will be writing new codes in the following files:

> * `rlkit/agents/dqn_agent.py`
> * `rlkit/critics/dqn_critic.py`
> * `rlkit/policies/argmax_policy.py`

In `rlkit/agents/dqn_agent.py`, you will implement some core parts of the DQN agent, including the epsilon-greedy exploration strategy, interacting with the environment, storing and retrieving from the replay buffer, etc.

> * epsilon-greedy exploration: $\pi_{k+1}(a | s) \leftarrow \epsilon \mathcal{U}(a) + (1-\epsilon) \delta \left(a = \text{argmax}_a Q(s, a; \phi_{k+1})\right)$

In `rlkit/critics/dqn_critic.py`, you will implement some core parts of the DQN Critic (Q-network), including Q-value estimation and Bellman error calculation.

> * estimate error: $\mathcal{E}(B, \phi_{k,g}) = \sum_{i\in \mathcal{I}} \left[\left(Q(s_i,a_i;\phi_{k,g}) - \left(r_i + \gamma \max_{a'} Q(s'_i,a';\phi_k)\right)\right)^2\right]$,

where $B = \{(s_i, a_i, s'_i, r_i)\}_{i\in\mathcal{I}}$ is a random subset of the replay buffer $\mathcal{D}$.

In `rlkit/policies/argmax_policy.py`, you will implement the argmax policy which is used for selecting the best action in terms of maximizing the estimated Q-value.



## Setup

In [None]:
#@title 1. Mount your Google Drive

from google.colab import drive

drive.mount('/content/drive', force_remount=True)

# enter the foldername in your Drive where you have saved the unzipped 'rlkit' folder
FOLDERNAME = 'hw9'

assert FOLDERNAME is not None, "[!] Enter the foldername."

%cd /content/drive/MyDrive/$FOLDERNAME

In [None]:
#@title 2. Install packages

#@markdown Please run the follown script to install external Linux and Python packages.

#@markdown This may take a few minutes.

!apt update 
!apt install xvfb ffmpeg

!pip install tensorboard tensorboardX pyvirtualdisplay selenium swig pyglet
!pip install Box2D
!pip install gym==0.22.0

## Run DQN

In [4]:
#@title 1. Import packages

import os
from pyvirtualdisplay import Display
import time

from rlkit.infrastructure.rl_trainer import OffPolicyRLTrainer
from rlkit.agents.dqn_agent import DQNAgent
from rlkit.infrastructure.dqn_utils import get_env_kwargs

%load_ext autoreload
%autoreload 2

In [6]:
#@title 3. Runtime arguments

class Args:

  def __getitem__(self, key):
    return getattr(self, key)

  def __setitem__(self, key, val):
    setattr(self, key, val)

  def __contains__(self, key):
    return hasattr(self, key)

  env_name = 'LunarLander-v3' #@param ['LunarLander-v3']
  exp_name = 'dqn' #@param

  ## PDF will tell you how to set ep_len
  ## and discount for each environment
  ep_len = 200 #@param {type: "integer"}

  #@markdown batches and steps
  batch_size = 32 #@param {type: "integer"}
  eval_batch_size = 1000 #@param {type: "integer"}

  num_agent_train_steps_per_iter = 1 #@param {type: "integer"}

  num_critic_updates_per_agent_update = 1 #@param {type: "integer"}
  
  #@markdown Q-learning parameters
  double_q = True #@param {type: "boolean"}

  #@markdown system
  save_params = False #@param {type: "boolean"}
  no_gpu = False #@param {type: "boolean"}
  which_gpu = 0 #@param {type: "integer"}
  seed = 1337 #@param {type: "integer"}

  #@markdown logging
  ## default is to not log video so
  ## that logs are small enough to be
  ## uploaded to gradscope
  video_log_freq = -1 #@param {type: "integer"}
  scalar_log_freq = 1000 #@param {type: "integer"}


args = Args()

args['train_batch_size'] = args['batch_size']

In [7]:
#@title 4. Create directory for logging

base_logdir = "logs"
exp_name = args["exp_name"] + '_' + args["env_name"]
logdir = os.path.join(base_logdir, exp_name)
os.makedirs(logdir, exist_ok=True)
args["logdir"] = logdir


In [8]:
#@title 5. Define Q-function trainer

class Q_Trainer(object):

    def __init__(self, params):
        self.params = params

        train_args = {
            'num_agent_train_steps_per_iter': params['num_agent_train_steps_per_iter'],
            'num_critic_updates_per_agent_update': params['num_critic_updates_per_agent_update'],
            'train_batch_size': params['batch_size'],
            'double_q': params['double_q'],
        }

        env_args = get_env_kwargs(params['env_name'])

        for k, v in env_args.items():
          params[k] = v

        self.params['agent_class'] = DQNAgent
        self.params['agent_params'] = params
        self.params['train_batch_size'] = params['batch_size']
        self.params['env_wrappers'] = env_args['env_wrappers']

        self.rl_trainer = OffPolicyRLTrainer(self.params)

    def run_training_loop(self):
        self.rl_trainer.run_training_loop(
            self.params['num_timesteps'],
            collect_policy = self.rl_trainer.agent.actor,
            eval_policy = self.rl_trainer.agent.actor,
            )

In [None]:
#@title 6. Run training

#@markdown If your implementation is correct, the average return will be close to or above 50.

#@markdown This may take about 30 minutes.

trainer = Q_Trainer(args)
trainer.run_training_loop()

In [None]:
#@title 7. Run Tensorboard

%load_ext tensorboard
%tensorboard --logdir /content/drive/MyDrive/$FOLDERNAME/logs