# RL Exercise 3 - Deep Q Networks

**GOAL:** The goal of this exercise is to demonstrate how to use the deep Q networks (DQN) algorithm and compare its efficiency with PPO. You will also learn how to use RLlib's command-line benchmark API and visualize results with TensorBoard.

To understand how to use **RLlib**, see the documentation at http://rllib.io.

DQN was *the* first deep RL algorithm, and is described in detail in https://arxiv.org/abs/1312.5602.

In DQN, instead of training a policy network to directly emit output actions from the observation, we learn a Q function that models the expected outcome of taking certain actions. This model is then used to compute the optimal actions at each step.

Unlike policy gradient algorithms such as PPO, DQN can learn from past experiences through *experience replay*. This allows DQN to use experiences multiple times over the course of training, improving its sample efficiency. In this exercise we are going to use a single-process configuration for DQN, but RLlib does provide a distributed variant of DQN: https://ray.readthedocs.io/en/latest/rllib-algorithms.html#distributed-prioritized-experience-replay-ape-x

## Running DQN with the command-line API

This time, we won't use the Python API. For well-known benchmark environments such as CartPole-v0, it is more convenient to run them from the command line.

**EXERCISE**: Open a new terminal in Jupyter lab using the "+" button, and run:

`$ rllib train --run=DQN --env=CartPole-v0`

## Comparison with PPO

**EXERCISE**: Compare the performance of DQN with PPO. How many timesteps does it take to reach a reward of 150?

Note that you can run PPO from the command line as well. Configuration can be passed via the --config flag.

`$ rllib train --run=PPO --env=CartPole-v0 --config='{"num_sgd_iter": 30}'`

## Visualize results with TensorBoard

**EXERCISE**: Finally, you can visualize your training results using TensorBoard. To do this, run:
    
`$ tensorboard --logdir=~/ray_results`

And open your browser to the address printed. Compare the learning curves of PPO vs DQN. Toggle the horizontal axis between both the "STEPS" and "RELATIVE" view to compare efficiency in number of timesteps vs real time time.