# Deep Q-Learning Network (DQN)

In this exercise, you will implement a Deep Q-Learning Network (DQN) agent that uses a Neural Network (NN) to estimate the action value (Q value) of a state. Deep Q-Learning Network was invented by Google DeepMind in 2013, and it became famous for achieving superhuman performance on 29 out of 49 games on *Atari 2600* games.

## Prerequisites

You are expected to have basic familiarity with Machine Learning, Deep Learning, and Reinforcement Learning, especially on following topics:

 * Neural Network
 * Q-Learning
 * (Optional) Convolutional Neural Network

You are also expected to have basic familarity with PyTorch, especially on following modules:

 * `torch.nn`
 * `torch.optim`

## Papers

 * [Playing Atari with Deep Reinforcement Learning (2013)](https://arxiv.org/pdf/1312.5602.pdf)
 * [Human-level control through deep reinforcement learning (2015)](https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf)

## Environment: CartPole

First, let's get familiar with the environment we will use to train and validate our DQN agent on. We will use the **CartPole** environment from [OpenAI Gym](https://gym.openai.com). In the CartPole environment, the agent has two components: a cart and a pole.

The agent's goal is to maintain balance so that the pole stays upright. To maintain balance, the agent must move the cart left or right accordingly.

![Cartpole](cartpole.gif)

**State Space Dimensions**: 4
 * $x$: Location of the cart
 * $x'$: Speed of the cart
 * $\theta$: Angle of the pole
 * $\theta'$: Angular Speed of the pole

**Action Space Size**: 2
 * Left
 * Right

**Reward**: +1 for all state

To understand the environment a bit better, let's render the environment with a random-action agent.

In [1]:
import gym

To create an environment from OpenAI Gym, we use `gym.make()` command with appropriate environment name. To initialize the environment, we use `env.reset()` command. `env.reset()` returns the observation of the first state $s_0$.

In [2]:
env = gym.make('CartPole-v0')
obs = env.reset()
print(obs)

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[ 0.03656076  0.03471627 -0.00079671  0.04015512]


All OpenAI Gym environment has an `action_space` attribute with a `sample()` method. `env.action_space.sample()` returns a randomly selected action.

In [3]:
action = env.action_space.sample()

To perform an action, we use `env.step(action)`. The environment then returns `(obs, rew, done, info)`, where
 * `obs` is the observation of the state after the action
 * `rew` is the reward after the action
 * `done` is a boolean indicating if the episode has terminated
 * `info` is a dictionary containing other useful information about the environment

In [4]:
obs, rew, done, info = env.step(action)
print('Observation: ', obs)
print('Reward: ', rew)
print('Done: ', done)
print('Info: ', info)

Observation:  [ 3.72550881e-02 -1.60394250e-01  6.39132317e-06  3.32586567e-01]
Reward:  1.0
Done:  False
Info:  {}


To render the environment, we use `env.render()`, and to close the rendered window, we use `env.close()`.

In [5]:
done = False
while not done:
    action = env.action_space.sample()
    obs, rew, done, info = env.step(action)
    env.render()
env.close()

## Naive DQN

Now, let us implement a neural network with PyTorch. We will use a simple neural network with few layers.

In [6]:
import torch
import torch.nn as nn

We create a `NaiveDQN` class that inherits `torch.nn.Module`, and define the layers in `NaiveDQN.__init__()` and define the forward propagation step in `NaiveDQN.forward()`.

In [7]:
class NaiveDQN(nn.Module):

    def __init__(self, input_dims, output_dims):
        super(DQN, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dims, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, output_dims)
        )

    def forward(self, x):
        return self.layers(x)

    def act(self, state, epsilon):
        pass

## Replay Memory

## Target Network

## Huber Loss

## DQN

## Optional Environment: Pong

In the DQN paper by Google DeepMind, they tested their agents with games from *Atari 2600*. We can also try training and evaluating the DQN agent's performance on *Atari 2600* games. Specifically, we will use *Pong* as our environment.

![](pong.jpg)

**State Space Dimensions**: (210, 160, 3)
 * Height of 210 pixels
 * Width of 160 pixels
 * Pixels represented by RGB values: 3 numbers ranging from 0~255

**Action Space Size**: 6

In [8]:
env = gym.make('PongNoFrameskip-v4')
obs = env.reset()
print(obs.shape)

(210, 160, 3)


In [9]:
print(env.action_space.n)

6


## Optional Agent: CNN DQN

Unlike CartPole, the observation we receive from the environment is an image of shape (210, 160, 3). Thus, we should use a Convolutional Neural Network (CNN) in the DQN.

In [10]:
class CNNDQN(nn.Module):
    def __init__(self, input_dims, output_dims):
        pass

    def forward(self, x):
        pass

    def act(self, state, epsilon):
        pass

## References

 * [PyTorch: Reinforcement Learning (DQN) Tutorial](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)
 * [RL Adventure by higgsfield](https://github.com/higgsfield/RL-Adventure)