# Tutorial for DQN (Deep Q-Network)

This is a jupyter tutorial for DQN (Deep Q-Network), a model-based algorithm for reinforcement learning.

## Requirements
- Python3.5 or higher
- Pip for Python3
- GPU environment

## Step1. Setup environment
In this step, we're going to install required modules. We use GPU environment (below we call remote) for training model, and need to share the same environment (i.e. installed modules) between local and remote. In this purpose, we use `pipenv`, a tool for package-managing and virtual environment.

### Step1.1. Setup pipenv for local
Run below commands in your local environment.

```
$ pip3 install pipenv --user
$ cd {project-directory}
$ pipenv --python3
```

Install required modules.

```
$ pipenv install numpy
$ pipenv install torch torchvision
$ pipenv install gym gym[atari]
$ pipenv install python-dotenv
```

## Step2. Implementation
Now let's start implementation.

Note: Code below is revised for jupyter notebook (for module import/usage). Complete version is found in https://github.com/nosukeru/DQN_tutorial 

### Step2.1. Configuration
First, we setup some configuration for switching local/remote environment.

In [1]:
# config.py

import os
from dotenv import load_dotenv

load_dotenv('.env')
isLocal = (os.environ.get('ENV') == 'local')


And edit .env file.

```
# .env
ENV=local
```

### Step2.2. Implement Q-Network
Q-Network is a deep neural network model approximating Q-function, which represents potential total reward for current state and action under policy $\pi(a|s)$:

$$ Q^\pi(s, a) = r(s, a) + E_{s' \sim P(s'|s, a)} [V(s')] $$
$$ V^\pi(s) = E_{a \sim \pi(a|s)} [Q(s, a)] $$

When policy $\pi$ is optimal, the following Bellman equation holds:

$$ Q^\ast(s, a) = r(s, a) + E_{s' \sim P(s'|s, a)} [\max_{a'} Q^\ast(s', a')] $$

Our objective is to approximate optimal Q-function by deep neural network.
Here, we design Q-Network to take $s$ as an input and output a vector of Q-values corresponding to each $a$, because of calculation efficiency. Note that this is possible only when action space is countable and independent on state.

Below is an implementation of Q-Network using PyTorch. Input is a Torch tensor of stacked preprocessed frames of size (batch, 4, 84, 84) (detailed in later), and output is also a Torch tensor of Q-values corresponding actions. Architecture is following that of the original paper.

In [2]:
# models.py

import torch.nn as nn
import torch.nn.functional as F


# Q-Network model
class QNet(nn.Module):
    def __init__(self, nAction):
        # nAction: number of action (depends on gym environment)

        super(QNet, self).__init__()
        self.conv1 = nn.Conv2d(4, 32, kernel_size=8, stride=4)  # (4, 84, 84) -> (32, 20, 20)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)  # (32, 20, 20) -> (64, 9, 9)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)  # (64, 9, 9) -> (64, 7, 7)
        self.fc1 = nn.Linear(64 * 7 * 7, 512)
        self.fc2 = nn.Linear(512, nAction)

    def forward(self, x):
        # run forward propagation
        # x: state (4 stacked grayscale frames of size 84x84)

        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = F.relu(self.fc1(x.view(x.size(0), -1)))  # flatten
        return self.fc2(x)


### Step 2.3. Implement Policy

And when Q-function is optimal, best policy for given state $s$ is to select an action $a$ which maximizes Q-value:

$$ \pi(a|s) = \underset{a}{\operatorname{argmax}} Q(s, a) $$

To facilitate exploration, we use epsilon greedy in which agent takes action randomly with probability $\epsilon$.

In [3]:
# agents.py

import numpy as np
from torch.autograd import Variable


# agent (or policy) model
class Agent(object):
    def __init__(self, nAction, Q):
        # nAction: number of action (depends on gym environment)
        # Q: policy network

        self.nAction = nAction
        self.Q = Q

    def getAction(self, state, eps):
        # calc best action for given state
        # state: state (4 stacked grayscale frames of size 84x84)
        # eps: value for epsilon greeedy

        var = Variable(state)
        if not isLocal:
            var = var.cuda()

        # action with max Q value (q for value and argq for index)
        q, argq = self.Q(var).max(1) 

        # epsilon greedy
        probs = np.full(self.nAction, eps / self.nAction, np.float32)
        probs[argq[0]] += 1 - eps
        return np.random.choice(np.arange(self.nAction), p=probs)


### Step2.4. Implement Trainer
When it came to training network, there are some techniques to improve its stability and efficiency:

#### ReplayBuffer
Saving trajectories to a memory and training network by samples randomly taken from this buffer has some benefits:

- enable batch training
- reduce correlation between experiences
- avoid forgetting previous experiences

#### Frozen target network
The output of Q-Network is needed to calculate policy, so updating Q-Network always results to changing policy, which causes learning instability. In order to improve stability, We can use the copy of Q-Network for calculating policy, whose weights are fixed for a while.

In these techniques in mind, we can use the following mean squared loss as a learning objective:

$$ \sum_{i \in batch} \frac{1}{2} \{ Q(s_i, a_i) - (r_i + \max_{a'} Q_{target}(s'_i, a')) \} ^2 $$

where batch (s, a, r, s') = (state, action, reward, nextState) is sampled from ReplayBuffer.

In [4]:
# trainers.py

import torch
import torch.optim as optim
import torch.nn as nn
from torch.autograd import Variable


# utility class for training Q-Network
class Trainer(object):
    def __init__(self, Q, QTarget, opt, args):
        # Q: Q-Network
        # QTarget: Target Q-Network
        # opt: optimizer

        self.Q = Q
        self.QTarget = QTarget
        self.opt = opt

        self.gamma = args.gamma
        self.lossFunc = nn.MSELoss()

    def update(self, batch):
        # update model for given batch
        # batch: training batch of (state, action, reward, nextState)

        # extract training batch
        stateBatch = Variable(torch.cat([step.state for step in batch], 0))
        actionBatch = torch.LongTensor([step.action for step in batch])
        rewardBatch = torch.Tensor([step.reward for step in batch])
        nextStateBatch = Variable(torch.cat([step.nextState for step in batch], 0))

        if not isLocal:
            stateBatch = stateBatch.cuda()
            actionBatch = actionBatch.cuda()
            rewardBatch = rewardBatch.cuda()
            nextStateBatch = nextStateBatch.cuda()

        # calc values for update model
        qValue = self.Q(stateBatch).gather(1, actionBatch.unsqueeze(1)).squeeze(1)  # Q(s, a)
        qTarget = rewardBatch + self.QTarget(nextStateBatch).detach().max(1)[0] * self.gamma  # r + γmaxQ(s', a')

        L = self.lossFunc(qValue, qTarget)  # loss to equalize Q(s) and r + γmaxQ(s', a')
        self.opt.zero_grad()
        L.backward()
        self.opt.step()  # train for one batch step


### Step 2.5. Implement ReplayBuffer
Below is a code for naive implementation of ReplayBuffer.

In [5]:
# utils.py

import random
from typing import NamedTuple

import torch
import torchvision.transforms as T


# one step of interaction with environment
class Step(NamedTuple):
    state: torch.Tensor
    action: int
    reward: float
    nextState: torch.Tensor


# replay buffer
class ReplayBuffer(object):
    def __init__(self, capacity):
        # capacity: max size of replay buffer

        self.capacity = capacity
        self.memory = []
        self.index = 0

    def push(self, step):
        # add  a step to buffer
        # step: one step of interaction

        if len(self.memory) < self.capacity:
            self.memory.append(step)
        else:
            self.memory[self.index] = step

        self.index = (self.index + 1) % self.capacity

    def sample(self, size):
        # collect batch of given size
        # size: batch size

        return random.sample(self.memory, size)


### Step 2.6. Implement preprocessing of frames
We also need some preprocessing of raw frames:

- resizing and trimming original frame of size (210, 160) into (84, 84) for less memory size
- grayscaling for simplicity

In [6]:
# utils.py

def preprocess(x):
    # preprocess frame
    # x: a frame of size 210x160
    
    # resize, grayscale and convert to tensor
    transform = T.Compose([
        T.ToPILImage(),
        T.Resize(84),
        T.Grayscale(),
        T.ToTensor()
    ])

    return transform(x[50:, :, :]).unsqueeze(0)


### Step 2.7. Implement training loop
Finally we're ready to implement main training loop, but there're still some points to consider.

#### State as stacked frames
To capture the movement betweeen frames, it is beneficial to stack some frames as a state, instead of a single frame.

#### Frame skip
Deciding action per each frame may be too frequent, because even expert (human player) can't decide action so quickly. In addition, state scarcely changes in adjacent frames and too many decision can result in training instability, so it can be helpful to skip some frames and take the same action as previous decision for these frames.

#### Training skip
Similar to frame skip, training per each frame can be too frequent and we'll train model one for some frames.

#### Initial waiting
It is better not to start training until enough experiences are saved in ReplayBuffer to prevent overfitting.

Below code is a full implementation of training process.

In [None]:
import argparse

import gym
import torch
import torch.optim as optim


def run(args):
    # setup
    env = gym.make('Breakout-v0')

    nAction = env.action_space.n
    buffer = ReplayBuffer(args.buffer_size)

    Q = QNet(nAction)
    QTarget = QNet(nAction)

    if args.model_path is not None:
        state_dict = None
        if isLocal:
            state_dict = torch.load(args.model_path, map_location='cpu')
        else:
            state_dict = torch.load(args.model_path)

        Q.load_state_dict(state_dict)
        QTarget.load_state_dict(state_dict)

    Q.train()
    QTarget.eval()

    if not isLocal:
        Q = Q.cuda()
        QTarget = QTarget.cuda()

    opt = optim.Adam(Q.parameters(), lr=args.lr)

    agent = Agent(nAction, Q)
    trainer = Trainer(Q, QTarget, opt, args)

    t = 0
    action = env.action_space.sample()

    for episode in range(args.episode):
        print("episode: %d\n" % (episode + 1))

        observation = env.reset()
        state = torch.cat([preprocess(observation)] * 4, 1)  # initial state
        sum_reward = 0

        # Exploration loop
        done = False
        while not done:
            if isLocal:
                env.render()

            # frame skip
            if t % args.frame_skip == 0:
                action = agent.getAction(state, args.eps)

            # take action and calc next state
            observation, reward, done, _ = env.step(action)
            nextState = torch.cat([state.narrow(1, 1, 3), preprocess(observation)], 1)

            buffer.push(Step(state, action, reward, nextState))
            state = nextState
            sum_reward += reward
            t += 1

            # initial waiting
            if t < args.initial_wait:
                continue

            # update model
            if t % args.train_freq == 0:
                batch = buffer.sample(args.batch)
                trainer.update(batch)

            # update target
            if t % args.target_update_freq == 0:
                QTarget.load_state_dict(Q.state_dict())

        print("  reward %f\n" % sum_reward)

        if episode % args.snapshot_freq == 0:
            torch.save(Q.state_dict(), "results/%d.pth" % (episode + 1))
            print("  model saved")

    torch.save(Q.state_dict(), "results/model.pth")
    env.close()


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--episode', type=int, default=12000)
    parser.add_argument('--buffer_size', type=int, default=400000)
    parser.add_argument('--train_freq', type=int, default=4)
    parser.add_argument('--initial_wait', type=int, default=20000)
    parser.add_argument('--batch', type=int, default=32)
    parser.add_argument('--target_update_freq', type=int, default=10000)
    parser.add_argument('--lr', type=float, default=0.0003)
    parser.add_argument('--frame_skip', type=int, default=4)
    parser.add_argument('--snapshot_freq', type=int, default=1000)
    parser.add_argument('--eps', type=float, default=0.05)
    parser.add_argument('--gamma', type=float, default=0.99)
    parser.add_argument('--model_path', type=str)

    args = parser.parse_args(args=[])
    run(args)


episode: 1

  reward 1.000000

  model saved
episode: 2

  reward 2.000000

episode: 3

  reward 0.000000

episode: 4

  reward 2.000000

episode: 5

  reward 0.000000

episode: 6

  reward 0.000000

episode: 7

  reward 0.000000

episode: 8

  reward 0.000000

episode: 9



RecursionError: maximum recursion depth exceeded while calling a Python object

## Step 3. Training in remote environment
Once you can confirm that training loop runs correctly in local environment, it's time to train model in remote environment with GPU.

### Step 3.1. Bring code in remote
You can use your own source repository, or run the following command:

```
$ 
```