# DQN Implementation

### Overview
The paper introduces the Deep Q-Network (DQN) algorithm, which combines Q-learning with deep neural networks. The key contributions of the paper are:

1. Q-Learning with Function Approximation: Using a convolutional neural network (CNN) to approximate the Q-function, which is a measure of the expected future rewards given a state-action pair.
2. Experience Replay: Storing the agent's experiences in a replay buffer and sampling random mini-batches from it to break the correlation between consecutive samples.
3. Fixed Q-Target: Using a separate target network to generate stable target Q-values, which are updated less frequently than the main Q-network.

In [None]:
import numpy as np
import gymnasium as gym
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.animation as animation
from IPython.display import HTML, clear_output
import cv2
import random
import string

### Preprocess Function

From DQN paper,
*The raw frames are preprocessed by first converting their RGB representation to gray-scale
…
The final input representation is obtained by cropping an 84 × 84 region of the image that roughly captures the playing area.*
* preprocess: Converts the input image to grayscale and normalizes it. This reduces the complexity of the input and speeds up training.

In [None]:
def preprocess(img):
    img = img[:84, 6:90]
    img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY) / 255.0
    return img

### Animate Function
* animate: Creates a video from a sequence of images and saves it. If _return is True, it returns an IPython Video object for display.


In [None]:
def animate(imgs, video_name, _return=True):
    if video_name is None:
        video_name = ''.join(random.choice(string.ascii_letters) for i in range(18)) + '.webm'
    height, width, layers = imgs[0].shape
    fourcc = cv2.VideoWriter_fourcc(*'VP90')
    video = cv2.VideoWriter(video_name, fourcc, 10, (width, height))
    for img in imgs:
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        video.write(img)
    video.release()
    if _return:
        from IPython.display import Video
        return Video(video_name)

### Evaluate Function
* evaluate: Evaluates the agent's performance over multiple episodes and returns the average score. This helps in tracking the learning progress.

In [None]:
def evaluate(agent, n_evals=5):
    eval_env = gym.make('CarRacing-v2', continuous=False)
    eval_env = ImageEnv(eval_env)
    scores = 0
    for _ in range(n_evals):
        (s, _), done, ret = eval_env.reset(), False, 0
        while not done:
            a = agent.act(s, training=False)
            s_prime, r, terminated, truncated, _ = eval_env.step(a)
            s = s_prime
            ret += r
            done = terminated or truncated
        scores += ret
    return np.round(scores / n_evals, 4)

# Image Environment Wrapper
A wrapper around the environment that:
* reset: Resets the environment, performs no-ops, preprocesses the state, and stacks frames.
* step: Executes the action, skips frames, preprocesses the state, and stacks frames. <br>

*For the experiments in this paper, the function $\phi$
 from algorithm 1 applies this preprocessing to the last 4 frames of a history and stacks them to produce the input to the Q-function <br>
… <br>
we also use a simple frame-skipping technique [3]. More precisely, the agent sees and selects actions on every kth frame instead of every frame, and its last action is repeated on skipped frames.* <br>

#### Key Points
* Skipping Frames: By skipping frames, the environment's observations are updated only every skip_frames steps, which can reduce computational load and make training more efficient.
* Stacking Frames: Multiple frames are stacked together to form a single state. This helps the agent to understand the temporal aspect of the environment.
* Initial No-op Actions: A number of no-op actions are performed initially to randomize the starting state.
* Preprocessing: The observations are preprocessed before being used. This can include resizing, normalization, etc., to make the data suitable for the neural network.

In [None]:
# ImageEnv Wrapper
class ImageEnv(gym.Wrapper):
    def __init__(self, env, skip_frames=4, stack_frames=4, initial_no_op=50, **kwargs):
        super(ImageEnv, self).__init__(env, **kwargs)
        self.initial_no_op = initial_no_op
        self.skip_frames = skip_frames
        self.stack_frames = stack_frames

    def reset(self):
        s, info = self.env.reset()
        for _ in range(self.initial_no_op):
            s, _, terminated, truncated, _ = self.env.step(0)
            if terminated or truncated:
                s, info = self.env.reset()
        s = preprocess(s)
        self.stacked_state = np.tile(s, (self.stack_frames, 1, 1))
        return self.stacked_state, info

    def step(self, action):
        reward = 0
        for _ in range(self.skip_frames):
            s, r, terminated, truncated, info = self.env.step(action)
            reward += r
            if terminated or truncated:
                break
        s = preprocess(s)
        self.stacked_state = np.concatenate((self.stacked_state[1:], s[np.newaxis]), axis=0)
        return self.stacked_state, reward, terminated, truncated, info

# The Q - Network
From DQN paper, 
*The input to the neural network consists is an 84 × 84 × 4 image produced by $\phi$
. The first hidden layer convolves 32 8 × 8 filters with stride 4 with the input image and applies a rectifier nonlinearity [10, 18]. The second hidden layer convolves 64 4 × 4 filters with stride 2, and the third hidden layer consists of 64 3x3 filters with stride 1 again followed by a rectifier nonlinearity. The final hidden layer is fully-connected and consists of 512 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action.* 
* CNNActionValue: A convolutional neural network for approximating the Q-function.
    * forward: Defines the forward pass of the network.


In [None]:
# CNNActionValue class
class CNNActionValue(nn.Module):
    def __init__(self, state_dim, action_dim, activation=F.relu):
        super(CNNActionValue, self).__init__()
        self.conv1 = nn.Conv2d(state_dim, 32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
        self.fc1 = nn.Linear(64 * 7 * 7, 512)
        self.fc2 = nn.Linear(512, action_dim)
        self.activation = activation

    def forward(self, x):
        x = self.activation(self.conv1(x))
        x = self.activation(self.conv2(x))
        x = self.activation(self.conv3(x))
        x = x.view(x.size(0), -1)
        x = self.activation(self.fc1(x))
        x = self.fc2(x)
        return x

# Replay Buffer
This code defines a ReplayBuffer class, which is used in reinforcement learning (RL) to store and sample experiences. 

### Class Initialization (__init__ method)
##### Parameters:

* state_dim: The dimensions of the state space.
* action_dim: The dimensions of the action space.
* max_size: The maximum size of the replay buffer, default is 100,000.

##### Attributes:

* self.s: A numpy array to store states with shape (max_size, *state_dim).
* self.a: A numpy array to store actions with shape (max_size, *action_dim).
* self.r: A numpy array to store rewards with shape (max_size, 1).
* self.s_prime: A numpy array to store next states with shape (max_size, *state_dim).
* self.terminated: A numpy array to store terminal flags (indicating episode termination) with shape (max_size, 1).
* self.ptr: A pointer to indicate the current position to insert the new experience.
* self.size: The current size of the buffer (number of stored experiences).
* self.max_size: The maximum capacity of the buffer.

### Updating the Buffer (update method)

##### Parameters:

* s: The current state.
* a: The action taken.
* r: The reward received.
* s_prime: The next state.
* terminated: A flag indicating whether the episode has terminated.
##### Functionality:

* Store the experience (s, a, r, s_prime, terminated) at the current position indicated by self.ptr.
* Increment the pointer (self.ptr) and wrap around if it exceeds max_size (circular buffer).
* Update the size of the buffer (self.size) ensuring it doesn't exceed max_size.

### Sampling from the Buffer (sample method)
##### Parameter:

* batch_size: The number of experiences to sample.

##### Functionality:

* Randomly sample batch_size indices from the range (0, self.size).
* Retrieve the experiences (states, actions, rewards, next states, and terminal flags) corresponding to these indices.
* Convert the sampled experiences to PyTorch tensors for use in training neural networks.

Replay Buffer stores the agent's experiences and allows for efficient sampling of random mini-batches, which is essential for breaking the correlation between consecutive experiences and stabilizing training. <br>
Reference:  https://github.com/sfujim/TD3/blob/master/utils.py#L5

In [1]:
# ReplayBuffer class
class ReplayBuffer:
    def __init__(self, state_dim, action_dim, max_size=int(1e5)):
        self.s = np.zeros((max_size, *state_dim), dtype=np.float32)
        self.a = np.zeros((max_size, *action_dim), dtype=np.int64)
        self.r = np.zeros((max_size, 1), dtype=np.float32)
        self.s_prime = np.zeros((max_size, *state_dim), dtype=np.float32)
        self.terminated = np.zeros((max_size, 1), dtype=np.float32)
        self.ptr = 0
        self.size = 0
        self.max_size = max_size

    def update(self, s, a, r, s_prime, terminated):
        self.s[self.ptr] = s
        self.a[self.ptr] = a
        self.r[self.ptr] = r
        self.s_prime[self.ptr] = s_prime
        self.terminated[self.ptr] = terminated
        self.ptr = (self.ptr + 1) % self.max_size
        self.size = min(self.size + 1, self.max_size)

    def sample(self, batch_size):
        ind = np.random.randint(0, self.size, batch_size)
        return (torch.FloatTensor(self.s[ind]), torch.FloatTensor(self.a[ind]), 
                torch.FloatTensor(self.r[ind]), torch.FloatTensor(self.s_prime[ind]), 
                torch.FloatTensor(self.terminated[ind]))

# DQN Class

The DQN class encapsulates the components and functionalities required for a DQN agent. <br>

### Initialization (__init__ method) 
The constructor initializes various parameters and objects needed for the DQN agent.

##### Parameters:
* state_dim: Dimensions of the state space.
* action_dim: Number of possible actions.
* lr: Learning rate for the optimizer. Default is 0.00025.
* epsilon: Initial exploration rate. Default is 1.0.
* epsilon_min: Minimum exploration rate. Default is 0.1.
* gamma: Discount factor for future rewards. Default is 0.99.
* batch_size: Number of experiences to sample for each learning step. Default is 32.
* warmup_steps: Number of steps to collect experiences before starting the learning process. Default is 5000.
* buffer_size: Maximum size of the replay buffer. Default is 100000.
* target_update_interval: Number of steps between updates of the target network. Default is 10000.

##### Neural Networks:
* self.network: Main Q-network used for action selection.
* self.target_network: Target Q-network used for stable Q-value estimation.<br>
Both networks are instances of CNNActionValue and have the same architecture.

##### Action Selection (act method)
* Selects an action based on the current policy.
* Uses ε-greedy strategy: with probability epsilon, selects a random action; otherwise, selects the action with the highest Q-value predicted by the network.
* Converts the state to a PyTorch tensor, passes it through the network, and selects the action with the highest Q-value.

##### Learning (learn method)
* Samples a batch of experiences from the replay buffer.
* Computes the target Q-values using the target network.
* Computes the loss between the predicted Q-values (from the main network) and the target Q-values.
* Performs a gradient descent step to minimize the loss.

##### Processing Transitions (process method)
* Updates the total number of steps.
* Adds the new transition to the replay buffer.
* If the total steps exceed the warmup steps, performs a learning step.
* Periodically updates the target network.
* Adjusts the epsilon value according to the epsilon decay schedule.


Initialization: Sets up the main and target networks, optimizer, replay buffer, and various hyperparameters.<br>
Action Selection: Uses ε-greedy strategy to balance exploration and exploitation.<br>
Learning: Samples experiences, computes loss, and updates the main network.<br>
Processing Transitions: Manages transitions, triggers learning, updates target network, and decays epsilon.<br><br>
This structure helps the DQN agent learn effective policies by leveraging experience replay and target networks to stabilize training.

In [None]:
# DQN class
class DQN:
    def __init__(self, state_dim, action_dim, lr=0.00025, epsilon=1.0, epsilon_min=0.1, gamma=0.99, batch_size=32,
                 warmup_steps=5000, buffer_size=int(1e5), target_update_interval=10000):
        self.action_dim = action_dim
        self.epsilon = epsilon
        self.epsilon_min = epsilon_min
        self.epsilon_decay = (epsilon - epsilon_min) / 1e6
        self.gamma = gamma
        self.batch_size = batch_size
        self.warmup_steps = warmup_steps
        self.target_update_interval = target_update_interval
        self.total_steps = 0

        self.network = CNNActionValue(state_dim[0], action_dim)
        self.target_network = CNNActionValue(state_dim[0], action_dim)
        self.target_network.load_state_dict(self.network.state_dict())
        self.optimizer = torch.optim.RMSprop(self.network.parameters(), lr)
        self.buffer = ReplayBuffer(state_dim, (1,), buffer_size)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.network.to(self.device)
        self.target_network.to(self.device)

    @torch.no_grad()
    def act(self, x, training=True):
        self.network.train(training)
        if training and (np.random.rand() < self.epsilon or self.total_steps < self.warmup_steps):
            return np.random.randint(0, self.action_dim)
        x = torch.from_numpy(x).float().unsqueeze(0).to(self.device)
        return torch.argmax(self.network(x)).item()

    def learn(self):
        s, a, r, s_prime, terminated = map(lambda x: x.to(self.device), self.buffer.sample(self.batch_size))
        next_q = self.target_network(s_prime).detach()
        td_target = r + (1. - terminated) * self.gamma * next_q.max(dim=1, keepdim=True).values
        loss = F.mse_loss(self.network(s).gather(1, a.long()), td_target)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        return {'total_steps': self.total_steps, 'value_loss': loss.item()}

    def process(self, transition):
        self.total_steps += 1
        self.buffer.update(*transition)
        if self.total_steps > self.warmup_steps:
            result = self.learn()
        if self.total_steps % self.target_update_interval == 0:
            self.target_network.load_state_dict(self.network.state_dict())
        self.epsilon = max(self.epsilon_min, self.epsilon - self.epsilon_decay)
        return result if self.total_steps > self.warmup_steps else {}

# Main Training Loop
### Initialization:
* Creates the environment and wraps it with ImageEnv.
* Defines the state and action dimensions.
* Initializes the DQN agent.
* Sets the maximum steps and evaluation interval.
* Initializes a dictionary to store the training history.
* Resets the environment and gets the initial state.

In [None]:
# Initialize environment and agent
env = gym.make('CarRacing-v2', continuous=False)
env = ImageEnv(env)

state_dim = (4, 84, 84)
action_dim = env.action_space.n
agent = DQN(state_dim, action_dim)

max_steps = 1e6 #The maximum number of training steps.
eval_interval = 10000 #The number of steps between evaluations.

history = {'Step': [], 'AvgReturn': []}

(s, _) = env.reset() #The environment is reset, and the initial state is stored in s.

### Training Loop:
* The loop runs until the total steps exceed max_steps.
* The agent takes an action based on the current state.
* Steps the environment with the action, receiving the next state, reward, and done flags.
* Processes the transition.
* Updates the current state.
* If the episode ends (terminated or truncated), resets the environment.
* Every eval_interval steps, evaluates the agent, logs the results, plots the training progress, and saves the model.

In [None]:
#training
while agent.total_steps <= max_steps:
    a = agent.act(s) # The agent selects an action a using its act method.
    s_prime, r, terminated, truncated, _ = env.step(a) #The environment performs the action and returns the next state s_prime, reward r, and termination flags terminated and truncated.
    result = agent.process((s, a, r, s_prime, terminated)) #The agent processes the transition (s, a, r, s_prime, terminated) using the process method.
    s = s_prime #The next state s_prime becomes the current state s.
    if terminated or truncated:
        s, _ = env.reset() #If the episode ends (terminated or truncated), the environment is reset.
    #Every eval_interval steps, the agent is evaluated using the evaluate function. The results are recorded in history, and a plot is updated and displayed. The agent's network is saved to a file dqn-CarRacing.pt.
    if agent.total_steps % eval_interval == 0:
        ret = evaluate(agent)
        history['Step'].append(agent.total_steps)
        history['AvgReturn'].append(ret)
        clear_output()
        plt.figure(figsize=(8, 5))
        plt.plot(history['Step'], history['AvgReturn'], 'b-')
        plt.xlabel('Step', fontsize=16)
        plt.ylabel('AvgReturn', fontsize=16)
        plt.xticks(fontsize=14)
        plt.yticks(fontsize=14)
        plt.grid(axis='y')
        plt.show()
        torch.save(agent.network.state_dict(), 'dqn-CarRacing.pt')

### Final Evaluation and Animation
* Creates a new evaluation environment with rendering enabled.
* Runs an evaluation episode, rendering each frame.
* Stores each rendered frame.
* Uses the agent to take actions without exploration.
* Animates the frames and saves the video.

In [None]:
# Final evaluation and animation
eval_env = gym.make('CarRacing-v2', continuous=False, render_mode='rgb_array') #A new evaluation environment eval_env is created with render_mode='rgb_array' to capture frames for animation.
eval_env = ImageEnv(eval_env) #The environment is wrapped with ImageEnv.

frames = [] #A list to store rendered frames.
(s, _), done, ret = eval_env.reset(), False, 0 #The environment is reset, and initial values are set for done and ret.
while not done: #The loop continues until the episode ends (done is True).
    frames.append(eval_env.render()) #The current frame is rendered and added to frames.
    a = agent.act(s, training=False) #The agent selects an action a without exploration (training=False).
    s_prime, r, terminated, truncated, _ = eval_env.step(a) #The environment performs the action and returns the next state s_prime, reward r, and termination flags.
    s = s_prime #The next state s_prime becomes the current state s.
    ret += r #The reward r is added to the cumulative return ret.
    done = terminated or truncated # The loop ends if the episode is terminated or truncated.

animate(frames, 'animation.webm')

##### Load Model, Continue Training or Test

In [None]:
# Load model function
def load_model(agent, model_path): #This function loads a pre-trained model into the agent.
    agent.network.load_state_dict(torch.load(model_path)) #Loads the state dictionary (weights) from model_path into agent.network.
    agent.target_network.load_state_dict(agent.network.state_dict()) #Copies the loaded weights from agent.network to agent.target_network.

In [None]:
# Initialize environment and agent
env = gym.make('CarRacing-v2', continuous=False)
env = ImageEnv(env)

state_dim = (4, 84, 84)
action_dim = env.action_space.n
agent = DQN(state_dim, action_dim)

# Load the model if it exists
model_path = 'dqn-CarRacing.pt'
try:
    load_model(agent, model_path)
    print("Model loaded successfully.")
except FileNotFoundError:
    print("Model file not found. Training from scratch.")

In [None]:
# continue training
max_steps = 1e6
eval_interval = 10000

history = {'Step': [], 'AvgReturn': []}

(s, _) = env.reset()
while True:
    a = agent.act(s)
    s_prime, r, terminated, truncated, _ = env.step(a)
    result = agent.process((s, a, r, s_prime, terminated))
    s = s_prime
    if terminated or truncated:
        s, _ = env.reset()

    if agent.total_steps % eval_interval == 0:
        ret = evaluate(agent)
        history['Step'].append(agent.total_steps)
        history['AvgReturn'].append(ret)
        clear_output(wait=True)
        plt.figure(figsize=(8, 5))
        plt.plot(history['Step'], history['AvgReturn'], 'b-')
        plt.xlabel('Step', fontsize=16)
        plt.ylabel('AvgReturn', fontsize=16)
        plt.xticks(fontsize=14)
        plt.yticks(fontsize=14)
        plt.grid(axis='y')
        plt.show()
        torch.save(agent.network.state_dict(), model_path)

    if agent.total_steps > max_steps:
        break


In [None]:
# Final evaluation and animation
eval_env = gym.make('CarRacing-v2', continuous=False, render_mode='rgb_array')
eval_env = ImageEnv(eval_env)

frames = []
(s, _), done, ret = eval_env.reset(), False, 0
while not done:
    frames.append(eval_env.render())
    a = agent.act(s, training=False)
    s_prime, r, terminated, truncated, _ = eval_env.step(a)
    s = s_prime
    ret += r
    done = terminated or truncated

animate(frames, 'animation.webm')