# **Deep Reinforcement Learning**

# M3-2 Deep Q-Networks

## Example of DQN implementation on Pong environment

Below we will see a simple example that will allow us to understand the concepts introduced in this module.

## Pong environment

The [Pong](https://gymnasium.farama.org/environments/atari/pong/) environment is part of the [Atari environments](https://gymnasium.farama.org/environments/atari/). Please read that page first for general information.

You control the right paddle, you compete against the left paddle controlled by the computer. You each try to keep deflecting the ball away from your goal and into your opponent’s goal.

<center><img src="https://ale.farama.org/_images/pong.gif"/></center>

For a more detailed documentation, see the [AtariAge page](https://atariage.com/manual_html_page.php?SoftwareLabelID=587).

First of all, we will load the environment. It is important to note that in this case, we will load the "preliminary" version of the environment, which belongs to the [Gym](https://github.com/openai/gym) framework (instead of [Gymnasium](https://gymnasium.farama.org/index.html)).

To install this environment, we need to execute the following command:
> pip install gym==0.25.0

And all related packages.

In [30]:
import warnings
warnings.filterwarnings('ignore')

!pip install gym[atari]==0.25.0
!pip install autorom[accept-rom-license]



Once the dependencies are installed, we load them and initialize the `PongNoFrameskip-v4` environment.

There are several Pong environments, with minor differences among them. See [Pong](https://gymnasium.farama.org/environments/atari/pong/) page for further details.

In [31]:
import gym

# version
print("Using Gym version {}".format(gym.__version__))

ENV_NAME = "PongNoFrameskip-v4" 
test_env = gym.make(ENV_NAME)

Using Gym version 0.25.0


Check the GPU model:

In [32]:
!nvidia-smi 

Thu Oct 24 11:09:31 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   74C    P0             29W /   70W |     271MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00

## Data preprocessing (Wrappers)

The original observations, provided by the environment, are:
- images of $210 \times 160$ in RGB colour
- Thus, we represent each observation using a numpy array of `(210, 160, 3) dtype=int8'

However, we have some problems:
1. The observations include **parts of the screen** that are not rellevant.
2. Number of states is: $256^{(210 \times 160 \times 3)}$ = $256^{100800}$! So, **reducing the image** size could help!
3. The images of the environment are in **color (RGB)**, but does color really provide any information?
4. In a single image it is not possible to know the **dynamics of the game** (i.e. direction and speed of the ball). Therefore, we must consider a sequence of several consecutive images to understand what is happening in the game.

We will use several wrappers to transform the observations. Specifically, we want to get (from the environment) observations with the following characteristics:
- Grayscale images
- Resolution $84 \times 84$
- Float images $\in [0,1]$

<center><img src="./figs/preprocessing-1.jpg"/></center>

<center><img src="./figs/preprocessing-2.png"/></center>

In [33]:
# OpenAI Gym Wrappers
# Taken from 
# https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter06/lib/wrappers.py
import cv2
import numpy as np
import collections
import gym.spaces

class FireResetEnv(gym.Wrapper):
    def __init__(self, env=None):
        super(FireResetEnv, self).__init__(env)
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        assert len(env.unwrapped.get_action_meanings()) >= 3

    def step(self, action):
        return self.env.step(action)

    def reset(self):
        self.env.reset()
        obs, _, done, _ = self.env.step(1)
        if done:
            self.env.reset()
        obs, _, done, _ = self.env.step(2)
        if done:
            self.env.reset()
        return obs

class MaxAndSkipEnv(gym.Wrapper):
    def __init__(self, env=None, skip=4):
        super(MaxAndSkipEnv, self).__init__(env)
        self._obs_buffer = collections.deque(maxlen=2)
        self._skip = skip

    def step(self, action):
        total_reward = 0.0
        done = None
        for _ in range(self._skip):
            obs, reward, done, info = self.env.step(action)
            self._obs_buffer.append(obs)
            total_reward += reward
            if done:
                break
        max_frame = np.max(np.stack(self._obs_buffer), axis=0)
        return max_frame, total_reward, done, info

    def reset(self):
        self._obs_buffer.clear()
        obs = self.env.reset()
        self._obs_buffer.append(obs)
        return obs


class ProcessFrame84(gym.ObservationWrapper):
    def __init__(self, env=None):
        super(ProcessFrame84, self).__init__(env)
        self.observation_space = gym.spaces.Box(low=0, high=255, shape=(84, 84, 1), dtype=np.uint8)

    def observation(self, obs):
        return ProcessFrame84.process(obs)

    @staticmethod
    def process(frame):
        if frame.size == 210 * 160 * 3:
            img = np.reshape(frame, [210, 160, 3]).astype(np.float32)
        elif frame.size == 250 * 160 * 3:
            img = np.reshape(frame, [250, 160, 3]).astype(np.float32)
        else:
            assert False, "Unknown resolution."
        img = img[:, :, 0] * 0.299 + img[:, :, 1] * 0.587 + img[:, :, 2] * 0.114
        resized_screen = cv2.resize(img, (84, 110), interpolation=cv2.INTER_AREA)
        x_t = resized_screen[18:102, :]
        x_t = np.reshape(x_t, [84, 84, 1])
        return x_t.astype(np.uint8)


class BufferWrapper(gym.ObservationWrapper):
    def __init__(self, env, n_steps, dtype=np.float32):
        super(BufferWrapper, self).__init__(env)
        self.dtype = dtype
        old_space = env.observation_space
        self.observation_space = gym.spaces.Box(old_space.low.repeat(n_steps, axis=0),
                                                old_space.high.repeat(n_steps, axis=0), dtype=dtype)

    def reset(self):
        self.buffer = np.zeros_like(self.observation_space.low, dtype=self.dtype)
        return self.observation(self.env.reset())

    def observation(self, observation):
        self.buffer[:-1] = self.buffer[1:]
        self.buffer[-1] = observation
        return self.buffer


class ImageToPyTorch(gym.ObservationWrapper):
    def __init__(self, env):
        super(ImageToPyTorch, self).__init__(env)
        old_shape = self.observation_space.shape
        self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(old_shape[-1], 
                                old_shape[0], old_shape[1]), dtype=np.float32)

    def observation(self, observation):
        return np.moveaxis(observation, 2, 0)


class ScaledFloatFrame(gym.ObservationWrapper):
    def observation(self, obs):
        return np.array(obs).astype(np.float32) / 255.0


def make_env(env_name):
    env = gym.make(env_name)
    print("Standard Env.        : {}".format(env.observation_space.shape))
    env = MaxAndSkipEnv(env)
    print("MaxAndSkipEnv        : {}".format(env.observation_space.shape))
    env = FireResetEnv(env)
    print("FireResetEnv         : {}".format(env.observation_space.shape))
    env = ProcessFrame84(env)
    print("ProcessFrame84       : {}".format(env.observation_space.shape))
    env = ImageToPyTorch(env)
    print("ImageToPyTorch       : {}".format(env.observation_space.shape))
    env = BufferWrapper(env, 4)
    print("BufferWrapper        : {}".format(env.observation_space.shape))
    env = ScaledFloatFrame(env)
    print("ScaledFloatFrame     : {}".format(env.observation_space.shape))
    
    return env


def print_env_info(name, env):
    obs = env.reset()
    print("*** {} Environment ***".format(name))
    print("Observation shape: {}, type: {} and range [{},{}]".format(obs.shape, obs.dtype, np.min(obs), np.max(obs)))
    print("Observation sample:\n{}".format(obs))

The `make_env` function applies all the wrappers to the environment.

Let's compare the observations from the **standard** and **wrapped** environments.

In [34]:
# standar Env
env = gym.make(ENV_NAME)
print_env_info("Standard", env)

*** Standard Environment ***
Observation shape: (210, 160, 3), type: uint8 and range [0,228]
Observation sample:
[[[  0   0   0]
  [  0   0   0]
  [  0   0   0]
  ...
  [109 118  43]
  [109 118  43]
  [109 118  43]]

 [[109 118  43]
  [109 118  43]
  [109 118  43]
  ...
  [109 118  43]
  [109 118  43]
  [109 118  43]]

 [[109 118  43]
  [109 118  43]
  [109 118  43]
  ...
  [109 118  43]
  [109 118  43]
  [109 118  43]]

 ...

 [[ 53  95  24]
  [ 53  95  24]
  [ 53  95  24]
  ...
  [ 53  95  24]
  [ 53  95  24]
  [ 53  95  24]]

 [[ 53  95  24]
  [ 53  95  24]
  [ 53  95  24]
  ...
  [ 53  95  24]
  [ 53  95  24]
  [ 53  95  24]]

 [[ 53  95  24]
  [ 53  95  24]
  [ 53  95  24]
  ...
  [ 53  95  24]
  [ 53  95  24]
  [ 53  95  24]]]


In [35]:
# wrapped Env
env = make_env(ENV_NAME)
print_env_info("Wrapped", env)

Standard Env.        : (210, 160, 3)
MaxAndSkipEnv        : (210, 160, 3)
FireResetEnv         : (210, 160, 3)
ProcessFrame84       : (84, 84, 1)
ImageToPyTorch       : (1, 84, 84)
BufferWrapper        : (4, 84, 84)
ScaledFloatFrame     : (4, 84, 84)
*** Wrapped Environment ***
Observation shape: (4, 84, 84), type: float32 and range [0.0,0.6352941393852234]
Observation sample:
[[[0.         0.         0.         ... 0.         0.         0.        ]
  [0.         0.         0.         ... 0.         0.         0.        ]
  [0.         0.         0.         ... 0.         0.         0.        ]
  ...
  [0.         0.         0.         ... 0.         0.         0.        ]
  [0.         0.         0.         ... 0.         0.         0.        ]
  [0.         0.         0.         ... 0.         0.         0.        ]]

 [[0.         0.         0.         ... 0.         0.         0.        ]
  [0.         0.         0.         ... 0.         0.         0.        ]
  [0.         0.    

## Neural network architecture

The following code will implement the NN:

In [36]:
import torch
import torch.nn as nn        
import torch.optim as optim 

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

print("Using device: {}".format(device))

Using device: cuda


In [37]:
def make_DQN(input_shape, output_shape):
    net = nn.Sequential(
        nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
        nn.ReLU(),
        nn.Conv2d(32, 64, kernel_size=4, stride=2),
        nn.ReLU(),
        nn.Conv2d(64, 64, kernel_size=3, stride=1),
        nn.ReLU(),
        nn.Flatten(),
        nn.Linear(64*7*7, 512),
        nn.ReLU(),
        nn.Linear(512, output_shape)
    )
    return net

test_env = make_env(ENV_NAME)
test_net = make_DQN(test_env.observation_space.shape, test_env.action_space.n).to(device)     

Standard Env.        : (210, 160, 3)
MaxAndSkipEnv        : (210, 160, 3)
FireResetEnv         : (210, 160, 3)
ProcessFrame84       : (84, 84, 1)
ImageToPyTorch       : (1, 84, 84)
BufferWrapper        : (4, 84, 84)
ScaledFloatFrame     : (4, 84, 84)


## Experience Replay and Target Network

First, we design a class to implement the **experience replay buffer**.

In [38]:
Experience = collections.namedtuple('Experience', field_names=['state', 'action', 'reward', 'done', 'new_state'])

class ExperienceReplay:
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)

    def __len__(self):
        return len(self.buffer)

    def append(self, experience):
        self.buffer.append(experience)

    def sample(self, BATCH_SIZE):
        indices = np.random.choice(len(self.buffer), BATCH_SIZE, replace=False)
        states, actions, rewards, dones, next_states = zip(*[self.buffer[idx] for idx in indices])
        return np.array(states), np.array(actions), np.array(rewards, dtype=np.float32), \
               np.array(dones, dtype=np.uint8), np.array(next_states)

## Deep Q-Learning algorithm

Define the hyperparameters:

In [39]:
import time
import numpy as np
import collections


MEAN_REWARD_BOUND = 19.0 
NUMBER_OF_REWARDS_TO_AVERAGE = 10          

GAMMA = 0.99       

BATCH_SIZE = 32  
LEARNING_RATE = 1e-4           

EXPERIENCE_REPLAY_SIZE = 10000            
SYNC_TARGET_NETWORK = 1000     

EPS_START = 1.0
EPS_DECAY = 0.999985
EPS_MIN = 0.02

Define the agent:

In [40]:
class Agent:
    def __init__(self, env, exp_replay_buffer):
        self.env = env
        self.exp_replay_buffer = exp_replay_buffer
        self._reset()

    def _reset(self):
        self.current_state = env.reset()
        self.total_reward = 0.0

    def step(self, net, epsilon=0.0, device="cpu"):
        done_reward = None
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            state_ = np.array([self.current_state])
            state = torch.tensor(state_).to(device)
            q_vals = net(state)
            _, act_ = torch.max(q_vals, dim=1)
            action = int(act_.item())

        new_state, reward, is_done, _ = self.env.step(action)
        self.total_reward += reward

        exp = Experience(self.current_state, action, reward, is_done, new_state)
        self.exp_replay_buffer.append(exp)
        self.current_state = new_state
        if is_done:
            done_reward = self.total_reward
            self._reset()
        
        return done_reward

## Training

Train a DQN model on the Pong environment using the parameters and architecture we have previously defined.

In [41]:
import wandb

# start a new wandb run to track this script
wandb.init(project="M3-2_Example_2a")

[34m[1mwandb[0m: Currently logged in as: [33mjcasasr[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [42]:
import datetime
print(">>> Training starts at ",datetime.datetime.now())

>>> Training starts at  2024-10-24 11:09:33.486951


Main bucle:

In [43]:
# create Env
env = make_env(ENV_NAME)

# create Agent
net = make_DQN(env.observation_space.shape, env.action_space.n).to(device)
target_net = make_DQN(env.observation_space.shape, env.action_space.n).to(device)
 
buffer = ExperienceReplay(EXPERIENCE_REPLAY_SIZE)
agent = Agent(env, buffer)

epsilon = EPS_START
optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
total_rewards = []
frame_number = 0  

while True:
    frame_number += 1
    epsilon = max(epsilon*EPS_DECAY, EPS_MIN)

    reward = agent.step(net, epsilon, device=device)
    if reward is not None:
        total_rewards.append(reward)

        mean_reward = np.mean(total_rewards[-NUMBER_OF_REWARDS_TO_AVERAGE:])
        print(f"Frame:{frame_number} | Total games:{len(total_rewards)} | Mean reward: {mean_reward:.3f}  (epsilon used: {epsilon:.2f})")
        wandb.log({"epsilon": epsilon, "reward_100": mean_reward, "reward": reward}, step=frame_number)

        if mean_reward > MEAN_REWARD_BOUND:
            print(f"SOLVED in {frame_number} frames and {len(total_rewards)} games")
            break

    if len(buffer) < EXPERIENCE_REPLAY_SIZE:
        continue

    batch = buffer.sample(BATCH_SIZE)
    states_, actions_, rewards_, dones_, next_states_ = batch

    states = torch.tensor(states_).to(device)
    next_states = torch.tensor(next_states_).to(device)
    actions = torch.tensor(actions_).to(device)
    rewards = torch.tensor(rewards_).to(device)
    dones = torch.BoolTensor(dones_).to(device)

    Q_values = net(states).gather(1, actions.unsqueeze(-1)).squeeze(-1)

    next_state_values = target_net(next_states).max(1)[0]
    next_state_values[dones] = 0
    next_state_values = next_state_values.detach()

    expected_Q_values = next_state_values * GAMMA + rewards
    loss = nn.MSELoss()(Q_values, expected_Q_values)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if frame_number % SYNC_TARGET_NETWORK == 0:
        target_net.load_state_dict(net.state_dict())

Standard Env.        : (210, 160, 3)
MaxAndSkipEnv        : (210, 160, 3)
FireResetEnv         : (210, 160, 3)
ProcessFrame84       : (84, 84, 1)
ImageToPyTorch       : (1, 84, 84)
BufferWrapper        : (4, 84, 84)
ScaledFloatFrame     : (4, 84, 84)
Frame:847 | Total games:1 | Mean reward: -21.000  (epsilon used: 0.99)
Frame:1690 | Total games:2 | Mean reward: -21.000  (epsilon used: 0.97)
Frame:2453 | Total games:3 | Mean reward: -21.000  (epsilon used: 0.96)
Frame:3216 | Total games:4 | Mean reward: -21.000  (epsilon used: 0.95)
Frame:4025 | Total games:5 | Mean reward: -21.000  (epsilon used: 0.94)
Frame:4816 | Total games:6 | Mean reward: -21.000  (epsilon used: 0.93)
Frame:5699 | Total games:7 | Mean reward: -21.000  (epsilon used: 0.92)
Frame:6462 | Total games:8 | Mean reward: -21.000  (epsilon used: 0.91)
Frame:7285 | Total games:9 | Mean reward: -21.000  (epsilon used: 0.90)
Frame:8076 | Total games:10 | Mean reward: -21.000  (epsilon used: 0.89)
Frame:8917 | Total games:11 |

Save the model:

In [44]:
torch.save(net.state_dict(), ENV_NAME + ".dat")

In [45]:
print(">>> Training ends at ",datetime.datetime.now())

>>> Training ends at  2024-10-24 12:15:13.886531


In [46]:
# Finish the wandb run, necessary in notebooks
wandb.finish()

VBox(children=(Label(value='0.032 MB of 0.032 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epsilon,██▇▆▆▅▅▅▅▅▄▄▄▄▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
reward,▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▂▁▂▃▁▁▂▂▂▄█▇█▆█▇▇▇▇█▇▇█
reward_100,▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▄▄▆▆▆▇▇██▇▇▇▇███████

0,1
epsilon,0.02
reward,19.0
reward_100,19.4
