# **Deep Reinforcement Learning**

# M3-2 Deep Q-Networks

## Example of DQN implementation on Pong environment (Part 2, testing)

Below we will see a simple example that will allow us to understand the concepts introduced in this module.

### Pong environment

The [Pong](https://ale.farama.org/environments/pong/) environment is part of the Arcade Learning Environment environments. The [Arcade Learning Environment]((https://ale.farama.org/environments/)) (ALE), commonly referred to as Atari, is a framework that allows researchers and hobbyists to develop AI agents for Atari 2600 roms. 

Please read that page first for general information.

You control the right paddle, you compete against the left paddle controlled by the computer. You each try to keep deflecting the ball away from your goal and into your opponent’s goal.

<center><img src="https://ale.farama.org/_images/pong.gif"/></center>

For a more detailed documentation, see the [AtariAge page](https://atariage.com/manual_html_page.php?SoftwareLabelID=587).

First, we will load the environment. It's important to note that we are specifically using **version 1.0.0** of the **Gymnasium** library.

To install this version of the environment, run the following command:
> pip install gymnasium==1.0.0

This will also install all the related packages.

In [1]:
import warnings
warnings.filterwarnings('ignore')

Once the dependencies are installed, we load them and initialize the `PongNoFrameskip-v4` environment.

There are several Pong environments, with minor differences among them. See [Pong](https://ale.farama.org/environments/pong/) page for further details.

In [2]:
import gymnasium as gym
import ale_py

# version
print("Using Gymnasium version {}".format(gym.__version__))

gym.register_envs(ale_py)

ENV_NAME = "PongNoFrameskip-v4"
test_env = gym.make(ENV_NAME)

Using Gymnasium version 1.0.0


A.L.E: Arcade Learning Environment (version 0.10.1+6a7e0ae)
[Powered by Stella]


### Data preprocessing (Wrappers)

We need to apply the **same set of wrappers** used during the model's training phase to ensure that the inputs to the model are consistent in shape, format, and meaning.

In [3]:
import numpy as np
import gymnasium
from gymnasium.wrappers import MaxAndSkipObservation, ResizeObservation, GrayscaleObservation, FrameStackObservation, ReshapeObservation


class ImageToPyTorch(gymnasium.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
        old_shape = self.observation_space.shape
        self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(old_shape[-1], old_shape[0], old_shape[1]), dtype=np.float32)

    def observation(self, observation):
        return np.moveaxis(observation, 2, 0)


class ScaledFloatFrame(gym.ObservationWrapper):
    def observation(self, obs):
        return np.array(obs).astype(np.float32) / 255.0


def make_env(env_name, render_mode=None):
    env = gym.make(env_name, render_mode=render_mode)
    print("Standard Env.        : {}".format(env.observation_space.shape))
    env = MaxAndSkipObservation(env, skip=4)
    print("MaxAndSkipObservation: {}".format(env.observation_space.shape))
    #env = FireResetEnv(env)
    env = ResizeObservation(env, (84, 84))
    print("ResizeObservation    : {}".format(env.observation_space.shape))
    env = GrayscaleObservation(env, keep_dim=True)
    print("GrayscaleObservation : {}".format(env.observation_space.shape))
    env = ImageToPyTorch(env)
    print("ImageToPyTorch       : {}".format(env.observation_space.shape))
    env = ReshapeObservation(env, (84, 84))
    print("ReshapeObservation   : {}".format(env.observation_space.shape))
    env = FrameStackObservation(env, stack_size=4)
    print("FrameStackObservation: {}".format(env.observation_space.shape))
    env = ScaledFloatFrame(env)
    print("ScaledFloatFrame     : {}".format(env.observation_space.shape))
    
    return env


env = make_env(ENV_NAME)

Standard Env.        : (210, 160, 3)
MaxAndSkipObservation: (210, 160, 3)
ResizeObservation    : (84, 84, 3)
GrayscaleObservation : (84, 84, 1)
ImageToPyTorch       : (1, 84, 84)
ReshapeObservation   : (84, 84)
FrameStackObservation: (4, 84, 84)
ScaledFloatFrame     : (4, 84, 84)


In [4]:
def print_env_info(name, env):
    obs, _ = env.reset()
    print("*** {} Environment ***".format(name))
    print("Environment obs. : {}".format(env.observation_space.shape))
    print("Observation shape: {}, type: {} and range [{},{}]".format(obs.shape, obs.dtype, np.min(obs), np.max(obs)))
    print("Observation sample:\n{}".format(obs))

print_env_info("Wrapped", env)

*** Wrapped Environment ***
Environment obs. : (4, 84, 84)
Observation shape: (4, 84, 84), type: float32 and range [0.25882354378700256,0.7098039388656616]
Observation sample:
[[[0.25882354 0.25882354 0.25882354 ... 0.43137255 0.43137255 0.43137255]
  [0.43137255 0.43137255 0.43137255 ... 0.43137255 0.43137255 0.43137255]
  [0.43137255 0.43137255 0.43137255 ... 0.43137255 0.43137255 0.43137255]
  ...
  [0.3137255  0.3137255  0.3137255  ... 0.3137255  0.3137255  0.3137255 ]
  [0.3137255  0.3137255  0.3137255  ... 0.3137255  0.3137255  0.3137255 ]
  [0.3137255  0.3137255  0.3137255  ... 0.3137255  0.3137255  0.3137255 ]]

 [[0.25882354 0.25882354 0.25882354 ... 0.43137255 0.43137255 0.43137255]
  [0.43137255 0.43137255 0.43137255 ... 0.43137255 0.43137255 0.43137255]
  [0.43137255 0.43137255 0.43137255 ... 0.43137255 0.43137255 0.43137255]
  ...
  [0.3137255  0.3137255  0.3137255  ... 0.3137255  0.3137255  0.3137255 ]
  [0.3137255  0.3137255  0.3137255  ... 0.3137255  0.3137255  0.313725

### Neural network architecture

The following code will implement the NN:

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
device = torch.device(device)

Using device: cpu


In [6]:
def make_DQN(input_shape, output_shape):
    net = nn.Sequential(
        nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
        nn.ReLU(),
        nn.Conv2d(32, 64, kernel_size=4, stride=2),
        nn.ReLU(),
        nn.Conv2d(64, 64, kernel_size=3, stride=1),
        nn.ReLU(),
        nn.Flatten(),
        nn.Linear(64*7*7, 512),
        nn.ReLU(),
        nn.Linear(512, output_shape)
    )
    return net

Load the trained model, saved on previous notebook (`M3-2_Example_1a (DQN on Pong, train)`).

In [7]:
# params
model = ENV_NAME + ".dat"

env = make_env(ENV_NAME, render_mode="rgb_array")
net = make_DQN(env.observation_space.shape, env.action_space.n)
net.load_state_dict(torch.load(model, map_location=torch.device(device)))

Standard Env.        : (210, 160, 3)
MaxAndSkipObservation: (210, 160, 3)
ResizeObservation    : (84, 84, 3)
GrayscaleObservation : (84, 84, 1)
ImageToPyTorch       : (1, 84, 84)
ReshapeObservation   : (84, 84)
FrameStackObservation: (4, 84, 84)
ScaledFloatFrame     : (4, 84, 84)


<All keys matched successfully>

### Test

We play one episodes and save them (.gif or .mp4)

In [8]:
import time
import collections
from PIL import Image

# params
visualize = True
images = []

state, _ = env.reset()
total_reward = 0.0

while True:
    start_ts = time.time()
    if visualize:
        img = env.render()
        images.append(Image.fromarray(img))

    state_ = torch.tensor(np.array([state], copy=False))
    q_vals = net(state_).data.numpy()[0]
    action = np.argmax(q_vals)

    state, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated
    total_reward += reward
    if done:
        break

print("Total reward: %.2f" % total_reward)

Total reward: 21.00


Export the episode to GIF file:

In [9]:
# params
gif_file = "video.gif"

# duration is the number of milliseconds between frames; this is 40 frames per second
images[0].save(gif_file, save_all=True, append_images=images[1:], duration=60, loop=0)

print("Episode export to '{}'".format(gif_file))

Episode export to video.gif
