# **Deep Reinforcement Learning**

# M3-2 Deep Q-Networks

## Example of DQN implementation on Pong environment (Part 1, training)

Below we will see a simple example that will allow us to understand the concepts introduced in this module.

## Pong environment

The [Pong](https://ale.farama.org/environments/pong/) environment is part of the Arcade Learning Environment environments. The [Arcade Learning Environment]((https://ale.farama.org/environments/)) (ALE), commonly referred to as Atari, is a framework that allows researchers and hobbyists to develop AI agents for Atari 2600 roms. 

Please read that page first for general information.

You control the right paddle, you compete against the left paddle controlled by the computer. You each try to keep deflecting the ball away from your goal and into your opponent’s goal.

<center><img src="https://ale.farama.org/_images/pong.gif"/></center>

For a more detailed documentation, see the [AtariAge page](https://atariage.com/manual_html_page.php?SoftwareLabelID=587).

First, we will load the environment. It's important to note that we are specifically using **version 1.0.0** of the **Gymnasium** library.

To install this version of the environment, run the following command:
> pip install gymnasium==1.0.0

This will also install all the related packages.

In [1]:
import warnings
warnings.filterwarnings('ignore')

Once the dependencies are installed, we load them and initialize the `PongNoFrameskip-v4` environment.

There are several Pong environments, with minor differences among them. See [Pong](https://ale.farama.org/environments/pong/) page for further details.

In [2]:
import gymnasium as gym
import ale_py

# version
print("Using Gymnasium version {}".format(gym.__version__))

gym.register_envs(ale_py)

ENV_NAME = "PongNoFrameskip-v4"
test_env = gym.make(ENV_NAME)

Using Gymnasium version 1.0.0


A.L.E: Arcade Learning Environment (version 0.10.1+6a7e0ae)
[Powered by Stella]


In [3]:
print(test_env.unwrapped.get_action_meanings())

['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']


In [4]:
print(test_env.observation_space.shape)

(210, 160, 3)


## Data preprocessing (Wrappers)

The original observations, provided by the environment, are:
- images of $210 \times 160$ in RGB colour
- Thus, we represent each observation using a numpy array of `(210, 160, 3) dtype=int8'

However, we have some problems:
1. The observations include **parts of the screen** that are not rellevant.
2. Number of states is: $256^{(210 \times 160 \times 3)}$ = $256^{100800}$! So, **reducing the image** size could help!
3. The images of the environment are in **color (RGB)**, but does color really provide any information?
4. In a single image it is not possible to know the **dynamics of the game** (i.e. direction and speed of the ball). Therefore, we must consider a sequence of several consecutive images to understand what is happening in the game.

We will use several wrappers to transform the observations. Specifically, we want to get (from the environment) observations with the following characteristics:
- Grayscale images
- Resolution $84 \times 84$
- Float images $\in [0,1]$

<center><img src="./figs/preprocessing-1.jpg"/></center>

<center><img src="./figs/preprocessing-2.png"/></center>

In [5]:
import numpy as np
import gymnasium
from gymnasium.wrappers import MaxAndSkipObservation, ResizeObservation, GrayscaleObservation, FrameStackObservation, ReshapeObservation


class ImageToPyTorch(gymnasium.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
        old_shape = self.observation_space.shape
        self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(old_shape[-1], old_shape[0], old_shape[1]), dtype=np.float32)

    def observation(self, observation):
        return np.moveaxis(observation, 2, 0)


class ScaledFloatFrame(gym.ObservationWrapper):
    def observation(self, obs):
        return np.array(obs).astype(np.float32) / 255.0


def make_env(env_name):
    env = gym.make(env_name)
    print("Standard Env.        : {}".format(env.observation_space.shape))
    env = MaxAndSkipObservation(env, skip=4)
    print("MaxAndSkipObservation: {}".format(env.observation_space.shape))
    #env = FireResetEnv(env)
    env = ResizeObservation(env, (84, 84))
    print("ResizeObservation    : {}".format(env.observation_space.shape))
    env = GrayscaleObservation(env, keep_dim=True)
    print("GrayscaleObservation : {}".format(env.observation_space.shape))
    env = ImageToPyTorch(env)
    print("ImageToPyTorch       : {}".format(env.observation_space.shape))
    env = ReshapeObservation(env, (84, 84))
    print("ReshapeObservation   : {}".format(env.observation_space.shape))
    env = FrameStackObservation(env, stack_size=4)
    print("FrameStackObservation: {}".format(env.observation_space.shape))
    env = ScaledFloatFrame(env)
    print("ScaledFloatFrame     : {}".format(env.observation_space.shape))
    
    return env


env = make_env(ENV_NAME)

Standard Env.        : (210, 160, 3)
MaxAndSkipObservation: (210, 160, 3)
ResizeObservation    : (84, 84, 3)
GrayscaleObservation : (84, 84, 1)
ImageToPyTorch       : (1, 84, 84)
ReshapeObservation   : (84, 84)
FrameStackObservation: (4, 84, 84)
ScaledFloatFrame     : (4, 84, 84)


The `make_env` function applies all the wrappers to the environment.

In [6]:
def print_env_info(name, env):
    obs, _ = env.reset()
    print("*** {} Environment ***".format(name))
    print("Environment obs. : {}".format(env.observation_space.shape))
    print("Observation shape: {}, type: {} and range [{},{}]".format(obs.shape, obs.dtype, np.min(obs), np.max(obs)))
    print("Observation sample:\n{}".format(obs))

print_env_info("Wrapped", env)

*** Wrapped Environment ***
Environment obs. : (4, 84, 84)
Observation shape: (4, 84, 84), type: float32 and range [0.25882354378700256,0.7098039388656616]
Observation sample:
[[[0.25882354 0.25882354 0.25882354 ... 0.43137255 0.43137255 0.43137255]
  [0.43137255 0.43137255 0.43137255 ... 0.43137255 0.43137255 0.43137255]
  [0.43137255 0.43137255 0.43137255 ... 0.43137255 0.43137255 0.43137255]
  ...
  [0.3137255  0.3137255  0.3137255  ... 0.3137255  0.3137255  0.3137255 ]
  [0.3137255  0.3137255  0.3137255  ... 0.3137255  0.3137255  0.3137255 ]
  [0.3137255  0.3137255  0.3137255  ... 0.3137255  0.3137255  0.3137255 ]]

 [[0.25882354 0.25882354 0.25882354 ... 0.43137255 0.43137255 0.43137255]
  [0.43137255 0.43137255 0.43137255 ... 0.43137255 0.43137255 0.43137255]
  [0.43137255 0.43137255 0.43137255 ... 0.43137255 0.43137255 0.43137255]
  ...
  [0.3137255  0.3137255  0.3137255  ... 0.3137255  0.3137255  0.3137255 ]
  [0.3137255  0.3137255  0.3137255  ... 0.3137255  0.3137255  0.313725

## Neural network architecture

The following code will implement the NN:

In [7]:
import torch
import torch.nn as nn        
import torch.optim as optim 
from torchsummary import summary

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

In [8]:
def make_DQN(input_shape, output_shape):
    net = nn.Sequential(
        nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
        nn.ReLU(),
        nn.Conv2d(32, 64, kernel_size=4, stride=2),
        nn.ReLU(),
        nn.Conv2d(64, 64, kernel_size=3, stride=1),
        nn.ReLU(),
        nn.Flatten(),
        nn.Linear(64*7*7, 512),
        nn.ReLU(),
        nn.Linear(512, output_shape)
    )
    return net

test_env = make_env(ENV_NAME)
test_net = make_DQN(test_env.observation_space.shape, test_env.action_space.n).to(device)     

Standard Env.        : (210, 160, 3)
MaxAndSkipObservation: (210, 160, 3)
ResizeObservation    : (84, 84, 3)
GrayscaleObservation : (84, 84, 1)
ImageToPyTorch       : (1, 84, 84)
ReshapeObservation   : (84, 84)
FrameStackObservation: (4, 84, 84)
ScaledFloatFrame     : (4, 84, 84)


## Experience Replay and Target Network

First, we design a class to implement the **experience replay buffer**.

In [9]:
import collections

Experience = collections.namedtuple('Experience', field_names=['state', 'action', 'reward', 'done', 'new_state'])

class ExperienceReplay:
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)

    def __len__(self):
        return len(self.buffer)

    def append(self, experience):
        self.buffer.append(experience)

    def sample(self, BATCH_SIZE):
        indices = np.random.choice(len(self.buffer), BATCH_SIZE, replace=False)
        states, actions, rewards, dones, next_states = zip(*[self.buffer[idx] for idx in indices])
        
        return np.array(states), np.array(actions), np.array(rewards, dtype=np.float32), \
               np.array(dones, dtype=np.uint8), np.array(next_states)

## Deep Q-Learning algorithm

Define the hyperparameters:

In [10]:
import time
import numpy as np
import collections


MEAN_REWARD_BOUND = 19.0 
NUMBER_OF_REWARDS_TO_AVERAGE = 10          

GAMMA = 0.99       

BATCH_SIZE = 32  
LEARNING_RATE = 1e-4           

EXPERIENCE_REPLAY_SIZE = 10000            
SYNC_TARGET_NETWORK = 1000     

EPS_START = 1.0
EPS_DECAY = 0.999985
EPS_MIN = 0.02

Define the agent:

In [11]:
class Agent:
    def __init__(self, env, exp_replay_buffer):
        self.env = env
        self.exp_replay_buffer = exp_replay_buffer
        self._reset()

    def _reset(self):
        self.current_state = self.env.reset()[0]
        self.total_reward = 0.0

    def step(self, net, epsilon=0.0, device="cpu"):
        done_reward = None
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            state_ = np.array([self.current_state])
            state = torch.tensor(state_).to(device)
            q_vals = net(state)
            _, act_ = torch.max(q_vals, dim=1)
            action = int(act_.item())

        new_state, reward, terminated, truncated, _ = self.env.step(action)
        is_done = terminated or truncated
        self.total_reward += reward

        exp = Experience(self.current_state, action, reward, is_done, new_state)
        self.exp_replay_buffer.append(exp)
        self.current_state = new_state
        
        if is_done:
            done_reward = self.total_reward
            self._reset()
        
        return done_reward

## Training

In [12]:
import wandb

# login
wandb.login()

# start a new wandb run to track this script
wandb.init(project="M3-2_Example_1a")

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mjcasasr[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [13]:
import datetime
print(">>> Training starts at ",datetime.datetime.now())

>>> Training starts at  2024-10-24 12:50:20.446644


Main bucle:

In [14]:
env = make_env(ENV_NAME)

net = make_DQN(env.observation_space.shape, env.action_space.n).to(device)
target_net = make_DQN(env.observation_space.shape, env.action_space.n).to(device)
 
buffer = ExperienceReplay(EXPERIENCE_REPLAY_SIZE)
agent = Agent(env, buffer)

epsilon = EPS_START
optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
total_rewards = []
frame_number = 0  

while True:
    frame_number += 1
    epsilon = max(epsilon * EPS_DECAY, EPS_MIN)

    reward = agent.step(net, epsilon, device=device)
    if reward is not None:
        total_rewards.append(reward)

        mean_reward = np.mean(total_rewards[-NUMBER_OF_REWARDS_TO_AVERAGE:])
        print(f"Frame:{frame_number} | Total games:{len(total_rewards)} | Mean reward: {mean_reward:.3f}  (epsilon used: {epsilon:.2f})")
        wandb.log({"epsilon": epsilon, "reward_100": mean_reward, "reward": reward}, step=frame_number)

        if mean_reward > MEAN_REWARD_BOUND:
            print(f"SOLVED in {frame_number} frames and {len(total_rewards)} games")
            break

    if len(buffer) < EXPERIENCE_REPLAY_SIZE:
        continue

    batch = buffer.sample(BATCH_SIZE)
    states_, actions_, rewards_, dones_, next_states_ = batch

    states = torch.tensor(states_).to(device)
    next_states = torch.tensor(next_states_).to(device)
    actions = torch.tensor(actions_).to(device)
    rewards = torch.tensor(rewards_).to(device)
    dones = torch.BoolTensor(dones_).to(device)

    Q_values = net(states).gather(1, actions.unsqueeze(-1)).squeeze(-1)

    next_state_values = target_net(next_states).max(1)[0]
    next_state_values[dones] = 0.0
    next_state_values = next_state_values.detach()

    expected_Q_values = next_state_values * GAMMA + rewards
    loss = nn.MSELoss()(Q_values, expected_Q_values)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if frame_number % SYNC_TARGET_NETWORK == 0:
        target_net.load_state_dict(net.state_dict())

Standard Env.        : (210, 160, 3)
MaxAndSkipObservation: (210, 160, 3)
ResizeObservation    : (84, 84, 3)
GrayscaleObservation : (84, 84, 1)
ImageToPyTorch       : (1, 84, 84)
ReshapeObservation   : (84, 84)
FrameStackObservation: (4, 84, 84)
ScaledFloatFrame     : (4, 84, 84)
Frame:842 | Total games:1 | Mean reward: -20.000  (epsilon used: 0.99)
Frame:1687 | Total games:2 | Mean reward: -20.500  (epsilon used: 0.98)
Frame:2599 | Total games:3 | Mean reward: -20.667  (epsilon used: 0.96)
Frame:3532 | Total games:4 | Mean reward: -20.750  (epsilon used: 0.95)
Frame:4404 | Total games:5 | Mean reward: -20.800  (epsilon used: 0.94)
Frame:5275 | Total games:6 | Mean reward: -20.667  (epsilon used: 0.92)
Frame:6247 | Total games:7 | Mean reward: -20.714  (epsilon used: 0.91)
Frame:7152 | Total games:8 | Mean reward: -20.625  (epsilon used: 0.90)
Frame:7976 | Total games:9 | Mean reward: -20.667  (epsilon used: 0.89)
Frame:8796 | Total games:10 | Mean reward: -20.700  (epsilon used: 0.88)

Save the model:

In [15]:
torch.save(net.state_dict(), ENV_NAME + ".dat")

In [16]:
print(">>> Training ends at ",datetime.datetime.now())

>>> Training ends at  2024-10-24 21:55:47.891256


In [17]:
# Finish the wandb run, necessary in notebooks
wandb.finish()

0,1
epsilon,█▇▇▇▇▆▅▅▅▄▄▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
reward,▁▁▁▁▁▁▁▁▂▁▃▁▂▃▃▁▂▃▅▇▇███▄█▇██▇███▆█▇█▇██
reward_100,▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▃▃▆███████████████

0,1
epsilon,0.02
reward,20.0
reward_100,19.1
