<a href="https://colab.research.google.com/github/tcharos/AIDL_B02-Advanced-Topics-in-Deep-Learning/blob/main/AIDL_B02_AdvancedTopicsInDeepLearning_SpaceInvaders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/tcharos/AIDL_B02-Advanced-Topics-in-Deep-Learning/blob/main/AIDL_B02_AdvancedTopicsInDeepLearning_SpaceInvaders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ðŸš€ Project Base: DQN Variants for ALE/SpaceInvaders-v5

This notebook strictly implements the project's requirements for the **`ALE/SpaceInvaders-v5`** environment with 4-frame stacking and CNN architecture.

**Key Requirements Met:**
* **Environment:** `ALE/SpaceInvaders-v5` [cite: 11]
* **Action Space:** 6 actions [cite: 13, 21]
* **State:** 4 stacked input frames [cite: 19]

**To run an implementation:**
1.  Change the `CONFIG['MODE']` variable below to one of: **`SimpleDQN`**, **`DoubleDQN`**, or **`DuelingDQN`**.
2.  Adjust hyperparameters (`LR`, `EPS_DECAY`, etc.) in the `CONFIG` dictionary if needed.
3.  Run all cells.

## 1. Setup and Configuration

In [1]:
!pip install "gymnasium[atari,accept-rom-license,other]" ale-py
!pip install pyvirtualdisplay
!apt-get install -y xvfb x11-utils

Collecting pyvirtualdisplay
  Downloading PyVirtualDisplay-3.0-py3-none-any.whl.metadata (943 bytes)
Downloading PyVirtualDisplay-3.0-py3-none-any.whl (15 kB)
Installing collected packages: pyvirtualdisplay
Successfully installed pyvirtualdisplay-3.0
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
xvfb is already the newest version (2:21.1.4-2ubuntu1.7~22.04.16).
The following additional packages will be installed:
  libxcomposite1 libxtst6 libxxf86dga1
Suggested packages:
  mesa-utils
The following NEW packages will be installed:
  libxcomposite1 libxtst6 libxxf86dga1 x11-utils
0 upgraded, 4 newly installed, 0 to remove and 41 not upgraded.
Need to get 239 kB of archives.
After this operation, 852 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libxcomposite1 amd64 1:0.4.5-1build2 [7,192 B]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libxtst6 amd64 2:1.2.3-1build4 [13.4 kB]
G

In [4]:
import gymnasium as gym
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from collections import deque, namedtuple
import matplotlib.pyplot as plt
#from gymnasium.wrappers import AtariPreprocessing, FrameStack
from gymnasium.wrappers.atari_preprocessing import AtariPreprocessing
from gymnasium.wrappers.frame_stack import FrameStack

# Tools for video display
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay
from base64 import b64encode

# ----------------- GLOBAL CONFIGURATION -----------------
CONFIG = {
    "ENV_ID": 'ALE/SpaceInvaders-v5',
    "SEED": 7,
    "MODE": "SimpleDQN", # Choice --> 'SimpleDQN', 'DoubleDQN', 'DuelingDQN'
    "INPUT_SHAPE": (4, 84, 84), # 4 stacked frames, resized to 84x84
    "BUFFER_SIZE": int(1e5),
    "BATCH_SIZE": 32, # Reduced batch size (common practice for Atari, hinted in PDF [cite: 37])
    "GAMMA": 0.99, # Prioritizing long-term cumulative reward
    "TAU": 1e-3, # Soft Update Rate
    "LR": 1e-4, # Lower learning rate --> stable convergence
    "UPDATE_EVERY": 4, # Learn frequency (standard for Atari DQN)
    "TARGET_UPDATE_FREQ": 1000,
    "N_EPISODES": 5000,
    "EPS_START": 1.0, # Initial probability of choosing a random action (exploration) --> fully exploring the environment to gather initial experiences
    "EPS_END": 0.01, # Minimum probability of choosing a random action.
    "EPS_DECAY": 0.999 # Exploration rate decays very slowly, allowing the agent to explore over a large number of episodes
}
# --------------------------------------------------------

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

random.seed(CONFIG['SEED'])
np.random.seed(CONFIG['SEED'])
torch.manual_seed(CONFIG['SEED'])
if torch.cuda.is_available():
    torch.cuda.manual_seed(CONFIG['SEED'])

print(f"Using device: {device}")
print(f"Current DQN Mode: {CONFIG['MODE']}")

ModuleNotFoundError: No module named 'gymnasium.wrappers.frame_stack'

## 2. Environment Initialization
We use **`AtariPreprocessing`** to handle resizing/cropping to 84x84 and grayscale conversion. **`FrameStack`** then stacks 4 consecutive frames, fulfilling the requirements for the state space[cite: 19, 20].

In [None]:
def make_atari_env(env_id, seed):
    """Creates and wraps the Atari environment with standard preprocessing and 4-frame stacking."""
    # 1. Base Environment (Using the required ID [cite: 11])
    env = gym.make(env_id)

    # 2. Atari Preprocessing: Resizes to 84x84, grayscale, handles max-pooling/skip.
    # Frame skip is set to 1 here because the ALE/SpaceInvaders-v5 environment generally handles skips
    # implicitly, or we rely on the standard wrappers' internal logic for compatibility.
    env = AtariPreprocessing(env, grayscale_obs=True, terminal_on_life_loss=True, frame_skip=1, screen_size=84)

    # 3. Frame Stacking (Creates the (4, 84, 84) state [cite: 19])
    env = FrameStack(env, num_stack=4)

    # Set seed on the final environment
    if seed is not None:
        env.action_space.seed(seed)
        env.observation_space.seed(seed)

    return env

env = make_atari_env(CONFIG['ENV_ID'], CONFIG['SEED'])
action_size = env.action_space.n
state_shape = env.observation_space.shape

print(f'Final State shape (Stacked Frames): {state_shape}')
print(f'Number of available actions (SpaceInvaders): {action_size}') # Confirms 6 actions [cite: 13, 21]

## 3. Q-Network Architecture
The network uses a CNN architecture  to process the high-dimensional image input, supporting Dueling components via a flag.

In [None]:
class QNetwork(nn.Module):
    """CNN-based Q-Network Model supporting Standard and Dueling structures."""

    def __init__(self, state_shape, action_size, seed, dueling=False):
        """Initializes the shared CNN layers and splits into Value/Advantage streams if Dueling is enabled.
        """
        super(QNetwork, self).__init__()
        self.seed = torch.manual_seed(seed)
        self.dueling = dueling
        in_channels = state_shape[0] # 4 stacked frames

        # --- Shared CNN Layers (Original DQN architecture) ---
        # Layers extract features from the 4 stacked 84x84 input images.
        self.conv1 = nn.Conv2d(in_channels, 32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)

        # --- Dynamic Calculation of fc_input_size ---
        # This prevents the network from breaking if the input image size changes.
        # 1. Create a dummy input tensor based on state_shape (e.g., (1, 4, 84, 84))
        dummy_input = torch.zeros(1, *state_shape)

        # 2. Pass the dummy input through the convolutional layers
        x = self._forward_conv(dummy_input)

        # 3. Calculate the flattened feature size (e.g., 7*7*64 = 3136)
        self.fc_input_size = x.view(1, -1).size(1)

        # --- Fully Connected Layers ---
        if self.dueling:
            # Dueling Architecture: Split into Value (V) and Advantage (A) streams
            self.fc_v1 = nn.Linear(self.fc_input_size, 512)
            self.fc_a1 = nn.Linear(self.fc_input_size, 512)

            self.fc_v2 = nn.Linear(512, 1) # Output V(s)
            self.fc_a2 = nn.Linear(512, action_size) # Output A(s, a)
        else:
            # Standard DQN Architecture (Single Q-stream)
            self.fc1 = nn.Linear(self.fc_input_size, 512)
            self.fc2 = nn.Linear(512, action_size)

    def forward(self, state):
        """Maps state (4, 84, 84) to action values (6)."""
        # CNN forward pass
        x = F.relu(self.conv1(state))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = x.view(x.size(0), -1) # Flatten

        if self.dueling:
        # Dueling Combination: Q(s,a) = V(s) + [A(s,a) - mean(A(s,a))]
            v = F.relu(self.fc_v1(x))
            a = F.relu(self.fc_a1(x))
            v = self.fc_v2(v)
            a = self.fc_a2(a)
            return v + a - a.mean(1).unsqueeze(1)
        else:
            # Standard Q-stream
            x = F.relu(self.fc1(x))
            return self.fc2(x)

## 4. Replay Buffer and Agent Implementations
The **Replay Buffer** (PER is an optional extension [cite: 27]) is crucial for breaking correlation in experience samples. The **AgentBase** handles common functions; specialized classes implement the specific Q-learning update rule.

In [None]:
class ReplayBuffer:
    """Fixed-size buffer to store experience tuples, essential for DQN."""

    def __init__(self, action_size, buffer_size, batch_size, seed):
        """Initializes the ReplayBuffer."""
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)

    def add(self, state, action, reward, next_state, done):
        """Adds a new experience (s, a, r, s', done) to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)

    def sample(self):
        """Randomly samples a batch of experiences from memory."""
        experiences = random.sample(self.memory, k=self.batch_size)

        # Convert FrameStack/NumPy data into required Torch tensor shape (B, C, H, W)
        states = torch.from_numpy(np.stack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.array([e.action for e in experiences if e is not None])).long().unsqueeze(1).to(device)
        rewards = torch.from_numpy(np.array([e.reward for e in experiences if e is not None])).float().unsqueeze(1).to(device)
        next_states = torch.from_numpy(np.stack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.array([e.done for e in experiences if e is not None]).astype(np.uint8)).float().unsqueeze(1).to(device)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        return len(self.memory)

In [None]:
class AgentBase:
    """Base class for all DQN agents, handling shared components and target network logic."""

    def __init__(self, state_shape, action_size, seed, mode, dueling):
        self.state_shape = state_shape
        self.action_size = action_size
        self.mode = mode

        # Initialize Q-Networks
        self.qnetwork_local = QNetwork(state_shape, action_size, seed, dueling=dueling).to(device)
        self.qnetwork_target = QNetwork(state_shape, action_size, seed, dueling=dueling).to(device)
        self.qnetwork_target.load_state_dict(self.qnetwork_local.state_dict())

        self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=CONFIG['LR'])
        self.memory = ReplayBuffer(action_size, CONFIG['BUFFER_SIZE'], CONFIG['BATCH_SIZE'], seed)

        self.t_step = 0

    def step(self, state, action, reward, next_state, done):
        # Save experience
        self.memory.add(state, action, reward, next_state, done)

        # Learn every UPDATE_EVERY steps
        self.t_step = (self.t_step + 1) % CONFIG['UPDATE_EVERY']
        if self.t_step == 0:
            if len(self.memory) > CONFIG['BATCH_SIZE']:
                experiences = self.memory.sample()
                self.learn(experiences, CONFIG['GAMMA'])

        # Hard update the target network periodically (standard for Atari)
        if self.t_step % CONFIG['TARGET_UPDATE_FREQ'] == 0:
           self.qnetwork_target.load_state_dict(self.qnetwork_local.state_dict())

    def act(self, state, eps=0.):
        """Returns action based on epsilon-greedy policy."""
        state = torch.from_numpy(np.copy(state)).float().unsqueeze(0).to(device)

        self.qnetwork_local.eval()
        with torch.no_grad():
            action_values = self.qnetwork_local(state)
        self.qnetwork_local.train()

        if random.random() > eps:
            return np.argmax(action_values.cpu().data.numpy())
        else:
            return random.choice(np.arange(self.action_size))

    def learn(self, experiences, gamma):
        # Placeholder, implemented by child classes
        pass


# ----------------------------------------------------------------
# --- ðŸ’¥ DQN Variant 1: Simple DQN (Original Target Calculation) ---
# ----------------------------------------------------------------
class SimpleDQNAgent(AgentBase):
    """Implements the original DQN learning step: Target Q = R + gamma * max_a Q_target(s', a)."""
    def __init__(self, state_shape, action_size, seed):
        # Initialize with Standard QNetwork (dueling=False)
        super().__init__(state_shape, action_size, seed, mode='SimpleDQN', dueling=False)

    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones = experiences

        # Target Q calculation uses the max Q-value from the target network directly.
        Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)

        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))

        Q_expected = self.qnetwork_local(states).gather(1, actions)

        loss = F.mse_loss(Q_expected, Q_targets)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()


# ----------------------------------------------------------------
# --- ðŸ’¥ DQN Variant 2: Double DQN (Decoupled Target Calculation) ---
# ----------------------------------------------------------------
class DoubleDQNAgent(AgentBase):
    """Implements the Double DQN learning step: Target Q = R + gamma * Q_target(s', argmax_a Q_local(s', a))."""
    def __init__(self, state_shape, action_size, seed):
        # Initialize with Standard QNetwork (dueling=False)
        super().__init__(state_shape, action_size, seed, mode='DoubleDQN', dueling=False)

    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones = experiences

        # 1. Action selection from LOCAL network (argmax_a Q_local(s', a))
        Q_local_next = self.qnetwork_local(next_states).detach()
        best_actions = Q_local_next.max(1)[1].unsqueeze(1)

        # 2. Value estimation from TARGET network (Q_target(s', best_actions))
        Q_targets_next = self.qnetwork_target(next_states).gather(1, best_actions).detach()

        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))

        Q_expected = self.qnetwork_local(states).gather(1, actions)

        loss = F.mse_loss(Q_expected, Q_targets)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()


# ----------------------------------------------------------------
# --- ðŸ’¥ DQN Variant 3: Dueling DQN (Dueling Architecture + Double Learning Rule) ---
# ----------------------------------------------------------------
class DuelingDQNAgent(DoubleDQNAgent):
    """Dueling DQN uses the Dueling architecture and the Double DQN learning rule for stability."""
    def __init__(self, state_shape, action_size, seed):
        # Initialize with Dueling QNetwork (dueling=True)
        AgentBase.__init__(self, state_shape, action_size, seed, mode='DuelingDQN', dueling=True)

    # Inherits the Double DQN learn() method for stability


# --- Agent Initialization based on global CONFIG['MODE'] ---
if CONFIG['MODE'] == "SimpleDQN":
    agent = SimpleDQNAgent(state_shape=state_shape, action_size=action_size, seed=CONFIG['SEED'])
elif CONFIG['MODE'] == "DoubleDQN":
    agent = DoubleDQNAgent(state_shape=state_shape, action_size=action_size, seed=CONFIG['SEED'])
elif CONFIG['MODE'] == "DuelingDQN":
    agent = DuelingDQNAgent(state_shape=state_shape, action_size=action_size, seed=CONFIG['SEED'])
else:
    raise ValueError("Invalid MODE specified in CONFIG.")

print(f"Initialized agent: {type(agent).__name__} with learning mode: {agent.mode}")

## 5. Training and Evaluation Functions

In [None]:
def dqn(n_episodes=CONFIG['N_EPISODES'], max_t=10000, eps_start=CONFIG['EPS_START'], eps_end=CONFIG['EPS_END'], eps_decay=CONFIG['EPS_DECAY']):
    """Deep Q-Learning training function, modified to save scores."""

    scores = []
    scores_window = deque(maxlen=100)
    eps = eps_start

    GOAL_SCORE = 400.0

    # Check for existing scores to prevent overwriting if continuing a run
    global ALL_SCORES

    print(f"\nStarting training for {agent.mode}...")

    for i_episode in range(1, n_episodes + 1):
        state, info = env.reset(seed=CONFIG['SEED'] if i_episode == 1 else None)
        state = np.array(state)
        score = 0

        for t in range(max_t):
            action = agent.act(state, eps)

            next_state_raw, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated

            next_state = np.array(next_state_raw)
            reward_np = np.array([reward]).astype(np.float32)
            done_np = np.array([done]).astype(np.uint8)

            agent.step(state, action, reward_np, next_state, done_np)
            state = next_state
            score += reward

            if done:
                break

        scores_window.append(score)
        scores.append(score)
        eps = max(eps_end, eps_decay * eps)

        avg_score = np.mean(scores_window)

        print(f'\rEpisode {i_episode}\tAverage Score: {avg_score:.2f}\tEpsilon: {eps:.4f}', end="")

        if i_episode % 100 == 0:
            print(f'\rEpisode {i_episode}\tAverage Score: {avg_score:.2f}\tEpsilon: {eps:.4f}')

        if avg_score >= GOAL_SCORE:
            print(f'\n{agent.mode} Goal Reached in {i_episode-100} episodes!\\tAverage Score: {avg_score:.2f}')
            torch.save(agent.qnetwork_local.state_dict(), f'{agent.mode}_{i_episode}.pth')
            break

    # Save the scores of the completed run
    ALL_SCORES[agent.mode] = scores
    np.save('dqn_project_scores.npy', ALL_SCORES)

    return scores

In [None]:
# Initialize an empty dictionary to hold scores from all runs
ALL_SCORES = {}

In [None]:
# Run training
scores = dqn()

In [None]:
def plot_all_dqn_scores(all_scores_dict, window=100):
    """
    Loads scores for all DQN variants and plots their moving average on a single graph.

    Args:
        all_scores_dict (dict): Dictionary mapping mode names ('SimpleDQN', etc.) to lists of episode scores.
        window (int): The window size for the moving average.
    """
    if not all_scores_dict:
        print("No scores available to plot. Please run training for at least one agent.")
        return

    plt.figure(figsize=(12, 6))

    for mode, scores in all_scores_dict.items():
        if len(scores) >= window:
            # Calculate 100-episode moving average
            moving_avg = np.convolve(scores, np.ones(window)/window, mode='valid')

            # The x-axis should start at the window size, as the moving average starts there
            x_axis = np.arange(len(moving_avg)) + window

            plt.plot(x_axis, moving_avg, label=f'{mode} (Avg={moving_avg[-1]:.2f})')
        else:
            print(f"Not enough data to calculate moving average for {mode}.")

    # Add score targets as horizontal lines, similar to the presentation graph
    plt.axhline(y=400, color='r', linestyle='--', linewidth=1, label='Goal: 400')
    plt.axhline(y=500, color='g', linestyle='--', linewidth=1, label='Goal: 500')

    plt.title('Consolidated DQN Training Progress (100-Episode Moving Average)')
    plt.ylabel('Average Score (100-Game Window)')
    plt.xlabel('Episode #')
    plt.grid(True)
    plt.legend()
    plt.savefig('all_dqn_scores.png')
    plt.show()

In [None]:
# Run run AFTER you have completed at least one training run
try:
    # Load saved scores (if they exist from previous runs)
    loaded_scores = np.load('dqn_project_scores.npy', allow_pickle=True).item()
    plot_all_dqn_scores(loaded_scores)
except FileNotFoundError:
    print("Scores file not found. Run training first.")

## 6. Video Visualization Utility
This function is provided for optional video recording using a trained model's weights.

In [None]:
def render_mp4(videopath: str) -> str:
  """Gets a string containing a b64-encoded version of the MP4 video."""
  import os
  if not os.path.exists(videopath):
      return f"<p>Video file not found at {videopath}. Run a test episode first.</p>"

  mp4 = open(videopath, 'rb').read()
  base64_encoded_mp4 = b64encode(mp4).decode()
  return f'<video width=400 controls><source src="data:video/mp4;base64,{base64_encoded_mp4}" type="video/mp4"></video>'

def run_and_record(env_id, weights_path, mode, seed=CONFIG['SEED'], num_episodes=1):
    """Runs a specified agent on the environment and records the interaction."""

    # 1. Setup Environment
    # Use the same wrapper stack as training for consistent state representation
    env_render = make_atari_env(env_id, seed=seed)

    # 2. Setup Agent
    action_size = env_render.action_space.n
    state_shape = env_render.observation_space.shape

    # Use the appropriate Agent class
    if mode == "SimpleDQN":
        test_agent = SimpleDQNAgent(state_shape, action_size, seed)
    elif mode == "DoubleDQN":
        test_agent = DoubleDQNAgent(state_shape, action_size, seed)
    elif mode == "DuelingDQN":
        test_agent = DuelingDQNAgent(state_shape, action_size, seed)
    else:
        return f"<p>Invalid MODE specified for testing: {mode}</p>"

    # 3. Load Weights
    try:
        test_agent.qnetwork_local.load_state_dict(torch.load(weights_path, map_location=device))
        test_agent.qnetwork_local.eval()
        print(f"Successfully loaded {mode} weights from {weights_path}")
    except FileNotFoundError:
        print(f"Checkpoint file {weights_path} not found. Agent will use random weights.")
        return

    # 4. Record Episodes
    video_path = f'{mode}_{env_id.split("/")[-1]}_test.mp4'
    frames = []

    for episode in range(num_episodes):
        state, info = env_render.reset(seed=seed)
        score = 0
        done = False

        while not done:
            # The FrameStack wrapper returns a LazyFrame, convert to NumPy array
            state_np = np.array(state)
            action = test_agent.act(state_np, eps=0.0)

            # Capture frame (convert to RGB before saving)
            frames.append(env_render.render())

            next_state, reward, terminated, truncated, info = env_render.step(action)
            done = terminated or truncated
            state = next_state
            score += reward

        print(f"Test Episode {episode+1} score: {score:.2f}")

    env_render.close()

    # Save video
    imageio.mimsave(video_path, frames, fps=30)

    # Display video
    html = render_mp4(video_path)
    ipythondisplay.display(ipythondisplay.HTML(html))


In [None]:
# Example usage (Uncomment and update weights_path after training):
# run_and_record(CONFIG['ENV_ID'], 'SimpleDQN_5000.pth', 'SimpleDQN', num_episodes=1)