<a href="https://colab.research.google.com/github/tcharos/AIDL_B02-Advanced-Topics-in-Deep-Learning/blob/main/AIDL_B02_AdvancedTopicsInDeepLearning_SpaceInvaders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/tcharos/AIDL_B02-Advanced-Topics-in-Deep-Learning/blob/main/AIDL_B02_AdvancedTopicsInDeepLearning_SpaceInvaders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üöÄ Project Base: DQN Variants for ALE/SpaceInvaders-v5

This notebook strictly implements the project's requirements for the **`ALE/SpaceInvaders-v5`** environment with 4-frame stacking and CNN architecture.

**Key Requirements Met:**
* **Environment:** `ALE/SpaceInvaders-v5` [cite: 11]
* **Action Space:** 6 actions [cite: 13, 21]
* **State:** 4 stacked input frames [cite: 19]

**To run an implementation:**
1.  Change the `CONFIG['MODE']` variable below to one of: **`SimpleDQN`**, **`DoubleDQN`**, or **`DuelingDQN`**.
2.  Adjust hyperparameters (`LR`, `EPS_DECAY`, etc.) in the `CONFIG` dictionary if needed.
3.  Run all cells.

## 1. Setup and Configuration

In [1]:
!pip install "gymnasium[atari,accept-rom-license,other]" ale-py
!pip install pyvirtualdisplay
!apt-get install -y xvfb x11-utils
!pip install shimmy imageio-ffmpeg

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
x11-utils is already the newest version (7.7+5build2).
xvfb is already the newest version (2:21.1.4-2ubuntu1.7~22.04.16).
0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.


In [2]:
import gymnasium as gym
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from collections import deque, namedtuple
import matplotlib.pyplot as plt
from gymnasium.wrappers import AtariPreprocessing
from gymnasium.wrappers import FrameStackObservation
import ale_py

import gc
import psutil
import os
import glob

# Tools for video display
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay
from base64 import b64encode

In [3]:
# ----------------- GLOBAL CONFIGURATION -----------------
CONFIG = {
    "ENV_ID": 'ALE/SpaceInvaders-v5',
    "SEED": 7,
    "MODE": "SimpleDQN", # Choice --> 'SimpleDQN', 'DoubleDQN', 'DuelingDQN'
    "INPUT_SHAPE": (4, 84, 84), # 4 stacked frames, resized to 84x84
    "BUFFER_SIZE": int(5e3), # from 1e5
    "BATCH_SIZE": 32, # Reduced batch size (common practice for Atari, hinted in PDF [cite: 37])
    "GAMMA": 0.99, # Prioritizing long-term cumulative reward
    "TAU": 1e-3, # Soft Update Rate
    "LR": 1e-4, # Lower learning rate --> stable convergence
    "UPDATE_EVERY": 4, # Learn frequency (standard for Atari DQN)
    "TARGET_UPDATE_FREQ": 1000,
    "N_EPISODES": 4000,
    "EPS_START": 1.0, # Initial probability of choosing a random action (exploration) --> fully exploring the environment to gather initial experiences
    "EPS_END": 0.01, # Minimum probability of choosing a random action.
    "EPS_DECAY": 0.999, # Exploration rate decays very slowly, allowing the agent to explore over a large number of episodes
    "USE_GOOGLE_DRIVE": True,
    "CHECKPOINT_FREQ": 400,
    "GOAL_SCORE": 400.0
}
# --------------------------------------------------------

gym.register_envs(ale_py)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

random.seed(CONFIG['SEED'])
np.random.seed(CONFIG['SEED'])
torch.manual_seed(CONFIG['SEED'])
if torch.cuda.is_available():
    torch.cuda.manual_seed(CONFIG['SEED'])

print(f"Using device: {device}")
print(f"Current DQN Mode: {CONFIG['MODE']}")

Using device: cuda:0
Current DQN Mode: SimpleDQN


In [4]:
if CONFIG['USE_GOOGLE_DRIVE']:
    from google.colab import drive
    drive.mount('/content/drive')
    checkpoint_dir = '/content/drive/MyDrive/DQN_Checkpoints'
    print("‚úì Google Drive mounted - checkpoints will be saved to Drive")
else:
    checkpoint_dir = './checkpoints'  # Local directory
    print("‚úì Using local storage - checkpoints will be saved locally")

os.makedirs(checkpoint_dir, exist_ok=True)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úì Google Drive mounted - checkpoints will be saved to Drive


## 2. Environment Initialization
We use **`AtariPreprocessing`** to handle resizing/cropping to 84x84 and grayscale conversion. **`FrameStack`** then stacks 4 consecutive frames, fulfilling the requirements for the state space[cite: 19, 20].

In [5]:
def make_atari_env(env_id, seed):
    """Creates and wraps the Atari environment with standard preprocessing and 4-frame stacking."""
    # 1. Base Environment (Using the required ID [cite: 11])
    env = gym.make(env_id)

    # 2. Atari Preprocessing: Resizes to 84x84, grayscale, handles max-pooling/skip.
    # Frame skip is set to 1 here because the ALE/SpaceInvaders-v5 environment generally handles skips
    # implicitly, or we rely on the standard wrappers' internal logic for compatibility.
    env = AtariPreprocessing(env, grayscale_obs=True, terminal_on_life_loss=True, frame_skip=1, screen_size=84)

    # 3. Frame Stacking (Creates the (4, 84, 84) state [cite: 19])
    env = FrameStackObservation(env, stack_size=4)

    # Set seed on the final environment
    if seed is not None:
        env.action_space.seed(seed)
        env.observation_space.seed(seed)

    return env

env = make_atari_env(CONFIG['ENV_ID'], CONFIG['SEED'])
action_size = env.action_space.n
state_shape = env.observation_space.shape

print(f'Final State shape (Stacked Frames): {state_shape}')
print(f'Number of available actions (SpaceInvaders): {action_size}') # Confirms 6 actions [cite: 13, 21]

Final State shape (Stacked Frames): (4, 84, 84)
Number of available actions (SpaceInvaders): 6


## 3. Q-Network Architecture
The network uses a CNN architecture  to process the high-dimensional image input, supporting Dueling components via a flag.

In [6]:
class QNetwork(nn.Module):
    """CNN-based Q-Network Model supporting Standard and Dueling structures."""

    def __init__(self, state_shape, action_size, seed, dueling=False):
        """Initializes the shared CNN layers and splits into Value/Advantage streams if Dueling is enabled.
        """
        super(QNetwork, self).__init__()
        self.seed = torch.manual_seed(seed)
        self.dueling = dueling
        in_channels = state_shape[0] # 4 stacked frames

        # --- Shared CNN Layers (Original DQN architecture) ---
        # Layers extract features from the 4 stacked 84x84 input images.
        self.conv1 = nn.Conv2d(in_channels, 32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)

        # --- Dynamic Calculation of fc_input_size ---
        # This prevents the network from breaking if the input image size changes.
        # 1. Create a dummy input tensor based on state_shape (e.g., (1, 4, 84, 84))
        dummy_input = torch.zeros(1, *state_shape)

        # 2. Pass the dummy input through the convolutional layers
        x = self._forward_conv(dummy_input)

        # 3. Calculate the flattened feature size (e.g., 7*7*64 = 3136)
        self.fc_input_size = x.view(1, -1).size(1)

        # --- Fully Connected Layers ---
        if self.dueling:
            # Dueling Architecture: Split into Value (V) and Advantage (A) streams
            self.fc_v1 = nn.Linear(self.fc_input_size, 512)
            self.fc_a1 = nn.Linear(self.fc_input_size, 512)

            self.fc_v2 = nn.Linear(512, 1) # Output V(s)
            self.fc_a2 = nn.Linear(512, action_size) # Output A(s, a)
        else:
            # Standard DQN Architecture (Single Q-stream)
            self.fc1 = nn.Linear(self.fc_input_size, 512)
            self.fc2 = nn.Linear(512, action_size)

    def _forward_conv(self, x):
        """Forward pass through convolutional layers only."""
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        return x.view(x.size(0), -1)  # Flatten

    def forward(self, state):
        """Maps state (4, 84, 84) to action values (6)."""
        # CNN forward pass
        x = F.relu(self.conv1(state))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = x.view(x.size(0), -1) # Flatten

        if self.dueling:
        # Dueling Combination: Q(s,a) = V(s) + [A(s,a) - mean(A(s,a))]
            v = F.relu(self.fc_v1(x))
            a = F.relu(self.fc_a1(x))
            v = self.fc_v2(v)
            a = self.fc_a2(a)
            return v + a - a.mean(1).unsqueeze(1)
        else:
            # Standard Q-stream
            x = F.relu(self.fc1(x))
            return self.fc2(x)

## 4. Replay Buffer and Agent Implementations
The **Replay Buffer** (PER is an optional extension [cite: 27]) is crucial for breaking correlation in experience samples. The **AgentBase** handles common functions; specialized classes implement the specific Q-learning update rule.

In [7]:
class ReplayBuffer:
    """Memory-efficient replay buffer using uint8 for frames."""

    def __init__(self, action_size, buffer_size, batch_size, seed):
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)

    def add(self, state, action, reward, next_state, done):
        """Store frames as uint8 to save memory (4x compression)."""
        # Convert float32 [0,1] to uint8 [0,255]
        if state.dtype == np.float32 or state.dtype == np.float64:
            state = (state * 255).astype(np.uint8)
        if next_state.dtype == np.float32 or next_state.dtype == np.float64:
            next_state = (next_state * 255).astype(np.uint8)

        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)

    def sample(self):
        """Sample and convert back to float32 [0,1]."""
        experiences = random.sample(self.memory, k=self.batch_size)

        states = torch.from_numpy(np.stack([e.state for e in experiences])).float().to(device) / 255.0
        actions = torch.from_numpy(np.vstack([e.action for e in experiences])).long().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences])).float().to(device)
        next_states = torch.from_numpy(np.stack([e.next_state for e in experiences])).float().to(device) / 255.0
        dones = torch.from_numpy(np.vstack([e.done for e in experiences]).astype(np.uint8)).float().to(device)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        return len(self.memory)

In [8]:
class AgentBase:
    """Base class for all DQN agents, handling shared components and target network logic."""

    def __init__(self, state_shape, action_size, seed, mode, dueling):
        self.state_shape = state_shape
        self.action_size = action_size
        self.mode = mode

        # Initialize Q-Networks
        self.qnetwork_local = QNetwork(state_shape, action_size, seed, dueling=dueling).to(device)
        self.qnetwork_target = QNetwork(state_shape, action_size, seed, dueling=dueling).to(device)
        self.qnetwork_target.load_state_dict(self.qnetwork_local.state_dict())

        self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=CONFIG['LR'])
        self.memory = ReplayBuffer(action_size, CONFIG['BUFFER_SIZE'], CONFIG['BATCH_SIZE'], seed)

        self.t_step = 0

    def step(self, state, action, reward, next_state, done):
      # Convert LazyFrames to numpy if needed
      if hasattr(state, '__array__'):
          state = np.array(state)
      if hasattr(next_state, '__array__'):
          next_state = np.array(next_state)

      # Save experience
      self.memory.add(state, action, reward, next_state, done)

      # Learn every UPDATE_EVERY steps
      self.t_step = (self.t_step + 1) % CONFIG['UPDATE_EVERY']
      if self.t_step == 0:
          if len(self.memory) > CONFIG['BATCH_SIZE']:
              experiences = self.memory.sample()
              self.learn(experiences, CONFIG['GAMMA'])
              # CRITICAL: Delete experiences tuple to free memory
              del experiences

      # Hard update the target network periodically
      # FIX: This condition was wrong - it would almost never trigger
      if (self.t_step + 1) % CONFIG['TARGET_UPDATE_FREQ'] == 0:
          self.qnetwork_target.load_state_dict(self.qnetwork_local.state_dict())

    def act(self, state, eps=0.):
      """Returns action based on epsilon-greedy policy."""
      # Convert LazyFrames to numpy if needed
      if hasattr(state, '__array__'):
          state = np.array(state)

      state = torch.from_numpy(state).float().unsqueeze(0).to(device)

      self.qnetwork_local.eval()
      with torch.no_grad():
          action_values = self.qnetwork_local(state)
      self.qnetwork_local.train()

      if random.random() > eps:
          return np.argmax(action_values.cpu().data.numpy())
      else:
          return random.choice(np.arange(self.action_size))

    def learn(self, experiences, gamma):
        # Placeholder, implemented by child classes
        pass


# ----------------------------------------------------------------
# --- üí• DQN Variant 1: Simple DQN (Original Target Calculation) ---
# ----------------------------------------------------------------
class SimpleDQNAgent(AgentBase):
    """Implements the original DQN learning step: Target Q = R + gamma * max_a Q_target(s', a)."""
    def __init__(self, state_shape, action_size, seed):
        # Initialize with Standard QNetwork (dueling=False)
        super().__init__(state_shape, action_size, seed, mode='SimpleDQN', dueling=False)

    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones = experiences

        # Ensure states have the right shape: (batch, 4, 84, 84)
        if states.dim() == 5:  # If shape is (batch, 4, 1, 84, 84)
            states = states.squeeze(2)
        if next_states.dim() == 5:
            next_states = next_states.squeeze(2)

        # Target Q calculation uses the max Q-value from the target network directly.
        with torch.no_grad():  # ‚Üê CRITICAL: Prevent gradient tracking for target
            Q_targets_next = self.qnetwork_target(next_states).max(1)[0].unsqueeze(1)
            Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))

        Q_expected = self.qnetwork_local(states).gather(1, actions)

        loss = F.mse_loss(Q_expected, Q_targets)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # CRITICAL: Clean up to prevent memory leak
        del states, actions, rewards, next_states, dones, Q_targets, Q_expected, loss


# ----------------------------------------------------------------
# --- üí• DQN Variant 2: Double DQN (Decoupled Target Calculation) ---
# ----------------------------------------------------------------
class DoubleDQNAgent(AgentBase):
    """Implements the Double DQN learning step."""
    def __init__(self, state_shape, action_size, seed):
        super().__init__(state_shape, action_size, seed, mode='DoubleDQN', dueling=False)

    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones = experiences

        # Ensure states have the right shape
        if states.dim() == 5:
            states = states.squeeze(2)
        if next_states.dim() == 5:
            next_states = next_states.squeeze(2)

        # Double DQN: Select actions with local, evaluate with target
        with torch.no_grad():  # ‚Üê CRITICAL: No gradient tracking
            # 1. Action selection from LOCAL network
            Q_local_next = self.qnetwork_local(next_states)
            best_actions = Q_local_next.max(1)[1].unsqueeze(1)

            # 2. Value estimation from TARGET network
            Q_targets_next = self.qnetwork_target(next_states).gather(1, best_actions)
            Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))

        Q_expected = self.qnetwork_local(states).gather(1, actions)

        loss = F.mse_loss(Q_expected, Q_targets)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Clean up
        del states, actions, rewards, next_states, dones, Q_targets, Q_expected, loss


# ----------------------------------------------------------------
# --- üí• DQN Variant 3: Dueling DQN (Dueling Architecture + Double Learning Rule) ---
# ----------------------------------------------------------------
class DuelingDQNAgent(DoubleDQNAgent):
    """Dueling DQN uses the Dueling architecture and the Double DQN learning rule for stability."""
    def __init__(self, state_shape, action_size, seed):
        # Initialize with Dueling QNetwork (dueling=True)
        AgentBase.__init__(self, state_shape, action_size, seed, mode='DuelingDQN', dueling=True)

    # Inherits the Double DQN learn() method for stability


# --- Agent Initialization based on global CONFIG['MODE'] ---
if CONFIG['MODE'] == "SimpleDQN":
    agent = SimpleDQNAgent(state_shape=state_shape, action_size=action_size, seed=CONFIG['SEED'])
elif CONFIG['MODE'] == "DoubleDQN":
    agent = DoubleDQNAgent(state_shape=state_shape, action_size=action_size, seed=CONFIG['SEED'])
elif CONFIG['MODE'] == "DuelingDQN":
    agent = DuelingDQNAgent(state_shape=state_shape, action_size=action_size, seed=CONFIG['SEED'])
else:
    raise ValueError("Invalid MODE specified in CONFIG.")

print(f"Initialized agent: {type(agent).__name__} with learning mode: {agent.mode}")

Initialized agent: SimpleDQNAgent with learning mode: SimpleDQN


## 5. Training and Evaluation Functions

In [9]:
def dqn_train(n_episodes=CONFIG['N_EPISODES'], max_t=10000, eps_start=CONFIG['EPS_START'], eps_end=CONFIG['EPS_END'], eps_decay=CONFIG['EPS_DECAY'], checkpoint_dir='/content/drive/MyDrive/DQN_Checkpoints'):
    """Deep Q-Learning with crash debugging."""

    import traceback

    scores = []
    scores_window = deque(maxlen=100)
    eps = eps_start

    global ALL_SCORES

    print(f"\nStarting training for {agent.mode}...")
    print(f"Checkpoints will be saved to: {checkpoint_dir}")

    try:
        for i_episode in range(1, n_episodes + 1):
            try:
                # Monitor memory at start of episode
                if i_episode % 10 == 0:
                    mem = psutil.virtual_memory()
                    gpu_mem = torch.cuda.memory_allocated() / (1024**3) if torch.cuda.is_available() else 0
                    print(f'\n[Ep {i_episode}] RAM: {mem.percent:.1f}% | GPU: {gpu_mem:.2f}GB | Buffer: {len(agent.memory)}/{CONFIG["BUFFER_SIZE"]}')

                state, info = env.reset(seed=CONFIG['SEED'] if i_episode == 1 else None)

                # Check state validity
                if state is None:
                    print(f"ERROR: Episode {i_episode} - env.reset() returned None!")
                    continue

                state = np.array(state)

                # Verify state shape
                if state.shape != (4, 84, 84):
                    print(f"ERROR: Episode {i_episode} - Invalid state shape: {state.shape}")
                    continue

                score = 0

                for t in range(max_t):
                    action = agent.act(state, eps)

                    next_state_raw, reward, terminated, truncated, info = env.step(action)
                    done = terminated or truncated

                    # Check for NaN or invalid values
                    if np.isnan(reward) or np.isinf(reward):
                        print(f"WARNING: Episode {i_episode}, step {t} - Invalid reward: {reward}")
                        reward = 0.0

                    next_state = np.array(next_state_raw)

                    # Verify next_state
                    if next_state.shape != (4, 84, 84):
                        print(f"ERROR: Episode {i_episode}, step {t} - Invalid next_state shape: {next_state.shape}")
                        break

                    reward_np = np.array([reward]).astype(np.float32)
                    done_np = np.array([done]).astype(np.uint8)

                    agent.step(state, action, reward_np, next_state, done_np)
                    state = next_state
                    score += reward

                    if done:
                        break

                scores_window.append(score)
                scores.append(score)
                eps = max(eps_end, eps_decay * eps)

                avg_score = np.mean(scores_window)

                # Print progress
                if i_episode % 1 == 0:  # Print every episode for debugging
                    print(f'\rEpisode {i_episode}\tScore: {score:.1f}\tAvg: {avg_score:.2f}\tEps: {eps:.3f}\tSteps: {t+1}', end="")

                # Frequent checkpoints while debugging
                if i_episode % CONFIG['CHECKPOINT_FREQ'] == 0:
                    print(f'\n[CHECKPOINT] Episode {i_episode}\tAverage Score: {avg_score:.2f}')

                    checkpoint = {
                        'episode': i_episode,
                        'model_state_dict': agent.qnetwork_local.state_dict(),
                        'target_state_dict': agent.qnetwork_target.state_dict(),
                        'optimizer_state_dict': agent.optimizer.state_dict(),
                        'scores': scores,
                        'eps': eps,
                        'mode': agent.mode
                    }
                    checkpoint_path = os.path.join(checkpoint_dir, f'checkpoint_{agent.mode}_ep{i_episode}.pth')
                    torch.save(checkpoint, checkpoint_path)
                    print(f"‚úì Checkpoint saved")

                    # Aggressive memory cleanup
                    gc.collect()
                    if torch.cuda.is_available():
                        torch.cuda.empty_cache()

                    mem = psutil.virtual_memory()
                    print(f"RAM: {mem.percent}% ({mem.available / (1024**3):.1f}GB free)")

                # In your training loop, add this after episode 10:
                if i_episode == 10:
                    print("\n" + "="*60)
                    print("MEMORY LEAK INVESTIGATION")
                    print("="*60)

                    # Check what's in the buffer
                    if len(agent.memory) > 0:
                        sample_exp = list(agent.memory.memory)[0]
                        print(f"State dtype in buffer: {sample_exp.state.dtype}")
                        print(f"State shape: {sample_exp.state.shape}")
                        print(f"State memory: {sample_exp.state.nbytes / 1024:.1f} KB")

                        # Calculate expected buffer size
                        bytes_per_exp = (sample_exp.state.nbytes + sample_exp.next_state.nbytes + 20)
                        expected_mb = (len(agent.memory) * bytes_per_exp) / (1024**2)

                        print(f"\nBuffer has {len(agent.memory)} experiences")
                        print(f"Expected buffer size: {expected_mb:.1f} MB")

                        # Check actual RAM
                        mem = psutil.virtual_memory()
                        print(f"Actual RAM used: {mem.used / (1024**3):.2f} GB ({mem.percent:.1f}%)")

                        print("\n‚ö†Ô∏è If RAM is much higher than expected buffer size,")
                        print("   there's a memory leak outside the buffer!")

                    print("="*60)

                # Check for goal
                if avg_score >= CONFIG['GOAL_SCORE']:
                    print(f'\nüéâ Goal Reached in {i_episode} episodes!')
                    torch.save(agent.qnetwork_local.state_dict(), os.path.join(checkpoint_dir, f'{agent.mode}_solved_{i_episode}.pth'))
                    break

            except Exception as e:
                print(f"\n‚ùå ERROR in Episode {i_episode}:")
                print(f"Exception: {type(e).__name__}: {e}")
                traceback.print_exc()

                # Save emergency checkpoint
                print("Saving emergency checkpoint...")
                torch.save({
                    'episode': i_episode,
                    'model_state_dict': agent.qnetwork_local.state_dict(),
                    'scores': scores,
                    'eps': eps,
                }, os.path.join(checkpoint_dir, f'emergency_ep{i_episode}.pth'))

                # Try to continue or break
                user_input = input("Continue training? (y/n): ")
                if user_input.lower() != 'y':
                    break

    except KeyboardInterrupt:
        print("\n\n‚ö†Ô∏è  Training interrupted by user")

    except Exception as e:
        print(f"\n\n‚ùå FATAL ERROR:")
        print(f"Exception: {type(e).__name__}: {e}")
        traceback.print_exc()

    finally:
        # Always save what we have
        print("\n\nSaving final results...")
        ALL_SCORES[agent.mode] = scores
        np.save(os.path.join(checkpoint_dir, 'dqn_project_scores.npy'), ALL_SCORES)
        print(f"‚úì Saved {len(scores)} episodes")

    return scores

In [10]:
# Initialize an empty dictionary to hold scores from all runs
ALL_SCORES = {}

In [11]:
# Resuming from checkpoint if it exists
# Auto-resume from latest checkpoint

checkpoint_files = glob.glob(os.path.join(checkpoint_dir, f'checkpoint_{CONFIG["MODE"]}_ep*.pth'))

if checkpoint_files:
    # Sort to get the latest checkpoint
    checkpoint_files.sort(key=lambda x: int(x.split('_ep')[-1].split('.')[0]))
    latest_checkpoint = checkpoint_files[-1]

    print(f"\nüîÑ Found existing checkpoint: {os.path.basename(latest_checkpoint)}")
    print(f"üìÅ Location: {checkpoint_dir}")

    # Option to resume or start fresh
    resume = input("Resume from checkpoint? (y/n): ").lower() == 'y'

    if resume:
        print(f"Loading checkpoint...")
        checkpoint = torch.load(latest_checkpoint)

        agent.qnetwork_local.load_state_dict(checkpoint['model_state_dict'])
        agent.qnetwork_target.load_state_dict(checkpoint['target_state_dict'])
        agent.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

        # Load previous scores
        scores_path = os.path.join(checkpoint_dir, 'dqn_project_scores.npy')
        if os.path.exists(scores_path):
            ALL_SCORES = np.load(scores_path, allow_pickle=True).item()

        start_episode = checkpoint['episode']
        start_eps = checkpoint['eps']

        print(f"‚úì Resumed from episode {start_episode}")
        print(f"‚úì Previous scores loaded: {len(checkpoint['scores'])} episodes")
        print(f"‚úì Epsilon: {start_eps:.4f}")
        if len(checkpoint['scores']) >= 100:
            print(f"‚úì Last 100 episodes avg: {np.mean(checkpoint['scores'][-100:]):.2f}")

        # Update CONFIG to continue from where we left off
        CONFIG['EPS_START'] = start_eps
    else:
        print("Starting fresh training...")
else:
    print("No checkpoint found. Starting fresh training...")


üîÑ Found existing checkpoint: checkpoint_SimpleDQN_ep25.pth
üìÅ Location: /content/drive/MyDrive/DQN_Checkpoints
Resume from checkpoint? (y/n): n
Starting fresh training...


In [12]:
# ============================================================
# PRE-TRAINING VERIFICATION
# ============================================================

print("=" * 60)
print("SYSTEM CHECK BEFORE TRAINING")
print("=" * 60)

# 1. Check RAM
mem = psutil.virtual_memory()
print(f"\nüìä RAM Status:")
print(f"   Total: {mem.total / (1024**3):.2f} GB")
print(f"   Available: {mem.available / (1024**3):.2f} GB")
print(f"   Used: {mem.used / (1024**3):.2f} GB ({mem.percent}%)")

if mem.available / (1024**3) < 2.0:
    print("   ‚ö†Ô∏è  WARNING: Less than 2GB RAM available!")
else:
    print("   ‚úì RAM looks good")

# 2. Check GPU
if torch.cuda.is_available():
    print(f"\nüéÆ GPU Status:")
    print(f"   Device: {torch.cuda.get_device_name(0)}")
    print(f"   Total Memory: {torch.cuda.get_device_properties(0).total_memory / (1024**3):.2f} GB")
    print(f"   Allocated: {torch.cuda.memory_allocated() / (1024**3):.2f} GB")
    print(f"   Cached: {torch.cuda.memory_reserved() / (1024**3):.2f} GB")
else:
    print("\n‚ö†Ô∏è  No GPU detected - training will be VERY slow!")

# 3. Verify networks are on GPU
print(f"\nüß† Network Status:")
print(f"   Local network device: {next(agent.qnetwork_local.parameters()).device}")
print(f"   Target network device: {next(agent.qnetwork_target.parameters()).device}")

# 4. Check buffer configuration
print(f"\nüíæ Replay Buffer:")
print(f"   Max size: {CONFIG['BUFFER_SIZE']:,}")
print(f"   Current size: {len(agent.memory)}")
print(f"   Batch size: {CONFIG['BATCH_SIZE']}")

# Estimate memory usage
frames_per_exp = 2 * 4 * 84 * 84  # state + next_state, 4 frames each
if hasattr(agent.memory.memory, 'maxlen') and len(agent.memory) > 0:
    # Check if using uint8 (1 byte) or float32 (4 bytes)
    sample_exp = list(agent.memory.memory)[0]
    dtype_size = 1 if sample_exp.state.dtype == np.uint8 else 4
    estimated_mb = (CONFIG['BUFFER_SIZE'] * frames_per_exp * dtype_size) / (1024**2)
    print(f"   Data type: {sample_exp.state.dtype}")
    print(f"   Estimated buffer memory: {estimated_mb:.0f} MB ({estimated_mb/1024:.2f} GB)")

    if dtype_size == 4:
        print("   ‚ö†Ô∏è  WARNING: Using float32! Switch to uint8 to save 75% memory")
    else:
        print("   ‚úì Using uint8 compression")
else:
    # Empty buffer estimation
    estimated_mb_uint8 = (CONFIG['BUFFER_SIZE'] * frames_per_exp * 1) / (1024**2)
    estimated_mb_float32 = (CONFIG['BUFFER_SIZE'] * frames_per_exp * 4) / (1024**2)
    print(f"   Estimated memory (uint8): {estimated_mb_uint8:.0f} MB ({estimated_mb_uint8/1024:.2f} GB)")
    print(f"   Estimated memory (float32): {estimated_mb_float32:.0f} MB ({estimated_mb_float32/1024:.2f} GB)")

# 5. Config summary
print(f"\n‚öôÔ∏è  Training Config:")
print(f"   Mode: {CONFIG['MODE']}")
print(f"   Episodes: {CONFIG['N_EPISODES']}")
print(f"   Buffer size: {CONFIG['BUFFER_SIZE']:,}")
print(f"   Epsilon decay: {CONFIG['EPS_DECAY']}")
print(f"   Checkpoint dir: {checkpoint_dir}")

# 6. Final recommendation
print("\n" + "=" * 60)
total_estimated_gb = estimated_mb_uint8 / 1024 if 'estimated_mb_uint8' in locals() else 1.5
ram_available = mem.available / (1024**3)

if ram_available > total_estimated_gb + 2:  # +2GB safety margin
    print("‚úÖ READY TO TRAIN - Sufficient resources available")
else:
    print("‚ö†Ô∏è  RISK OF CRASH - Consider:")
    print(f"   - Reduce BUFFER_SIZE to {int(CONFIG['BUFFER_SIZE'] * 0.5):,}")
    print("   - Make sure uint8 ReplayBuffer is being used")
    print("   - Close other applications")

print("=" * 60)

# Force garbage collection before starting
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

SYSTEM CHECK BEFORE TRAINING

üìä RAM Status:
   Total: 12.67 GB
   Available: 10.87 GB
   Used: 1.48 GB (14.2%)
   ‚úì RAM looks good

üéÆ GPU Status:
   Device: Tesla T4
   Total Memory: 14.74 GB
   Allocated: 0.01 GB
   Cached: 0.02 GB

üß† Network Status:
   Local network device: cuda:0
   Target network device: cuda:0

üíæ Replay Buffer:
   Max size: 5,000
   Current size: 0
   Batch size: 32
   Estimated memory (uint8): 269 MB (0.26 GB)
   Estimated memory (float32): 1077 MB (1.05 GB)

‚öôÔ∏è  Training Config:
   Mode: SimpleDQN
   Episodes: 4000
   Buffer size: 5,000
   Epsilon decay: 0.999
   Checkpoint dir: /content/drive/MyDrive/DQN_Checkpoints

‚úÖ READY TO TRAIN - Sufficient resources available


In [13]:
# Add this right after creating the agent
print("\nüîç Testing ReplayBuffer compression:")

# Create a test state
test_state = np.random.rand(4, 84, 84).astype(np.float32)
test_action = 0
test_reward = 1.0
test_next_state = np.random.rand(4, 84, 84).astype(np.float32)
test_done = False

# Add to buffer
agent.memory.add(test_state, test_action, test_reward, test_next_state, test_done)

# Check what was stored
if len(agent.memory) > 0:
    stored_exp = list(agent.memory.memory)[0]
    print(f"Stored state dtype: {stored_exp.state.dtype}")
    print(f"Stored state shape: {stored_exp.state.shape}")

    if stored_exp.state.dtype == np.uint8:
        print("‚úÖ ReplayBuffer IS using uint8 compression")
    else:
        print("‚ùå ReplayBuffer NOT using uint8! This will cause crashes!")
        print("You need to use the uint8 ReplayBuffer implementation!")


üîç Testing ReplayBuffer compression:
Stored state dtype: uint8
Stored state shape: (4, 84, 84)
‚úÖ ReplayBuffer IS using uint8 compression


In [14]:
# Run training
scores = dqn_train()


Starting training for SimpleDQN...
Checkpoints will be saved to: /content/drive/MyDrive/DQN_Checkpoints
Episode 9	Score: 5.0	Avg: 21.67	Eps: 0.991	Steps: 139
[Ep 10] RAM: 22.1% | GPU: 0.05GB | Buffer: 1182/5000
Episode 10	Score: 125.0	Avg: 32.00	Eps: 0.990	Steps: 370
MEMORY LEAK INVESTIGATION
State dtype in buffer: uint8
State shape: (4, 84, 84)
State memory: 27.6 KB

Buffer has 1552 experiences
Expected buffer size: 83.6 MB
Actual RAM used: 2.77 GB (24.4%)

‚ö†Ô∏è If RAM is much higher than expected buffer size,
   there's a memory leak outside the buffer!
Episode 19	Score: 25.0	Avg: 56.05	Eps: 0.981	Steps: 121
[Ep 20] RAM: 38.4% | GPU: 0.05GB | Buffer: 3736/5000
Episode 25	Score: 0.0	Avg: 60.80	Eps: 0.975	Steps: 68
[CHECKPOINT] Episode 25	Average Score: 60.80
‚úì Checkpoint saved
RAM: 45.5% (6.9GB free)
Episode 29	Score: 160.0	Avg: 61.38	Eps: 0.971	Steps: 466
[Ep 30] RAM: 46.7% | GPU: 0.05GB | Buffer: 5000/5000
Episode 39	Score: 0.0	Avg: 66.92	Eps: 0.962	Steps: 118
[Ep 40] RAM: 47.2

In [None]:
def plot_all_dqn_scores(all_scores_dict, window=100):
    """
    Loads scores for all DQN variants and plots their moving average on a single graph.

    Args:
        all_scores_dict (dict): Dictionary mapping mode names ('SimpleDQN', etc.) to lists of episode scores.
        window (int): The window size for the moving average.
    """
    if not all_scores_dict:
        print("No scores available to plot. Please run training for at least one agent.")
        return

    plt.figure(figsize=(12, 6))

    for mode, scores in all_scores_dict.items():
        if len(scores) >= window:
            # Calculate 100-episode moving average
            moving_avg = np.convolve(scores, np.ones(window)/window, mode='valid')

            # The x-axis should start at the window size, as the moving average starts there
            x_axis = np.arange(len(moving_avg)) + window

            plt.plot(x_axis, moving_avg, label=f'{mode} (Avg={moving_avg[-1]:.2f})')
        else:
            print(f"Not enough data to calculate moving average for {mode}.")

    # Add score targets as horizontal lines, similar to the presentation graph
    plt.axhline(y=400, color='r', linestyle='--', linewidth=1, label='Goal: 400')
    plt.axhline(y=500, color='g', linestyle='--', linewidth=1, label='Goal: 500')

    plt.title('Consolidated DQN Training Progress (100-Episode Moving Average)')
    plt.ylabel('Average Score (100-Game Window)')
    plt.xlabel('Episode #')
    plt.grid(True)
    plt.legend()
    plt.savefig('all_dqn_scores.png')
    plt.show()

In [None]:
# Run run AFTER you have completed at least one training run
try:
    # Load saved scores (if they exist from previous runs)
    loaded_scores = np.load('dqn_project_scores.npy', allow_pickle=True).item()
    plot_all_dqn_scores(loaded_scores)
except FileNotFoundError:
    print("Scores file not found. Run training first.")

## 6. Video Visualization Utility
This function is provided for optional video recording using a trained model's weights.

In [None]:
def render_mp4(videopath: str) -> str:
  """Gets a string containing a b64-encoded version of the MP4 video."""
  import os
  if not os.path.exists(videopath):
      return f"<p>Video file not found at {videopath}. Run a test episode first.</p>"

  mp4 = open(videopath, 'rb').read()
  base64_encoded_mp4 = b64encode(mp4).decode()
  return f'<video width=400 controls><source src="data:video/mp4;base64,{base64_encoded_mp4}" type="video/mp4"></video>'

def run_and_record(env_id, weights_path, mode, seed=CONFIG['SEED'], num_episodes=1):
    """Runs a specified agent on the environment and records the interaction."""

    # 1. Setup Environment
    # Use the same wrapper stack as training for consistent state representation
    env_render = make_atari_env(env_id, seed=seed)

    # 2. Setup Agent
    action_size = env_render.action_space.n
    state_shape = env_render.observation_space.shape

    # Use the appropriate Agent class
    if mode == "SimpleDQN":
        test_agent = SimpleDQNAgent(state_shape, action_size, seed)
    elif mode == "DoubleDQN":
        test_agent = DoubleDQNAgent(state_shape, action_size, seed)
    elif mode == "DuelingDQN":
        test_agent = DuelingDQNAgent(state_shape, action_size, seed)
    else:
        return f"<p>Invalid MODE specified for testing: {mode}</p>"

    # 3. Load Weights
    try:
        test_agent.qnetwork_local.load_state_dict(torch.load(weights_path, map_location=device))
        test_agent.qnetwork_local.eval()
        print(f"Successfully loaded {mode} weights from {weights_path}")
    except FileNotFoundError:
        print(f"Checkpoint file {weights_path} not found. Agent will use random weights.")
        return

    # 4. Record Episodes
    video_path = f'{mode}_{env_id.split("/")[-1]}_test.mp4'
    frames = []

    for episode in range(num_episodes):
        state, info = env_render.reset(seed=seed)
        score = 0
        done = False

        while not done:
            # The FrameStack wrapper returns a LazyFrame, convert to NumPy array
            state_np = np.array(state)
            action = test_agent.act(state_np, eps=0.0)

            # Capture frame (convert to RGB before saving)
            frames.append(env_render.render())

            next_state, reward, terminated, truncated, info = env_render.step(action)
            done = terminated or truncated
            state = next_state
            score += reward

        print(f"Test Episode {episode+1} score: {score:.2f}")

    env_render.close()

    # Save video
    imageio.mimsave(video_path, frames, fps=30)

    # Display video
    html = render_mp4(video_path)
    ipythondisplay.display(ipythondisplay.HTML(html))


In [None]:
# Example usage (Uncomment and update weights_path after training):
# run_and_record(CONFIG['ENV_ID'], 'SimpleDQN_5000.pth', 'SimpleDQN', num_episodes=1)