# 🧠 Deep Reinforcement Learning — Doom Agent (SS2025)

Welcome to the last assignment for the **Deep Reinforcement Learning** course (SS2025). In this notebook, you'll implement and train a reinforcement learning agent to play **Doom**.

You will:
- Set up a custom VizDoom environment with shaped rewards
- Train an agent using an approach of your choice
- Track reward components across episodes
- Evaluate the best model
- Visualize performance with replays and GIFs
- Export the trained agent to ONNX to submit to the evaluation server

In [1]:
# Install the dependencies
!python -m pip install --upgrade pip
!pip install --upgrade notebook ipywidgets ipykernel -q
!pip install torch numpy matplotlib vizdoom portpicker gym onnx wandb stable-baselines3 stable-baselines3[extra] Shimmy einops torchvision -q

[0m

In [2]:
import os
import subprocess

base_dir = os.path.abspath(os.getcwd())
dir_path = os.path.join(base_dir, "jku.wad")

if os.path.isdir(dir_path):
    os.chdir(dir_path)
    subprocess.run(["git", "pull", "origin", "main"])
else:
    subprocess.run(["git", "clone", "https://github.com/syseitz/jku.wad.git", dir_path])
    os.chdir(dir_path)

Already up to date.


From https://github.com/syseitz/jku.wad
 * branch            main       -> FETCH_HEAD


## Environment configuration

ViZDoom supports multiple visual buffers that can be used as input for training agents. Each buffer provides different information about the game environment, as seen from left to right:


Screen
- The default first-person RGB view seen by the agent.

Labels
- A semantic map where each pixel is tagged with an object ID (e.g., enemy, item, wall).

Depth
- A grayscale map showing the distance from the agent to surfaces in the scene.

Automap
- A top-down schematic view of the map, useful for global navigation tasks.

![buffers gif](https://vizdoom.farama.org/_images/vizdoom-demo.gif)

In [3]:
import wandb
from typing import Dict, Sequence, Tuple

import torch
from collections import deque, OrderedDict
from copy import deepcopy
import random
import numpy as np
import torch.nn.functional as F
import torch.optim as optim
import pandas as pd
from matplotlib import pyplot as plt
from PIL import Image

from gym import Env
import gymnasium as gym
from torch import nn
from einops import rearrange

from doom_arena import VizdoomMPEnv
from doom_arena.reward import VizDoomReward
from doom_arena.render import render_episode
from IPython.display import HTML

from vizdoom import ScreenFormat
from stable_baselines3 import DQN
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor

In [4]:
USE_GRAYSCALE = False # ← flip to False for RGB

PLAYER_CONFIG = {
    "n_stack_frames": 1,
    "extra_state": ["depth", "labels"],
    "hud": "none",
    "crosshair": True,
    "screen_format": ScreenFormat.GRAY8 if USE_GRAYSCALE else ScreenFormat.CRCGCB,
}

## Reward function
In this task, you will define a reward function to guide the agent's learning. The function is called at every step and receives the current and previous game variables (e.g., number of frags, hits taken, health).

Your goal is to combine these into a meaningful reward, encouraging desirable behavior, such as:

- Rewarding frags (enemy kills)

- Rewarding accuracy (hitting enemies)

- Penalizing damage taken

- (Optional) Encouraging survival, ammo efficiency, etc.

You can return multiple reward components, which are summed during training. Consider the class below as a great starting point!

In [5]:
# TODO: environment training paramters
N_STACK_FRAMES = 1
NUM_BOTS = 4
EPISODE_TIMEOUT = 1000
# TODO: model hyperparams
GAMMA = 0.95
EPISODES = 1000 
BATCH_SIZE = 256
REPLAY_BUFFER_SIZE = 20000
LEARNING_RATE = 1e-4
EPSILON_START = 1.0
EPSILON_END = 0.1
EPSILON_DECAY = 0.995
FEATURES_DIM = 1024
#N_EPOCHS = 50 # Not used with stable_baseline3
TOTAL_TIMESTEPS = 1000000
EXPLORATION_FRACTION = 0.05


In [6]:
class YourReward(VizDoomReward):
    def __init__(self, num_players: int):
        super().__init__(num_players)

    def __call__(
        self,
        vizdoom_reward: float,
        game_var: Dict[str, float],
        game_var_old: Dict[str, float],
        player_id: int,
    ) -> Tuple[float, ...]:
        """
        Custom reward function for training and evaluation:
        * +5.0   for each hit (rwd_hit)
        * -1.0   for each hit taken (rwd_hit_taken)
        * +100.0 for each new frag (rwd_frag)
        * -0.5   for each missed shot (rwd_missed)
        * +0.01  for surviving each step (rwd_survival)
        * -10.0  if the player dies (rwd_dead)
        * -0.05  for shooting without damage (rwd_spam_penalty)
        * +0.1   for maintaining health (rwd_health)
        """
        self._step += 1
        _ = vizdoom_reward, player_id

        # Rewards
        rwd_hit = 5.0 * (game_var["HITCOUNT"] - game_var_old["HITCOUNT"])
        rwd_hit_taken = -1.0 * (game_var["HITS_TAKEN"] - game_var_old["HITS_TAKEN"])
        rwd_frag = 100.0 * (game_var["FRAGCOUNT"] - game_var_old["FRAGCOUNT"])

        # Shots and missed shots
        ammo_delta = game_var_old["SELECTED_WEAPON_AMMO"] - game_var["SELECTED_WEAPON_AMMO"]
        if ammo_delta > 0:
            shots_fired = ammo_delta
            hits = game_var["HITCOUNT"] - game_var_old["HITCOUNT"]
            missed_shots = max(0, shots_fired - hits)
            rwd_missed = -0.5 * missed_shots
        else:
            rwd_missed = 0

        # Survival and death
        rwd_survival = 0.01
        rwd_dead = -10.0 if game_var["DEAD"] == 1 else 0.0

        # Penalty for spamming (shooting without damage)
        damage_done = game_var["DAMAGECOUNT"] - game_var_old["DAMAGECOUNT"]
        rwd_spam_penalty = -0.05 if ammo_delta > 0 and damage_done <= 0 else 0.0

        # Health reward
        health_delta = game_var["HEALTH"] - game_var_old["HEALTH"]
        rwd_health = 0.1 * health_delta if health_delta > 0 else 0.0

        return (rwd_hit, rwd_hit_taken, rwd_frag, rwd_missed, rwd_survival, rwd_dead, rwd_spam_penalty, rwd_health)

In [7]:
device = "cuda"
DTYPE = torch.float32

reward_fn = YourReward(num_players=1)

env = VizdoomMPEnv(
    num_players=1,
    num_bots=NUM_BOTS,
    bot_skill=0,
    doom_map="ROOM",  # NOTE simple, small map; other options: TRNM, TRNMBIG
    extra_state=PLAYER_CONFIG["extra_state"], # see info about states at the beginning of 'Environment configuration' above
    episode_timeout=EPISODE_TIMEOUT,
    n_stack_frames=PLAYER_CONFIG["n_stack_frames"],
    crosshair=PLAYER_CONFIG["crosshair"],
    hud=PLAYER_CONFIG["hud"],
    reward_fn=reward_fn,
)

Host 39625
Player 39625


## Agent

Implement **your own agent** in the code cell that follows.

* In `agents/dqn.py` and `agents/ppo.py` you’ll find very small **skeletons**—they compile but are meant only as reference or quick tests.  
  Feel free to open them, borrow ideas, extend them, or ignore them entirely.
* The notebook does **not** import those files automatically; whatever class you define in the next cell is the one that will be trained.
* You may keep the DQN interface, switch to PPO, or try something else.
* Tweak any hyper-parameters (`PLAYER_CONFIG`, ε-schedule, optimiser, etc.) and document what you tried.


In [8]:
class CustomCNN(BaseFeaturesExtractor):
    def __init__(self, observation_space: gym.spaces.Box, features_dim: int = 256):
        super(CustomCNN, self).__init__(observation_space, features_dim)
        assert isinstance(observation_space, gym.spaces.Box), "Observation space must be a Box"

        # Annahme: Die Kanäle sind wie folgt geordnet: Bildschirm (3), Tiefe (1), Labels (1)
        c = observation_space.shape[2]
        screen_ch = 3
        depth_ch = 1
        labels_ch = 1
        assert screen_ch + depth_ch + labels_ch == c, "Kanal-Mismatch"

        # Definiere CNNs für jeden Teil
        self.cnn_screen = nn.Sequential(
            nn.Conv2d(screen_ch, 16, kernel_size=8, stride=4, padding=0),  
            nn.ReLU(),
            nn.Conv2d(16, 32, kernel_size=4, stride=2, padding=0),         
            nn.ReLU(),
            nn.Conv2d(32, 32, kernel_size=3, stride=1, padding=0),   
            nn.ReLU(),
            nn.Flatten(),
        )

        self.cnn_depth = nn.Sequential(
            nn.Conv2d(depth_ch, 16, kernel_size=4, stride=2, padding=0),
            nn.ReLU(),
            nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )

        self.cnn_labels = nn.Sequential(
            nn.Conv2d(labels_ch, 16, kernel_size=4, stride=2, padding=0),
            nn.ReLU(),
            nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Berechne die flachen Größen
        with torch.no_grad():
            sample = torch.as_tensor(observation_space.sample()[None]).float()
            sample = rearrange(sample, 'n h w c -> n c h w')
            n_flatten_screen = self.cnn_screen(sample[:, :screen_ch, :, :]).shape[1]
            n_flatten_depth = self.cnn_depth(sample[:, screen_ch:screen_ch+depth_ch, :, :]).shape[1]
            n_flatten_labels = self.cnn_labels(sample[:, screen_ch+depth_ch:, :, :]).shape[1]

        total_flatten = n_flatten_screen + n_flatten_depth + n_flatten_labels
        self.linear = nn.Sequential(nn.Linear(total_flatten, features_dim), nn.ReLU())

    def forward(self, observations: torch.Tensor) -> torch.Tensor:
        # observations shape: (N, H, W, C)
        observations = rearrange(observations, 'n h w c -> n c h w')
        screen = observations[:, :3, :, :]
        depth = observations[:, 3:4, :, :]
        labels = observations[:, 4:5, :, :]
        features_screen = self.cnn_screen(screen)
        features_depth = self.cnn_depth(depth)
        features_labels = self.cnn_labels(labels)
        combined = torch.cat((features_screen, features_depth, features_labels), dim=1)
        return self.linear(combined)

In [9]:
# ================================================================
# Initialise your networks and training utilities
# ================================================================

# main Q-network
in_channels = env.observation_space.shape[0]   # 1 if grayscale, else 3/4
#model = DQN(
#    input_dim    = in_channels,
#    action_space = env.action_space.n,
#    hidden       = 64,   # change or ignore
#).to(device, dtype=DTYPE)

policy_kwargs = dict(
    features_extractor_class=CustomCNN,
    features_extractor_kwargs=dict(features_dim=FEATURES_DIM),
)

model = DQN(
    "CnnPolicy",
    env,
    learning_rate=LEARNING_RATE,
    buffer_size=REPLAY_BUFFER_SIZE,
    batch_size=BATCH_SIZE,
    gamma=GAMMA,
    exploration_fraction=EXPLORATION_FRACTION,
    exploration_initial_eps=EPSILON_START,
    exploration_final_eps=EPSILON_END,
    verbose=1,
    policy_kwargs=policy_kwargs,
)

# TODO ------------------------------------------------------------
# 1. Create a target network (hard-copy or EMA)
# 2. Choose an optimiser + learning-rate schedule
# 3. Instantiate a replay buffer and set the initial epsilon value
#
# Hints:
#   model_tgt  = deepcopy(model).to(device)y
#   optimiser  = torch.optim.Adam(...)
#   scheduler  = torch.optim.lr_scheduler.ExponentialLR(...)
#   replay_buf = collections.deque(maxlen=...)
# ---------------------------------------------------------------

#model_tgt = deepcopy(model).to(device, dtype=DTYPE)
#optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
#scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.99)
#replay_buffer = deque(maxlen=REPLAY_BUFFER_SIZE)
#epsilon = EPSILON_START


Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.




In [10]:
class EpisodeCallback(BaseCallback):
    def __init__(self, verbose=0):
        super(EpisodeCallback, self).__init__(verbose)
        self.episode_num = 0
        self.episode_reward = 0

    def _on_step(self) -> bool:
        self.episode_reward += self.locals['rewards'][0]
        if self.locals['dones'][0]:
            self.episode_num += 1
            wandb.log({
                "episode": self.episode_num,
                "return": self.episode_reward,
            })
            self.episode_reward = 0
        return True

## Training loop

In [None]:
# ---------------------  TRAINING LOOP  ----------------------
# Feel free to change EVERYTHING below:
#   • choose your own reward function
#   • track different episode statistics in `ep_metrics`
#   • switch optimiser, scheduler, update rules, etc.
run = wandb.init(project="doom-rl", entity="soerenseitz-university-of-vienna", config={
    "gamma": GAMMA,
    "episodes": EPISODES,
    "batch_size": BATCH_SIZE,
    "replay_buffer_size": REPLAY_BUFFER_SIZE,
    "learning_rate": LEARNING_RATE,
    "epsilon_start": EPSILON_START,
    "epsilon_end": EPSILON_END,
    "epsilon_decay": EPSILON_DECAY,
    "num_bots": NUM_BOTS,
    "episode_timeout": EPISODE_TIMEOUT,
    "use_grayscale": USE_GRAYSCALE,
    "extra_state": PLAYER_CONFIG["extra_state"],
    "hud": PLAYER_CONFIG["hud"],
    "crosshair": PLAYER_CONFIG["crosshair"],
    "screen_format": PLAYER_CONFIG["screen_format"].name,
    "doom_map": "ROOM",
})
callback = EpisodeCallback()

model.learn(total_timesteps=TOTAL_TIMESTEPS, callback=callback, progress_bar=True)
final_model = model


[34m[1mwandb[0m: Currently logged in as: [33msoerenseitz[0m ([33msoerenseitz-university-of-vienna[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Output()

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 1e+03    |
|    ep_rew_mean      | -96.6    |
|    exploration_rate | 0.928    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 48       |
|    time_elapsed     | 81       |
|    total_timesteps  | 4000     |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.0467   |
|    n_updates        | 974      |
----------------------------------


## Dump to ONNX

In [None]:
import onnx
import json


def onnx_dump(env, model, config, filename: str):
    init_state = env.reset()[0].unsqueeze(0)
    policy_net = model.policy.q_net

    torch.onnx.export(
        policy_net.cpu(),
        args=init_state,
        f=filename,
        export_params=True,
        opset_version=11,
        do_constant_folding=True,
        input_names=["input"],
        output_names=["output"],
        dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}},
    )
    onnx_model = onnx.load(filename)

    meta = onnx_model.metadata_props.add()
    meta.key = "config"
    meta.value = json.dumps(config)

    onnx.save(onnx_model, filename)

export_config = {
    **PLAYER_CONFIG,
    "algo_type": "Q",
}
onnx_dump(env, final_model, export_config, filename="model.onnx")
print("Best network exported to doom_dqn_best.onnx")

# Upload to wandb
artifact = wandb.Artifact('model', type='model')
artifact.add_file('model.onnx')
run.log_artifact(artifact)
artifact.wait()
run.finish()

### Evaluation and Visualization

In this final section, you can evaluate your trained agent, inspect its performance visually, and analyze reward components over time.


In [None]:
# ---------------------------------------------------------------
# 📈  Reward-plot helper  (feel free to edit / extend)
# ---------------------------------------------------------------
import pandas as pd
import matplotlib.pyplot as plt

def plot_reward_components(reward_log, smooth_window: int = 5):
    """
    Plot raw and smoothed episode-level reward components.

    Parameters
    ----------
    reward_log : list[dict]
        Append a dict for each episode, e.g. {"frag": …, "hit": …, "hittaken": …}
    smooth_window : int
        Rolling-mean window size for the smoothed curve.
    """
    if not reward_log:
        print("reward_log is empty – nothing to plot.")
        return

    df = pd.DataFrame(reward_log)
    df_smooth = df.rolling(window=smooth_window, min_periods=1).mean()

    # raw
    plt.figure(figsize=(12, 5))
    for col in df.columns:
        plt.plot(df.index, df[col], label=col)
    plt.title("Raw episode reward components")
    plt.legend(); plt.grid(True); plt.tight_layout()
    plt.show()

    # smoothed
    plt.figure(figsize=(12, 5))
    for col in df.columns:
        plt.plot(df.index, df_smooth[col], label=f"{col} (avg)")
    plt.title(f"Smoothed (window={smooth_window})")
    plt.legend(); plt.grid(True); plt.tight_layout()
    plt.show()


# ----------------------------------------------------------------
# 🔍  Hint for replay visualisation:
# ----------------------------------------------------------------
# env.enable_replay()
# ... run an evaluation episode ...
# env.disable_replay()
# replays = env.get_player_replays()
#
# from doom_arena.render import render_episode
# from IPython.display import HTML
# HTML(render_episode(replays, subsample=5).to_html5_video())
#
# Feel free to adapt or write your own GIF/MP4 export.
