# 🧠 Deep Reinforcement Learning — Doom Agent (SS2025)

Welcome to the last assignment for the **Deep Reinforcement Learning** course (SS2025). In this notebook, you'll implement and train a reinforcement learning agent to play **Doom**.

You will:
- Set up a custom VizDoom environment with shaped rewards
- Train an agent using an approach of your choice
- Track reward components across episodes
- Evaluate the best model
- Visualize performance with replays and GIFs
- Export the trained agent to ONNX to submit to the evaluation server

In [None]:
# Install the dependencies
!python -m pip install --upgrade pip
!pip install --upgrade notebook ipywidgets ipykernel -q
!pip install torch numpy matplotlib vizdoom portpicker gym onnx wandb stable-baselines3 stable-baselines3[extra] Shimmy einops torchvision -q

In [None]:
import os
import subprocess

base_dir = os.path.abspath(os.getcwd())
dir_path = os.path.join(base_dir, "jku.wad")

if os.path.isdir(dir_path):
    os.chdir(dir_path)
    subprocess.run(["git", "pull", "origin", "main"])
else:
    subprocess.run(["git", "clone", "https://github.com/syseitz/jku.wad.git", dir_path])
    os.chdir(dir_path)

## Environment configuration

ViZDoom supports multiple visual buffers that can be used as input for training agents. Each buffer provides different information about the game environment, as seen from left to right:

Screen
- The default first-person RGB view seen by the agent.

Labels
- A semantic map where each pixel is tagged with an object ID (e.g., enemy, item, wall).

Depth
- A grayscale map showing the distance from the agent to surfaces in the scene.

Automap
- A top-down schematic view of the map, useful for global navigation tasks.

![buffers gif](https://vizdoom.farama.org/_images/vizdoom-demo.gif)

In [None]:
import wandb
from typing import Dict, Sequence, Tuple

import torch
from collections import deque, OrderedDict
from copy import deepcopy
import random
import numpy as np
import torch.nn.functional as F
import torch.optim as optim
import pandas as pd
from matplotlib import pyplot as plt
from PIL import Image

from gym import Env
import gymnasium as gym
from torch import nn
from einops import rearrange

from doom_arena import VizdoomMPEnv
from doom_arena.reward import VizDoomReward
from doom_arena.render import render_episode
from IPython.display import HTML

from vizdoom import ScreenFormat
from stable_baselines3 import DQN
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor

USE_GRAYSCALE = False  # ← flip to False for RGB

PLAYER_CONFIG = {
    "n_stack_frames": 1,
    "extra_state": ["depth", "labels"],
    "hud": "none",
    "crosshair": True,
    "screen_format": ScreenFormat.GRAY8 if USE_GRAYSCALE else ScreenFormat.CRCGCB,
}

## Reward function

In this task, you will define a reward function to guide the agent's learning. The function is called at every step and receives the current and previous game variables (e.g., number of frags, hits taken, health).

Your goal is to combine these into a meaningful reward, encouraging desirable behavior, such as:

- Rewarding frags (enemy kills)

- Rewarding accuracy (hitting enemies)

- Penalizing damage taken

- (Optional) Encouraging survival, ammo efficiency, etc.

You can return multiple reward components, which are summed during training. Consider the class below as a great starting point!

In [None]:
# TODO: environment training parameters
EPISODE_TIMEOUT = 1100
NUM_BOTS = 4
# TODO: model hyperparameters
GAMMA = 0.95
BATCH_SIZE = 32
REPLAY_BUFFER_SIZE = 10000
LEARNING_RATE = 1e-4
EPSILON_START = 1.0
EPSILON_END = 0.5
EXPLORATION_FRACTION = 1.0
FEATURES_DIM = 512
TOTAL_TIMESTEPS = 1000000

class YourReward(VizDoomReward):
    def __init__(self, num_players: int):
        super().__init__(num_players)
        self.last_rewards = [None] * num_players

    def __call__(
        self,
        vizdoom_reward: float,
        game_var: Dict[str, float],
        game_var_old: Dict[str, float],
        player_id: int,
    ) -> Tuple[float, ...]:
        self._step += 1
        _ = vizdoom_reward, player_id

        damage_done = game_var["DAMAGECOUNT"] - game_var_old["DAMAGECOUNT"]
        rwd_damage = 0.01 * damage_done

        rwd_frag = 1.0 * (game_var["FRAGCOUNT"] - game_var_old["FRAGCOUNT"])

        ammo_delta = game_var_old["SELECTED_WEAPON_AMMO"] - game_var["SELECTED_WEAPON_AMMO"]
        if ammo_delta > 0:
            shots_fired = ammo_delta
            hits = game_var["HITCOUNT"] - game_var_old["HITCOUNT"]
            missed_shots = max(0, shots_fired - hits)
            rwd_missed = -0.1 * missed_shots
        else:
            rwd_missed = 0

        rwd_survival = 0.001
        rwd_dead = -0.5 if game_var["DEAD"] == 1 else 0.0

        rwd_spam_penalty = -0.01 if ammo_delta > 0 and damage_done <= 0 else 0.0

        health_delta = game_var["HEALTH"] - game_var_old["HEALTH"]
        health_gained = max(0, health_delta)
        rwd_health_pickup = 0.02 * health_gained

        position_changed = (game_var["POSITION_X"] != game_var_old["POSITION_X"]) or (game_var["POSITION_Y"] != game_var_old["POSITION_Y"])
        rwd_movement = 0.00005 if position_changed else -0.0025

        rewards = (rwd_damage, rwd_frag, rwd_missed, rwd_survival, rwd_dead, rwd_spam_penalty, rwd_health_pickup, rwd_movement)
        self.last_rewards[player_id] = rewards
        
        return rewards

In [None]:
device = "cuda"
DTYPE = torch.float32

reward_fn = YourReward(num_players=1)

env = VizdoomMPEnv(
    num_players=1,
    num_bots=NUM_BOTS,
    bot_skill=0,
    doom_map="ROOM",
    extra_state=PLAYER_CONFIG["extra_state"],
    episode_timeout=EPISODE_TIMEOUT,
    n_stack_frames=PLAYER_CONFIG["n_stack_frames"],
    crosshair=PLAYER_CONFIG["crosshair"],
    hud=PLAYER_CONFIG["hud"],
    reward_fn=reward_fn,
)

## Agent

Implement your agent using Stable Baselines3's DQN with the default CnnPolicy.

In [None]:
model = DQN(
    "CnnPolicy",
    env,
    learning_rate=LEARNING_RATE,
    buffer_size=REPLAY_BUFFER_SIZE,
    batch_size=BATCH_SIZE,
    gamma=GAMMA,
    exploration_fraction=EXPLORATION_FRACTION,
    exploration_initial_eps=EPSILON_START,
    exploration_final_eps=EPSILON_END,
    verbose=1,
)

In [None]:
class EpisodeCallback(BaseCallback):
    def __init__(self):
        super(EpisodeCallback, self).__init__()
        self.episode_reward = 0
        self.episode_num = 0
        self.episode_rwd_components = {
            "damage": 0.0,
            "frag": 0.0,
            "missed": 0.0,
            "survival": 0.0,
            "dead": 0.0,
            "spam_penalty": 0.0,
            "health": 0.0,
            "movement": 0.0
        }

    def _on_step(self) -> bool:
        self.episode_reward += self.locals['rewards'][0]

        monitor_env = self.locals['env'].envs[0]
        actual_env = monitor_env.env
        last_rewards = actual_env.reward_fn.last_rewards[0]

        if last_rewards is not None:
            self.episode_rwd_components["damage"] += last_rewards[0]
            self.episode_rwd_components["frag"] += last_rewards[1]
            self.episode_rwd_components["missed"] += last_rewards[2]
            self.episode_rwd_components["survival"] += last_rewards[3]
            self.episode_rwd_components["dead"] += last_rewards[4]
            self.episode_rwd_components["spam_penalty"] += last_rewards[5]
            self.episode_rwd_components["health"] += last_rewards[6]
            self.episode_rwd_components["movement"] += last_rewards[7]

        if self.locals['dones'][0]:
            self.episode_num += 1
            wandb.log({
                "episode": self.episode_num,
                "return": self.episode_reward,
                "rwd_damage": self.episode_rwd_components["damage"],
                "rwd_frag": self.episode_rwd_components["frag"],
                "rwd_missed": self.episode_rwd_components["missed"],
                "rwd_survival": self.episode_rwd_components["survival"],
                "rwd_dead": self.episode_rwd_components["dead"],
                "rwd_spam_penalty": self.episode_rwd_components["spam_penalty"],
                "rwd_health": self.episode_rwd_components["health"],
                "rwd_movement": self.episode_rwd_components["movement"],
            })
            self.episode_reward = 0
            for key in self.episode_rwd_components:
                self.episode_rwd_components[key] = 0.0

        return True

## Training loop

In [None]:
run = wandb.init(project="doom-rl", entity="soerenseitz-university-of-vienna", config={
    "gamma": GAMMA,
    "batch_size": BATCH_SIZE,
    "replay_buffer_size": REPLAY_BUFFER_SIZE,
    "learning_rate": LEARNING_RATE,
    "epsilon_start": EPSILON_START,
    "epsilon_end": EPSILON_END,
    "exploration_fraction": EXPLORATION_FRACTION,
    "num_bots": NUM_BOTS,
    "episode_timeout": EPISODE_TIMEOUT,
    "use_grayscale": USE_GRAYSCALE,
    "extra_state": PLAYER_CONFIG["extra_state"],
    "hud": PLAYER_CONFIG["hud"],
    "crosshair": PLAYER_CONFIG["crosshair"],
    "screen_format": PLAYER_CONFIG["screen_format"].name,
    "doom_map": "ROOM",
})
callback = EpisodeCallback()

model.learn(total_timesteps=TOTAL_TIMESTEPS, callback=callback, progress_bar=True)
final_model = model

## Dump to ONNX

In [None]:
import onnx
import json
import torch

def onnx_dump(env, model, config, run, filename_prefix="model"):
    dummy_input = torch.randn(1, *env.observation_space.shape).float().to('cpu')
    print("Dummy input shape:", dummy_input.shape)
    
    policy_net = model.policy.to('cpu')
    
    run_id = run.id
    filename = f"{filename_prefix}_{run_id}.onnx"
    
    torch.onnx.export(
        policy_net,
        args=dummy_input,
        f=filename,
        export_params=True,
        opset_version=11,
        do_constant_folding=True,
        input_names=["input"],
        output_names=["output"],
        dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}},
    )
    
    onnx_model = onnx.load(filename)
    meta = onnx_model.metadata_props.add()
    meta.key = "config"
    meta.value = json.dumps(config)
    onnx.save(onnx_model, filename)
    
    return filename

export_config = {
    **{k: str(v) if isinstance(v, ScreenFormat) else v for k, v in PLAYER_CONFIG.items()},
    "algo_type": "Q",
}

filename = onnx_dump(env, final_model, export_config, run, filename_prefix="model")
print(f"Best network exported to {filename}")

artifact = wandb.Artifact('model', type='model')
artifact.add_file(filename)
run.log_artifact(artifact)
artifact.wait()
run.finish()

## Evaluation and Visualization

In this final section, you can evaluate your trained agent, inspect its performance visually, and analyze reward components over time.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

def plot_reward_components(reward_log, smooth_window: int = 5):
    if not reward_log:
        print("reward_log is empty – nothing to plot.")
        return

    df = pd.DataFrame(reward_log)
    df_smooth = df.rolling(window=smooth_window, min_periods=1).mean()

    plt.figure(figsize=(12, 5))
    for col in df.columns:
        plt.plot(df.index, df[col], label=col)
    plt.title("Raw episode reward components")
    plt.legend(); plt.grid(True); plt.tight_layout()
    plt.show()

    plt.figure(figsize=(12, 5))
    for col in df.columns:
        plt.plot(df.index, df_smooth[col], label=f"{col} (avg)")
    plt.title(f"Smoothed (window={smooth_window})")
    plt.legend(); plt.grid(True); plt.tight_layout()
    plt.show()

# Hint for replay visualisation:
# env.enable_replay()
# ... run an evaluation episode ...
# env.disable_replay()
# replays = env.get_player_replays()
# from doom_arena.render import render_episode
# from IPython.display import HTML
# HTML(render_episode(replays, subsample=5).to_html5_video())