VideoGameBench is an evaluation benchmark for evaluating Vision-Language Models (VLMs) multi-modal understanding + reasoning on well-known video games. It currently supports Game Boy (through PyBoy), MS-DOS (through JS-DOS), and browser games, providing a standardized way to evaluate LLM performance in game understanding and interaction.
For mouse + keyboard games, we also provide a simple interface for helping the model generically / properly click on positions on the screen. We provide some example trajectories below of the model playing these games real-time, i.e. the model calls are async with respect to the game.
conda create -n videogamebench python=3.10
conda activate videogamebench
pip install -r requirements.txt
pip install -e .
playwright install # Install playwright for DOS games
Once you've downloaded and placed the appropriate ROMs into the roms/
folder (for expected names, see src/consts.py
). We provide a simple VideoGameAgent for Game Boy games which you can run below:
# Run with a Game Boy ROM file
python main.py --game pokemon_red --model gpt-4o
DOS games are loaded with js-dos
and do not require downloading games. We provide a simple VideoGameAgent for DOS games which you can run below:
# Run Doom2 with Gemini 2.5 Pro
python main.py --game doom2 --model gemini/gemini-2.5-pro-preview-03-25
# Run CIV with claude 3-7 and the agent state UI side by side
python main.py --game civ --model anthropic/claude-3-7-sonnet-20250219 --enable-ui
# Run Warcraft 2 in lite mode, so game pauses while model thinks
python main.py --game warcraft2 --model together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --enable-ui
# Run website-only mode to play the game yourself
python main.py --game quake --website-only
Running games on VideoGameBench Lite is the same as running on the regular benchmark, just with the --lite
flag. For example,
python main.py --game doom2 --model gemini/gemini-2.5-pro-preview-03-25 --lite
You can specify a model name with --model
(according to LiteLLM naming, which is very intuitive -- e.g. Gemini 2.0 Flash is gemini/gemini-2.0-flash
) which will draw from your environment variables and/or .env
, but you can specify particular keys with --api-key
. Each game will also have an associated config with defaults in the configs/
folder.
For local models with Ollama: You can also specify an api_base
with --api-base
for models like Ollama, which works with LiteLLM. See more details at https://docs.litellm.ai/docs/providers/ollama.
# First start ollama server
python main.py --game kirby --model ollama/llama2 --api-base http://localhost:11434 --enable-ui
We also provide a GUI with tkinter
to view the agent's actions, thoughts, memory, etc. with --enable-ui
:
We've run each of the following models:
gemini/gemini-2-5.pro-preview-03-25
(we also use--max-tokens 2048
)gemini/gemini-2.0-flash
(we also use--max-tokens 2048
)llama4
(actual model isLlama-4-Maverick-17B-128E-Instruct-FP8
, we also usemax-context-size 8
so it does not go over the context window limit)gpt-4o
(actual model isgpt-4o-2024-08-06
)claude-3.7
(actual model isclaude-3-7-sonnet-20250219
)
We provide a list of scripts run in all_experiments.sh
.
We want to be clear on what we built and what we build on top of! Our benchmark relies heavily on PyBoy for a Gym-based Game Boy emulator, JS-DOS for MS-DOS games, and Playwright for interacting with browser-based games. We also use LiteLLM for handling models, so you can use run almost any frontier VLM API! You will need to provide your own API keys.
The codebase easily supports more MS-DOS and GameBoy games, but we fix a diverse set of games for evaluation purposes. The games relevant to VideoGameBench are below:
- [3D] [shooter] Doom II
- [2D] [grid-world] [turn-based] Pokemon Crystal (GBC)
- [2D] [open-world] Legend of Zelda: Link's Awakening (DX for GBC)
- [2D] [strategy] [turn-based] Sid Meier's Civilization 1
- [3D] [racer] The Need for Speed
- [2D] [platformer] Kirby's Dream Land (DX for GBC)
- [2D] [puzzle] The Incredible Machine (1993)
Note: There are three secret games in the test set which we do not release in this repo. We will run them on our side for new submissions.
DOS Games:
- [3D] [shooter] Doom
- [3D] [shooter] Quake
- [2.5D] [strategy] Warcraft II: Tides of Darkness (Orc Campaign)
- [2D] [strategy] [turn-based] Oregon Trail Deluxe (1992)
- [2D] [strategy] X-COM UFO Defense
- [2D] [platformer] Prince of Persia
- [2D] [strategy] Age of Empires (1997)
Game Boy Games:
- [2D] [grid-world] [turn-based] Pokemon Red (GB)
- [2D] [platformer] Super Mario Land
- [2D] [platformer] Mega Man: Dr. Wily's Revenge
- [2D] [platformer] Donkey Kong Land 2
- [2D] [platformer] Castlevania Adventure
- [2D] [detective] Scooby-Doo! - Classic Creep Capers
Each game has its own associated config in the configs/
folder, which controls a few settings about each game (currently very few). Each game has a folder under its name, e.g. configs/game/
with a config.yaml
and some optional files that the code looks for:
- The most important other file is the
prompt.txt
, which is the game-specific prompt that gets fed to the model at every step. You can update this with whatever game specific information you want to provide to help the agent. - For tracking "end-states", you can add to a folder called
checkpoints/
inside theconfigs/{game}
folder. Here, you can add as many checkpoints as you want, named by number (e.g.1.png
). The last checkpoint in numerical order is considered theend state
, and signals a complete game. See - For DOS games there is the ability to use a custom HTML template for your own JS-DOS games. This allows modifying the website with your own tools and frameworks for your agents to aid them in playing the game. If you specify a
game.html
and ados
game in the config, it will override the defaultJS-DOS
html. - For DOS games, if you want the game to load a set of actions beforehand, you can fill a
preload.txt
out with actions and delays. Seesrc/browser_controller.py
for how this is controlled. This is useful because a lot of DOS games have difficulty selection menus which you may want to fix.
To add new games, in addition to make a config above, you also have to edit the src/consts.py
file. For GB games edit the ROM_FILE_MAP
to map a game name for the --game
flag to the name of the ROM that you put in the roms/
folder, and for DOS games update the JSDOS file link in the GAME_URL_MAP
.
We mostly separate out the codebase by the evaluators (which run the episode) and the agent logic. It's relatively simple to edit, and we are working on making it more robust.
src/
├── evaluator.py # Main evaluation loop and episode management
├── run_dos.py # DOS-specific game runner
├── run_gb.py # Game Boy-specific emulator runner
├── emulators/
│ ├── dos/ # DOS-specific emulator logic
│ ├── gba/ # Game Boy-specific emulator logic
│ ├── interface_base.py # Base interface class for emulators
├── llm/ # Folder for LLM and Agent logic.
├──── llm_client.py # LLMLite logic
├──── vgagent.py # VG-Agent logic for both GB and DOS
main.py # Entrypoint for launching VideoGameBench
We provide a shorter list of the relevant command line arguments for replicating experiments. The full set of arguments (provided for added flexibility) are found in main.py
.
Common options:
--emulator {dos,gba} Which emulator to use (default: none)
--game GAME Picks a game from the configs list
--lite Enable lite mode for the game, meaning pause enabled while model is thinking
--model MODEL The VLM model to use (using LiteLLM syntax). Defaults to gpt-4o
--enable-ui Run the game with the agent UI enabled
--max-steps MAX_STEPS Maximum number of steps to run. Defaults to 15000
DOS-specific options:
--website-only Just open the website without agent interaction
The system creates comprehensive logs for each session including:
- Screenshot images at each step (found in
game_screen/
orlite_screenshots
for lite games.) - Screenshot images for agent UI (found in
monitor/
) - LLM API interactions, button presses, game state information
- Model reflections / internal memory
Logs are stored in the logs/{game}/{model}/{run}
directory with timestamps.
There are two main ways to generate videos / clips from VideoGameBench. The first is to just
record the screen as it plays (best for streaming). The second is to use the frames provided to
the model to generate a video using OpenCV, which we used to generate on our website. We provide
a Python notebook to generate these videos from saved frames for both MS-DOS and Game Boy games in
the notebooks/generate_clips.ipynb
file. For example, we generate a video of Gemini 2.5 Pro playing Kirby using only frames given to the agent:
gemini-2-5-pro_kirby.mp4
Disclaimer: While all parts of this codebase fall under the MIT license (i.e. free use for both personal and commercial purposes), the games themselves do not. You must legally own the games to play them.