SCOPE is an interactive world model for FPS games with 10-DoF action control, trained on 69K clips across 7 games.
We are excited to introduce SCOPE, an open-source interactive world model for first-person shooter (FPS) games. Positioned as a top-tier action-conditioned world model, it offers the following features.
- Hybrid Action Space: Jointly processes continuous (4D dual-joystick) and discrete (6 binary buttons) control signals within a unified framework โ the first FPS world model to do so.
- Dense Per-Frame Conditioning: Resolves overlapping actions at every single frame, enabling simultaneous multi-action composition (e.g., moving + aiming + firing) that reflects real gameplay complexity.
- Cross-Game Generalization: Trained on 7 diverse FPS titles, a single model generalizes zero-shot to entirely unseen game environments without fine-tuning.
- In-Scope / Out-of-Scope Decoupling: Spatially selective conditioning that separates localized in-scope effects (weapon recoil, HUD) from stable out-of-scope world generation โ without any segmentation labels.
Generated at 480ร832 resolution, 81 frames @ 20 FPS (~4 seconds), conditioned on 10-DoF action inputs.
- May 2026: ๐ We release the model weights, inference code, and CrossFPS dataset.
This codebase is built upon Wan2.2. Please refer to their documentation for environment setup.
Clone the repo:
git clone https://github.com/z2tong/SCOPE.git
cd SCOPEInstall dependencies (we recommend uv for fast, reproducible environment setup):
# Using uv (recommended)
uv venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install -e .
# Or using conda + pip
conda create -n scope python=3.10 && conda activate scope
pip install -r requirements.txt
pip install -e .Install flash_attn (recommended for faster inference):
pip install ninja
pip install flash-attn --no-build-isolationAll required weights (DiT + Text Encoder + VAE + Tokenizer) are packaged in a single HuggingFace repo:
pip install "huggingface_hub[cli]"
huggingface-cli download zizhaotong/SCOPE --local-dir ./SCOPEDirectory layout after download
SCOPE/
โโโ model-00001-of-00003.safetensors # DiT shard 1 (~5.0 GB)
โโโ model-00002-of-00003.safetensors # DiT shard 2 (~5.0 GB)
โโโ model-00003-of-00003.safetensors # DiT shard 3 (~4.6 GB)
โโโ model.safetensors.index.json # Shard index
โโโ models_t5_umt5-xxl-enc-bf16.pth # Text Encoder (UMT5-XXL, ~20 GB)
โโโ Wan2.2_VAE.pth # Video VAE (~700 MB)
โโโ google/umt5-xxl/ # Tokenizer
โ โโโ config.json
โ โโโ spiece.model
โ โโโ tokenizer_config.json
โโโ config.json # Model config
Single image + action:
python inference.py \
--model_dir ./SCOPE \
--input_image examples/example_0/image.png \
--action_path examples/example_0/action.parquet \
--prompt "In a whimsical, toy-inspired garden, the first-person view reveals a tactical weapon aimed forward along a sunlit path." \
--output_dir ./outputsBatch processing (directory of images):
python inference.py \
--model_dir ./SCOPE \
--input_image_dir ./my_images \
--action_path examples/example_1/action.parquet \
--prompt "A breathtaking view of a vibrant fantasy world, seen through an FPS perspective." \
--output_dir ./outputsWe provide 3 ready-to-run examples in examples/:
| Example | Scene | Command |
|---|---|---|
example_0 |
It Takes Two | python inference.py --model_dir ./SCOPE --input_image examples/example_0/image.png --action_path examples/example_0/action.parquet --prompt "$(cat examples/example_0/prompt.txt)" |
example_1 |
Genshin Impact | python inference.py --model_dir ./SCOPE --input_image examples/example_1/image.png --action_path examples/example_1/action.parquet --prompt "$(cat examples/example_1/prompt.txt)" |
example_2 |
Black Myth: Wukong | python inference.py --model_dir ./SCOPE --input_image examples/example_2/image.png --action_path examples/example_2/action.parquet --prompt "$(cat examples/example_2/prompt.txt)" |
Tips: If you have sufficient CUDA memory, you may increase the --max_frames parameter (e.g., to 161) to generate longer videos. The frame count must satisfy n % 4 == 1.
| Argument | Default | Description |
|---|---|---|
--model_dir |
(required) | Path to model directory containing all weights |
--input_image |
None | Single input image (first frame) |
--input_image_dir |
None | Directory of images for batch mode |
--action_path |
(required) | Action signal file (.parquet) |
--prompt |
"" |
Text prompt describing the scene |
--output_dir |
./outputs |
Output directory |
--height |
480 | Video height (pixels) |
--width |
832 | Video width (pixels) |
--max_frames |
81 | Max frames (must satisfy n % 4 == 1) |
--num_inference_steps |
30 | Diffusion denoising steps |
--seed |
0 | Random seed for reproducibility |
The .parquet file defines per-frame player inputs. Each row corresponds to one raw video frame (81 rows = 81 frames).
| Column | Game Action | Controller Button |
|---|---|---|
right_trigger |
Fire | RT |
left_trigger |
Aim Down Sights | LT |
south |
Jump | A |
right_thumb |
Melee | R3 |
west |
Reload | X |
north |
Weapon Switch | Y |
| Column | Axes | Function |
|---|---|---|
j_left |
[x, y] |
Character movement (left stick) |
j_right |
[x, y] |
Camera/aim rotation (right stick) |
Note: The model handles 4ร temporal compression internally via the VAE. Action sequences should have one entry per raw frame (not per latent frame).
Built on Wan2.2-TI2V-5B (~5B parameters, BFloat16), the model inserts an ActionModule into each of the 30 DiT transformer blocks, enabling per-pixel action-conditioned video generation at 480ร832 resolution, 81 frames @ 20 FPS (~4 seconds).
Each ActionModule contains two conditioning paths:
- Mouse/Joystick Path: Sliding-window temporal features โ MLP fusion โ pixel-wise temporal self-attention with RoPE
- Keyboard/Button Path: Button embedding โ temporal windowing โ cross-attention (video queries, keyboard keys/values)
Both output projections are zero-initialized for stable residual training on top of frozen pretrained weights. For detailed architecture specifications, see the model card on HuggingFace.
SCOPE/
โโโ inference.py # Inference entry point
โโโ pyproject.toml # Package configuration
โโโ README.md
โโโ assets/ # Documentation assets
โโโ examples/ # Ready-to-run inference examples
โ โโโ example_0/ # It Takes Two
โ โโโ example_1/ # Genshin Impact
โ โโโ example_2/ # Black Myth: Wukong
โโโ diffsynth/ # Core inference library
โโโ models/
โ โโโ scope_dit.py # DiT with ActionModule
โโโ pipelines/
โ โโโ scope_pipeline.py # Video generation pipeline
โโโ diffusion/ # Flow-matching scheduler
โโโ core/ # Model loading & VRAM management
โโโ utils/ # Video I/O utilities
| Resource | Minimum | Recommended |
|---|---|---|
| GPU VRAM | 24 GB (with CPU offload) | 80 GB (A100/H200) |
| System RAM | 32 GB | 64 GB |
| Disk Space | ~45 GB | โ |
The model is trained on CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry:
| Property | Value |
|---|---|
| Games | 7 diverse FPS titles |
| Total Clips | 69,000+ |
| Action Dimensions | 10-DoF (6 buttons + 4D joystick) |
| Annotation | Frame-aligned action telemetry |
| Curation | Gameplay-bias removal for general visual-to-action mapping |
This project is licensed under the Apache License 2.0.
We would like to express our gratitude to the following open-source projects for their invaluable contributions:
- DiffSynth-Studio โ Base diffusion framework
- Wan2.2 โ Pretrained video generation model
- UMT5-XXL โ Multilingual text encoder
If you find this work useful for your research, please cite our paper:
@article{scope2026,
title={SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models},
author={Zizhao Tong and Hongfeng Lai and Zeqing Wang and Zhaohu Xing and Kexu Cheng and Haoran Xu and Zhao Pu and Shangwen Zhu and Ruili Feng and Jian Zhao and Yan Zhang and Hao Tang and Yeying Jin and Ling Shao},
year={2026}
}


