KIVI: Knowledge-Intensive Video Generation

KIVI evaluates text-to-video models on factuality and helpfulness — shifting the question from "Does the video look good?" to "Does the video communicate correct and useful information?"

Given a short instructional prompt (e.g., "How to set up cellular service on a Google Pixel 10"), models must generate videos that are factually accurate and practically useful. KIVI-Bench includes 1,080 prompts across 18 knowledge-intensive categories, with automatic metrics that achieve ~70% agreement with human evaluation.

Installation

git clone <repo-url>
cd KIVI
conda create -n kivi python=3.10 -y && conda activate kivi
pip install -r requirements.txt
conda install ffmpeg

API keys. Set your LLM API key for script generation and evaluation:

export OPENAI_API_KEY={your_key}

The pipeline uses the OpenAI-compatible API format. By default it calls Gemini 3.1 Pro, but any provider that supports this format is supported (OpenAI, Gemini API, etc.).

For API-based video generation models, set the provider-specific key:

Model	Variable
Seedance 2.0	`ARK_API_KEY`
HappyHorse 1.0	`DASHSCOPE_API_KEY`

Model Repositories

Clone the code for the model(s) you need into video_generation_models/, then download their weights to models_cache/ following each model's official documentation.

Model	Code Repository	Checkpoint (HuggingFace)
Wan 2.2	`git clone https://github.com/Wan-Video/Wan2.2 video_generation_models/Wan2.2`	`Wan-AI/Wan2.2-T2V-A14B`, `Wan-AI/Wan2.2-I2V-A14B`
HunyuanVideo 1.5	`git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5 video_generation_models/HunyuanVideo-1.5`	`Tencent-Hunyuan/HunyuanVideo-1.5`
Helios-Base	`git clone https://github.com/PKU-YuanGroup/Helios video_generation_models/Helios`	`PKU-YuanGroup/Helios-Base`
LongCat-Video	`git clone https://github.com/meituan-longcat/LongCat-Video video_generation_models/LongCat-Video`	`meituan-longcat/LongCat-Video` (foundational)
LongLive 1.0	`git clone https://github.com/NVlabs/LongLive video_generation_models/LongLive`	`NVlabs/LongLive` (base model + LoRA)

Seedance 2.0 and HappyHorse 1.0 are API-based and require neither code repositories nor local weights.

Usage

List available models:

python run_evaluation.py --list-models

Main Arguments

Argument	Description	Default
`--model {model}`	Model name to evaluate (see `--list-models`)	required
`--step {stage}`	Pipeline stage: `all`, `script`, `generate`, `extract`, `verify`, `score`	`all`
`--gpu {ids}`	GPU device IDs (e.g., `0` or `0,1` for multi-GPU)	`0`
`--category {name}`	Filter by category (e.g., `"Cars_Other_Vehicles"`). Runs all if omitted.	`None`
`--prompt-index {n}`	Filter to a specific prompt within a category (1-based). Requires `--category`.	`None`
`--prompts-json {path}`	Path to prompts JSON file	`experiment_prompts.json`
`--video-path {path}`	Evaluate a user-provided video. Requires `--prompt`. Skips script/generate.	`None`
`--prompt {text}`	Prompt text when using `--video-path`	`None`

Full Pipeline

python run_evaluation.py --model {model} --gpu 0

This runs all five stages in order: outline + script → video generation → claim extraction → claim verification → scoring.

Individual Stages

Each stage reads/writes cached intermediates on disk and can be resumed independently:

python run_evaluation.py --model {model} --step script              # outline + script (LLM only)
python run_evaluation.py --model {model} --step generate --gpu 0     # video generation
python run_evaluation.py --model {model} --step extract             # claim extraction
python run_evaluation.py --model {model} --step verify              # claim verification
python run_evaluation.py --model {model} --step score               # compute final scores

Custom Video Evaluation

Evaluate your own video without running model generation:

python run_evaluation.py --video-path {path/to/video.mp4} --prompt {your_video_prompt}

Results are saved to evaluation/ next to the video file.

Output Structure

outputs/{model}/{category}/Q{idx}_{prompt}/
├── final_video.mp4
├── outline.json
├── segment_prompts.json
└── evaluation/
    ├── extracted_claims.json
    ├── verification_results.json
    ├── helpfulness_score.json
    └── score.json

Prompt Sets

Two prompt sets are provided in the repository root:

File	Description
`experiment_prompts.json`	54 prompts (3 per category), used in the paper experiments. Default.
`kivi_bench_prompts.json`	Full KIVI-Bench: 1,080 prompts across 18 categories.

Switch with --prompts-json kivi_bench_prompts.json.

Evaluation Metrics

Factual Precision (FactP)

FactP = correct claims / total claims × 100%

Claim Extraction: An LLM reviews the generated video and extracts atomic, externally verifiable statements about what the video depicts.
Claim Verification: Each claim is verified against world knowledge and classified as Correct, Incorrect, or Uncertain.

Helpfulness Score (HelpS)

HelpS = (Relevance + Completeness + Clarity) / 3 × 100%

Dimension	Description
Relevance	Does the video directly address the prompt and its explicit constraints?
Completeness	Are the key steps and required information covered?
Clarity	Is the procedure easy to follow, with logical sequence and clear pacing?

Results

Results on the 54-prompt subset using Gemini 3.1 Pro Preview as the evaluator:

Model	FactP (%)	HelpS (%)
Human (reference)	97.8	81.9
Seedance 2.0	81.6	66.6
HappyHorse 1.0	83.2	61.6
Wan 2.2	73.1	48.4
HunyuanVideo 1.5	63.2	32.9
Helios-Base	64.2	27.0
LongCat-Video	50.8	15.3
LongLive 1.0	46.5	15.9

Adding a New Video Generation Model

Four generation modes are supported: segment (T2V+I2V), interactive (prompt switching), single_prompt, and api (REST).

cp configs/_template_segment.yaml configs/{my-model}.yaml
# edit the YAML — fill in name, mode, code_dir, model_path, and generation commands
python run_evaluation.py --model {my-model} --gpu 0

Repository Structure

├── run_evaluation.py            # Main entry point
├── compare_scores.py            # Cross-model score comparison
├── kivi/                        # Core framework
│   ├── llm_client.py            # LLM API client
│   ├── generation/              # Video generation (outline, script, models)
│   └── evaluation/              # Claim extraction, verification, helpfulness
├── configs/                     # Per-model YAML configurations
├── prompts/                     # LLM prompt templates
├── video_generation_models/     # Model code (clone manually)
├── outputs/                     # Evaluation outputs
└── models_cache/                # Downloaded model weights (gitignored)

Citation

@article{wang2026kivi,
  title={KIVI: Knowledge-Intensive Video Generation},
  author={Wang, Chenxu and Chen, Mingda},
  journal={arXiv preprint},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KIVI: Knowledge-Intensive Video Generation

Installation

Model Repositories

Usage

Main Arguments

Full Pipeline

Individual Stages

Custom Video Evaluation

Output Structure

Prompt Sets

Evaluation Metrics

Factual Precision (FactP)

Helpfulness Score (HelpS)

Results

Adding a New Video Generation Model

Repository Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
configs		configs
kivi		kivi
outputs		outputs
prompts		prompts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compare_scores.py		compare_scores.py
experiment_prompts.json		experiment_prompts.json
kivi_bench_prompts.json		kivi_bench_prompts.json
requirements.txt		requirements.txt
run_evaluation.py		run_evaluation.py

Folders and files

Latest commit

History

Repository files navigation

KIVI: Knowledge-Intensive Video Generation

Installation

Model Repositories

Usage

Main Arguments

Full Pipeline

Individual Stages

Custom Video Evaluation

Output Structure

Prompt Sets

Evaluation Metrics

Factual Precision (FactP)

Helpfulness Score (HelpS)

Results

Adding a New Video Generation Model

Repository Structure

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages