This repository contains:
- UVE: A framework that adopts MLLMs to evaluate any aspect of AI-generated videos.
- Evaluation on UVE-Bench: A benchmark that assesses the ability of MLLMs to evaluate AI-generated videos.
To begin with, setup the envionment:
bash scripts/setup_env.sh
Evaluate pre-defined aspects:
from uve import UVE
# Initialize the evaluator
evaluator = UVE(model_name='qwen2-vl-7b', max_num_frames=16)
# Evaluate subject structural correctness of a single video
video_path = 'example_videos/mochi_00002.mp4'
result = evaluator.evaluate(video_path, aspect='structural_correctness')
# Evaluate video-text alignment of a single video
video_path = 'example_videos/mochi_00002.mp4'
result = evaluator.evaluate(video_path, aspect='tv_alignment', video_prompt='a man wearing red hat staring at the camera')
# Compare subject structural correctness of a video pair
video_path1 = 'example_videos/mochi_00002.mp4'
video_path2 = 'example_videos/OpenSora1.2_00002.mp4'
result = evaluator.evaluate([video_path1, video_path2], aspect='structural_correctness', eval_mode='pairwise')
# Compare dynamic degree of a video pair
video_path1 = 'example_videos/mochi_00002.mp4'
video_path2 = 'example_videos/OpenSora1.2_00002.mp4'
result = evaluator.evaluate([video_path1, video_path2], aspect='dynamic_degree', eval_mode='pairwise')
Evaluate customized aspects:
from uve import UVE
# Initialize the evaluator and customized settings
evaluator = UVE(model_name='qwen2-vl-7b', max_num_frames=16)
custom_prompt = "Is the video containing sexual or violent material?\nPlease directly answer yes or no:"
pos_tokens = ['yes', 'Yes', 'YES']
neg_tokens = ['no', 'No', 'NO']
video_path = 'example_videos/mochi_00002.mp4'
result = evaluator.evaluate(video_path, eval_mode='single_soft_custom', custom_prompt=custom_prompt, pos_tokens=pos_tokens, neg_tokens=neg_tokens)
You can also evaluate using this shell script
bash scripts/eval_example.sh
- model_name: The name of the MLLM model. Currently supported models are: qwen2-vl-2b, qwen2-vl-7b, qwen2-vl-72b, internvl-2.5-2b-mpo, internvl-2.5-4b-mpo, internvl-2.5-8b-mpo, internvl-2.5-26b-mpo, internvl-2.5-38b-mpo, internvl-2.5-78b-mpo, llava-onevision-0.5b, llava-onevision-7b, llava-onevision-72b, llava-video-7b, llava-video-72b, minicpm-v-2.6, gpt4o, videoscore, videoscore-v1.1
- max_num_frames: The maximum number of frames to sample from each video.
- video_path: The path to the video file. For single video evaluation it is a string. For pairwise video evaluation it is a list of strings.
- video_prompt (optional): The text prompt used to generate the video.
- custom_prompt (optional): The prompt for customized evaluation.
- eval_mode: The evaluation mode. Supported modes are:
Eval Mode | Description |
---|---|
single_soft_yn | Single video evaluation, using yes/no token probability as the rating score |
single_soft_good_bad | Single video evaluation, using good/bad token probability as the rating score |
single_soft_adaptive | Single video evaluation, adaptively using yes/no or good/bad token probability as the rating score |
single_soft_custom | Single video evaluation, using customized token probability as the rating score |
single_soft_reg-avg | Single video evaluation using VideoScore, average over 5 dimensions (aspects) |
single_soft_reg-dim | Single video evaluation using VideoScore 5 dimensions (aspects) |
single_hard | Single video evaluation, prompting MLLM to predict the rating score in text form |
pairwise | Video pair comparison |
pairwise_no_vid_index | Video pair comparison, eliminating video order index |
- aspect: The aspect to evaluate. Pre-defined aspects are:
Aspect | Description |
---|---|
tv_alignment | overall video-text alignment |
tv_alignment_appearance | video-text alignment in terms of appearance |
tv_alignment_motion | video-text alignment in terms of motion |
static_visual_quality | overall visual quality of each individual frame |
aesthetic_quality | aesthetic visual quality of each individual frame |
technical_quality | technical visual quality of each individual frame, focusing on noise, blur and distortion, etc |
structural_correctness | structural correctness of the subjects in each individual frame |
dynamic_degree | overall dynamic degree of the video |
subject_motion_degree | dynamic degree in terms of subject motion |
camera_motion_degree | dynamic degree in terms of camera motion |
light_change | dynamic degree in terms of the change of lighting conditions and colors |
temporal_visual_quality | overall visual quality from the temporal perspective |
appearance_consistency | subject and background appearance consistency |
flickering | is the video free of unwanted temporal flickering and jitterring that negatively affect visual quality |
motion_naturalness | is the motion and interactions between subjects natural and adhere to physical laws |
Overview of UVE-Bench. (a) The distribution of video sources. (b) The distribution of data example over 15 fine-grained AIGV evaluation aspects. (c) The distribution of human preference over the four categories. (d) Data examples illustrating how to evaluate both single video rating and video pair comparison using the human preference annotations.
UVE-Bench is a benchmark designed to assess the ability of MLLMs to evaluate AI-generated videos. It consists of 1,230 videos and human annotated pairwise preferences for 15 fine-grained AIGV evaluation aspects.
For example:
{
"video_a": "moviegen_480p/moviegen_480p_00000.mp4",
"video_b": "mochi/mochi_00000.mp4",
"prompt": null,
"preference": "B is better",
"aspect": "dynamic_degree",
"subaspects": [
"dynamic_degree",
"subject_motion_degree"
],
"dataset": "movie_gen_video_bench"
}
Video Preparation
Download the videos from this link to the folder uve_bench_videos/
Evaluate MLLMs
# Single video rating
bash scripts/eval_uve_bench_single.sh
# Video pair comparison
bash scripts/eval_uve_bench_pair.sh
Evaluate VBench Metrics
- Download the VBench models according to this link.
- Setup VBench environment:
bash scripts/setup_vbench.sh
- Convert UVE-Bench annotations to VBench format:
python3 anno2vbench_info.py
- Evaluate VBench metrics:
bash scripts/eval_vbench.sh