Accepted by ACM Multimedia 2025
🥈 2nd Place Winner in the Identity-Preserving Video Generation Challenge!
Our spatial-temporal decoupled approach has been validated through rigorous competition evaluation, demonstrating superior performance in identity-preserving video generation tasks. This repository contains the complete implementation of our award-winning method.
This repository contains the official implementation of our paper "Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations". Our method addresses the challenge of maintaining identity consistency in text-to-video generation through a novel spatial-temporal decoupled approach.
Our implementation is built upon two main repositories:
- Wan2.2 - For video generation
- ComfyUI-HyperLoRA - For identity-preserving image generation
-
Install Wan2.2
git clone https://github.com/Wan-Video/Wan2.2.git cd Wan2.2 # Follow the installation instructions in Wan2.2 repository
-
Install ComfyUI-HyperLoRA
git clone https://github.com/bytedance/ComfyUI-HyperLoRA.git cd ComfyUI-HyperLoRA # Follow the installation instructions in ComfyUI-HyperLoRA repository
Download the required model checkpoints:
- Wan2.2 Models: Download from Wan2.2 model hub
- HyperLoRA Models: Download from HyperLoRA releases
Our work also supports ComfyUI workflow for a visual interface experience:
-
Setup ComfyUI Environment (same as above prerequisites)
-
Install Custom Node
# Copy our custom node to ComfyUI cp ./ComfyUI/qwen3_prompt_enhancer.py ./ComfyUI/custom_nodes/ -
Load Workflow
- Open ComfyUI interface
- Load the provided workflow file:
HyperLoRA-WAN2.2.json - Experience the complete pipeline through visual interface
You can use our method in two ways: Command Line Interface or ComfyUI Visual Interface.
Our pipeline consists of three main stages:
Generate optimized T2I and I2V prompts using spatial-temporal decoupled representations:
python prompt_decoupled.py \
--input_dir path/to/vip200k/raw/ \
--output_dir path/to/vip200k \
--model_name "Qwen/Qwen3-8B" \
--cache_dir path/to/modelGenerate the first frame using HyperLoRA with identity preservation:
python hyperlora_t2i.py \
--input_dir path/to/vip200k \
--output_dir path/to/vip200kGenerate final videos using Wan2.2:
python infer_Wan2.2.py \
--csv_file /path/to/i2v_prompts.csv \
--output_dir /path/to/videos \
--task ti2v-5BSupported Tasks:
ti2v-5B: Text+Image to Video (5B model)i2v-A14B: Image to Video (14B model)
Use our data collection script to prepare your dataset:
python collect_data.pyHere's a complete example workflow:
# 1. Generate optimized prompts
python prompt_decoupled.py \
--input_dir path/to/vip200k/raw/ \
--output_dir path/to/vip200k
# 2. Generate first frames
python hyperlora_t2i.py \
--input_dir path/to/vip200k \
--output_dir path/to/vip200k
# 3. Prepare data for video generation
python collect_data.py
# 4. Generate videos
python infer_Wan2.2.py \
--csv_file ./vip200k/data_refined.csv \
--output_dir ./final_videosFor users who prefer a visual workflow:
- Load the Custom Node: Ensure
qwen3_prompt_enhancer.pyis in your ComfyUIcustom_nodesdirectory - Open ComfyUI: Start your ComfyUI interface
- Load Workflow: Import the provided
HyperLoRA-WAN2.2.jsonworkflow file - Configure Parameters: Set your input images and prompts in the visual interface
- Execute: Run the complete pipeline through the visual workflow
MM-IPVG/
├── README.md # This file
├── requirements.txt # Python dependencies
├── prompt_decoupled.py # Stage 1: Prompt optimization
├── hyperlora_t2i.py # Stage 2: First frame generation
├── infer_Wan2.2.py # Stage 3: Video generation
├── collect_data.py # Data preparation utility
├── prompt_optimizer.py # Prompt optimization utilities
├── ComfyUI/
│ ├── qwen3_prompt_enhancer.py # ComfyUI custom node
│ └── HyperLoRA-WAN2.2.json # ComfyUI workflow file
├── utils/ # Utility functions
└── vip200k/ # Example data and configs
If you find this work useful, please cite our paper:
@article{wang2025identity,
title={Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations},
author={Wang, Yuji and Li, Moran and Hu, Xiaobin and Yi, Ran and Zhang, Jiangning and Feng, Han and Cao, Weijian and Wang, Yabiao and Wang, Chengjie and Ma, Lizhuang},
journal={arXiv e-prints},
pages={arXiv--2507},
year={2025}
}This work builds upon:
- Wan2.2 for video generation capabilities
- ComfyUI-HyperLoRA for identity-preserving image generation
- The open-source community for various tools and libraries
We are grateful to the organizers of the Identity-Preserving Video Generation Challenge for providing a platform to evaluate and showcase our method.
This project is licensed under the MIT License - see the LICENSE file for details.