IPVG-STD: Spatial-Temporal Decoupled Identity Preserving Video Generation [MM 2025]

Accepted by ACM Multimedia 2025

🏆 Competition Achievement

🥈 2nd Place Winner in the Identity-Preserving Video Generation Challenge!

Our spatial-temporal decoupled approach has been validated through rigorous competition evaluation, demonstrating superior performance in identity-preserving video generation tasks. This repository contains the complete implementation of our award-winning method.

📝 Overview

This repository contains the official implementation of our paper "Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations". Our method addresses the challenge of maintaining identity consistency in text-to-video generation through a novel spatial-temporal decoupled approach.

🔧 Environment Setup

Our implementation is built upon two main repositories:

Wan2.2 - For video generation
ComfyUI-HyperLoRA - For identity-preserving image generation

Prerequisites

Install Wan2.2

git clone https://github.com/Wan-Video/Wan2.2.git
cd Wan2.2
# Follow the installation instructions in Wan2.2 repository

Install ComfyUI-HyperLoRA

git clone https://github.com/bytedance/ComfyUI-HyperLoRA.git
cd ComfyUI-HyperLoRA
# Follow the installation instructions in ComfyUI-HyperLoRA repository

Model Checkpoints

Download the required model checkpoints:

Wan2.2 Models: Download from Wan2.2 model hub
HyperLoRA Models: Download from HyperLoRA releases

ComfyUI Integration

Our work also supports ComfyUI workflow for a visual interface experience:

Setup ComfyUI Environment (same as above prerequisites)

Install Custom Node

# Copy our custom node to ComfyUI
cp ./ComfyUI/qwen3_prompt_enhancer.py ./ComfyUI/custom_nodes/

Load Workflow
- Open ComfyUI interface
- Load the provided workflow file: HyperLoRA-WAN2.2.json
- Experience the complete pipeline through visual interface

🚀 Quick Start

You can use our method in two ways: Command Line Interface or ComfyUI Visual Interface.

Option 1: Command Line Interface

Our pipeline consists of three main stages:

Stage 1: Prompt Optimization

Generate optimized T2I and I2V prompts using spatial-temporal decoupled representations:

python prompt_decoupled.py \
    --input_dir path/to/vip200k/raw/ \
    --output_dir path/to/vip200k \
    --model_name "Qwen/Qwen3-8B" \
    --cache_dir path/to/model

Stage 2: Identity-Preserving First Frame Generation

Generate the first frame using HyperLoRA with identity preservation:

python hyperlora_t2i.py \
    --input_dir path/to/vip200k \
    --output_dir path/to/vip200k

Stage 3: Video Generation

Generate final videos using Wan2.2:

python infer_Wan2.2.py \
    --csv_file /path/to/i2v_prompts.csv \
    --output_dir /path/to/videos \
    --task ti2v-5B

Supported Tasks:

ti2v-5B: Text+Image to Video (5B model)
i2v-A14B: Image to Video (14B model)

📊 Data Preparation

Use our data collection script to prepare your dataset:

python collect_data.py

🔄 Complete Pipeline Example

Here's a complete example workflow:

# 1. Generate optimized prompts
python prompt_decoupled.py \
    --input_dir path/to/vip200k/raw/ \
    --output_dir path/to/vip200k

# 2. Generate first frames
python hyperlora_t2i.py \
    --input_dir path/to/vip200k \
    --output_dir path/to/vip200k

# 3. Prepare data for video generation
python collect_data.py

# 4. Generate videos
python infer_Wan2.2.py \
    --csv_file ./vip200k/data_refined.csv \
    --output_dir ./final_videos

Option 2: ComfyUI Visual Interface

For users who prefer a visual workflow:

Load the Custom Node: Ensure qwen3_prompt_enhancer.py is in your ComfyUI custom_nodes directory
Open ComfyUI: Start your ComfyUI interface
Load Workflow: Import the provided HyperLoRA-WAN2.2.json workflow file
Configure Parameters: Set your input images and prompts in the visual interface
Execute: Run the complete pipeline through the visual workflow

📁 Project Structure

MM-IPVG/
├── README.md                          # This file
├── requirements.txt                   # Python dependencies
├── prompt_decoupled.py               # Stage 1: Prompt optimization
├── hyperlora_t2i.py                  # Stage 2: First frame generation
├── infer_Wan2.2.py                   # Stage 3: Video generation
├── collect_data.py                   # Data preparation utility
├── prompt_optimizer.py               # Prompt optimization utilities
├── ComfyUI/
│   ├── qwen3_prompt_enhancer.py      # ComfyUI custom node
│   └── HyperLoRA-WAN2.2.json         # ComfyUI workflow file
├── utils/                            # Utility functions
└── vip200k/                          # Example data and configs

📄 Citation

If you find this work useful, please cite our paper:

@article{wang2025identity,
  title={Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations},
  author={Wang, Yuji and Li, Moran and Hu, Xiaobin and Yi, Ran and Zhang, Jiangning and Feng, Han and Cao, Weijian and Wang, Yabiao and Wang, Chengjie and Ma, Lizhuang},
  journal={arXiv e-prints},
  pages={arXiv--2507},
  year={2025}
}

🙏 Acknowledgments

This work builds upon:

Wan2.2 for video generation capabilities
ComfyUI-HyperLoRA for identity-preserving image generation
The open-source community for various tools and libraries

We are grateful to the organizers of the Identity-Preserving Video Generation Challenge for providing a platform to evaluate and showcase our method.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IPVG-STD: Spatial-Temporal Decoupled Identity Preserving Video Generation [MM 2025]

🏆 Competition Achievement

📝 Overview

🔧 Environment Setup

Prerequisites

Model Checkpoints

ComfyUI Integration

🚀 Quick Start

Option 1: Command Line Interface

Stage 1: Prompt Optimization

Stage 2: Identity-Preserving First Frame Generation

Stage 3: Video Generation

📊 Data Preparation

🔄 Complete Pipeline Example

Option 2: ComfyUI Visual Interface

📁 Project Structure

📄 Citation

🙏 Acknowledgments

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
ComfyUI		ComfyUI
utils		utils
vip200k/raw		vip200k/raw
wan		wan
.gitignore		.gitignore
README.md		README.md
collect_data.py		collect_data.py
generate.py		generate.py
hyperlora_t2i.py		hyperlora_t2i.py
infer_Wan2.2.py		infer_Wan2.2.py
prompt_decoupled.py		prompt_decoupled.py
prompt_optimizer.py		prompt_optimizer.py

Folders and files

Latest commit

History

Repository files navigation

IPVG-STD: Spatial-Temporal Decoupled Identity Preserving Video Generation [MM 2025]

🏆 Competition Achievement

📝 Overview

🔧 Environment Setup

Prerequisites

Model Checkpoints

ComfyUI Integration

🚀 Quick Start

Option 1: Command Line Interface

Stage 1: Prompt Optimization

Stage 2: Identity-Preserving First Frame Generation

Stage 3: Video Generation

📊 Data Preparation

🔄 Complete Pipeline Example

Option 2: ComfyUI Visual Interface

📁 Project Structure

📄 Citation

🙏 Acknowledgments

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages