FlashVSR-Pro is an enhanced, production-ready re-implementation of the real-time diffusion-based video super-resolution algorithm introduced in the paper "FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution".
Original Paper: Zhuang, J., Guo, S., Cai, X., Li, X., Liu, Y., Yuan, C., & Xue, T. (2025). FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution. arXiv preprint arXiv:2510.12747.
Paper Link: https://arxiv.org/abs/2510.12747
This project is not the official code release but an independent, refactored implementation focused on improved usability, additional features, and better compatibility for real-world deployment.
This project builds upon the core FlashVSR algorithm and introduces several key improvements:
- π§© Unified Inference Script: A single, parameterized
infer.pyscript replaces multiple original scripts (full,tiny,tiny-long), simplifying the user interface. - π΅ Audio Track Preservation: Automatically detects and preserves the audio track from the input video in the super-resolved output. (Inspiration for this feature was drawn from the FlashVSR_plus project.)
- πΎ Tiled Inference for Reduced VRAM: Implements a tiling mechanism for the DiT model, significantly lowering GPU memory requirements and enabling processing of higher-resolution videos or operation on GPUs with limited VRAM. (The concept for tiled DiT inference was inspired by the FlashVSR_plus project.)
- π³ Optimized Docker Container: A fully-configured Dockerfile that automatically sets up the complete environment, including Conda environment activation upon container startup.
- π§ Automated Block-Sparse Attention Installation: Optimizes and automates the installation of the Block-Sparse-Attention backend within the Docker build process. This eliminates the manual compilation complexity encountered in the original implementation, ensuring a seamless setup experience. My specific improvements to Block-Sparse-Attention are documented in this PR: mit-han-lab/Block-Sparse-Attention#16.
- π¨ Configurable VAE Decoders: Introduces a unified
VAEManagersupporting five different VAE decoder options (Wan2.1, Wan2.2, LightVAE, TAE_W2.2, LightTAE_HY1.5). This allows users to dynamically trade off between output quality, processing speed, and GPU memory (VRAM) usage based on their hardware and needs. See the detailed VAE Model Selection guide below. - β‘ Performance Optimizations: Comprehensive speed enhancements including optimized device transfers, suppressed verbose outputs, streamlined memory management, and reduced redundant operations. Achieves 20-30% performance improvement for high-throughput video processing scenarios.
- Real-Time Performance: Achieves ~17 FPS for 768 Γ 1408 videos on a single A100 GPU.
- One-Step Diffusion: Efficient streaming framework based on a distilled one-step diffusion model.
- State-of-the-Art Quality: Combines Locality-Constrained Sparse Attention and a Tiny Conditional Decoder for high-fidelity results.
- Scalability: Reliably scales to ultra-high resolutions.
FlashVSR-Pro supports multiple VAE decoders to optimize for your specific hardware and quality requirements.
| VAE Type | VRAM Usage | Speed | Quality | Best For |
|---|---|---|---|---|
| Wan2.1 | 8-12 GB | Baseline | β β β β β | High quality, moderate VRAM |
| Wan2.2 | 8-12 GB | Baseline | β β β β β | Best quality, highest VRAM (H100 recommended) |
| LightVAE_W2.1 | 4-5 GB | 2-3x faster | β β β β β | 8-16GB VRAM, speed priority |
| TAE_W2.2 | 6-8 GB | 1.5x faster | β β β β β | Temporal consistency priority |
| LightTAE_HY1.5 | 3-4 GB | 3x faster | β β β β β | HunyuanVideo compatible, minimum VRAM |
FlashVSR-Pro has three inference modes, each compatible with specific VAE types:
| Mode | Compatible VAEs | Default VAE | Description |
|---|---|---|---|
| full | wan2.1, wan2.2, light |
wan2.1 |
Full diffusion pipeline with VAE decoding for highest quality |
| tiny | tcd, tae-hv, tae-w2.2 |
tcd |
Fast inference using Tiny Conditional Decoder |
| tiny-long | tcd, tae-hv, tae-w2.2 |
tcd |
Optimized for long videos with Tiny Conditional Decoder |
Important: Each mode is strictly compatible with its designated VAE types. If you specify an incompatible VAE, the program will display a warning and automatically switch to the default VAE for that mode.
| Your VRAM | Recommended VAE | Additional Settings |
|---|---|---|
| 8GB | LightTAE_HY1.5 or LightVAE_W2.1 |
--tile-vae, --tile-dit, --tile-size 128 |
| 12GB | LightVAE_W2.1 or Wan2.1 |
--tile-vae |
| 16GB | Any VAE | Optional tiling for long videos |
| 24GB+ | Wan2.2 (preferred) or Wan2.1 |
Maximum quality, no restrictions |
All VAE models are expected to be in the ./models/VAEs/ directory. If not found, you will need to download them manually:
| VAE Selection | File | Direct Download Link |
|---|---|---|
| Wan2.1 | Wan2.1_VAE.pth |
Download |
| Wan2.2 | Wan2.2_VAE.pth |
Download |
| LightVAE_W2.1 | lightvaew2_1.pth |
Download |
| TAE_W2.2 | taew2_2.safetensors |
Download |
| LightTAE_HY1.5 | lighttaehy1_5.pth |
Download |
Usage Examples:
# High quality (full mode with Wan2.2)
python infer.py -i input.mp4 -o results --mode full --vae-type wan2.2
# Balanced quality/VRAM (full mode with Light VAE)
python infer.py -i input.mp4 -o results --mode full --vae-type light
# Fast inference (tiny mode with TCDecoder)
python infer.py -i input.mp4 -o results --mode tiny --vae-type tcd
# Custom VAE weights
python infer.py -i input.mp4 -o results --mode full --vae-type wan2.1 --vae-path ./custom/path/Wan2.1_VAE.pthFlashVSR-Pro includes comprehensive performance enhancements designed for high-throughput production environments and real-time processing requirements.
- π Optimized Device Transfers: Merged redundant tensor
.to()operations into single calls, reducing device transfer overhead by ~50% during data preprocessing and model loading. - π Suppressed Verbose Outputs: Automatic redirection of diffsynth library outputs and removal of detailed progress prints during tiled inference, eliminating I/O bottlenecks in high-performance scenarios.
- πΎ Streamlined Memory Management: Removed frequent GPU cache clearing operations and optimized VAE memory usage by eliminating unnecessary encoder components, reducing memory fragmentation.
- βοΈ Code Structure Improvements: Merged redundant validation checks, cached registry accesses, and optimized pipeline initialization for faster startup times.
| Optimization Area | Performance Impact | Use Case |
|---|---|---|
| Device Transfers | ~50% reduction in tensor movement overhead | Large video processing, batch operations |
| Output Suppression | Eliminates I/O blocking during inference | Real-time streaming, production deployment |
| Memory Management | Reduced GPU memory fragmentation | Long-running processes, high-resolution videos |
| Code Optimization | Faster initialization and validation | Frequent script execution, automated workflows |
# High-performance configuration for production use
python infer.py -i input.mp4 -o results \
--mode tiny \
--vae-type lighttae-hy1.5 \
--tile-dit \
--tile-vae \
--tile-size 256 \
--dtype fp16 \
--device cudaNote: These optimizations are particularly beneficial for:
- Large-scale video processing pipelines
- Real-time streaming applications
- Production environments with high throughput requirements
- Systems with limited I/O bandwidth
The easiest way to run FlashVSR-Pro is using the provided Docker container, which includes automated setup for the Block-Sparse-Attention backend.
- Install Docker
- Install NVIDIA Container Toolkit for GPU support.
- Ensure
git-lfsis installed on your host system to clone model weights.
git clone https://github.com/LujiaJin/FlashVSR-Pro.git
cd FlashVSR-Pro
docker build -t flashvsr-pro:latest .Note: The Dockerfile automatically handles the compilation and installation of the optimized Block-Sparse-Attention backend, eliminating manual configuration.
Before running the container, download the required model weights.
# Download the main FlashVSR model weights (v1.1 is recommended)
git lfs clone https://huggingface.co/JunhaoZhuang/FlashVSR-v1.1 ./models/FlashVSR-v1.1
# or for v1
# git lfs clone https://huggingface.co/JunhaoZhuang/FlashVSR ./models/FlashVSR
# (Optional) Download VAE models to ./models/VAEs/ as per the table above.The container is configured to automatically activate the flashvsr Conda environment upon startup. Make sure that the models/ directory of the host machine already contains the necessary model weight files, and provide the models when starting the container by mounting.
# Basic run with interactive shell
docker run --gpus all -it --rm \
-v $(pwd):/workspace/FlashVSR-Pro \
flashvsr-pro:latest
# You will be dropped into a shell with the `(flashvsr)` environment active.
# Verify by running: `which python`The main interface is the unified infer.py script.
# Basic upscaling (Tiny mode - balanced quality/speed)
python infer.py -i ./inputs/example0.mp4 -o ./results --mode tiny
# Full mode (Highest quality, requires more VRAM)
python infer.py -i ./inputs/example0.mp4 -o ./results --mode full
# Tiny-long mode for long videos
python infer.py -i ./inputs/example4.mp4 -o ./results --mode tiny-long# 1. Preserve the audio track from the input video
python infer.py -i input_with_audio.mp4 -o ./results --mode tiny --keep-audio
# 2. Use tiled DiT inference to reduce VRAM usage (enables running on smaller GPUs)
python infer.py -i large_input.mp4 -o ./results --mode tiny --tile-dit --tile-size 256 --overlap 24
# 3. Use a specific VAE for optimal VRAM/quality trade-off
python infer.py -i input.mp4 -o ./results --mode full --vae-type light --tile-vae
# 4. Combine multiple enhancements
python infer.py -i large_input_with_audio.mp4 -o ./results --mode full --vae-type wan2.2 --tile-dit --keep-audio| Argument | Description | Default |
|---|---|---|
-i, --input |
Path to input video or image folder. | Required |
-o, --output |
Output directory or file path. | ./results |
--mode |
Inference mode: full, tiny, tiny-long. |
tiny |
--vae-type |
VAE decoder type: wan2.1, wan2.2, light, tcd, tae-hv, tae-w2.2. |
wan2.1 (full), tcd (tiny/tiny-long) |
--vae-path |
Custom path to VAE weights file. | None |
--keep-audio |
Preserve audio from input video (if exists). | False |
--tile-dit |
Enable memory-efficient tiled DiT inference. | False |
--tile-vae |
Enable tiled decoding for VAE. | False |
--tile-size |
Size of each tile when using tiling. | 256 |
--overlap |
Overlap between tiles to reduce seams. | 24 |
--scale |
Super-resolution scale factor. | 2.0 |
--seed |
Random seed for reproducible results. | 0 |
Note: The original FlashVSR is primarily designed and tested for 4x super-resolution. While other scales are supported, for optimal quality and stability, using --scale 4.0 is recommended.
For a full list of arguments, run python infer.py --help.
- Use
--tile-ditand--tile-vaeto enable tiled inference. - Decrease
--tile-size(e.g., from 256 to 128). - Use
--dtype fp16to reduce memory usage. - Select a lighter
--vae-type(e.g.,lightorlighttae-hy1.5).
- Ensure ffmpeg is installed on the host system.
- Check whether the input video contains an audio stream:
ffprobe -i input.mp4
- Make sure model weights are downloaded with git-lfs:
git lfs pull - Verify model file integrity
- Check that VAE weights are in the correct directory:
./models/VAEs/
- The Docker build process should handle this automatically. If building manually, ensure you have CUDA 12.1+ and the correct PyTorch version installed.
- Reference the fix I released in: mit-han-lab/Block-Sparse-Attention#16
FlashVSR-Pro/
βββ .gitmodules # Git submodule configuration
βββ Block-Sparse-Attention/ # Git submodule: Sparse attention backend (with automated build)
βββ models/ # Model weights directory
β βββ FlashVSR/ # Model weights V1
β βββ FlashVSR-v1.1/ # Model weights V1.1
β βββ prompt_tensor/ # Pre-computed text prompt embeddings
β βββ VAEs/ # VAE model weights (wan2.1, light, tae-hv, etc.)
βββ diffsynth/ # Core library (ModelManager, Pipelines)
βββ inputs/ # Default directory for input videos/images
βββ results/ # Default directory for output videos
βββ utils/ # Enhanced utilities module
β βββ __init__.py
β βββ utils.py # Core utilities (Causal_LQ4x_Proj, etc.)
β βββ TCDecoder.py # Tiny Conditional Decoder for 'tiny' mode
β βββ audio_utils.py # Audio preservation functions
β βββ tile_utils.py # Tiled inference for low VRAM
β βββ vae_manager.py # VAE Manager for multiple VAE support
βββ infer.py # Main unified inference script
βββ Dockerfile # Container definition with auto-activation
βββ entrypoint.sh # Container entry script
βββ config.yaml # Configuration file with VAE defaults
βββ requirements.txt # Python dependencies
βββ setup.py # Package setup for the `diffsynth` module
βββ LICENSE # Project license file
βββ README.md # This file
This project is released under the same license (Apache Software license) as the original FlashVSR implementation. Please see the LICENSE file in the original repository for details.
If you use the FlashVSR algorithm in your research, please cite the original FlashVSR paper:
@article{zhuang2025flashvsr,
title={FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution},
author={Zhuang, Junhao and Guo, Shi and Cai, Xin and Li, Xiaohui and Liu, Yihao and Yuan, Chun and Xue, Tianfan},
journal={arXiv preprint arXiv:2510.12747},
year={2025}
}If you use this implementation (FlashVSR-Pro) in your work, please cite:
- This repository: https://github.com/LujiaJin/FlashVSR-Pro
- The optimized Block-Sparse-Attention backend: https://github.com/LujiaJin/Block-Sparse-Attention
- The core algorithm is from the original FlashVSR authors.
- Inspiration for the audio preservation and tiled DiT inference features came from the community project FlashVSR_plus.
- The idea and implementation for supporting multiple VAE decoders was inspired by ComfyUI-FlashVSR_Stable.
- VAE models are from the open-source community, particularly lightx2v/Autoencoders.
- The automated build and optimization of the Block-Sparse-Attention backend is a contribution of this project, with improvements documented in mit-han-lab/Block-Sparse-Attention#16.
Contributions, issues, and feature requests are welcome. Feel free to check the issues page if you want to contribute.
Happy Super-Resolution! π