CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal
Qingdong He1* Β· Chaoyi Wang2* Β· Peng Tang3 Β· Yifan Yang4 Β· Xiaobin Hu5βοΈ
1University of Electronic Science and Technology of China Β Β 2University of Chinese Academy of Sciences Β Β 3Technical University of Munich Β Β 4Shanghai Jiao Tong University Β Β 5National University of Singapore
* Equal contribution Β Β βοΈ Corresponding author
π¬ QQ Group: 1082514558
βΆ Click to watch the full-resolution MP4
CLEAR is a mask-free video subtitle removal framework that achieves end-to-end inference through context-aware adaptive learning. By decoupling prior extraction from generative refinement in a two-stage design, CLEAR requires only 0.77% of the base diffusion model's parameters for training while outperforming mask-dependent baselines by a large margin.
- π― End-to-End Mask-Free Inference: No external text detection or segmentation modules needed at inference time
- π Parameter Efficient: Only 0.77% trainable parameters via LoRA adaptation
- π Zero-Shot Cross-Lingual Generalization: Trained on Chinese subtitles, generalizes to English, Korean, French, Japanese, Russian, and German
- π State-of-the-Art Performance: +6.77 dB PSNR and -74.7% VFID over best baselines
| Language | Demo |
|---|---|
| English | english1_demo.mp4 |
| English | english2_demo.mp4 |
| English | english3_demo.mp4 |
| Japanese | japanese_demo.mp4 |
| Arabic | arabic_demo.mp4 |
All demos show zero-shot cross-lingual generalization β the model was trained only on Chinese subtitle data.
Quantitative results on the Chinese subtitle test set (default configuration: rank=64, steps=5, cfg=1.0, lora_scale=1.0):
| Metric | Category | CLEAR (Ours) |
|---|---|---|
| PSNR β | Reconstruction | 26.80 |
| SSIM β | Reconstruction | 0.894 |
| LPIPS β | Reconstruction | 0.101 |
| DISTS β | Perceptual | 0.075 |
| VFID β | Perceptual | 20.37 |
| TWE β | Temporal | 1.227 |
| TC β | Temporal | 1.049 |
| Flow Mean β | Motion | 0.209 |
| Flow Var β | Motion | 0.029 |
| Time (s/frame) | Efficiency | 4.86 |
| Configuration | PSNR β | SSIM β | LPIPS β | VFID β |
|---|---|---|---|---|
| Baseline (LoRA-only) | 21.62 | 0.855 | 0.131 | 34.74 |
| + M1: Stage I Prior with Focal Weighting | 23.11 | 0.868 | 0.130 | 38.21 |
| + M2: Context Distillation | 24.72 | 0.890 | 0.110 | 31.73 |
| + M3: Context-Aware Adaptation | 25.09 | 0.891 | 0.109 | 31.56 |
| + M4: Context Consistency (CLEAR) | 26.80 | 0.894 | 0.101 | 20.37 |
- Wan2.1 Environment: CLEAR is built on top of Wan2.1. Please follow the Wan2.1 installation guide first:
# Clone Wan2.1
git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1
# Install Wan2.1 dependencies (requires torch >= 2.4.0)
pip install -r requirements.txt- DiffSynth-Studio: CLEAR uses DiffSynth-Studio for the video diffusion pipeline:
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .- Download Wan2.1 Model Weights:
# Download Wan2.1-Fun-V1.1-1.3B-Control
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./models/Wan2.1-T2V-1.3B- Download CLEAR Model Weights:
huggingface-cli download charlesw09/CLEAR-mask-free-video-subtitle-removal \
CLEAR-mask-free-subtitle-removal.pt \
--local-dir ./checkpointsgit clone https://github.com/your-repo/CLEAR.git
cd CLEAR
pip install -r requirements.txtβ After installation, your
checkpoints/folder should containCLEAR-mask-free-subtitle-removal.pt.
CLEAR/
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ .gitignore
β
βββ configs/
β βββ stage1_config.yaml # Stage I training configuration
β
βββ models/
β βββ __init__.py
β βββ dual_encoder.py # Dual ResNet-50 encoders with FPN fusion
β βββ disentangled_modules.py # Disentangled feature learning modules
β βββ occlusion_head.py # Context-dependent occlusion head (Stage II)
β
βββ utils/
β βββ __init__.py
β βββ video_utils.py # Video loading and processing
β βββ mask_utils.py # Mask generation and processing
β βββ contrastive_loss.py # Contrastive learning utilities
β βββ vae_temporal_alignment.py # VAE temporal alignment
β
βββ scripts/
β βββ train_stage1.sh # Stage I training launcher
β βββ train_stage2.sh # Stage II training launcher
β βββ inference.sh # Inference launcher
β
βββ train_stage1.py # Stage I: Self-Supervised Prior Learning
βββ train_stage2.py # Stage II: Adaptive Weighting Learning
βββ inference.py # End-to-End Mask-Free Inference
β
βββ checkpoints/ # Model checkpoints (download separately)
β βββ CLEAR-mask-free-subtitle-removal.pt # Pre-trained CLEAR LoRA weights
β
βββ assets/
βββ demo_videos/ # Demo comparison videos
βββ english1_demo.mp4
βββ english2_demo.mp4
βββ english3_demo.mp4
βββ japanese_demo.mp4
βββ arabic_demo.mp4
The simplest way to use CLEAR β just provide a video with subtitles:
# Set model paths
export MODEL_BASE_PATH=/path/to/Wan2.1-Fun-V1.1-1.3B-Control
export LORA_CHECKPOINT=./checkpoints/CLEAR-mask-free-subtitle-removal.pt
# Run inference
bash scripts/inference.sh input_video.mp4 ./resultsOr use Python directly:
python inference.py \
--model_base_path /path/to/Wan2.1-Fun-V1.1-1.3B-Control \
--lora_checkpoint ./checkpoints/CLEAR-mask-free-subtitle-removal.pt \
--lora_rank 64 \
--lora_scale 1.0 \
--input_video input_video.mp4 \
--output_dir ./results \
--num_steps 5 \
--cfg_scale 1.0 \
--use_sliding_window \
--create_comparisonTrains dual encoders to extract occlusion guidance from paired videos:
# Edit configs/stage1_config.yaml to set your data paths
bash scripts/train_stage1.shTrains LoRA-adapted diffusion model with context-dependent occlusion head:
# Set environment variables
export MODEL_BASE_PATH=/path/to/Wan2.1-Fun-V1.1-1.3B-Control
export ADAPTER_CHECKPOINT=./checkpoints/stage1/checkpoint_best.pt
export CLEAN_DIRS=/path/to/clean_videos
export SUBTITLE_DIRS=/path/to/subtitle_videos
bash scripts/train_stage2.sh| Parameter | Default | Description |
|---|---|---|
num_steps |
5 | Denoising steps (5 recommended for speed-quality balance) |
cfg_scale |
1.0 | Classifier-free guidance scale |
lora_scale |
1.0 | LoRA strength (0.0-2.0) |
chunk_size |
81 | Sliding window size (frames) |
chunk_overlap |
16 | Overlap between chunks |
CLEAR consists of two training stages and a mask-free inference pipeline:
- Dual ResNet-50 encoders extract disentangled features from video pairs
- Orthogonality constraints ensure feature independence
- Pseudo-labels from pixel differences (no manual annotation needed)
- Output: Coarse occlusion prior M^prior
- LoRA adaptation (rank=64) on frozen Wan2.1 diffusion model
- Lightweight occlusion head (~2.1M params) predicts adaptive weights
- Joint optimization: L_distill + L_gen + 0.1 Γ L_sparse
- Dynamic alpha scheduling prevents over-reliance on noisy priors
- Only requires subtitled video input
- No Stage I dependency, no external modules
- Adaptive weighting internalized into LoRA-augmented attention
- Single-pass generation via DDIM sampling
If you find this work helpful, please consider citing:
@misc{he2026clearcontextawarelearningendtoend,
title={CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal},
author={Qingdong He and Chaoyi Wang and Peng Tang and Yifan Yang and Xiaobin Hu},
year={2026},
eprint={2603.21901},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.21901},
}- Qingdong He β heqingdong@alu.uestc.edu.cn β Google Scholar
- Chaoyi Wang β chaoyiwang@mail.sim.ac.cn β Google Scholar
This project is built upon the following excellent works:
- Wan2.1 β Open and advanced large-scale video generative models
- DiffSynth-Studio β Diffusion model training and inference framework
- PEFT β Parameter-Efficient Fine-Tuning
This project is released under the Apache 2.0 License. We claim no rights over your generated contents. Please use responsibly and ensure compliance with applicable laws.
We acknowledge the potential for misuse, particularly for generating misinformation. This code is released for research purposes with explicit stipulations against malicious use.