This repository contains the official implementation code of SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning. The code is for the inference of SCAIL-2 Model, an open-source model to support End-to-End Character Animation.
SCAIL-1 identifies the key bottlenecks that hinder character animation towards production level: how to represent the pose and how to inject the pose. However, the reliance on intermediate pose representation still hinders the model towards complex motion and generalizable identity. We define the issue as over reliance on intermediates.
As intermediates, skeleton maps suffer from inherent ambiguity under complex scenarios. Further, it restricts the driving source to be exocentric human movements and thus cannot handle driving sources like animals. Character replacement and multi-character animation suffers from similar issues, where state-of-the-art methods use inpainting masks, but such masks are still a form of intermediates and limits the application and bounds the performance.
To bypass intermediate pose representation, we utilize several off-the-shelf models, including SCAIL-Preview, Wan-Animate, MoCha to synthesize 60K motion pairs. By designing a Unified Motion Transfer Interface containing 2 type of masking channels and a dedicated RoPE design, we support training with all those data. We utilize reserve driving, so that the model can learn capabilities beyond those models. From the data composition and the training recipe, the final model yield emergent capabilities. For example, it supports cross-identity replacement, animal-driving scenarios, and support more advanced control intermediate like SAM3D-Body's mesh rendering in zero-shot manner.
| ckpts | Download Link | Notes |
|---|---|---|
| SCAIL-2 | π€ Hugging Face π€ ModelScope |
Trained with mixed resolutions and fps. End-to-end driven supports both 512p and 704p. Pose-driven performs better under 704p. H and W should be both divisible by 32 (e.g. 704*1280) if using other resolutions. |
Use the following commands to download the model weights (We have integrated both Wan VAE and T5 modules into this checkpoint for convenience).
hf download zai-org/SCAIL-2The files should be organized like:
SCAIL-2/
βββ Wan2.1_VAE.pth
βββ model
β βββ 1
β β βββ fsdp2_rank_0000_checkpoint.pt
β βββ latest
βββ umt5-xxl
βββ ...
The model weights are intended for sat branch, for usage in wan branch, convert to safetensors format:
python convert.py --scail-dir /path/to/SCAIL-2 --save-path /path/to/SCAIL-2.safetensorsPlease make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
pip install -r requirements.txt
SCAIL-Pose contains the preprocessing code used to prepare SCAIL-2 inputs, including pose extraction, pose rendering, reference masks, and driving-video masks. It can prepare both animation inputs and character replacement inputs. The submodule should live under the project root:
SCAIL-2/
βββ generate.py
βββ examples/
βββ SCAIL-Pose/
βββ ...
After cloning this repository, initialize the submodule:
git submodule update --init --recursiveEnter the submodule and follow its environment setup. SCAIL-Pose recommends an OpenMMLab/MMPose environment, then installing its own requirements:
cd SCAIL-Pose
pip install -r requirements.txtDownload the pose-preprocessing weights inside SCAIL-Pose/pretrained_weights. The required layout is:
pretrained_weights/
βββ nlf_l_multi_0.3.2.torchscript
βββ DWPose/
βββ dw-ll_ucoco_384.onnx
βββ yolox_l.onnx
For SCAIL-2 animation, SCAIL-Pose provides an all-in-one preprocessing entrypoint:
# Recommended end-to-end mode: rendered_v2.mp4 is the driving video copy,
# and the mask video is generated from SAM3 masks.
python NLFPoseExtract/process_animation_aio.py --subdir /path/to/input --e2e_mode
# Pose-driven mode: runs NLF + DWPose and writes a skeleton render.
python NLFPoseExtract/process_animation_aio.py --subdir /path/to/inputFor character replacement, use:
python NLFPoseExtract/process_replacement.py --subdir /path/to/input
# If the driving video has multiple people and only one should be replaced:
python NLFPoseExtract/process_replacement.py --subdir /path/to/input --matchnearestThe preprocessing outputs are written back to the example folder and can be passed to generate.py as --image, --mask_image, --pose, and --mask_video.
generate.py runs one SCAIL-2 inference job from four local input files:
examples/001/
βββ ref.jpg # reference character image
βββ ref_mask.jpg # foreground mask of the reference image
βββ rendered_v2.mp4 # driving / pose video consumed by --pose
βββ rendered_mask_v2.mp4 # per-frame driving mask consumed by --mask_video
The paths passed to --image, --mask_image, --pose, and --mask_video must exist. The script checks them before loading the image/video data.
For animation mode, --pose can be an end-to-end driving video or a pose-rendered video, depending on how the sample was prepared. --mask_video should be the corresponding per-frame foreground/control mask. For replacement mode, pass --replace_flag and provide the replacement-region mask through --mask_video.
For both animation and character replacement, --prompt should describe the generated video itself. It should not be an instruction to the model.
For replacement tasks, the prompt should describe the video after replacement has already happened. For better results, describe the replacement character's visible clothing and appearance, and include objects the character interacts with or stays close to in the video, such as tools, instruments, chairs, tables, vehicles, doors, or handheld items.
We provide an optional Gemini-based helper, prompt_enhancer.py, to turn a short replacement instruction into a positive prompt for generate.py. The helper samples frames from the source video, reads the replacement reference image, uses few-shot examples from prompt_examples.txt, and outputs a long English description of the replaced video.
google-genai is not installed by default in requirements.txt. Install it before using the enhancer:
pip install google-genaiSet a Gemini API key before running.
export GEMINI_API_KEY=your_api_keyExample:
python prompt_enhancer.py \
--video /path/to/driving.mp4 \
--image /path/to/ref.png \
--instruction "replace the man in the blue jacket in the video with the person in the image" \
--examples prompt_examples.txt \
--num_frames 8 \
--output enhanced_prompt.txt \
--caption_out source_caption.txtThe --instruction argument is only for Gemini, so it can say who should be replaced by whom. The file written to --output is the positive generated-video description that should be passed to generate.py --prompt; the enhancer is instructed to include useful SCAIL-2 prompt details such as the replacement character's clothing and objects the character interacts with.
Use the enhanced prompt for replacement inference:
python generate.py \
--model SCAIL-14B \
--ckpt_dir /path/to/SCAIL-2 \
--scail_path /path/to/SCAIL-2.safetensors \
--replace_flag \
--target_w 896 --target_h 512 \
--image /path/to/ref.png \
--mask_image /path/to/ref_mask.png \
--pose /path/to/driving.mp4 \
--mask_video /path/to/replace_mask.mp4 \
--prompt "$(cat enhanced_prompt.txt)" \
--save_file replacement_output.mp4prompt_examples.txt is used as few-shot style guidance. Add more examples there if you want the enhanced prompts to follow a different level of detail or wording.
Run inference directly with generate.py:
Example for animation:
python generate.py \
--model SCAIL-14B \
--ckpt_dir /path/to/SCAIL-2 \
--scail_path /path/to/SCAIL-2.safetensors \
--target_w 896 --target_h 512 \
--image examples/001/ref.jpg \
--mask_image examples/001/ref_mask.jpg \
--pose examples/001/rendered_v2.mp4 \
--mask_video examples/001/rendered_mask_v2.mp4 \
--prompt "The girl is dancing" \
--save_file output.mp4Example for replacement:
python generate.py \
--model SCAIL-14B \
--ckpt_dir /path/to/SCAIL-2 \
--scail_path /path/to/SCAIL-2.safetensors \
--target_w 896 --target_h 512 \
--image examples/replace_001/ref.png \
--mask_image examples/replace_001/ref_mask.png \
--pose examples/replace_001/rendered_v2.mp4 \
--mask_video examples/replace_001/replace_mask.mp4 \
--prompt "A blond white male wearing a black suit, trousers, and leather shoes is playing the violin on the street while pedestrians walk past him." \
--save_file output.mp4 \
--replace_flagUseful sampling options:
--sample_steps: number of denoising steps. Defaults to40.--sample_shift: flow-matching scheduler shift. Defaults to3.0if not specified.--sample_guide_scale: classifier-free guidance scale. Defaults to5.0.--sample_solver:unipcordpm++. Defaults tounipc.--offload_model: whether to offload model components between stages. For single-process inference, the default isTrue.
If you use a Lightx2v LoRA checkpoint, pass it with --lora_path and set its strength with --lora_alpha:
python generate.py \
--model SCAIL-14B \
--ckpt_dir /path/to/SCAIL-2 \
--scail_path /path/to/SCAIL-2.safetensors \
--lora_path Lightx2v/lightx2v_I2V_14B_480p_cfg_step_distill_rank128_bf16.safetensors \
--lora_alpha 1.0 \
--sample_steps 8 \
--sample_shift 1 \
--sample_guide_scale 1.0 \
--target_w 896 --target_h 512 \
--image examples/001/ref.jpg \
--mask_image examples/001/ref_mask.jpg \
--pose examples/001/rendered_v2.mp4 \
--mask_video examples/001/rendered_mask_v2.mp4 \
--prompt "The girl is dancing" \
--save_file output.mp4Note that SCAIL-2 is trained with long, detailed prompts. Short prompts or an empty prompt can run, but detailed descriptions of the reference subject and motion usually produce better results.
Our implementation is built upon the foundation of Wan 2.1 and the overall project architecture is inherited from SCAIL. We specially thanks Wan-Animate, MoCha as supplement data generators besides SCAIL and HuMo Dataset as the high-quality source video provider.
If you find this work useful in your research, please cite:
@article{yan2025scail,
title={SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations},
author={Yan, Wenhao and Ye, Sheng and Yang, Zhuoyi and Teng, Jiayan and Dong, ZhenHui and Wen, Kairui and Gu, Xiaotao and Liu, Yong-Jin and Tang, Jie},
journal={arXiv preprint arXiv:2512.05905},
year={2025}
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.



