Skip to content

showlab/Sparkle

Repository files navigation

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Ziyun Zeng, Yiqi Lin, Guoqiang Liang, and Mike Zheng Shou

arXiv Project Page Code Dataset Benchmark Model

πŸ“¦ Dataset

Sparkle is a large-scale video background replacement dataset comprising ~140K high-quality source–edited video pairs. It is fully open-sourced at πŸ€—stdKonjac/Sparkle. For full methodology and dataset details, please refer to our paper.

The dataset is organized into five themes along different background-change axes:

Theme Description
location Background replaced with a different physical environment (rural, nature, landmark, ...).
season Background changed across seasons (spring, summer, autumn, winter).
time Background changed across times of day (dawn, dusk, night, ...).
style Background restyled (era, mood, cinematic, ...).
openve3m A re-creation of the OpenVE-3M background-replacement subset using our pipeline, retained for direct comparison with prior work.

πŸ—‚οΈ Repository Structure

Sparkle/
β”œβ”€β”€ README.md
β”œβ”€β”€ prompts/                                        # training annotations + dataset-viewer source
β”‚   β”œβ”€β”€ location_train.csv                          # 4 columns: prompt, src_video, tgt_video, task
β”‚   β”œβ”€β”€ location_train_metadata.jsonl               # per-task metadata (edit_type, subtheme, original scene)
β”‚   β”œβ”€β”€ season_train.csv
β”‚   β”œβ”€β”€ season_train_metadata.jsonl
β”‚   β”œβ”€β”€ time_train.csv
β”‚   β”œβ”€β”€ time_train_metadata.jsonl
β”‚   β”œβ”€β”€ style_train.csv
β”‚   β”œβ”€β”€ style_train_metadata.jsonl
β”‚   β”œβ”€β”€ openve3m_train.csv
β”‚   └── openve3m_train_metadata.jsonl
β”‚
β”œβ”€β”€ location/                                       # online preview: first 100 samples
β”‚   β”œβ”€β”€ source_video/
β”‚   β”‚   β”œβ”€β”€ Sparkle_location_000000.mp4
β”‚   β”‚   └── ... (100 files)
β”‚   └── edited_video/
β”‚       β”œβ”€β”€ Sparkle_location_000000.mp4
β”‚       └── ... (100 files)
β”œβ”€β”€ season/                                         # same structure as location/
β”œβ”€β”€ time/
β”œβ”€β”€ style/
β”œβ”€β”€ openve3m/
β”‚
β”œβ”€β”€ location_source_video_part00.tar                # full corpus, sharded into ~5GB tars
β”œβ”€β”€ location_source_video_part01.tar
β”œβ”€β”€ location_edited_video_part00.tar
β”œβ”€β”€ ...
β”œβ”€β”€ season_*_partXX.tar
β”œβ”€β”€ time_*_partXX.tar
β”œβ”€β”€ style_*_partXX.tar
β”œβ”€β”€ openve3m_*_partXX.tar
β”‚
└── intermediate_data/                              # pipeline intermediates (described below)
    └── ...

🧾 Training Data Format

We follow the training data format of Kiwi-Edit for direct compatibility with downstream training pipelines.

Each theme's annotations live in prompts/{edit_type}_train.csv, a four-column table:

Column Description
prompt The natural-language editing instruction.
src_video Path to the source video, e.g. location/source_video/Sparkle_location_000000.mp4.
tgt_video Path to the edited video, e.g. location/edited_video/Sparkle_location_000000.mp4.
task The unique sample id, e.g. Sparkle_location_000000. Joins to the id field in the JSONL metadata.

Per-task auxiliary metadata is stored alongside in prompts/{edit_type}_train_metadata.jsonl. Each line is one sample:

{
  "id": "Sparkle_location_000000",
  "prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
  "metadata": {
    "edit_type": "location",
    "chosen_keyword": "urban: rooftop overlooking skyline",
    "original_scene": "A cobblestone street in a historical European city, ..."
  }
}
Field Description
id Sample id, matches the task column in the CSV.
prompt Same as the prompt column in the CSV.
metadata.edit_type One of the five themes: location / season / time / style / openve3m (denoted as openve3m_background_change).
metadata.chosen_keyword The subtheme: scene label (e.g. "urban: rooftop overlooking skyline"). Not available for the openve3m theme.
metadata.original_scene A description of the source video's first frame.

πŸ‘€ Online Preview

The first 100 samples of every theme are stored as uncompressed .mp4 files under {edit_type}/source_video/ and {edit_type}/edited_video/, and can be played directly in the browser without downloading the full corpus.

For example, for the task Sparkle_location_000000 (the first row in the location theme of the dataset viewer), you can directly browse its Source Video and Edited Video.

The dataset viewer at the top of the HF page lets you scroll through all five themes and read the corresponding prompts inline.

⬇️ Downloading the Full Corpus

The full ~140K-sample corpus is sharded into ~5GB .tar archives at the repository root, named {edit_type}_{source_video|edited_video}_partXX.tar.

Step 1 β€” Download the tar shards. Download everything (recommended for full reproduction):

hf download stdKonjac/Sparkle --repo-type=dataset --local-dir ./Sparkle

Or only a single theme (e.g. location):

hf download stdKonjac/Sparkle \
  --repo-type=dataset \
  --local-dir ./Sparkle \
  --include "location_*.tar" "prompts/location_*"

Or only the source videos of a theme:

hf download stdKonjac/Sparkle \
  --repo-type=dataset \
  --local-dir ./Sparkle \
  --include "location_source_video_*.tar"

Step 2 β€” Extract the tars. Each tar is self-contained: its internal paths are {edit_type}/{source_video|edited_video}/{task}.mp4, so extracting any subset of shards in place will populate the corresponding folders correctly. There is no need to concatenate the parts before extraction.

cd ./Sparkle
for f in *.tar; do tar -xf "$f"; done

After extraction, the directory layout matches the online preview structure, and the relative paths in prompts/{edit_type}_train.csv (e.g. location/source_video/Sparkle_location_000000.mp4) will resolve directly.

πŸ§ͺ Pipeline Intermediates

To support full reproducibility, transparency, and downstream research, we additionally release every intermediate artifact produced by the 5-stage Sparkle data pipeline (see Figure 2: Data Pipeline in our paper) under intermediate_data/. The first 100 samples of every theme are uncompressed and previewable directly in the browser, mirroring the layout of the {edit_type}/ preview folders described above.

Taking Sparkle_location_000000 as a running example, the artifact layout looks like:

Sparkle/
└── intermediate_data/
    └── location/
        β”œβ”€β”€ source_frame0/                          # Stage 2 input: 0-th frame of the source video
        β”‚   └── Sparkle_location_000000.png
        β”œβ”€β”€ edited_frame0/                          # Stage 2 output: first frame after preliminary background replacement
        β”‚   └── Sparkle_location_000000.png
        β”œβ”€β”€ edited_frame0_foreground_removed/       # Stage 3 intermediate: foreground-removed clean background image
        β”‚   └── Sparkle_location_000000.png
        β”œβ”€β”€ edited_background_video/                # Stage 3 output: 81-frame pure background video (no foreground)
        β”‚   └── Sparkle_location_000000.mp4
        β”œβ”€β”€ source_video_mask/                      # Stage 4 output: BAIT-tracked foreground mask (packed bits)
        β”‚   └── Sparkle_location_000000.npz
        └── edited_video_canny/                     # Stage 5 intermediate: decoupled foreground + background Canny edges
            └── Sparkle_location_000000.mp4

For the same task Sparkle_location_000000, every artifact is browsable online:

Pipeline stage Artifact Preview
Stage 2 (in) Source first frame source_frame0/Sparkle_location_000000.png
Stage 2 (out) Preliminarily edited first frame edited_frame0/Sparkle_location_000000.png
Stage 3 (mid) Foreground-removed clean background image edited_frame0_foreground_removed/Sparkle_location_000000.png
Stage 3 (out) Pure background video (81 frames, no foreground) edited_background_video/Sparkle_location_000000.mp4
Stage 4 BAIT-tracked foreground mask source_video_mask/Sparkle_location_000000.npz
Stage 5 (mid) Decoupled foreground + background Canny edges edited_video_canny/Sparkle_location_000000.mp4

Loading the foreground mask. The masks in source_video_mask/ are bit-packed for storage efficiency. Each .npz file contains two arrays: mask (a np.uint8 array of bits) and shape (the original (T, H, W) mask shape, where T ≀ 81). Unpack with:

import numpy as np

def load_mask(mask_path: str) -> np.ndarray:
    data = np.load(mask_path)
    packed_mask = data["mask"]
    shape = tuple(int(s) for s in data["shape"])
    total = shape[0] * shape[1] * shape[2]
    video_mask = np.unpackbits(packed_mask)[:total].reshape(shape).astype(bool)
    return video_mask  # boolean array of shape (T, H, W)

Downloading the full intermediates. Like the main corpus, the full intermediates for every theme are sharded into ~5GB .tar archives, stored under intermediate_data/ and named {edit_type}_{subdir}_partXX.tar where {subdir} is one of the six folder names above. Download and extract them as follows:

# Download all intermediates for a single theme (e.g. location)
hf download stdKonjac/Sparkle \
  --repo-type=dataset \
  --local-dir ./Sparkle \
  --include "intermediate_data/location_*_part*.tar"

# Extract in place; tar-internal paths are {edit_type}/{subdir}/{file},
# so the working directory must be intermediate_data/ for the layout to align.
cd ./Sparkle/intermediate_data
for f in location_*_part*.tar; do tar -xf "$f"; done

After extraction, the layout matches the online preview structure exactly, populating intermediate_data/location/{source_frame0, edited_frame0, ...}/.

πŸ“‹ Per-task Pipeline Metadata

In addition to the per-task artifacts, each theme's intermediate_data/{edit_type}/ folder also contains five .jsonl files recording metadata produced at various stages of the pipeline (e.g., quality scores, foreground grounding labels). These records are useful for reproducing our quality filtering, inspecting per-stage rejection statistics, or building stricter / looser variants of Sparkle for downstream research.

edited_frame0_score.jsonl records per-sample EditScore evaluation of the Stage 2 output (edited_frame0/{task}.png). One JSON object per line:

{
  "id": "Sparkle_location_000000",
  "prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
  "editscore": {
    "prompt_following": 9.7,
    "consistency": 8.8,
    "perceptual_quality": 8.5,
    "overall": 8.62887857991077,
    "SC_reasoning": "The edited image perfectly follows the instruction: ...",
    "PQ_reasoning": "The image displays a realistic cityscape with convincing lighting ..."
  }
}
Field Description
id Sample id, matches the task column in the CSV.
prompt The editing instruction.
editscore.prompt_following Sub-score (0–10): how well the edit follows the instruction.
editscore.consistency Sub-score (0–10): subject and identity consistency with the source frame.
editscore.perceptual_quality Sub-score (0–10): perceptual quality of the edited image.
editscore.overall Aggregated overall score. We filter out samples with overall < 8.
editscore.SC_reasoning Free-text rationale for the consistency / instruction-following sub-scores.
editscore.PQ_reasoning Free-text rationale for the perceptual-quality sub-score.

edited_frame0_foreground_removed_score.jsonl records per-sample EditScore evaluation of the Stage 3 intermediate output (edited_frame0_foreground_removed/{task}.png), measuring the foreground-removal quality. The schema is identical to edited_frame0_score.jsonl:

{
  "id": "Sparkle_location_000000",
  "prompt": "...",
  "editscore": {
    "prompt_following": ...,
    "consistency": ...,
    "perceptual_quality": ...,
    "overall": ...,
    "SC_reasoning": "...",
    "PQ_reasoning": "..."
  }
}

At this stage we apply a stricter threshold and filter out samples with overall < 8.5 to guarantee a perfectly clean background before the I2V generation that follows.

foreground_grounding_r1.jsonl records the first-round VLM grounding result that compares the source first frame and the Stage 2 edited first frame to identify foreground objects to preserve. This is the labeling step described in Stage 3 of the pipeline. One JSON object per line:

{
  "id": "Sparkle_location_000000",
  "prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
  "edit_type": "location",
  "round1_labels": [
    "woman in brown hat and coat",
    "clasped hands with ring",
    "striped shirt under coat",
    "brown wide-brimmed hat"
  ],
  "round1_objects": [
    {"bbox_2d": [447, 27, 765, 998], "label": "woman in brown hat and coat"},
    {"bbox_2d": [515, 800, 615, 980], "label": "clasped hands with ring"},
    {"bbox_2d": [490, 398, 615, 800], "label": "striped shirt under coat"},
    {"bbox_2d": [505, 27, 710, 258], "label": "brown wide-brimmed hat"}
  ]
}
Field Description
id Sample id, matches the task column in the CSV.
prompt The editing instruction.
edit_type The theme this sample belongs to (location / season / time / style / openve3m).
round1_labels List of foreground-object labels detected by the VLM.
round1_objects Per-object detection records; each item has a bbox_2d and a label.

The bounding boxes are detected on the source first frame (source_frame0/{task}.png). Since our pipeline preserves the foreground identity and pose during background replacement, these boxes apply equally to the corresponding edited first frame (edited_frame0/{task}.png).

The bbox_2d field follows Qwen3-VL's normalized coordinate format with values in the range [0, 1000], representing [x1, y1, x2, y2] (top-left and bottom-right corners). Convert them to absolute pixel coordinates of the real frame as follows:

def normalize_bbox(bbox, video_width: int, video_height: int):
    """Convert a Qwen3-VL [0, 1000]-normalized bbox to absolute pixel coordinates."""
    x1 = int(bbox[0] / 1000.0 * video_width)
    y1 = int(bbox[1] / 1000.0 * video_height)
    x2 = int(bbox[2] / 1000.0 * video_width)
    y2 = int(bbox[3] / 1000.0 * video_height)

    # Clamp to frame bounds and ensure x1 <= x2, y1 <= y2.
    x1 = max(0, min(min(x1, x2), video_width - 1))
    y1 = max(0, min(min(y1, y2), video_height - 1))
    x2 = max(0, min(max(x1, x2), video_width - 1))
    y2 = max(0, min(max(y1, y2), video_height - 1))
    return x1, y1, x2, y2

foreground_grounding_r2.jsonl records the second-round VLM grounding result that produces the temporal anchors for Stage 4 (BAIT Foreground Tracking). Building on the labels from foreground_grounding_r1.jsonl, Qwen3-VL is asked to re-locate every Round 1 label on frames sampled at 2 FPS from the source video, yielding per-frame bounding boxes that anchor the subsequent SAM3 multi-pass tracking. One JSON object per line:

{
  "id": "Sparkle_location_000000",
  "prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
  "edit_type": "location",
  "round1_labels": [...],
  "round1_objects": [...],
  "frame_objects": [
    [
      {"bbox_2d": [448, 26, 765, 998], "label": "woman in brown hat and coat"},
      {"bbox_2d": [521, 795, 618, 968], "label": "clasped hands with ring"},
      {"bbox_2d": [545, 420, 625, 805], "label": "striped shirt under coat"},
      {"bbox_2d": [507, 26, 712, 270], "label": "brown wide-brimmed hat"}
    ],
    [
      {"bbox_2d": [452, 34, 764, 998], "label": "woman in brown hat and coat"},
      {"bbox_2d": [505, 784, 600, 955], "label": "clasped hands with ring"},
      ...
    ],
    ...
  ]
}

The schema extends foreground_grounding_r1.jsonl with a single new field:

Field Description
frame_objects A 2D list of grounding results, one inner list per 2 FPS-sampled frame. Each inner list mirrors the round1_objects schema (a list of {"bbox_2d": [...], "label": "..."} items), giving the per-frame bbox of every Round 1 label on that frame.

The other fields (id, prompt, edit_type, round1_labels, round1_objects) are inherited unchanged from foreground_grounding_r1.jsonl. Use the same normalize_bbox helper to convert bbox_2d values to absolute pixel coordinates.

Note. Some entries in frame_objects may have an empty bbox_2d (e.g. {"bbox_2d": [], "label": "..."}), indicating that the VLM failed to localize that particular label on that frame. Our BAIT algorithm handles these gracefully by relying on the remaining frames' anchors and a pixel-wise majority vote across SAM3 tracking passes.

edited_video_score.jsonl records per-sample EditScore evaluation of the Stage 5 final synthesized video. Following the protocol in our paper, we uniformly sample four non-first frames from each video and score them independently. One JSON object per line:

{
  "id": "Sparkle_location_000000",
  "prompt": "Shift the background to a rooftop overlooking a modern city skyline at dusk, ...",
  "frame_indices": [1, 26, 51, 76],
  "editscore": [
    {
      "SC_score": 9.0,
      "PQ_score": 8.5,
      "O_score": 8.719958110896453,
      "SC_score_reasoning": "The editing successfully changed the background to a rooftop overlooking a modern city skyline at dusk, ...",
      "PQ_score_reasoning": "The image has a mostly natural cityscape and lighting, but the person's hands appear slightly distorted ...",
      "SC_raw_output": "...",
      "PQ_raw_output": "..."
    },
    { "SC_score": 8.3, "PQ_score": 8.5, "O_score": 8.388302424289282, "...": "..." },
    { "SC_score": 9.1, "PQ_score": 7.4, "O_score": 8.143194240945185, "...": "..." },
    { "SC_score": 8.9, "PQ_score": 7.8, "O_score": 8.318623075017307, "...": "..." }
  ]
}
Field Description
id Sample id, matches the task column in the CSV.
prompt The editing instruction.
frame_indices The 4 frame indices (0-based) sampled from the synthesized video for evaluation, e.g. [1, 26, 51, 76].
editscore A length-4 list, one entry per sampled frame, in the same order as frame_indices.
editscore[i].SC_score Sub-score (0–10) for instruction-following / consistency on frame i.
editscore[i].PQ_score Sub-score (0–10) for perceptual quality on frame i.
editscore[i].O_score Aggregated overall score on frame i.
editscore[i].SC_score_reasoning Free-text rationale behind SC_score.
editscore[i].PQ_score_reasoning Free-text rationale behind PQ_score.
editscore[i].SC_raw_output Raw JSON string returned by the EditScore SC head (contains reasoning and per-criterion score array).
editscore[i].PQ_raw_output Raw JSON string returned by the EditScore PQ head.

The final filtering rule is: average O_score across all four sampled frames; discard the sample if the mean is below 8.

πŸ“œ Dataset License

The Sparkle dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Source videos in the openve3m theme are derived from OpenVE-3M and retain their original licenses; please consult the upstream source before redistribution.

🎯 Benchmark

Sparkle-Bench is the largest evaluation benchmark tailored for instruction-guided video background replacement, comprising 458 carefully curated videos across 4 themes, 21 subthemes, and 97 distinct scenes. It is fully open-sourced at πŸ€—stdKonjac/Sparkle-Bench. For evaluation methodology and our six-dimensional scoring protocol, please refer to our paper.

All source videos in the benchmark are uncompressed and previewable directly in the browser, so users can inspect any sample without downloading anything.

The benchmark is organized into four themes:

Theme Description
location Background replaced with a different physical environment (rural, nature, landmark, ...).
season Background changed across seasons (spring, summer, autumn, winter).
time Background changed across times of day (dawn, dusk, night, ...).
style Background restyled (era, mood, cinematic, ...).

πŸ—‚οΈ Repository Structure

Sparkle-Bench/
β”œβ”€β”€ README.md
β”œβ”€β”€ location_bench.csv                              # 3 columns: edited_type, prompt, original_video
β”œβ”€β”€ location_metadata.jsonl                         # per-task metadata (edit_type, subtheme, original scene)
β”œβ”€β”€ season_bench.csv
β”œβ”€β”€ season_metadata.jsonl
β”œβ”€β”€ time_bench.csv
β”œβ”€β”€ time_metadata.jsonl
β”œβ”€β”€ style_bench.csv
β”œβ”€β”€ style_metadata.jsonl
β”œβ”€β”€ source_videos/                                  # all 458 source videos, browsable online
β”‚   β”œβ”€β”€ location/
β”‚   β”‚   β”œβ”€β”€ Sparkle_location_000011.mp4
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ season/
β”‚   β”œβ”€β”€ time/
β”‚   └── style/
└── ref_images/                                     # optional reference background images (see below)
    β”œβ”€β”€ location/
    β”œβ”€β”€ season/
    β”œβ”€β”€ time/
    └── style/

🧾 Benchmark Format

We follow the format of OpenVE-Bench for direct compatibility with existing evaluation pipelines.

Each theme's evaluation prompts live in {edit_type}_bench.csv, a three-column table:

Column Description
edited_type The theme of this sample, one of location / season / time / style.
prompt The natural-language editing instruction.
original_video Path to the source video, e.g. source_videos/location/Sparkle_location_010913.mp4.

Per-task auxiliary metadata is stored alongside in {edit_type}_metadata.jsonl. Each line is one sample:

{
  "id": "Sparkle_location_004302",
  "prompt": "Put the subject against ancient stone ruins overgrown with wind-swept grass, ...",
  "metadata": {
    "edit_type": "location",
    "chosen_keyword": "landmark: ancient stone ruins with wind-swept grass",
    "original_scene": "A dimly lit indoor bar or restaurant with brick walls, framed artwork, and warm overhead lighting."
  }
}
Field Description
id Sample id, e.g. Sparkle_location_004302. Matches the basename of the corresponding original_video path.
prompt Same as the prompt column in the CSV.
metadata.edit_type The theme this sample belongs to (location / season / time / style).
metadata.chosen_keyword The subtheme: scene label (e.g. "landmark: ancient stone ruins with wind-swept grass").
metadata.original_scene A description of the source video's first frame.

πŸ‘€ Online Preview

All 458 source videos are stored as uncompressed .mp4 files under source_videos/{edit_type}/, and can be played directly in the browser without any download.

For example, the source video of task Sparkle_location_000011 (the first row in the location theme of the dataset viewer) is browsable at: Sparkle_location_000011.

The dataset viewer at the top of the HF page lets you scroll through all four themes and read the corresponding prompts inline.

⬇️ Downloading the Benchmark

Sparkle-Bench is small enough to download in one command. Pull the entire repo:

hf download stdKonjac/Sparkle-Bench --repo-type=dataset --local-dir ./Sparkle-Bench

Or download only a single theme (e.g. location):

hf download stdKonjac/Sparkle-Bench \
  --repo-type=dataset \
  --local-dir ./Sparkle-Bench \
  --include "location_*" "source_videos/location/*"

After downloading, the relative paths in {edit_type}_bench.csv (e.g. source_videos/location/Sparkle_location_010913.mp4) will resolve directly.

πŸ“Š Evaluation

We provide an end-to-end evaluation script, eval_sparkle_bench_gemini.py, that scores edited videos using Gemini-2.5-Pro under our six-dimensional rubric (see Section 3.7 in our paper). The six dimensions are: Instruction Compliance, Overall Visual Quality, Foreground Integrity, Foreground Motion Consistency, Background Dynamics, and Background Visual Quality, each scored on a 1–5 scale.

1. Prepare your inference outputs

The script expects edited videos to be organized in a specific directory tree. For every sample in Sparkle-Bench, the inference output should be saved as:

{save_dir}/{edit_type}/{subtheme}---{scene_key}/{id}_edited.mp4

where:

  • {save_dir} is your inference root (free to choose).
  • {edit_type} is one of location / season / time / style.
  • {subtheme}---{scene_key} is derived from the sample's chosen_keyword field in {edit_type}_metadata.jsonl. Specifically, splitting chosen_keyword on ": " yields subtheme: scene, then scene_key = scene.replace(" ", "_"). The triple-dash --- is the separator between the two parts.
  • {id} is the sample id, e.g. Sparkle_location_000172.

For example, the inference outputs across the four themes should look like:

{save_dir}/
β”œβ”€β”€ location/
β”‚   └── landmark---ancient_stone_ruins_with_wind-swept_grass/
β”‚       └── Sparkle_location_000172_edited.mp4
β”œβ”€β”€ season/
β”‚   └── {subtheme}---{scene_key}/
β”‚       └── Sparkle_season_xxxxxx_edited.mp4
β”œβ”€β”€ time/
β”‚   └── {subtheme}---{scene_key}/
β”‚       └── Sparkle_time_xxxxxx_edited.mp4
└── style/
    └── {subtheme}---{scene_key}/
        └── Sparkle_style_xxxxxx_edited.mp4

2. Configure the Gemini API

By default the script uses Azure-hosted Gemini via the OpenAI-compatible API for convenient concurrency. Export two environment variables before running:

export AZURE_ENDPOINT="https://your-azure-endpoint"
export GEMINI_API_KEY="your-api-key"

If you have direct access to the Gemini API, you can swap the GEMINI_API client at the top of the script for the native google-genai SDK. The request payload only needs (system prompt, source video, edited video), so the adaptation is straightforward. Just keep the temperature=0 / seed=42 settings for reproducibility.

3. Run the evaluation

Assuming Sparkle-Bench has been downloaded to data/Sparkle-Bench/ (the default --bench_root):

python3 eval_sparkle_bench_gemini.py \
    --video_paths /path/to/sparkle_bench_results/

For multiple checkpoints in one run:

python3 eval_sparkle_bench_gemini.py \
    --video_paths /path/to/ckpt_a/sparkle_bench/ \
                  /path/to/ckpt_b/sparkle_bench/ \
                  /path/to/ckpt_c/sparkle_bench/

By default the script evaluates all four themes (location, season, time, style); pass --edit_types to restrict to a subset. Concurrency is controlled inside the script (default 20 workers).

4. Read the output

For each (save_dir, edit_type) pair, the script writes:

{save_dir}/{edit_type}_gemini-2.5-pro_sparkle_score.jsonl

Each line is a per-sample record containing the six-dim scores plus the original Gemini reasoning:

{
  "id": "Sparkle_location_000172",
  "prompt": "Put the subject against ancient stone ruins overgrown with wind-swept grass, ...",
  "edit_type": "location",
  "subtheme": "landmark",
  "scene": "ancient stone ruins with wind-swept grass",
  "scores": [5, 5, 5, 5, 5, 5],
  "result": "Brief reasoning: The edited background perfectly matches every detail of the prompt, ...\nInstruction Compliance: 5\nOverall Visual Quality: 5\nForeground Integrity: 5\nForeground Motion Consistency: 5\nBackground Dynamics: 5\nBackground Visual Quality: 5"
}

The scores array follows this fixed order: [Instruction Compliance, Overall Visual Quality, Foreground Integrity, Foreground Motion Consistency, Background Dynamics, Background Visual Quality]. Following the OpenVE-Bench protocol, the script automatically caps dimensions 2–6 at the Instruction Compliance score to prevent score hacking.

After scoring, the script aggregates per-theme and macro averages and prints a summary table to stdout. The evaluation is deterministic by design (temperature=0, fixed seed=42) for reproducibility.

πŸ–ΌοΈ Reference Images (Optional, Use with Caution)

By construction, every Sparkle-Bench sample is a video that passed the first four stages of our pipeline but failed the final synthesis quality check in Stage 5 (see Section 3.7 of our paper). As a free byproduct, this means each sample comes with a pure background image generated by Stage 3 (Individual Background Generation), where the foreground has been removed from the preliminarily edited first frame.

We release these images under ref_images/{edit_type}/{id}.png, alongside the CSV/JSONL annotations. These images may be useful for reference-based background-replacement experiments (e.g., feeding the clean background as an extra visual condition to the editing model).

⚠️ Disclaimer. Our paper neither trains any reference-based model nor includes any reference-image-based evaluation. We release ref_images/ purely to facilitate future research in this direction. The images are not curated and may contain noise such as low-quality edits or imperfect foreground removal. Please use them with caution. We make no quality guarantees about this auxiliary asset.

πŸ“œ Benchmark License

The Sparkle-Bench is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Source videos are derived from OpenVE-3M and retain their original licenses; please consult the upstream source before redistribution.

🧠 Model

We release Kiwi-Sparkle, a video background-replacement model fine-tuned on the Sparkle dataset for 10K steps with a batch size of 128, starting from a Kiwi-Edit base. Since we apply no architectural modifications to Kiwi-Edit, Kiwi-Sparkle's weights are fully compatible with the Kiwi-Edit weights structure. Any inference, training, or deployment pipeline that runs Kiwi-Edit can run Kiwi-Sparkle as a drop-in replacement.

The model is open-sourced at πŸ€—stdKonjac/Kiwi-Sparkle-720P-81F and supports 720P resolution with up to 81-frame outputs.

Setting Value
Foundation model Kiwi-Edit-Stage2 (Image + Video)
Resolution 720 Γ— 1280
Max output frames 81
Fine-tuning steps 10,000
Batch size 128
Architectural changes None. Drop-in compatible with Kiwi-Edit.

πŸš€ Training

Kiwi-Sparkle is trained using the official Kiwi-Edit recipe in this script with no modifications. Two common entry points are supported:

Train from the Kiwi-Edit base on a Sparkle theme. Point --vid_dataset_metadata_path to the corresponding Sparkle training CSV, and load the foundation Kiwi-Edit-Stage2 checkpoint:

--vid_dataset_metadata_path /path/to/Sparkle/prompts/{edit_type}_train.csv
--checkpoint /path/to/Kiwi-Edit-Stage2/model.safetensors

where {edit_type} is one of location / season / time / style / openve3m. The five training CSVs are hosted here.

Continue training from our Kiwi-Sparkle checkpoint. Replace the --checkpoint argument:

--checkpoint /path/to/Kiwi-Sparkle-720P-81F/model.safetensors

The rest of the script stays exactly as in the official Kiwi-Edit setup.

🎬 Inference

OpenVE-Bench

Since Kiwi-Sparkle is architecturally identical to Kiwi-Edit, you can simply follow the official OpenVE-Bench evaluation pipeline of Kiwi-Edit and swap the checkpoint to Kiwi-Sparkle. For example:

python3 test_benchmark.py \
  --ckpt_path /path/to/Kiwi-Sparkle-720P-81F/model.safetensors \
  --bench openve \
  --max_frame 81 \
  --max_pixels 921600 \
  --save_dir ./infer_results/

Sparkle-Bench

We provide a dedicated launch pair, test_benchmark_sparkle_bench.py and test_benchmark_sparkle_bench.sh, that mirror Kiwi-Edit's existing benchmarking layout.

Step 1. Clone the Kiwi-Edit repository and copy our two scripts into the Kiwi-Edit repo root, alongside the official test_benchmark.py.

Step 2. Edit the shell script to point at your Kiwi-Sparkle checkpoint, then launch (defaults to 8 GPUs):

bash test_benchmark_sparkle_bench.sh

The script writes inference outputs to infer_results/Kiwi-Sparkle-720P-81F/sparkle_bench/{edit_type}/{subtheme}---{scene_key}/{id}_edited.mp4. Re-run it with a different EDIT_TYPE to cover all four themes.

Step 3. Score the outputs with our Gemini-based evaluator:

python3 eval_sparkle_bench_gemini.py \
    --video_paths infer_results/Kiwi-Sparkle-720P-81F/sparkle_bench/

See the Evaluation section above for details on environment setup, output format, and the six-dimensional scoring rubric.

πŸ“œ Model License

Kiwi-Sparkle is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

πŸ™ Acknowledgements

This project is built on top of a number of excellent open-source projects. We thank the authors of Kiwi-Edit, FLUX.2-klein-9B, Qwen3-VL-32B, Wan2.2-I2V-A14B, LightX2V, and VideoX-Fun for releasing the infrastructure that made this work possible.

πŸ“ Citation

If you find Sparkle useful for your research, please consider citing our paper:

@misc{zeng2026sparkle,
  title         = {Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance},
  author        = {Zeng, Ziyun and Lin, Yiqi and Liang, Guoqiang and Shou, Mike Zheng},
  year          = {2026},
  eprint        = {2605.06535},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.06535}
}

About

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors