🎬 Your personal editor for turning hours of footage into cinematic montages.
Overview • Roadmap • Features • Gallery • Quick Start • CLI Reference • Troubleshooting • Citation • Star History
CutClaw is an end-to-end editing system for long-form footage + music.
It first deconstructs raw video/audio into structured captions, then uses a multi-agent pipeline to plan shots (shot_plan), select clip timestamps (shot_point), and validate final quality before rendering.
We warmly welcome new issues and ideas from the community. If you have suggestions, please open an issue. Your feedback will help shape our future plans and be the fuel that helps this project take off. 🔥
What we're building next for faster, cheaper, and more expressive video editing.
- 🧩 ARC-Chapter Integration
Bring in ARC-Chapter to reduce the cost of long-form footage deconstruction. - 💸 Low-Cost Mode
Add a budget-friendly mode that proactively reads only relevant footage instead of fully processing all source material. - 🎙️ Talking-Head + Visual Mixing
Introduce hybrid editing logic that coordinates narration-driven clips with supporting visual footage.
Broader product and ecosystem directions for the next stage of CutClaw.
- ✍️ Playwriter Upgrade
Expand the Playwriter with richer editing patterns and more diverse visual storytelling methods. - 🔌 Claude Code MCP Support
Adapt CutClaw to work smoothly within Claude Code MCP workflows. - 🌐 Online Service Interface
Build a web-based service interface for easier access and deployment.
git clone https://github.com/GVCLab/CutClaw.git
cd CutClaw
conda create -n CutClaw python=3.12
conda activate CutClaw
pip install -r requirements.txtWe strongly recommend the GPU-accelerated Decord/NVDEC build for faster video decoding. Build from source.
resource/
├── video/ ← put your .mp4 / .mkv here
├── audio/ ← put your .mp3 / .wav here
└── subtitle/ ← optional .srt (skips ASR, saves time)
UI (recommended)
streamlit run app.pyThen open http://localhost:8501 in your browser. (*If http://localhost:8501 does not work well, try http://127.0.0.1:8501)
Place your footage in the paths above, then you can directly select those files in the UI.
Model selection guidance:
-
Video model
- Role: shot/scene understanding and visual captioning.
- Recommended: Gemini-3, Qwen3.5, GPT-5.3
-
Audio model
- Role: ASR plus music-structure parsing (beat/downbeat, pitch, energy) for music-aware segmentation.
- Recommended: Gemini-3
-
Agent model
- Role: drives the Screenwriter + Editor + Reviewer loop to generate
shot_planandshot_point. - Recommended: MiniMax-2.7, Kimi-2.5, Claude-4.5
- Role: drives the Screenwriter + Editor + Reviewer loop to generate
We leverage LiteLLM as the api manager gateway, the typical Model name is e.g. 'openai/MiniMax-2.7' which means using openai protocol to call the given model, more information see LiteLLM documents.
CLI (advanced)
python local_run.py \
--Video_Path "resource/video/xxxx.mp4" \
--Audio_Path "resource/audio/xxxx.mp3" \
--Instruction "xxxx"Common config overrides
Any src/config.py parameter can be overridden with --config.PARAM_NAME VALUE.
| Parameter | Default | Effect |
|---|---|---|
VIDEO_PATH |
"resource/video/The_Dark_Knight.mkv" |
Default input video path used by UI remembered inputs |
AUDIO_PATH |
"resource/audio/Way_Down_We_Go.mp3" |
Default input audio path used by UI remembered inputs |
INSTRUCTION |
"Joker's crazy that want to change the world." |
Default editing instruction prompt |
ASR_BACKEND |
"litellm" |
ASR engine (litellm cloud or whisper_cpp local) |
VIDEO_FPS |
2 |
Sampling FPS for preprocessing |
MAIN_CHARACTER_NAME |
"Joker" |
Protagonist name for character-focused edits |
AUDIO_MIN_SEGMENT_DURATION |
3.0 |
Minimum beat segment duration (seconds) |
AUDIO_MAX_SEGMENT_DURATION |
5.0 |
Maximum beat segment duration (seconds) |
AUDIO_DETECTION_METHODS |
["downbeat", "pitch", "mel_energy"] |
Audio keypoint detection methods |
PARALLEL_SHOT_MAX_WORKERS |
4 |
Parallel shot selection workers |
Example:
python local_run.py \
--Video_Path "resource/video/xxxx.mp4" \
--Audio_Path "resource/audio/xxxx.mp3" \
--Instruction "xxxx" \
--config.MAIN_CHARACTER_NAME "Batman" \
--config.VIDEO_FPS 2 \
--config.AUDIO_TOTAL_SHOTS 50Then render manually:
python render/render_video.py \
--shot-plan "Output/<video_audio>/shot_plan_*.json" \
--shot-json "Output/<video_audio>/shot_point_*.json" \
--video "resource/video/xxxx.mp4" \
--audio "resource/audio/xxxx.mp3" \
--output "output/final.mp4" \
--crop-ratio "9:16" \
--no-labels --render-hook-dialogueAll commands must be run from the CutClaw project directory with the correct conda environment:
cd ~/Develop/CutClaw
conda activate CutClawAnalyze video without BGM (used when BGM is not yet selected):
python local_run.py \
--Video_Path "resource/video/sample.MOV" \
--Instruction "video analysis only" \
--type vlog \
--preprocess-onlyOutput: Output/Video/{VIDEO_ID}/captions/scene_summaries_video/ + shot_scenes.txt
Analyze BGM structure (after the content strategist has downloaded the BGM):
python -c "
from src.audio.audio_caption_madmom import caption_audio_with_madmom_segments
caption_audio_with_madmom_segments(
audio_path='resource/audio/bgm.mp3',
output_path='Output/Audio/{BGM_ID}/captions/captions.json',
)
"Output: Output/Audio/{BGM_ID}/captions/captions.json (BPM, structure segments, keypoints)
Run both video and BGM analysis together (preprocess only, no creative generation):
python local_run.py \
--Video_Path "resource/video/sample.MOV" \
--Audio_Path "resource/audio/bgm.mp3" \
--Instruction "preprocess only" \
--type vlog \
--preprocess-onlySearch and download BGM from Pixabay (free commercial use, no API key required):
# Search
python3 ~/.openclaw/skills/pixabay-music-skill/scripts/pixabay_music.py \
search "upbeat travel vlog" --max-duration 120
# Download
python3 ~/.openclaw/skills/pixabay-music-skill/scripts/pixabay_music.py \
download "upbeat travel vlog" \
-o ~/Develop/CutClaw/resource/audio/bgm.mp3Based on scene analysis + BGM structure, the content strategist generates a shot plan:
python src/planner_agent.py \
--video "resource/video/sample.MOV" \
--scene-summaries "Output/Video/{VIDEO_ID}/captions/scene_summaries_video" \
--audio-captions "Output/Audio/{BGM_ID}/captions/captions.json" \
--subtitle "Output/Video/{VIDEO_ID}/subtitles_with_characters.srt" \
--bgm-name "bgm.mp3" \
--output-dir "Output/Output/{VIDEO_ID}_{BGM_ID}" \
--strategy "fast cuts in first 4s, warm interaction in middle 6s, emotional climax in last 5s" \
--action shot_planGenerate precise clip timestamps from the confirmed shot plan:
python src/short_video_editor.py \
--video "resource/video/sample.MOV" \
--shot-plan "Output/Output/{VIDEO_ID}_{BGM_ID}/shot_plan_xxx.json" \
--scene-summaries "Output/Video/{VIDEO_ID}/captions/scene_summaries_video" \
--audio-captions "Output/Audio/{BGM_ID}/captions/captions.json" \
--scene-cuts "Output/Video/{VIDEO_ID}/frames/shot_scenes.txt" \
--instruction "warm family outing, 15s beat-sync short video" \
--shot-point-context "prioritize shots with children laughing" \
--action shot_pointPreview the generated composition without rendering:
python src/short_video_editor.py ... --action dry_runOnce shot points are confirmed, render the final video:
python src/short_video_editor.py \
--video "resource/video/sample.MOV" \
--shot-plan "Output/Output/{VIDEO_ID}_{BGM_ID}/shot_plan_xxx.json" \
--scene-summaries "Output/Video/{VIDEO_ID}/captions/scene_summaries_video" \
--audio-captions "Output/Audio/{BGM_ID}/captions/captions.json" \
--action renderCommon runtime configuration overrides:
python local_run.py ... \
--config.VIDEO_FPS 2 \
--config.AUDIO_TOTAL_SHOTS 50 \
--config.MAIN_CHARACTER_NAME "Tree" \
--config.MIN_PROTAGONIST_RATIO 0.7 \
--config.AUDIO_MIN_SEGMENT_DURATION 1.8 \
--config.AUDIO_MAX_SEGMENT_DURATION 3.8| Operation | Output Path | Description |
|---|---|---|
| Video Analysis | Output/Video/{ID}/captions/scene_summaries_video/ |
Per-scene descriptions |
| Scene Cuts | Output/Video/{ID}/frames/shot_scenes.txt |
Shot boundaries |
| BGM Analysis | Output/Audio/{ID}/captions/captions.json |
Rhythm structure + captions |
| ASR Subtitles | Output/Video/{ID}/subtitles.srt |
Speech-to-text |
| Shot Plan | Output/Output/{ID}/{BGM}/shot_plan_xxx.json |
Creative plan |
| Shot Point | Output/Output/{ID}/{BGM}/shot_point_xxx.json |
Precise timestamps |
| Final Video | Output/Output/{ID}/{BGM}/output_9x16.mp4 |
Rendered video |
Very slow runtime
- API latency — the pipeline sends a large number of concurrent requests to vision/language APIs. Speed is heavily dependent on your API provider's response time and rate limits.
- First-run Footage Deconstruction — the first time you process a video, shot detection, captioning, ASR, and scene analysis all run from scratch. This is a one-time cost per video; subsequent edits with the same footage reuse the cached results and are much faster.
- GPU acceleration — a CUDA-capable GPU significantly speeds up video decoding and encoding. We recommend building Decord with NVDEC support (see Install section).
- Video codec compatibility — if the pipeline appears to hang during video-related steps, the source video's encoding may be the cause. In our testing, videos encoded with
libx264worked reliably.
If you find CutClaw useful for your research, welcome to cite our work using the following BibTeX:
@article{cutclaw,
title={CutClaw: Agentic Hours-Long Video Editing via Music Synchronization},
author={Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun},
journal={arXiv preprint arXiv:2603.29664},
year={2026}
}This project is a derivative work of GVCLab/CutClaw, the original academic research project by Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, and Xiaodong Cun from Beijing Jiaotong University, Great Bay University, and Tencent ARC Lab.
- The original codebase and research are (c) GVCLab and its authors.
- New features, modifications, and extensions by @treesan are released under the MIT License (see LICENSE).
- Please cite the original CutClaw paper if you use this work in your research.

