Production-grade pipeline for generating 15-20 minute AI videos from text prompts using open-source models.
This pipeline transforms a single text prompt into a complete long-form video by:
- Splitting the concept into scenes using LLM
- Planning individual camera shots for each scene
- Generating 3-12 second video clips using AI models
- Maintaining continuity between clips via frame analysis
- Creating audio (narration + background music)
- Stitching everything with FFmpeg into a final MP4
User Prompt β Scene Splitter β Shot Planner β Clip Generator β Continuity Manager
β
Audio Pipeline β FFmpeg Stitcher β Final Video
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 16GB | 64GB |
| GPU VRAM | None (cloud) | 24GB+ (local) |
| Storage | 50GB | 200GB |
| CPU | 8 cores | 16+ cores |
Note: This pipeline is designed for cloud-first execution (Replicate, HuggingFace). Local video generation requires 12GB+ VRAM.
cd AI_VIDEO
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt# For video generation
export REPLICATE_API_TOKEN="your_token_here"
# For scene splitting and HuggingFace models
export HF_TOKEN="your_token_here"Get free tokens:
- Replicate: https://replicate.com (50 free predictions/month)
- HuggingFace: https://huggingface.co/settings/tokens (free tier available)
# Dry run to test setup
python -m pipeline.orchestrator --dry-run --example
# Generate a 90-second test video
python -m pipeline.orchestrator "A cinematic sunset over mountains" --duration 1.5
# Full 12-minute documentary
python -m pipeline.orchestrator "Your video concept here" --duration 12AI_VIDEO/
βββ pipeline/ # Core pipeline modules
β βββ __init__.py
β βββ orchestrator.py # Main coordinator
β βββ scene_splitter.py # Prompt β Scenes (LLM)
β βββ shot_planner.py # Scenes β Shots
β βββ prompt_templates.py # Jinja2 templates
β βββ clip_worker.py # Video generation
β βββ continuity_manager.py # Visual continuity
β βββ audio_pipeline.py # TTS + Music
βββ scripts/
β βββ ffmpeg_tools.sh # FFmpeg commands
βββ examples/
β βββ salt_flats_shots.json # Sample shot sequence
βββ tests/
β βββ test_prompts.py # Unit tests
βββ config.yaml # Configuration
βββ requirements.txt # Dependencies
βββ sample_run.sh # Example script
βββ manifest.json # Project manifest
from pipeline import VideoPipeline
pipeline = VideoPipeline()
result = pipeline.run(
user_prompt="A serene Japanese garden in autumn, falling maple leaves, koi pond",
target_duration=5.0 # minutes
)
print(f"Final video: {result['final_video']}")from pipeline.orchestrator import PipelineConfig, VideoPipeline
config = PipelineConfig(
resolution=1080,
default_clip_duration=8.0,
primary_backend="replicate",
tts_engine="bark",
skip_music=False
)
pipeline = VideoPipeline(config)python -m pipeline.orchestrator "Your prompt" \
--duration 10 \ # Target duration in minutes
--resolution 1080 \ # 720 or 1080
--backend replicate \ # replicate, huggingface, local
--skip-audio \ # Skip audio generation
--dry-run # Test without generationThe pipeline includes a complete example for a 12-minute documentary:
# View the sample shot breakdown
cat examples/salt_flats_shots.json
# Generate using the example
python -m pipeline.orchestrator --example --duration 2Sample Scenes:
- Opening - Empty Horizon (vast salt flats, dusk)
- First Glimpse - The Traveler (silhouette in green coat)
- Lost Machine - The Plane (rusted aircraft in salt)
Edit config.yaml to customize:
video:
resolution: 720
fps: 24
default_clip_duration: 6
backend:
primary: "replicate"
fallback: "huggingface"
audio:
tts:
engine: "bark"
voice: "v2/en_speaker_6"
music:
engine: "musicgen"# Run unit tests
python -m pytest tests/ -v
# Validate FFmpeg installation
bash scripts/ffmpeg_tools.sh --validate
# Test module imports
python -c "from pipeline import VideoPipeline; print('OK')"The pipeline tracks:
| Metric | Description | Target |
|---|---|---|
| Continuity Score | Visual consistency between clips | >0.8 |
| Audio Sync | Narration/video alignment | Β±0.5s |
| Color Match | Cross-clip color consistency | ΞE <10 |
| Generation Success | Clips generated vs planned | >95% |
flowchart TD
A[User Prompt] --> B[Scene Splitter]
B --> C[Shot Planner]
C --> D{For Each Shot}
D --> E[Get Continuity Context]
E --> F[Generate Clip]
F --> G[Extract Last Frame]
G --> H[Analyze for Continuity]
H --> D
D --> I[Audio Pipeline]
I --> J[TTS Narration]
I --> K[Music Generation]
J --> L[FFmpeg Stitch]
K --> L
F --> L
L --> M[Final MP4]
- Batch Processing: Generate clips overnight to work within free tier limits
- Checkpointing: Pipeline saves progress; resume with
--continue-from - Local Testing: Use
--backend localfor fast pipeline testing (placeholders only) - Resolution: Start with 720p; upscale final video if needed
export REPLICATE_API_TOKEN="your_token"
# or
export HF_TOKEN="your_token"Install FFmpeg: https://ffmpeg.org/download.html
Use cloud backends instead:
python -m pipeline.orchestrator "prompt" --backend replicateWait and retry, or use checkpoints to resume later.
MIT License - see LICENSE file.
- Video Models: Stability AI, Replicate, HuggingFace
- Audio: Bark TTS, MusicGen
- Processing: FFmpeg