Veta is an autonomous agent designed to mass-produce high-quality, viral-style short videos (YouTube Shorts, TikTok, Reels) from structured data. It combines state-of-the-art TTS, generative AI for logic/vision, and dynamic video editing into a single pipeline.
- Input: Accepts a simple JSON file (
config/input.jsonby default) defining multiple topics. - Workflow: Processes topics sequentially, managing detailed asset generation for each.
- Engine: ComfyUI with Flux.1 Schnell.
- Generative Workflow: Creates high-quality, custom images for each segment based on the script context.
- Workflow File: Uses
config/flux_schnell_workflow.json.
- Engine: Powered by Kokoro TTS (v1.0).
- Style: Uses the
af_heartvoice profile with a 1.0x speed for natural, high-energy narration. - Audio Processing: Normalizes audio levels and ensures clean segment transitions.
- Resolution: Native 1080x1920 (9:16) Vertical Video.
- Supersampling: Renders internally at 2160x3840 (2x) before downscaling to eliminate shimmer/aliasing during zooms.
- Vision-Enhanced Refinement: Uses Llama 3.2 Vision to review generated images against the script and intelligently refine prompts if quality is low or if the user rejects them.
- Effects:
- Ken Burns: Randomized smooth pans and zooms for every static image.
- Stabilization: Applies
deshakefilters to ensure smooth motion. - Solid Backgrounds: Uses professional solid black backgrounds for letterboxing.
- Transcription: Uses OpenAI Whisper (base model) for accurate word-level timestamps.
- Styling: Generates .ass subtitles with "Influencer" styling (Montserrat Black font, Karaoke effects).
- Burn-in: Hardcodes subtitles into the final video using
ffmpeg.
Here is a sample generated by Veta:
This guide assumes you are running a modern Linux distribution (Ubuntu 22.04+ or similar).
-
System Tools:
sudo apt update sudo apt update sudo apt install git python3 python3-venv python3-pip ffmpeg
Optional but Recommended: Install uv (Fast Package Manager)
curl -LsSf https://astral.sh/uv/install.sh | sh -
Ollama (LLM Engine): Install Ollama to run the Llama 3.1 model locally.
curl -fsSL https://ollama.com/install.sh | sh ollama pull llama3.1:8b ollama pull gemma3:4b -
ComfyUI (Image Generation Engine): You need a local instance of ComfyUI running.
- Follow the ComfyUI Installation Guide.
- Model: Download the Flux.1 Schnell checkpoint and place it in
ComfyUI/models/checkpoints/. - Running: Start ComfyUI (usually
python main.py). It typically runs athttp://127.0.0.1:8188.
-
Clone the Repository:
git clone https://github.com/your-username/veta.git cd veta -
Set Up Virtual Environment: It is highly recommended to use a virtual environment to avoid conflicts.
python3 -m venv .venv source .venv/bin/activate -
Install Python Dependencies:
# Using standard pip pip install -r requirements.txt # OR using uv (faster) uv pip install -r requirements.txt
Note: If you encounter issues with
whisper, ensure you haveopenai-whisperinstalled, not the package namedwhisper. -
Configuration: Create a
.envfile in the root directory:touch .env
Add the following content to
.env:# ComfyUI Configuration COMFYUI_URL=127.0.0.1:8188 # Optional: Pixabay API Key for Stock Images (Fallback) PIXABAY_API_KEY=your_pixabay_api_key_here # Optional: Ollama Vision Model OLLAMA_VISION_MODEL=llama3.2-vision
Create or edit config/input.json. This file controls what videos are generated.
Format:
[
{
"topic": "The Search Engine Shift",
"hook": "Stop Googling Everything",
"script": "Stop Googling everything. Seriously. For twenty years we have been using search engines the same way..."
},
{
"topic": "The Dead Internet Theory",
"script": "Have you ever felt like the internet is empty? Like you are the only real person left..."
}
]- topic: Unique identifier for the video (used for folder naming).
- hook: The headline displayed on the video (optional, defaults to topic).
- script: Full voiceover text. The AI will automatically segment this into scenes.
With your virtual environment activated:
python3 main.py --input_file config/input.jsonThe agent runs in an interactive mode:
- Script Generation: It segments your script.
- Prompt Generation: It creates image prompts.
- Review: It will generate images and ask for your approval.
[a] Approve: Keeps the image.[r] Reject: Prompts you for feedback to regenerate.[s] Skip: Auto-approved.
If you stop the process (Ctrl+C) or if it crashes, Veta saves your progress automatically. When you run the same command again:
python3 main.py --input_file config/input.jsonIt will detect the existing checkpoint and ask:
[r] Resume: Continues exactly where it left off.[n] New: Deletes the checkpoint and starts fresh.
- Final Videos:
output/{Topic_Name}/final_video_captioned.mp4 - Temporary Files:
output/temp/{Topic_Name}/- Contains raw audio, generated images, and segment videos.
- These are auto-deleted after successful generation to save space.
graph TD
JSON[input.json] --> Main[main.py]
Main --> Graph[LangGraph Workflow]
subgraph "Agents & Tools"
Graph --> Script["Script Writer\n(Ollama Llama 3.1)"]
Script --> VisualDir["Visual Director\n(Llama 3.1 + Vision)"]
VisualDir --> Audio["Audio Gen\n(Kokoro TTS)"]
VisualDir --> Visual["Visual Gen\n(ComfyUI Flux)"]
Audio --> Wav["Segment.wav"]
Visual --> Img["Segment.jpg"]
Wav & Img --> Render["Renderer\n(ffmpeg)"]
Render --> Caps["Caption Engine\n(Whisper + pysubs2)"]
end
Caps --> Final[Final Video]
Currently, the agents are hardcoded to use llama3.1:8b. To use a different local model (e.g., gemma2 or mistral):
- Pull the model:
ollama pull <model_name> - Edit Files: Update the model string in:
src/agents/script_writer.pysrc/agents/visual_director.py
The review capability uses llama3.2-vision by default. To change this:
- Pull the model:
ollama pull <model_name> - Update .env:
OLLAMA_VISION_MODEL=llava
To use a different ComfyUI workflow (e.g., for SDXL or a Realism LoRA):
- Save your workflow: Export it as API Format (JSON) from ComfyUI.
- Replace File: Overwrite
config/flux_schnell_workflow.jsonOR update the path insrc/tools/image_tools.py.
1. AttributeError: module 'whisper' has no attribute 'load_model'
This means you installed the wrong whisper package.
Fix:
pip uninstall whisper
pip install openai-whisper2. ffmpeg not found
Ensure ffmpeg is installed system-wide.
sudo apt install ffmpeg
ffmpeg -version # Verify installation3. ComfyUI Connection Refused
Ensure ComfyUI is running in a separate terminal window and verify the URL in your .env file matches its output (default 127.0.0.1:8188).