Skip to content

ram1o1/veta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Veta: AI Viral Shorts Generator

Veta is an autonomous agent designed to mass-produce high-quality, viral-style short videos (YouTube Shorts, TikTok, Reels) from structured data. It combines state-of-the-art TTS, generative AI for logic/vision, and dynamic video editing into a single pipeline.

🌟 Key Features

1. πŸ“¦ Bulk Processing Pipeline

  • Input: Accepts a simple JSON file (config/input.json by default) defining multiple topics.
  • Workflow: Processes topics sequentially, managing detailed asset generation for each.

2. πŸ¦… Asset Supply Agent (Generative AI)

  • Engine: ComfyUI with Flux.1 Schnell.
  • Generative Workflow: Creates high-quality, custom images for each segment based on the script context.
  • Workflow File: Uses config/flux_schnell_workflow.json.

3. πŸŽ™οΈ Cinematic Audio

  • Engine: Powered by Kokoro TTS (v1.0).
  • Style: Uses the af_heart voice profile with a 1.0x speed for natural, high-energy narration.
  • Audio Processing: Normalizes audio levels and ensures clean segment transitions.

4. 🎬 Dynamic Video Engine

  • Resolution: Native 1080x1920 (9:16) Vertical Video.
  • Supersampling: Renders internally at 2160x3840 (2x) before downscaling to eliminate shimmer/aliasing during zooms.
  • Vision-Enhanced Refinement: Uses Llama 3.2 Vision to review generated images against the script and intelligently refine prompts if quality is low or if the user rejects them.
  • Effects:
    • Ken Burns: Randomized smooth pans and zooms for every static image.
    • Stabilization: Applies deshake filters to ensure smooth motion.
    • Solid Backgrounds: Uses professional solid black backgrounds for letterboxing.

5. πŸ“ Automatic Captions

  • Transcription: Uses OpenAI Whisper (base model) for accurate word-level timestamps.
  • Styling: Generates .ass subtitles with "Influencer" styling (Montserrat Black font, Karaoke effects).
  • Burn-in: Hardcodes subtitles into the final video using ffmpeg.

πŸŽ₯ Example Output

Here is a sample generated by Veta:

Watch the video

πŸ› οΈ Installation & Setup (Linux Guide)

This guide assumes you are running a modern Linux distribution (Ubuntu 22.04+ or similar).

Prerequisites

  1. System Tools:

    sudo apt update
    sudo apt update
    sudo apt install git python3 python3-venv python3-pip ffmpeg

    Optional but Recommended: Install uv (Fast Package Manager)

    curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Ollama (LLM Engine): Install Ollama to run the Llama 3.1 model locally.

    curl -fsSL https://ollama.com/install.sh | sh
    ollama pull llama3.1:8b
    ollama pull gemma3:4b
  3. ComfyUI (Image Generation Engine): You need a local instance of ComfyUI running.

    • Follow the ComfyUI Installation Guide.
    • Model: Download the Flux.1 Schnell checkpoint and place it in ComfyUI/models/checkpoints/.
    • Running: Start ComfyUI (usually python main.py). It typically runs at http://127.0.0.1:8188.

Installation

  1. Clone the Repository:

    git clone https://github.com/your-username/veta.git
    cd veta
  2. Set Up Virtual Environment: It is highly recommended to use a virtual environment to avoid conflicts.

    python3 -m venv .venv
    source .venv/bin/activate
  3. Install Python Dependencies:

    # Using standard pip
    pip install -r requirements.txt
    
    # OR using uv (faster)
    uv pip install -r requirements.txt

    Note: If you encounter issues with whisper, ensure you have openai-whisper installed, not the package named whisper.

  4. Configuration: Create a .env file in the root directory:

    touch .env

    Add the following content to .env:

    # ComfyUI Configuration
    COMFYUI_URL=127.0.0.1:8188
    
    # Optional: Pixabay API Key for Stock Images (Fallback)
    PIXABAY_API_KEY=your_pixabay_api_key_here
    
    # Optional: Ollama Vision Model
    OLLAMA_VISION_MODEL=llama3.2-vision

πŸš€ Usage

1. Prepare config/input.json

Create or edit config/input.json. This file controls what videos are generated.

Format:

[
  {
    "topic": "The Search Engine Shift",
    "hook": "Stop Googling Everything",
    "script": "Stop Googling everything. Seriously. For twenty years we have been using search engines the same way..."
  },
  {
     "topic": "The Dead Internet Theory",
     "script": "Have you ever felt like the internet is empty? Like you are the only real person left..."
  }
]
  • topic: Unique identifier for the video (used for folder naming).
  • hook: The headline displayed on the video (optional, defaults to topic).
  • script: Full voiceover text. The AI will automatically segment this into scenes.

2. Run the Agent

With your virtual environment activated:

python3 main.py --input_file config/input.json

3. Review Process

The agent runs in an interactive mode:

  1. Script Generation: It segments your script.
  2. Prompt Generation: It creates image prompts.
  3. Review: It will generate images and ask for your approval.
    • [a] Approve: Keeps the image.
    • [r] Reject: Prompts you for feedback to regenerate.
    • [s] Skip: Auto-approved.

4. Interrupt & Resume

If you stop the process (Ctrl+C) or if it crashes, Veta saves your progress automatically. When you run the same command again:

python3 main.py --input_file config/input.json

It will detect the existing checkpoint and ask:

  • [r] Resume: Continues exactly where it left off.
  • [n] New: Deletes the checkpoint and starts fresh.

πŸ“‚ Output Structure

  • Final Videos: output/{Topic_Name}/final_video_captioned.mp4
  • Temporary Files: output/temp/{Topic_Name}/
    • Contains raw audio, generated images, and segment videos.
    • These are auto-deleted after successful generation to save space.

🧩 Architecture

graph TD
    JSON[input.json] --> Main[main.py]
    Main --> Graph[LangGraph Workflow]
    
    subgraph "Agents & Tools"
        Graph --> Script["Script Writer\n(Ollama Llama 3.1)"]
        Script --> VisualDir["Visual Director\n(Llama 3.1 + Vision)"]
        
        VisualDir --> Audio["Audio Gen\n(Kokoro TTS)"]
        VisualDir --> Visual["Visual Gen\n(ComfyUI Flux)"]
        
        Audio --> Wav["Segment.wav"]
        Visual --> Img["Segment.jpg"]
        
        Wav & Img --> Render["Renderer\n(ffmpeg)"]
        Render --> Caps["Caption Engine\n(Whisper + pysubs2)"]
    end
    
    Caps --> Final[Final Video]
Loading

βš™οΈ Customization

1. Changing the LLM (Script & Visuals)

Currently, the agents are hardcoded to use llama3.1:8b. To use a different local model (e.g., gemma2 or mistral):

  1. Pull the model: ollama pull <model_name>
  2. Edit Files: Update the model string in:
    • src/agents/script_writer.py
    • src/agents/visual_director.py

2. Changing the Vision Model

The review capability uses llama3.2-vision by default. To change this:

  1. Pull the model: ollama pull <model_name>
  2. Update .env:
    OLLAMA_VISION_MODEL=llava

3. Customizing Image Generation

To use a different ComfyUI workflow (e.g., for SDXL or a Realism LoRA):

  1. Save your workflow: Export it as API Format (JSON) from ComfyUI.
  2. Replace File: Overwrite config/flux_schnell_workflow.json OR update the path in src/tools/image_tools.py.

⚠️ Troubleshooting

1. AttributeError: module 'whisper' has no attribute 'load_model' This means you installed the wrong whisper package. Fix:

pip uninstall whisper
pip install openai-whisper

2. ffmpeg not found Ensure ffmpeg is installed system-wide.

sudo apt install ffmpeg
ffmpeg -version  # Verify installation

3. ComfyUI Connection Refused Ensure ComfyUI is running in a separate terminal window and verify the URL in your .env file matches its output (default 127.0.0.1:8188).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages