# üé¨ Social Video Engine

**AI-powered social video generator with Qwen3-TTS + Remotion**

Generate professional animated short-form videos (Reels/TikTok/Shorts) with natural AI voiceover.

**Pipeline:**
```
Story Script ‚Üí Qwen3-TTS (voiceover) ‚Üí Remotion (React animations) ‚Üí FFmpeg (merge) ‚Üí Final MP4
```

**Cost: $0.00 per video** (runs entirely on Colab's free GPU)

## 1Ô∏è‚É£ Setup Environment
Installs Node.js, Chromium, Qwen3-TTS, and clones the video engine repo.

‚è±Ô∏è Takes ~5-7 minutes on first run.

In [None]:
%%bash
echo "üîß Installing Node.js 20..."
curl -fsSL https://deb.nodesource.com/setup_20.x | bash - > /dev/null 2>&1
apt-get install -y nodejs > /dev/null 2>&1
echo "  Node: $(node -v)"
echo "  npm: $(npm -v)"

echo ""
echo "üîß Installing Chromium for Remotion..."
apt-get install -y chromium-browser > /dev/null 2>&1 || apt-get install -y chromium > /dev/null 2>&1
echo "  Chromium: $(chromium-browser --version 2>/dev/null || chromium --version 2>/dev/null || echo 'using bundled')"

echo ""
echo "üîß FFmpeg check..."
echo "  FFmpeg: $(ffmpeg -version 2>&1 | head -1)"

echo ""
echo "‚úÖ System dependencies ready!"

In [None]:
# Clone the Social Video Engine repo
import os

REPO_DIR = "/content/social-video-engine"

if not os.path.exists(REPO_DIR):
    !git clone https://github.com/redwanJemal/social-video-engine.git {REPO_DIR}
    !cd {REPO_DIR} && npm install --legacy-peer-deps 2>&1 | tail -3
else:
    !cd {REPO_DIR} && git pull
    print("Repo already cloned")

print(f"\n‚úÖ Video engine ready at {REPO_DIR}")

In [None]:
# Install Qwen3-TTS
!pip install -U qwen-tts soundfile numpy > /dev/null 2>&1

# Try installing flash-attention (speeds up inference, optional)
try:
    !MAX_JOBS=2 pip install flash-attn --no-build-isolation > /dev/null 2>&1
    print("‚úÖ flash-attn installed")
except:
    print("‚ö†Ô∏è flash-attn failed (will use default attention ‚Äî still works fine)")

import torch
print(f"\nüñ•Ô∏è GPU: {torch.cuda.get_device_name(0)}")
print(f"   VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
print(f"\n‚úÖ Qwen3-TTS ready!")

## 2Ô∏è‚É£ Load TTS Model

Choose your model:
- **CustomVoice** (recommended) ‚Äî 9 premium voices with mood/style control
- **VoiceDesign** ‚Äî describe any voice you want in text
- **Base** ‚Äî clone any voice from 3-second audio sample

In [None]:
import torch
import soundfile as sf
import numpy as np
from qwen_tts import Qwen3TTSModel
from IPython.display import Audio, display

# ===== CHOOSE MODEL =====
MODEL_NAME = "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
# MODEL_NAME = "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign"
# MODEL_NAME = "Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice"  # Lighter, if T4 runs out of VRAM

print(f"Loading {MODEL_NAME}...")

# Try flash_attention_2, fall back to default
try:
    model = Qwen3TTSModel.from_pretrained(
        MODEL_NAME,
        device_map="cuda:0",
        dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
except Exception:
    model = Qwen3TTSModel.from_pretrained(
        MODEL_NAME,
        device_map="cuda:0",
        dtype=torch.bfloat16,
    )

print("‚úÖ Model loaded!")
if "CustomVoice" in MODEL_NAME:
    print(f"üé§ Speakers: {model.get_supported_speakers()}")
    print(f"üåç Languages: {model.get_supported_languages()}")

## 3Ô∏è‚É£ Define Your Video

### Available Remotion Templates
| Template | Description |
|----------|-------------|
| `intro` | Dramatic hook with ring animation |
| `kinetic-text` | Words flying in with spring physics |
| `stat-card` | Animated number counters |
| `list-reveal` | Items appearing one by one |
| `quote-card` | Testimonial with quotation marks |
| `cta` | Pulsing call-to-action button |

### Available TTS Speakers
| Speaker | Voice | Best For |
|---------|-------|----------|
| **Ryan** | Dynamic male, strong rhythm | Narration, energy |
| **Aiden** | Sunny American male | Friendly, casual |
| **Vivian** | Bright young female | Engaging, punchy |
| **Serena** | Warm gentle female | Calm, storytelling |

### Available Themes
`midnight` `ocean` `sunset` `forest` `noir` `fire`

In [None]:
# ===== DEFINE YOUR VIDEO =====
# Each scene has: template props + TTS narration

VIDEO_CONFIG = {
    "theme": {
        "name": "Midnight",
        "bgGradient": ["#0f0c29", "#302b63"],
        "textColor": "#ffffff",
        "accentColor": "#f5576c",
        "fontFamily": "sans-serif"
    },
    "scenes": [
        {
            "type": "intro",
            "duration": 90,  # frames (30fps) = 3 seconds
            "props": {
                "hook": "Stop scrolling.",
                "subtitle": "This changes everything"
            },
            "tts": {
                "text": "Stop scrolling. This is going to change everything you know about productivity.",
                "speaker": "Ryan",
                "instruct": "Dramatic, attention-grabbing, confident."
            }
        },
        {
            "type": "kinetic-text",
            "duration": 120,  # 4 seconds
            "props": {
                "lines": ["Most people waste", "3 HOURS a day", "on tasks AI can do", "in 3 MINUTES"],
                "accentLineIndex": 1,
                "animation": "slide-up"
            },
            "tts": {
                "text": "Most people waste three hours every single day on tasks that AI can finish in just three minutes.",
                "speaker": "Ryan",
                "instruct": "Building intensity, emphasize the contrast between hours and minutes."
            }
        },
        {
            "type": "stat-card",
            "duration": 120,  # 4 seconds
            "props": {
                "title": "The Numbers Don't Lie",
                "stats": [
                    {"value": "87%", "label": "Time Saved"},
                    {"value": "10x", "label": "More Output"},
                    {"value": "$0", "label": "Extra Cost"}
                ]
            },
            "tts": {
                "text": "Eighty-seven percent time saved. Ten times more output. And it costs you absolutely nothing extra.",
                "speaker": "Ryan",
                "instruct": "Confident, data-driven, impressive."
            }
        },
        {
            "type": "list-reveal",
            "duration": 150,  # 5 seconds
            "props": {
                "title": "Top 3 AI Tools",
                "items": ["ChatGPT for writing", "Midjourney for design", "Cursor for coding"],
                "icon": "üöÄ",
                "numbered": True
            },
            "tts": {
                "text": "Here are the top three AI tools you need right now. Number one: ChatGPT for writing. Number two: Midjourney for design. And number three: Cursor for coding.",
                "speaker": "Ryan",
                "instruct": "Enthusiastic, listing items with clear pauses between each."
            }
        },
        {
            "type": "quote-card",
            "duration": 120,  # 4 seconds
            "props": {
                "quote": "AI won't replace you. But someone using AI will.",
                "author": "Tech Industry",
                "role": "Common saying"
            },
            "tts": {
                "text": "Remember this: AI won't replace you. But someone using AI, definitely will.",
                "speaker": "Ryan",
                "instruct": "Thoughtful pause before the punchline, serious tone."
            }
        },
        {
            "type": "cta",
            "duration": 90,  # 3 seconds
            "props": {
                "headline": "Start Today",
                "subtext": "Follow for more AI tips",
                "buttonText": "Follow ‚Üí"
            },
            "tts": {
                "text": "Follow for more AI tips that actually save you time. See you in the next one.",
                "speaker": "Ryan",
                "instruct": "Warm, inviting, friendly call to action."
            }
        }
    ]
}

total_frames = sum(s["duration"] for s in VIDEO_CONFIG["scenes"])
print(f"üìã Video: {len(VIDEO_CONFIG['scenes'])} scenes, {total_frames} frames ({total_frames/30:.1f}s)")
print(f"üé® Theme: {VIDEO_CONFIG['theme']['name']}")
for i, s in enumerate(VIDEO_CONFIG["scenes"]):
    print(f"   {i+1}. [{s['type']}] {s['duration']/30:.1f}s ‚Äî {s['tts']['speaker']}: \"{s['tts']['text'][:50]}...\"")

## 4Ô∏è‚É£ Generate Voiceover
Generates TTS audio for each scene using Qwen3-TTS.

In [None]:
import os
import json

AUDIO_DIR = "/content/social-video-engine/public/audio"
os.makedirs(AUDIO_DIR, exist_ok=True)

print("üéôÔ∏è Generating voiceover for each scene...\n")

for i, scene in enumerate(VIDEO_CONFIG["scenes"]):
    tts = scene["tts"]
    print(f"  Scene {i+1}/{len(VIDEO_CONFIG['scenes'])}: [{scene['type']}] {tts['speaker']}")
    print(f"    \"{tts['text'][:70]}...\"")
    
    wavs, sr = model.generate_custom_voice(
        text=tts["text"],
        language="English",
        speaker=tts["speaker"],
        instruct=tts.get("instruct", ""),
    )
    
    audio_file = f"{AUDIO_DIR}/scene_{i:03d}.wav"
    sf.write(audio_file, wavs[0], sr)
    duration_s = len(wavs[0]) / sr
    print(f"    ‚úÖ {duration_s:.1f}s ‚Üí {audio_file}")
    
    # Store audio duration to adjust video scene length
    scene["audio_duration"] = duration_s
    scene["audio_file"] = f"audio/scene_{i:03d}.wav"
    print()

# Preview the last generated audio
print("\nüîä Preview (last scene):")
display(Audio(wavs[0], rate=sr))

print("\n‚úÖ All voiceovers generated!")

## 5Ô∏è‚É£ Adjust Scene Durations to Match Audio
Auto-adjusts video scene lengths to match the TTS audio durations.

In [None]:
FPS = 30
PADDING_FRAMES = 15  # 0.5s padding per scene

print("‚è±Ô∏è Adjusting scene durations to match audio:\n")

for i, scene in enumerate(VIDEO_CONFIG["scenes"]):
    audio_dur = scene.get("audio_duration", scene["duration"] / FPS)
    audio_frames = int(audio_dur * FPS) + PADDING_FRAMES
    old_dur = scene["duration"]
    scene["duration"] = max(audio_frames, old_dur)  # Use whichever is longer
    print(f"  Scene {i+1} [{scene['type']}]: {old_dur/FPS:.1f}s ‚Üí {scene['duration']/FPS:.1f}s (audio: {audio_dur:.1f}s)")

total_frames = sum(s["duration"] for s in VIDEO_CONFIG["scenes"])
print(f"\nüìä Total video: {total_frames} frames = {total_frames/FPS:.1f}s")

## 6Ô∏è‚É£ Render Video with Remotion
Renders the React animations to MP4 using Remotion + Chromium.

In [None]:
import json

REPO_DIR = "/content/social-video-engine"
CONFIG_FILE = f"{REPO_DIR}/render-config.json"
VIDEO_OUTPUT = f"{REPO_DIR}/out/video-no-audio.mp4"

# Write config for Remotion (strip TTS-specific fields)
remotion_config = {
    "theme": VIDEO_CONFIG["theme"],
    "scenes": [
        {"type": s["type"], "duration": s["duration"], "props": s["props"]}
        for s in VIDEO_CONFIG["scenes"]
    ]
}

with open(CONFIG_FILE, "w") as f:
    json.dump(remotion_config, f, indent=2)

print(f"üìù Config written to {CONFIG_FILE}")
print(f"üé¨ Rendering {sum(s['duration'] for s in remotion_config['scenes'])} frames...\n")

# Render with Remotion
!cd {REPO_DIR} && node render.mjs --config render-config.json --output {VIDEO_OUTPUT}

import os
if os.path.exists(VIDEO_OUTPUT):
    size_mb = os.path.getsize(VIDEO_OUTPUT) / 1e6
    print(f"\n‚úÖ Video rendered: {VIDEO_OUTPUT} ({size_mb:.1f} MB)")
else:
    print("\n‚ùå Render failed ‚Äî check output above")

## 7Ô∏è‚É£ Merge Audio + Video
Combines the Remotion video with the Qwen3-TTS voiceover using FFmpeg.

In [None]:
import subprocess

AUDIO_DIR = "/content/social-video-engine/public/audio"
FINAL_OUTPUT = "/content/social-video-engine/out/final-video.mp4"
FPS = 30

# Step 1: Concatenate all scene audio with proper gaps
print("üîä Building audio track...")

# Create silence-padded audio segments matching scene durations
audio_segments = []
for i, scene in enumerate(VIDEO_CONFIG["scenes"]):
    scene_audio = f"{AUDIO_DIR}/scene_{i:03d}.wav"
    scene_dur = scene["duration"] / FPS  # target duration in seconds
    
    if os.path.exists(scene_audio):
        # Read audio
        data, sr = sf.read(scene_audio)
        audio_dur = len(data) / sr
        
        # Pad with silence to match scene duration
        target_samples = int(scene_dur * sr)
        if len(data) < target_samples:
            padding = np.zeros(target_samples - len(data), dtype=data.dtype)
            data = np.concatenate([data, padding])
        else:
            data = data[:target_samples]
        
        audio_segments.append(data)
        print(f"  Scene {i+1}: {audio_dur:.1f}s audio ‚Üí padded to {scene_dur:.1f}s")
    else:
        # No audio ‚Äî pure silence
        silence = np.zeros(int(scene_dur * 24000), dtype=np.float32)  # assume 24kHz
        audio_segments.append(silence)
        print(f"  Scene {i+1}: silence ({scene_dur:.1f}s)")

# Concatenate all
full_audio = np.concatenate(audio_segments)
full_audio_path = f"{AUDIO_DIR}/full_narration.wav"
sf.write(full_audio_path, full_audio, sr)
print(f"\n  Full audio: {len(full_audio)/sr:.1f}s ‚Üí {full_audio_path}")

# Step 2: Merge video + audio with FFmpeg
print(f"\nüé¨ Merging video + audio...")
cmd = [
    "ffmpeg", "-y",
    "-i", VIDEO_OUTPUT,
    "-i", full_audio_path,
    "-c:v", "copy",
    "-c:a", "aac",
    "-b:a", "192k",
    "-shortest",
    FINAL_OUTPUT
]
result = subprocess.run(cmd, capture_output=True, text=True)

if os.path.exists(FINAL_OUTPUT):
    size_mb = os.path.getsize(FINAL_OUTPUT) / 1e6
    print(f"\n‚úÖ Final video: {FINAL_OUTPUT} ({size_mb:.1f} MB)")
else:
    print(f"\n‚ùå FFmpeg failed:\n{result.stderr}")

## 8Ô∏è‚É£ Preview & Download

In [None]:
from IPython.display import HTML
from base64 import b64encode

# Preview in notebook
if os.path.exists(FINAL_OUTPUT):
    mp4 = open(FINAL_OUTPUT, 'rb').read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    display(HTML(f"""
    <video width="360" height="640" controls autoplay>
        <source src="{data_url}" type="video/mp4">
    </video>
    """))
else:
    print("No video to preview")

In [None]:
# Download final video
from google.colab import files

if os.path.exists(FINAL_OUTPUT):
    files.download(FINAL_OUTPUT)
    print("üì• Downloading...")
else:
    print("No video to download")

---

## üîÑ Quick Re-render
Edit the `VIDEO_CONFIG` in cell 3, then run cells 4‚Üí8 again.

## üí° Tips
- **Change theme**: Edit `VIDEO_CONFIG["theme"]` ‚Äî try `ocean`, `sunset`, `forest`, `noir`, `fire`
- **Change voice**: Swap `speaker` in any scene's `tts` config
- **Add scenes**: Add more entries to the `scenes` list
- **Mood control**: The `instruct` field controls speaking style
- **Voice clone**: Switch to `Base` model and use `generate_voice_clone()` with a 3s audio sample
- **Batch videos**: Loop over multiple `VIDEO_CONFIG`s in a for loop

## üé§ All Speakers
| Speaker | Voice | Native |
|---------|-------|--------|
| Ryan | Dynamic male | English |
| Aiden | Sunny American male | English |
| Vivian | Bright young female | Chinese |
| Serena | Warm gentle female | Chinese |
| Uncle_Fu | Seasoned male, low | Chinese |
| Dylan | Youthful Beijing male | Chinese |
| Eric | Lively Chengdu male | Chinese |
| Ono_Anna | Playful female | Japanese |
| Sohee | Warm female | Korean |