# Video Generation Pipeline

This notebook generates a video from a given topic or text using Ollama for scripting, Stable Diffusion (Diffusers) for images, Edge-TTS for audio, and MoviePy for assembly.

In [24]:
import torch
print(torch.__version__)
print(torch.version.cuda)
print(torch.cuda.is_available())


2.5.1+cu121
12.1
True


In [25]:
# Configuration
OLLAMA_API_URL = "http://localhost:11434/api/generate"
# Use 'llama3' or another model you have installed
OLLAMA_MODEL = "phi3:mini"

OUTPUT_DIR = "output"
SCENE_DIR = os.path.join(OUTPUT_DIR, "scenes")
AUDIO_DIR = os.path.join(OUTPUT_DIR, "audio")
FINAL_VIDEO_DIR = os.path.join(OUTPUT_DIR, "video")
SCRIPTS_DIR = os.path.join(OUTPUT_DIR, "scripts")

os.makedirs(SCENE_DIR, exist_ok=True)
os.makedirs(AUDIO_DIR, exist_ok=True)
os.makedirs(FINAL_VIDEO_DIR, exist_ok=True)
os.makedirs(SCRIPTS_DIR, exist_ok=True)

## Step 0: Initialize Stable Diffusion Model

In [26]:
# ===== SETTINGS =====
model_id = "runwayml/stable-diffusion-v1-5"
lora_id = "latent-consistency/lcm-lora-sdv1-5"

device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Loading Stable Diffusion on {device}...")

# ===== LOAD PIPELINE =====
pipe = StableDiffusionPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    safety_checker=None
).to(device)

# Enable memory optimizations
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
if device == "cuda":
    pipe.enable_model_cpu_offload()

# ===== LOAD LCM LoRA =====
pipe.load_lora_weights(lora_id)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

def generate_image_sd(prompt_text, output_path):
    print(f"Generating image for: {prompt_text}...")
    prompt = f"""
    simple flat illustration of {prompt_text},
    minimal design,
    clean white background,
    educational graphic,
    vector style,
    no text
    """
    
    try:
        image = pipe(
            prompt=prompt,
            num_inference_steps=6,      # VERY LOW = FAST
            guidance_scale=1.5,         # LCM works best low
            height=512,
            width=512
        ).images[0]
        
        image.save(output_path)
        return output_path
    except Exception as e:
        print(f"Error generating image: {e}")
        return None

Loading Stable Diffusion on cuda...


Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 196/196 [00:00<00:00, 313.06it/s, Materializing param=text_model.final_layer_norm.weight]
[1mCLIPTextModel LOAD REPORT[0m from: C:\Users\navee\.cache\huggingface\hub\models--runwayml--stable-diffusion-v1-5\snapshots\451f4fe16113bff5a5d2269ed5ad43b0592e9a14\text_encoder
Key                                | Status     |  | 
-----------------------------------+------------+--+-
text_model.embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m
Loading pipeline components...: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [00:08<00:00,  1.36s/it]
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results 

## Step 1: Script Generation with Ollama

In [27]:
def generate_script(topic):
    """
    Generate a fresh video script from Ollama based on the given topic.
    Always creates NEW content - never uses cached/old scripts.
    """
    prompt = f"""
    Convert this topic into a structured video plan.
    Topic: {topic}
    Return JSON only:
    {{
      "scenes": [
        {{
          "title": "",
          "bullets": [],
          "narration": "",
          "image_prompt": "visual description for illustration"
        }}
      ]
    }}
    """
    
    print(f"üé¨ Generating NEW script for: {topic}...")
    print("‚öôÔ∏è  Connecting to Ollama...")
    
    # Generate fresh content from Ollama (REQUIRED)
    try:
        response = requests.post(OLLAMA_API_URL, json={
            "model": OLLAMA_MODEL,
            "prompt": prompt,
            "format": "json",
            "stream": False
        }, timeout=300)
        response.raise_for_status()
        script_data = json.loads(response.json()['response'])
        
        # Create a safe filename from the topic
        safe_topic = "".join(c if c.isalnum() or c in (' ', '-', '_') else '_' for c in topic)
        safe_topic = safe_topic.replace(' ', '_')[:50]  # Limit length
        
        # Save the fresh plan to scripts directory with topic-based name
        timestamp = __import__('datetime').datetime.now().strftime("%Y%m%d_%H%M%S")
        plan_filename = f"{safe_topic}_{timestamp}.json"
        plan_file = os.path.join(SCRIPTS_DIR, plan_filename)
        
        with open(plan_file, "w") as f:
            json.dump(script_data, f, indent=2)
        
        print(f"‚úÖ Fresh script generated and saved to: {plan_file}")
        return script_data
        
    except requests.exceptions.RequestException as e:
        print(f"‚ùå Cannot connect to Ollama: {e}")
        print("üìã Please ensure Ollama is running:")
        print("   1. Open terminal and run: ollama serve")
        print("   2. Or ensure Ollama service is active")
        return None
        
    except json.JSONDecodeError as e:
        print(f"‚ùå Error decoding JSON from Ollama: {e}")
        print("   The model may have returned invalid JSON format")
        return None

## Step 2: Slide Generation (PIL)

In [28]:
def create_slide(scene, index, image_path=None):
    width, height = 1280, 720
    img = Image.new('RGB', (width, height), color='white')
    draw = ImageDraw.Draw(img)
    
    # Fonts
    try:
        title_font = ImageFont.truetype("arial.ttf", 60)
        text_font = ImageFont.truetype("arial.ttf", 35)
    except:
        title_font = ImageFont.load_default()
        text_font = ImageFont.load_default()
    
    # Layout Configuration
    margin = 50
    content_width = width - (2 * margin)
    
    # If image exists, we use split layout: Text (Left) | Image (Right)
    if image_path and os.path.exists(image_path):
        try:
            sd_img = Image.open(image_path)
            # Resize to fit right side but keep aspect ratio or simple fit
            # Let's make it 512x512 centered on the right half, or scaled nicely
            # Right half starts at x = 640
            
            # Target height 600, maintain aspect
            target_ih = 600
            aspect = sd_img.width / sd_img.height
            target_iw = int(target_ih * aspect)
            
            sd_img = sd_img.resize((target_iw, target_ih), Image.Resampling.LANCZOS)
            
            # Position on right side
            img_x = 640 + (640 - target_iw) // 2
            img_y = (720 - target_ih) // 2
            
            img.paste(sd_img, (img_x, img_y))
            
            # Constrain text to left half
            content_width = 580 # 640 - margin - padding
        except Exception as e:
            print(f"Error placing image: {e}")

    # Draw Title
    title_text = scene.get('title', f"Scene {index}")
    # Wrap title if needed
    title_lines = textwrap.wrap(title_text, width=20 if content_width < 600 else 40)
    ty = 50
    for line in title_lines:
        draw.text((margin, ty), line, fill='black', font=title_font)
        ty += 70
    
    # Draw Bullets
    y = ty + 30
    bullets = scene.get('bullets', [])
    for bullet in bullets:
        lines = textwrap.wrap(bullet, width=30 if content_width < 600 else 50)
        for line in lines:
            draw.text((margin + 30, y), f"‚Ä¢ {line}", fill='black', font=text_font)
            y += 45
            
    filename = os.path.join(SCENE_DIR, f"scene_{index}.png")
    img.save(filename)
    return filename

## Step 3: Audio Generation (Edge-TTS)

In [29]:
async def generate_audio(text, index):
    voice = "en-US-ChristopherNeural"
    output_file = os.path.join(AUDIO_DIR, f"scene_{index}.mp3")
    
    print(f"Generating audio for scene {index}...")
    try:
        communicate = edge_tts.Communicate(text, voice)
        await communicate.save(output_file)
        return output_file
    except Exception as e:
        print(f"Error generating audio: {e}")
        return None

## Step 4: Video Assembly (MoviePy)

In [30]:
def create_video_clip(image_path, audio_path, index):
    output_path = os.path.join(FINAL_VIDEO_DIR, f"scene_{index}.mp4")
    
    print(f"Creating video clip for scene {index} using MoviePy...")
    try:
        # Load audio first to get duration
        audio_clip = AudioFileClip(audio_path)
        
        # Create video clip with proper FPS setting before adding audio
        video_clip = (ImageClip(image_path)
                     .with_duration(audio_clip.duration)
                     .with_fps(24)  # Must set FPS before adding audio
                     .with_audio(audio_clip))
        
        # Write with explicit audio settings for better compatibility
        video_clip.write_videofile(
            output_path, 
            fps=24, 
            codec='libx264',
            audio_codec='aac',
            audio_bitrate='192k',
            preset='medium',
            threads=4,
            logger=None  # Suppress verbose output
        )
        
        # Close clips to ensure file writes are complete
        video_clip.close()
        audio_clip.close()
        
        # Small delay to ensure file is fully written
        import time
        time.sleep(0.5)
        
        # Verify the output file has audio
        test_clip = VideoFileClip(output_path)
        has_audio = test_clip.audio is not None
        test_clip.close()
        
        if has_audio:
            print(f"‚úì Audio verified in scene {index}")
        else:
            print(f"‚ö†Ô∏è  Warning: Audio missing from scene {index}")
        
        return output_path
    except Exception as e:
        print(f"MoviePy failed for scene {index}: {e}")
        import traceback
        traceback.print_exc()
        return None

## Step 5: Merge All Scenes

In [31]:
def merge_scenes(video_files):
    output_filename = os.path.join(FINAL_VIDEO_DIR, "final_video.mp4")
    
    print("Merging all scenes into final video...")
    try:
        clips = []
        for f in video_files:
            print(f"Loading {os.path.basename(f)}...")
            clip = VideoFileClip(f)
            
            # Verify audio is present
            if clip.audio is None:
                print(f"‚ö†Ô∏è  Warning: No audio in {f}")
            else:
                print(f"‚úì Audio loaded: {clip.audio.duration:.2f}s")
            
            clips.append(clip)
        
        print(f"\nConcatenating {len(clips)} clips...")
        # Use default method (chain) which preserves audio better
        final_clip = concatenate_videoclips(clips)
        
        print("Writing final video with audio...")
        final_clip.write_videofile(
            output_filename, 
            fps=24, 
            codec='libx264',
            audio_codec='aac',
            audio_bitrate='192k',
            preset='medium',
            threads=4,
            logger=None  # Suppress verbose output
        )
        
        # Close all clips to free resources
        final_clip.close()
        for clip in clips:
            clip.close()
        
        print(f"‚úÖ Done! Output: {output_filename}")
        return output_filename
    except Exception as e:
        print(f"‚ùå Error merging scenes: {e}")
        import traceback
        traceback.print_exc()
        return None

## Execution Pipeline

In [32]:
async def main(topic):
    """
    Main video generation pipeline.
    Always generates fresh content based on the given topic.
    """
    # 1. Generate Script (FRESH - never uses cache)
    script_data = generate_script(topic)
    if not script_data:
        print("‚ùå Cannot proceed without a valid script. Exiting...")
        return
    
    scenes = script_data.get('scenes', [])
    if not scenes:
        print("‚ùå No scenes found in script. Exiting...")
        return
    
    video_clips = []
    
    print(f"\nüé• Starting video generation for {len(scenes)} scenes...\n")
    
    for i, scene in enumerate(scenes, 1):
        title = scene.get('title')
        print(f"üìç Processing Scene {i}/{len(scenes)}: {title}")
        
        # 1.5 Generate Image (SD)
        image_prompt = scene.get('image_prompt')
        generated_img_path = None
        if image_prompt:
             # Create a safe filename for the raw SD generation
             raw_img_path = os.path.join(SCENE_DIR, f"scene_{i}_raw.png")
             generated_img_path = generate_image_sd(image_prompt, raw_img_path)
        
        if not generated_img_path:
             # Fallback if generation failed or no prompt, create_slide handles None
             print("‚ö†Ô∏è  No image generated, using text-only layout.")

        # 2. Create Slide with Text + Image
        img_path = create_slide(scene, i, image_path=generated_img_path)
        
        # 3. Narration -> Audio
        narration = scene.get('narration', '')
        if not narration:
            print(f"‚ö†Ô∏è  Warning: No narration for scene {i}, skipping...")
            continue
            
        audio_path = await generate_audio(narration, i)
        if not audio_path:
            continue
            
        # 4. Combine -> Video Clip
        clip_path = create_video_clip(img_path, audio_path, i)
        if clip_path:
            video_clips.append(clip_path)
            print(f"‚úÖ Scene {i} completed\n")
            
    # 5. Merge All Scenes
    if video_clips:
        print(f"\nüé¨ Merging {len(video_clips)} scenes into final video...")
        final_path = merge_scenes(video_clips)
        if final_path:
            print(f"\nüéâ SUCCESS! Your video is ready: {final_path}")
    else:
        print("‚ùå No video clips were created. Cannot generate final video.")

# Example Usage

In [33]:
await main("Model context protocol")

üé¨ Generating NEW script for: Model context protocol...
‚öôÔ∏è  Connecting to Ollama...
‚úÖ Fresh script generated and saved to: output\scripts\Model_context_protocol_20260215_211021.json

üé• Starting video generation for 3 scenes...

üìç Processing Scene 1/3: Understanding Model Context Protocol
Generating image for: A flowchart illustrating the key components involved when deploying a ML model using MCP...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [00:02<00:00,  2.21it/s]


Generating audio for scene 1...
Creating video clip for scene 1 using MoviePy...
‚úì Audio verified in scene 1
‚úÖ Scene 1 completed

üìç Processing Scene 2/3: Input, Output and Environment Interaction
Generating image for: Side-by-side visual representation of input data capture process versus output interpretation in MCP...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [00:02<00:00,  2.25it/s]


Generating audio for scene 2...
Creating video clip for scene 2 using MoviePy...
‚úì Audio verified in scene 2
‚úÖ Scene 2 completed

üìç Processing Scene 3/3: Handling Environment Interaction
Generating image for: Visual depiction of environment-model interaction using MCP framework...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [00:02<00:00,  2.35it/s]


Generating audio for scene 3...
Creating video clip for scene 3 using MoviePy...
‚úì Audio verified in scene 3
‚úÖ Scene 3 completed


üé¨ Merging 3 scenes into final video...
Merging all scenes into final video...
Loading scene_1.mp4...
‚úì Audio loaded: 15.10s
Loading scene_2.mp4...
‚úì Audio loaded: 12.43s
Loading scene_3.mp4...
‚úì Audio loaded: 13.51s

Concatenating 3 clips...
Writing final video with audio...




‚úÖ Done! Output: output\video\final_video.mp4

üéâ SUCCESS! Your video is ready: output\video\final_video.mp4
