# Zero-Shot Attention-Keyword Alignment Manipulation (GenAI v3)

**No training required!** This notebook demonstrates how to manipulate attention-keyword alignment using pre-trained generative models.

## Key Features
- ✅ **Zero training** - uses pre-trained models (InstructPix2Pix, Inpainting)
- ✅ **Fast** - 2-3 seconds per frame
- ✅ **Simple** - no complex training loops or loss functions
- ✅ **Flexible** - easy to adjust via text instructions

## Workflow
1. Load pre-trained models
2. Load scene images and keyword masks
3. Generate variants by editing frames
4. Apply temporal smoothing
5. Save results

## Step 1: Setup

In [None]:
import os
import sys
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from PIL import Image
from tqdm.auto import tqdm

# Add parent directory to path
sys.path.insert(0, str(Path.cwd().parent))

# Import GenAI v3 modules
from GenAI_v3.zero_shot_manipulator import (
    ZeroShotAlignmentManipulator,
    load_scenes_from_paths,
    load_masks_from_paths,
    save_scenes,
)
from GenAI_v3.temporal_smoother import (
    TemporalSmoother,
    apply_temporal_consistency_filter,
)

# Check GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## Step 2: Load Data

We use the same data structure as before:
- Scene images: `data/video_scene_cuts/{video_id}/{video_id}-Scene-0xx-01.jpg`
- Keyword masks: `data/keyword_masks/{video_id}/scene_{x}.png`
- Valid scenes: `data/valid_scenes.csv`

In [None]:
# Load valid scenes
VALID_SCENES_FILE = '../data/valid_scenes.csv'

if not os.path.exists(VALID_SCENES_FILE):
    print("⚠ ERROR: Please run data_validation.ipynb first!")
else:
    scenes_df = pd.read_csv(VALID_SCENES_FILE)
    print(f"✓ Loaded {len(scenes_df)} valid scenes from {scenes_df['video_id'].nunique()} videos")
    print(f"\nFirst few rows:")
    display(scenes_df.head())

## Step 3: Initialize Models

Choose your method:
- **InstructPix2Pix**: Simpler, text-guided editing
- **Inpainting**: More precise, edits only keyword region

In [None]:
# Choose method: "instruct_pix2pix" or "inpainting"
METHOD = "instruct_pix2pix"  # Recommended for simplicity

# Initialize manipulator
manipulator = ZeroShotAlignmentManipulator(
    method=METHOD,
    device=device,
    torch_dtype=torch.float16,  # Use FP16 for speed
)

# Initialize temporal smoother
smoother = TemporalSmoother(method="simple_blend")  # Fast and effective

print("\n✓ Models initialized!")
print(f"  Method: {METHOD}")
print(f"  Temporal smoothing: simple_blend")

## Step 4: Test on Single Video

Let's test the pipeline on a single video first.

In [None]:
# Select a test video
test_video_id = scenes_df['video_id'].iloc[0]
print(f"Test video ID: {test_video_id}")

# Get all scenes for this video
video_scenes_df = scenes_df[scenes_df['video_id'] == test_video_id].sort_values('scene_number')
print(f"Number of scenes: {len(video_scenes_df)}")

# Get keyword
keyword = video_scenes_df.iloc[0]['keyword']
print(f"Keyword: {keyword}")

# Load scenes and masks
scene_paths = video_scenes_df['scene_image_path'].tolist()
mask_paths = video_scenes_df['keyword_mask_path'].tolist()

scenes = load_scenes_from_paths(scene_paths)
masks = load_masks_from_paths(mask_paths) if METHOD == "inpainting" else None

print(f"✓ Loaded {len(scenes)} scenes")

### Visualize Original Scenes

In [None]:
# Show first 6 scenes
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i in range(min(6, len(scenes))):
    axes[i].imshow(scenes[i])
    axes[i].set_title(f"Scene {i+1}")
    axes[i].axis('off')

plt.tight_layout()
plt.show()

## Step 5: Create Variants

Generate all 7 experimental variants:
1. baseline
2. early_boost
3. middle_boost
4. late_boost
5. full_boost
6. reduction
7. placebo

In [None]:
# Create output directory
output_dir = Path(f'../outputs/genai_v3/{test_video_id}')
output_dir.mkdir(parents=True, exist_ok=True)

# Variant types
VARIANT_TYPES = [
    "baseline",
    "early_boost",
    "middle_boost",
    "late_boost",
    "full_boost",
    "reduction",
    "placebo",
]

# Generate each variant
variants = {}

for variant_type in VARIANT_TYPES:
    print(f"\n{'='*60}")
    print(f"Creating variant: {variant_type}")
    print(f"{'='*60}")
    
    # Create variant (edit frames)
    edited_scenes = manipulator.create_variant(
        scenes=scenes,
        keyword=keyword,
        variant_type=variant_type,
        keyword_masks=masks,
        num_inference_steps=20,  # Adjust for speed/quality trade-off
    )
    
    # Apply temporal smoothing (if not baseline/placebo)
    if variant_type not in ["baseline", "placebo"]:
        print("Applying temporal smoothing...")
        
        # Determine which indices were edited
        num_scenes = len(scenes)
        third = num_scenes // 3
        
        if variant_type == "early_boost":
            edited_indices = set(range(third))
        elif variant_type == "middle_boost" or variant_type == "reduction":
            edited_indices = set(range(third, 2*third))
        elif variant_type == "late_boost":
            edited_indices = set(range(2*third, num_scenes))
        elif variant_type == "full_boost":
            edited_indices = set(range(num_scenes))
        else:
            edited_indices = set()
        
        smoothed_scenes = smoother.smooth(
            original_frames=scenes,
            edited_frames=edited_scenes,
            edited_indices=edited_indices,
            blend_strength=0.7,
        )
    else:
        smoothed_scenes = edited_scenes
    
    # Save variant
    variant_dir = output_dir / variant_type
    save_scenes(smoothed_scenes, variant_dir, prefix="scene")
    
    variants[variant_type] = smoothed_scenes

print(f"\n✓ All variants created!")
print(f"Output directory: {output_dir}")

## Step 6: Visualize Results

Compare variants side-by-side.

In [None]:
# Compare a specific scene across variants
scene_idx = len(scenes) // 2  # Middle scene

fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.flatten()

for i, variant_type in enumerate(VARIANT_TYPES):
    axes[i].imshow(variants[variant_type][scene_idx])
    axes[i].set_title(f"{variant_type}\n(Scene {scene_idx+1})", fontsize=14, fontweight='bold')
    axes[i].axis('off')

# Hide last empty subplot
axes[-1].axis('off')

plt.suptitle(f"Comparison: {keyword}", fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

### Compare Temporal Evolution

In [None]:
# Show how a variant changes over time
variant_to_show = "middle_boost"

# Select evenly spaced scenes
num_to_show = 6
indices = np.linspace(0, len(scenes)-1, num_to_show, dtype=int)

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

for i, idx in enumerate(indices):
    axes[i].imshow(variants[variant_to_show][idx])
    axes[i].set_title(f"Scene {idx+1}/{len(scenes)}", fontsize=12, fontweight='bold')
    axes[i].axis('off')

plt.suptitle(f"Temporal Evolution: {variant_to_show}", fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## Step 7: Process Multiple Videos

Scale up to process all videos.

In [None]:
# Get unique video IDs
video_ids = scenes_df['video_id'].unique()[:5]  # Limit to first 5 for testing

print(f"Processing {len(video_ids)} videos...")

for video_id in tqdm(video_ids, desc="Videos"):
    print(f"\nProcessing video: {video_id}")
    
    # Get scenes for this video
    video_scenes_df = scenes_df[scenes_df['video_id'] == video_id].sort_values('scene_number')
    
    # Get keyword
    keyword = video_scenes_df.iloc[0]['keyword']
    
    # Load scenes and masks
    scene_paths = video_scenes_df['scene_image_path'].tolist()
    mask_paths = video_scenes_df['keyword_mask_path'].tolist()
    
    scenes = load_scenes_from_paths(scene_paths)
    masks = load_masks_from_paths(mask_paths) if METHOD == "inpainting" else None
    
    # Create output directory
    output_dir = Path(f'../outputs/genai_v3/{video_id}')
    
    # Generate variants
    for variant_type in VARIANT_TYPES:
        edited_scenes = manipulator.create_variant(
            scenes=scenes,
            keyword=keyword,
            variant_type=variant_type,
            keyword_masks=masks,
            num_inference_steps=20,
        )
        
        # Save
        variant_dir = output_dir / variant_type
        save_scenes(edited_scenes, variant_dir, prefix="scene")

print("\n✓ All videos processed!")

## Step 8: Video Assembly (Optional)

Reassemble edited scenes into videos.

In [None]:
import cv2

def assemble_video(scenes: list, output_path: str, fps: int = 30):
    """
    Assemble scenes into a video file.
    
    Args:
        scenes: List of PIL Images
        output_path: Output video path (.mp4)
        fps: Frames per second
    """
    # Get dimensions from first scene
    width, height = scenes[0].size
    
    # Create video writer
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    writer = cv2.VideoWriter(output_path, fourcc, fps, (width, height))
    
    # Write frames
    for scene in scenes:
        # Convert PIL to CV2 format (RGB -> BGR)
        frame = cv2.cvtColor(np.array(scene), cv2.COLOR_RGB2BGR)
        writer.write(frame)
    
    writer.release()
    print(f"✓ Video saved to: {output_path}")

# Example: assemble one variant
video_output_dir = Path('../outputs/genai_v3/videos')
video_output_dir.mkdir(parents=True, exist_ok=True)

for variant_type in ["baseline", "full_boost"]:
    output_path = str(video_output_dir / f"{test_video_id}_{variant_type}.mp4")
    assemble_video(variants[variant_type], output_path, fps=30)

print("\n✓ Videos assembled!")

## Summary

### What We Did
1. ✅ Loaded pre-trained models (no training!)
2. ✅ Loaded scene images and keyword masks
3. ✅ Generated 7 experimental variants
4. ✅ Applied temporal smoothing
5. ✅ Saved results

### Performance
- **Speed**: ~2-3 seconds per frame
- **Quality**: Natural-looking edits
- **Training**: Zero!

### Next Steps
1. Deploy variants for A/B testing
2. Collect engagement metrics (CTR, CVR, watch time)
3. Analyze which variants perform best
4. Iterate on text instructions for better control

### Advantages Over ControlNet Training
- ✅ No training loop - immediate results
- ✅ No GPU memory issues
- ✅ No debugging complex architectures
- ✅ Easy to adjust behavior via text
- ✅ Can switch methods instantly

### Trade-offs
- ⚠️ Slightly slower per-frame than trained ControlNet
- ⚠️ Less precise numerical control
- ⚠️ May edit unintended regions slightly (InstructPix2Pix)