# ControlNet Fine-Tuning and Inference for Video Ad Manipulation (Updated)

This notebook demonstrates:
1. **Data preprocessing** - Generate keyword masks using CLIPSeg
2. **Data format requirements** - Work with alignment_score.csv and keywords.csv
3. **Fine-tuning process** - Training the ControlNet adapter
4. **Inference** - Generating 7 experimental video variants

---

## Table of Contents
1. [Setup](#setup)
2. [Data Format Specification](#data-format)
3. [Preprocessing: Generate Keyword Masks](#preprocessing)
4. [Data Preparation](#data-prep)
5. [Model Fine-Tuning](#training)
6. [Inference: Generate 7 Variants](#inference)
7. [Visualization](#visualization)

---
## 1. Setup <a name="setup"></a>

In [None]:
import os
import sys
import json
import yaml
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from PIL import Image
from tqdm.auto import tqdm

# Add src to path
sys.path.insert(0, str(Path.cwd()))

# Import framework modules
from src.models import StableDiffusionControlNetWrapper
from src.training import ControlNetTrainer
from src.training.dataset_v2 import VideoSceneDataModule
from src.data_preparation import ControlTensorBuilder
from src.video_editing.experimental_variants_v2 import VideoVariantGenerator, visualize_variant_comparison

# Check GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

---
## 2. Data Format Specification <a name="data-format"></a>

### Current Directory Structure

```
data/
├── alignment_score.csv          # Pre-computed alignment scores
│   Columns:
│   - video id: Video identifier
│   - Scene Number: Scene index within video
│   - attention_proportion: Alignment score [0-1]
│   - start_time, end_time: Scene timing
│   - CTR_mean, CVR_mean: Engagement metrics
│   - contrast, brightness: Visual features
│   - industry: Product category
│
├── keywords.csv                 # Video keywords
│   Columns:
│   - _id: Video identifier
│   - keyword_list[0]: Product/keyword text
│
├── video_scene_cuts/            # RGB images per scene (not in repo)
│   ├── {video_id}/
│   │   ├── {video_id}-Scene-001-01.jpg
│   │   ├── {video_id}-Scene-002-01.jpg
│   │   └── ...
│
├── videos_tiktok/               # Original video files (not in repo)
│   └── {video_id}.mp4
│
└── keyword_masks/               # Generated by preprocessing (see below)
    ├── {video_id}/
    │   ├── scene_1.png
    │   ├── scene_2.png
    │   └── ...
```

### Key Differences from Original Design

1. **No raw attention heatmaps** - We have pre-computed `attention_proportion` scores
2. **No raw keyword heatmaps** - We generate keyword masks using CLIPSeg
3. **Scene-based instead of frame-based** - Data is organized by scenes, not individual frames
4. **Alignment scores are scalars** - Not spatial maps

### Load and Inspect Data

In [None]:
# Load alignment scores
alignment_df = pd.read_csv('data/alignment_score.csv')
print("Alignment Score Data:")
print(alignment_df.head())
print(f"\nShape: {alignment_df.shape}")
print(f"Unique videos: {alignment_df['video id'].nunique()}")

In [None]:
# Load keywords
keywords_df = pd.read_csv('data/keywords.csv')
print("Keywords Data:")
print(keywords_df.head(10))
print(f"\nShape: {keywords_df.shape}")

# Filter out empty keywords
keywords_df_clean = keywords_df[keywords_df['keyword_list[0]'].notna() & (keywords_df['keyword_list[0]'] != '')]
print(f"Videos with valid keywords: {len(keywords_df_clean)}")

---
## 3. Preprocessing: Generate Keyword Masks <a name="preprocessing"></a>

Since we don't have keyword heatmaps, we use **CLIPSeg** to generate spatial masks indicating where the product appears in each scene.

### Run Keyword Mask Generation Script

**Note:** This step requires the `video_scene_cuts` directory to be present locally.

```bash
python scripts/generate_keyword_masks.py \
    --screenshots_dir data/video_scene_cuts \
    --keywords_file data/keywords.csv \
    --output_dir data/keyword_masks \
    --device cuda \
    --threshold 0.4
```

This will:
1. Load CLIPSeg model
2. For each video with a keyword:
   - Load all scene screenshots
   - Run CLIPSeg to segment the product/keyword
   - Save binary masks to `data/keyword_masks/`

In [None]:
# Check if keyword masks exist
keyword_masks_dir = 'data/keyword_masks'

if os.path.exists(keyword_masks_dir):
    mask_videos = os.listdir(keyword_masks_dir)
    print(f"✓ Keyword masks found for {len(mask_videos)} videos")
    print(f"  Directory: {keyword_masks_dir}")
else:
    print("✗ Keyword masks not found")
    print("  Please run: python scripts/generate_keyword_masks.py")
    print("  OR create dummy masks for testing (shown below)")

### Create Dummy Masks for Testing (Optional)

If you don't have the screenshots, you can create dummy masks for testing the framework:

In [None]:
def create_dummy_masks_for_testing():
    """Create dummy keyword masks for testing when screenshots aren't available."""
    os.makedirs('data/keyword_masks', exist_ok=True)
    
    # Get unique video IDs from alignment_score.csv
    video_ids = alignment_df['video id'].unique()[:5]  # Just first 5 for testing
    
    for video_id in video_ids:
        # Get scenes for this video
        scenes = alignment_df[alignment_df['video id'] == video_id]['Scene Number'].values
        
        # Create directory
        video_mask_dir = os.path.join('data/keyword_masks', str(video_id))
        os.makedirs(video_mask_dir, exist_ok=True)
        
        for scene_num in scenes:
            # Create a dummy mask (centered rectangle)
            mask = np.zeros((512, 512), dtype=np.uint8)
            mask[150:350, 150:350] = 255  # Center square
            
            # Save mask
            mask_path = os.path.join(video_mask_dir, f"scene_{scene_num}.png")
            Image.fromarray(mask).save(mask_path)
    
    print(f"✓ Created dummy masks for {len(video_ids)} videos")

# Uncomment to create dummy masks
# create_dummy_masks_for_testing()

---
## 4. Data Preparation <a name="data-prep"></a>

### Build Control Tensors

Control tensors are constructed from:
- **keyword_mask** (M_t): From CLIPSeg segmentation
- **alignment_score**: From alignment_score.csv

**C_t = [M_t, S_t]** where:
- **M_t**: Keyword mask (binary)
- **S_t**: Alignment map = M_t × alignment_score

In [None]:
# Configuration
CONFIG = {
    'data': {
        'alignment_score_file': 'data/alignment_score.csv',
        'keywords_file': 'data/keywords.csv',
        'screenshots_dir': 'data/video_scene_cuts',
        'keyword_masks_dir': 'data/keyword_masks',
        'image_size': 512,
    },
    'model': {
        'sd_model_name': 'runwayml/stable-diffusion-v1-5',
        'controlnet': {
            'control_channels': 2,  # [M_t, S_t]
            'base_channels': 64,
        },
        'use_lora': False,
    },
    'training': {
        'batch_size': 4,
        'num_workers': 4,
        'learning_rate': 1e-4,
        'num_epochs': 10,
        'lambda_recon': 1.0,
        'lambda_lpips': 1.0,
        'lambda_bg': 0.5,
        'use_recon_loss': True,
        'gradient_accumulation_steps': 1,
        'mixed_precision': True,
        'log_wandb': False,
        'project_name': 'video-ad-manipulation',
        'output_dir': 'outputs/training',
    },
}

# Create output directory
os.makedirs(CONFIG['training']['output_dir'], exist_ok=True)

# Save config
config_save_path = os.path.join(CONFIG['training']['output_dir'], 'config.yaml')
with open(config_save_path, 'w') as f:
    yaml.dump(CONFIG, f, default_flow_style=False)
print(f"Configuration saved to: {config_save_path}")

### Setup Data Loaders

In [None]:
# Get valid video IDs (only videos with both keywords and screenshots)
valid_videos = VideoSceneDataModule.get_valid_video_ids(
    alignment_score_file=CONFIG['data']['alignment_score_file'],
    keywords_file=CONFIG['data']['keywords_file'],
    screenshots_dir=CONFIG['data']['screenshots_dir'],
)

print(f"\n✓ Found {len(valid_videos)} valid videos")
print(f"  (videos with both keywords and screenshots)")

# Split: 80% train, 20% validation
split_idx = int(0.8 * len(valid_videos))
train_videos = valid_videos[:split_idx]
val_videos = valid_videos[split_idx:]

print(f"\nTrain/Val Split:")
print(f"  Training videos: {len(train_videos)}")
print(f"  Validation videos: {len(val_videos)}")
print(f"\nFirst 5 training videos: {train_videos[:5]}")
print(f"First 5 validation videos: {val_videos[:5]}")

In [None]:
# Create data module
data_module = VideoSceneDataModule(
    alignment_score_file=CONFIG['data']['alignment_score_file'],
    keywords_file=CONFIG['data']['keywords_file'],
    train_videos=train_videos,
    val_videos=val_videos,
    screenshots_dir=CONFIG['data']['screenshots_dir'],
    keyword_masks_dir=CONFIG['data']['keyword_masks_dir'],
    batch_size=CONFIG['training']['batch_size'],
    num_workers=CONFIG['training']['num_workers'],
    image_size=(CONFIG['data']['image_size'], CONFIG['data']['image_size']),
)

train_loader = data_module.train_dataloader()
val_loader = data_module.val_dataloader()

print(f"\nDataset Statistics:")
print(f"  Training samples: {len(data_module.train_dataset)}")
print(f"  Validation samples: {len(data_module.val_dataset)}")
print(f"  Batches per epoch: {len(train_loader)}")

### Inspect a Training Batch

In [None]:
# Get one batch
batch = next(iter(train_loader))

print("Training Batch Contents:")
print(f"  'image' shape: {batch['image'].shape}")  # [B, 3, 512, 512]
print(f"  'control' shape: {batch['control'].shape}")  # [B, 2, 512, 512]
print(f"  'keyword_mask' shape: {batch['keyword_mask'].shape}")  # [B, 1, 512, 512]
print(f"  'alignment_score': {batch['alignment_score'][:3].tolist()}...")  # Scalars
print(f"  'keyword' (text prompts): {batch['keyword'][:2]}...")
print(f"  'video_id': {batch['video_id'][:2]}...")
print(f"  'scene_number': {batch['scene_number'][:3].tolist()}...")

In [None]:
# Visualize first sample in batch
sample_idx = 0
image = batch['image'][sample_idx].permute(1, 2, 0).numpy()
image = (image * 0.5 + 0.5).clip(0, 1)  # Denormalize

keyword_mask = batch['control'][sample_idx, 0].numpy()  # M_t
alignment_map = batch['control'][sample_idx, 1].numpy()  # S_t
alignment_score = batch['alignment_score'][sample_idx].item()

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].imshow(image)
axes[0].set_title(f"Scene Image\nVideo: {batch['video_id'][sample_idx]}\nScene: {batch['scene_number'][sample_idx]}\nKeyword: '{batch['keyword'][sample_idx]}'")
axes[0].axis('off')

axes[1].imshow(keyword_mask, cmap='gray')
axes[1].set_title("Control Channel 0 (M_t)\nKeyword Mask")
axes[1].axis('off')

axes[2].imshow(alignment_map, cmap='hot')
axes[2].set_title(f"Control Channel 1 (S_t)\nAlignment Map\nScore: {alignment_score:.4f}")
axes[2].axis('off')

plt.tight_layout()
plt.show()

---
## 5. Model Fine-Tuning <a name="training"></a>

### Initialize Model

In [None]:
print("Initializing Stable Diffusion + ControlNet model...")
print("This may take a few minutes on first run (downloading pretrained weights)\n")

model = StableDiffusionControlNetWrapper(
    sd_model_name=CONFIG['model']['sd_model_name'],
    controlnet_config=CONFIG['model']['controlnet'],
    device=device,
    use_lora=CONFIG['model']['use_lora'],
)

print("✓ Model initialized successfully\n")
print(f"Model configuration:")
print(f"  SD backbone: {CONFIG['model']['sd_model_name']}")
print(f"  ControlNet input channels: {CONFIG['model']['controlnet']['control_channels']}")
print(f"  Using LoRA: {CONFIG['model']['use_lora']}")

### Training Loop

In [None]:
# Initialize trainer
trainer = ControlNetTrainer(
    model=model,
    train_dataloader=train_loader,
    val_dataloader=val_loader,
    learning_rate=CONFIG['training']['learning_rate'],
    num_epochs=CONFIG['training']['num_epochs'],
    device=device,
    output_dir=CONFIG['training']['output_dir'],
    lambda_recon=CONFIG['training']['lambda_recon'],
    lambda_lpips=CONFIG['training']['lambda_lpips'],
    lambda_bg=CONFIG['training']['lambda_bg'],
    use_recon_loss=CONFIG['training']['use_recon_loss'],
    gradient_accumulation_steps=CONFIG['training']['gradient_accumulation_steps'],
    mixed_precision=CONFIG['training']['mixed_precision'],
    log_wandb=CONFIG['training']['log_wandb'],
    project_name=CONFIG['training']['project_name'],
)

print(f"Starting training for {CONFIG['training']['num_epochs']} epochs...\n")

# Train
# trainer.train()  # Uncomment to start training

print("\n" + "="*50)
print("Training completed!")
print(f"Best model saved to: {os.path.join(CONFIG['training']['output_dir'], 'best_model.pt')}")
print("="*50)

---
## 6. Inference: Generate 7 Experimental Variants <a name="inference"></a>

### Define Experimental Variants

We generate **7 variants** for each video:

1. **baseline**: Original alignment scores (control condition)
2. **early_boost**: Boost alignment in first 33% of scenes (×1.5)
3. **middle_boost**: Boost alignment in middle 33% of scenes (×1.5)
4. **late_boost**: Boost alignment in last 33% of scenes (×1.5)
5. **full_boost**: Boost alignment in all scenes (×1.5)
6. **reduction**: Reduce alignment in middle 33% of scenes (×0.5)
7. **placebo**: Modify non-keyword regions only

### Generate Variant Specifications

In [None]:
# Initialize variant generator
variant_generator = VideoVariantGenerator(
    alignment_score_file=CONFIG['data']['alignment_score_file'],
    keywords_file=CONFIG['data']['keywords_file'],
    boost_alpha=1.5,
    reduction_alpha=0.5,
)

print("Variant generator initialized")
print(f"  Boost alpha: 1.5")
print(f"  Reduction alpha: 0.5")

In [None]:
# Generate variants for a valid video (example)
# Use one of the valid videos from our dataset
if len(valid_videos) > 0:
    example_video_id = valid_videos[0]
    print(f"Generating variants for video: {example_video_id}")

    variants = variant_generator.create_all_variants_for_video(example_video_id)

    print(f"\n✓ Generated {len(variants)} variants:")
    for variant_name in variants.keys():
        print(f"  - {variant_name}")
else:
    print("⚠️ No valid videos found. Please check your data.")

In [None]:
# Compute and display statistics
stats = variant_generator.compute_variant_statistics(variants)

print("\nVariant Statistics:\n")
for variant_name, stat in stats.items():
    print(f"{variant_name:15s}: mean={stat['mean_alignment']:.4f}, std={stat['std_alignment']:.4f}, scenes={stat['num_scenes']}")

---
## 7. Visualization <a name="visualization"></a>

### Visualize Variant Comparison

In [None]:
# Get keyword for this video
keyword = variant_generator.keywords.get(str(example_video_id), "unknown")

# Visualize alignment profiles
visualize_variant_comparison(variants, example_video_id, keyword)

### Generate Variants for All Videos

In [None]:
# Generate variants for all videos
all_variants = variant_generator.generate_variants_for_all_videos(
    output_dir='outputs/variants'
)

# Save manifest
variant_generator.save_variant_manifest(
    all_variants,
    output_path='outputs/variants/manifest.json'
)

print(f"\n✓ Generated variants for {len(all_variants)} videos")
print(f"  Output directory: outputs/variants/")
print(f"  Manifest: outputs/variants/manifest.json")

### Inspect Variant Files

In [None]:
# Load manifest
with open('outputs/variants/manifest.json', 'r') as f:
    manifest = json.load(f)

print("Variant Generation Manifest:")
print(f"  Total videos: {manifest['num_videos']}")
print(f"  Variants per video: {manifest['num_variants_per_video']}")
print(f"  Variant types: {manifest['variant_types']}")
print(f"\nFirst 3 videos:")
for video_id, info in list(manifest['videos'].items())[:3]:
    print(f"  {video_id}: {info['num_scenes']} scenes, keyword='{info['keyword']}'")

---
## Summary

### Updated Workflow

**Data Preparation:**
1. Load `alignment_score.csv` with pre-computed alignment scores
2. Load `keywords.csv` with product keywords
3. Use CLIPSeg to generate keyword masks from screenshots
4. Build control tensors: `C_t = [M_t, S_t]`
   - `M_t` = keyword mask from CLIPSeg
   - `S_t` = M_t × alignment_score

**Training:**
- Fine-tune ControlNet adapter
- Freeze Stable Diffusion backbone
- Train to manipulate scenes based on keyword masks and alignment scores

**Inference (7 Variants):**
For each video, generate 7 variants by modulating alignment scores:
1. **baseline**: No change
2. **early_boost**: Boost first 33% of scenes
3. **middle_boost**: Boost middle 33%
4. **late_boost**: Boost last 33%
5. **full_boost**: Boost all scenes
6. **reduction**: Reduce middle 33%
7. **placebo**: Modify non-keyword regions

**Output:**
- Variant specifications saved to `outputs/variants/{video_id}/{variant_name}.csv`
- Statistics saved to `outputs/variants/{video_id}/statistics.json`
- Manifest saved to `outputs/variants/manifest.json`

### Next Steps

1. Run preprocessing to generate keyword masks
2. Train ControlNet model
3. Run inference to generate edited scenes
4. Reassemble scenes into video files
5. Deploy variants for A/B testing