# Multi-Modal Models with MetaGen

This notebook demonstrates synthesizing models for different modalities:
- Text (LLMs)
- Image (ViT, Diffusion)
- Audio (MusicGen-style)
- Video (Generation)
- Multi-modal (CLIP-style)

In [None]:
# Imports
from pathlib import Path

import yaml

from metagen.specs.loader import load_spec
from metagen.synth.engine import synthesize
from metagen.synth.modalities import get_handler

## 1. Text Modality

Standard transformer-based LLM:

In [None]:
# Load text spec
text_spec, text_seed = load_spec(Path("../specs/text/text_llm_8b.yaml"))

print(f"Text Model: {text_spec.name}")
print(f"  Inputs: {text_spec.modality.inputs}")
print(f"  Outputs: {text_spec.modality.outputs}")
print(f"  Architecture: {text_spec.architecture.family}")
print(f"  Objective: {text_spec.training.objective}")

# Get handler
handler = get_handler(text_spec)
print(f"  Handler: {handler.__class__.__name__}")

## 2. Image Modality - Vision Transformer

ViT for image classification/encoding:

In [None]:
# Load ViT spec
vit_spec, vit_seed = load_spec(Path("../specs/image/image_vit_base.yaml"))

print(f"Vision Transformer: {vit_spec.name}")
print(f"  Inputs: {vit_spec.modality.inputs}")
print(f"  Outputs: {vit_spec.modality.outputs}")
print(f"  Architecture: {vit_spec.architecture.family}")
print(f"  Parameter budget: {vit_spec.constraints.parameter_budget.max}")

# Get handler
handler = get_handler(vit_spec)
print(f"  Handler: {handler.__class__.__name__}")

## 3. Image Modality - Diffusion Model

U-Net based diffusion for image generation:

In [None]:
# Load diffusion spec
diff_spec, diff_seed = load_spec(Path("../specs/image/image_diffusion_sdxl_like.yaml"))

print(f"Diffusion Model: {diff_spec.name}")
print(f"  Inputs: {diff_spec.modality.inputs}")
print(f"  Outputs: {diff_spec.modality.outputs}")
print(f"  Architecture: {diff_spec.architecture.family}")
print(f"  Objective: {diff_spec.training.objective}")
print(f"  Parameter budget: {diff_spec.constraints.parameter_budget.max}")

## 4. Audio Modality

MusicGen-style audio generation:

In [None]:
# Load audio spec
audio_spec, audio_seed = load_spec(Path("../specs/audio/audio_musicgen.yaml"))

print(f"Audio Model: {audio_spec.name}")
print(f"  Inputs: {audio_spec.modality.inputs}")
print(f"  Outputs: {audio_spec.modality.outputs}")
print(f"  Architecture: {audio_spec.architecture.family}")
print(f"  Context window: {audio_spec.constraints.context_window}")

# Get handler
handler = get_handler(audio_spec)
print(f"  Handler: {handler.__class__.__name__}")

## 5. Video Modality

Video generation model:

In [None]:
# Load video spec
video_spec, video_seed = load_spec(Path("../specs/video/video_generation.yaml"))

print(f"Video Model: {video_spec.name}")
print(f"  Inputs: {video_spec.modality.inputs}")
print(f"  Outputs: {video_spec.modality.outputs}")
print(f"  Architecture: {video_spec.architecture.family}")

# Get handler
handler = get_handler(video_spec)
print(f"  Handler: {handler.__class__.__name__}")

## 6. Multi-Modal - CLIP Style

Contrastive learning across text and images:

In [None]:
# Load multimodal spec
clip_spec, clip_seed = load_spec(Path("../specs/multimodal/multimodal_clip.yaml"))

print(f"Multi-Modal Model: {clip_spec.name}")
print(f"  Inputs: {clip_spec.modality.inputs}")
print(f"  Outputs: {clip_spec.modality.outputs}")
print(f"  Architecture: {clip_spec.architecture.family}")
print(f"  Objective: {clip_spec.training.objective}")

# Get handler
handler = get_handler(clip_spec)
print(f"  Handler: {handler.__class__.__name__}")

## 7. Synthesize Different Modalities

Let's synthesize models for each modality:

In [None]:
# Synthesize a selection of models
specs_to_synth = [
    ("../specs/text/text_llm_tiny.yaml", "text_tiny"),
    ("../specs/image/image_vit_base.yaml", "image_vit"),
    ("../specs/multimodal/multimodal_clip.yaml", "multimodal_clip"),
]

results = {}
for spec_path, name in specs_to_synth:
    spec, seed = load_spec(Path(spec_path))
    output_dir = Path(f"./outputs/multi_modal/{name}")

    result = synthesize(spec=spec, output_dir=output_dir, seed=seed)
    results[name] = result

    print(f"Synthesized: {name}")
    print(f"  Output: {result['output_dir']}")

## 8. Compare Architectures

Let's compare the generated architectures:

In [None]:
print(f"{'Model':<20} {'Hidden':<10} {'Layers':<10} {'Heads':<10}")
print("-" * 50)

for name, result in results.items():
    arch_path = Path(result["output_dir"]) / "blueprint" / "architecture.yaml"
    with open(arch_path) as f:
        arch = yaml.safe_load(f)

    hidden = arch.get("hidden_size", "N/A")
    layers = arch.get("num_layers", "N/A")
    heads = arch.get("num_heads", "N/A")

    print(f"{name:<20} {str(hidden):<10} {str(layers):<10} {str(heads):<10}")

## 9. Examine Model Card Differences

Each modality gets appropriate documentation:

In [None]:
for name, result in results.items():
    model_card_path = Path(result["output_dir"]) / "docs" / "model_card.md"
    with open(model_card_path) as f:
        content = f.read()

    # Show first 20 lines
    lines = content.split("\n")[:20]

    print(f"\n{'=' * 60}")
    print(f"Model Card: {name}")
    print(f"{'=' * 60}")
    for line in lines:
        print(line)

## 10. Create a Custom Multi-Modal Spec

Let's create a custom vision-language model:

In [None]:
from metagen.specs.schema import (
    Architecture,
    Constraints,
    Modality,
    ModelSpec,
    ParameterBudget,
    Training,
)

# Create custom spec programmatically
custom_spec = ModelSpec(
    metagen_version="1.0",
    name="custom_vlm",
    description="Custom vision-language model for image captioning.",
    modality=Modality(inputs=["text", "image"], outputs=["text"]),
    constraints=Constraints(
        parameter_budget=ParameterBudget(max="3B"), latency="near-real-time", device="consumer_gpu"
    ),
    training=Training(objective=["autoregressive"]),
    architecture=Architecture(family="transformer"),
)

print(f"Custom spec: {custom_spec.name}")
print(f"  Modality: {custom_spec.modality.inputs} â†’ {custom_spec.modality.outputs}")

# Synthesize
custom_result = synthesize(
    spec=custom_spec, output_dir=Path("./outputs/multi_modal/custom_vlm"), seed=42
)

print(f"  Output: {custom_result['output_dir']}")

## Available Modality Handlers

Summary of MetaGen's modality support:

| Handler | Modality | Architectures |
|---------|----------|---------------|
| TextModalityHandler | text | transformer |
| ImageModalityHandler | image | transformer, cnn, diffusion, hybrid |
| AudioModalityHandler | audio | transformer |
| VideoModalityHandler | video | transformer, diffusion |
| MultimodalHandler | text+image | transformer, hybrid |

## Further Reading

- [Multi-Modal Guide](../../docs/user-guide/multi_modal.md) - Complete reference
- [Spec Language](../../docs/user-guide/spec_language.md) - All spec options