# Lab 4: Image Generation and Multimodal AI

In this lab, you'll learn how to generate images using **Stable Diffusion** and understand images using **vision models** - all running locally.

## What You'll Learn
- Generate images from text prompts
- Image-to-image transformations
- Vision models for image understanding
- Multimodal embeddings

## Requirements
- **GPU strongly recommended** for image generation
- 8GB+ VRAM for Stable Diffusion
- CPU works but is very slow (10+ minutes per image)

## 1. Setup

In [None]:
!pip install diffusers transformers accelerate torch Pillow -q

In [None]:
import torch
from diffusers import StableDiffusionPipeline, DiffusionPipeline
from PIL import Image
import matplotlib.pyplot as plt

# Check device
if torch.cuda.is_available():
    device = "cuda"
    print(f"Using CUDA: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
    device = "mps"
    print("Using Apple Silicon (MPS)")
else:
    device = "cpu"
    print("Using CPU (this will be slow!)")

## 2. Load Stable Diffusion

In [None]:
# Load Stable Diffusion XL Turbo (fast, good quality)
# For older GPUs or less VRAM, use "stabilityai/sd-turbo" instead

model_id = "stabilityai/sdxl-turbo"  # Fast, 4 steps
# Alternative: "runwayml/stable-diffusion-v1-5"  # Classic, more compatible

print(f"Loading {model_id}...")

if device == "cuda":
    pipe = DiffusionPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
        variant="fp16"
    )
    pipe = pipe.to(device)
else:
    # For CPU/MPS
    pipe = DiffusionPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float32
    )
    if device == "mps":
        pipe = pipe.to(device)

print("Model loaded!")

## 3. Basic Image Generation

In [None]:
def generate_image(prompt: str, negative_prompt: str = None, steps: int = 4):
    """
    Generate an image from a text prompt.
    
    Args:
        prompt: What you want to see
        negative_prompt: What you don't want to see
        steps: Number of diffusion steps (more = better quality, slower)
    """
    image = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=steps,
        guidance_scale=0.0 if "turbo" in model_id else 7.5,  # Turbo doesn't need guidance
    ).images[0]
    
    return image

# Display helper
def show_image(image, title="Generated Image"):
    plt.figure(figsize=(8, 8))
    plt.imshow(image)
    plt.title(title)
    plt.axis('off')
    plt.show()

In [None]:
# Generate your first image!
prompt = "A cozy coffee shop interior with warm lighting, plants, and wooden furniture"

print(f"Generating: {prompt}")
image = generate_image(prompt)
show_image(image)

In [None]:
# Try different prompts
prompts = [
    "A futuristic city skyline at sunset, cyberpunk style",
    "A cute robot reading a book in a library",
    "Abstract art with vibrant colors and geometric shapes"
]

for prompt in prompts:
    print(f"\nGenerating: {prompt}")
    image = generate_image(prompt)
    show_image(image, prompt[:50] + "...")

## 4. Prompt Engineering for Better Results

In [None]:
# Good prompts include:
# - Subject description
# - Style/medium
# - Lighting
# - Quality tags

detailed_prompt = """
A majestic mountain landscape at golden hour,
snow-capped peaks reflecting sunset colors,
crystal clear lake in foreground,
professional photography, 8k, highly detailed,
cinematic lighting, dramatic sky
""".replace("\n", " ")

negative_prompt = "blurry, low quality, distorted, ugly, bad anatomy"

image = generate_image(detailed_prompt, negative_prompt)
show_image(image, "Detailed Prompt Example")

In [None]:
# Style modifiers
base_prompt = "A cat sitting on a windowsill"

styles = [
    "oil painting style",
    "anime style",
    "pixel art style",
    "watercolor style"
]

fig, axes = plt.subplots(2, 2, figsize=(12, 12))
axes = axes.flatten()

for ax, style in zip(axes, styles):
    prompt = f"{base_prompt}, {style}"
    print(f"Generating: {style}")
    image = generate_image(prompt)
    ax.imshow(image)
    ax.set_title(style)
    ax.axis('off')

plt.tight_layout()
plt.show()

## 5. Save Generated Images

In [None]:
import os
from datetime import datetime

# Create output directory
output_dir = "./generated_images"
os.makedirs(output_dir, exist_ok=True)

# Generate and save
prompt = "A beautiful sunset over the ocean, dramatic clouds"
image = generate_image(prompt)

# Save with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{output_dir}/image_{timestamp}.png"
image.save(filename)

print(f"Saved to: {filename}")
show_image(image)

## 6. Vision Models with Ollama

Now let's use vision models to understand images!

In [None]:
# First, pull a vision model
# Run in terminal: ollama pull llava

!pip install ollama -q
import ollama
import base64
from io import BytesIO

In [None]:
def image_to_base64(image: Image.Image) -> str:
    """Convert PIL Image to base64 string."""
    buffered = BytesIO()
    image.save(buffered, format="PNG")
    return base64.b64encode(buffered.getvalue()).decode()

def analyze_image(image: Image.Image, question: str = "Describe this image in detail."):
    """Analyze an image using LLaVA vision model."""
    
    response = ollama.chat(
        model='llava',
        messages=[{
            'role': 'user',
            'content': question,
            'images': [image_to_base64(image)]
        }]
    )
    
    return response['message']['content']

In [None]:
# Generate an image and then analyze it
prompt = "A robot chef cooking in a modern kitchen"
image = generate_image(prompt)
show_image(image, "Generated Image")

# Analyze it
print("\nVision model analysis:")
description = analyze_image(image)
print(description)

In [None]:
# Ask specific questions about images
questions = [
    "What objects can you see in this image?",
    "What is the mood or atmosphere of this image?",
    "What artistic style does this image use?"
]

for question in questions:
    print(f"\nQ: {question}")
    answer = analyze_image(image, question)
    print(f"A: {answer}")

## 7. Analyze External Images

In [None]:
# Load and analyze an image from file
def analyze_image_file(filepath: str, question: str = "Describe this image."):
    """Analyze an image file."""
    image = Image.open(filepath)
    show_image(image, filepath)
    
    print(f"\nQuestion: {question}")
    answer = analyze_image(image, question)
    print(f"Answer: {answer}")
    return answer

# Example: Analyze one of our saved images
# analyze_image_file("./generated_images/your_image.png")
print("Uncomment above to analyze your own images!")

## 8. Image Captioning Pipeline

In [None]:
def generate_caption(image: Image.Image) -> str:
    """Generate a concise caption for an image."""
    prompt = "Generate a short, descriptive caption for this image (1-2 sentences)."
    return analyze_image(image, prompt)

def generate_tags(image: Image.Image) -> str:
    """Generate tags for an image."""
    prompt = "List 5-10 relevant tags for this image, separated by commas."
    return analyze_image(image, prompt)

# Test on generated image
prompt = "A scientist in a lab coat examining colorful test tubes"
image = generate_image(prompt)
show_image(image)

print("Caption:", generate_caption(image))
print("\nTags:", generate_tags(image))

## Summary

In this lab, you learned how to:
- Generate images from text using Stable Diffusion
- Write effective prompts for image generation
- Apply different artistic styles
- Use vision models (LLaVA) to analyze images
- Build captioning and tagging pipelines

**Key takeaways:**
- Stable Diffusion runs completely locally
- Prompt engineering greatly affects results
- Vision models can understand and describe images
- Combine generation + vision for powerful workflows

**Next Lab:** Building AI Agents