# 078: Multimodal Large Language Models

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** vision-language models and cross-modal alignment
- **Implement** CLIP for zero-shot image classification
- **Build** image captioning systems (encoder-decoder architecture)
- **Create** Visual Question Answering (VQA) systems
- **Apply** multimodal models to semiconductor wafer map analysis
- **Evaluate** multimodal systems using BLEU, ROUGE, and visual metrics

## üñºÔ∏è What are Multimodal LLMs?

**Multimodal LLMs** process and generate content across multiple modalities:
- üëÅÔ∏è **Vision** - Images, videos, diagrams
- üìù **Text** - Natural language, code
- üîä **Audio** - Speech, sounds (not covered in this notebook)

**Key capabilities:**
- Image captioning ("A dog playing in the park")
- Visual question answering ("What color is the car?")
- Image-text retrieval (search images with text descriptions)
- Vision-grounded text generation (stories from images)

## üè≠ Post-Silicon Validation Use Cases

**Automated Failure Analysis Reports**
- Input: Wafer map image + test data
- Output: Natural language failure report with root cause analysis
- Value: Reduce engineer time from 2 hours ‚Üí 15 minutes per failure

**Visual Test Documentation Search**
- Input: Text query "show me ring defect patterns"
- Output: Retrieve similar wafer maps from historical database
- Value: 10√ó faster pattern matching vs manual search

**Defect Classification with Context**
- Input: Wafer map + question "What type of defect is this?"
- Output: "Ring pattern indicating chamber conditioning issue"
- Value: Standardize defect classification across teams

**Parametric Correlation Explanation**
- Input: Scatter plot + "Why do these parameters correlate?"
- Output: Technical explanation grounded in test physics
- Value: Enable non-experts to interpret complex data

## üîÑ Multimodal Architecture Workflow

```mermaid
graph LR
    A[Image] --> B[Vision Encoder]
    B --> C[Visual Features]
    
    D[Text] --> E[Language Model]
    E --> F[Text Features]
    
    C --> G[Cross-Modal Fusion]
    F --> G
    
    G --> H[Multimodal Understanding]
    H --> I[Generated Response]
    
    style B fill:#e1f5ff
    style E fill:#fff4e1
    style G fill:#f0e1ff
    style I fill:#e1ffe1
```

## üìä Learning Path Context

**Prerequisites:**
- 072: GPT & Large Language Models (language generation)
- 073: Vision Transformers (image encoding)
- 058: Transformers & Self-Attention (attention mechanism)

**Next Steps:**
- 079: RAG Fundamentals (retrieval-augmented generation)
- 083: AI Agents (multimodal agents)
- 085: Vector Databases (image embedding search)

---

## üìö Required Libraries

### Core Dependencies
```python
# Deep Learning
torch>=2.0.0
torchvision>=0.15.0
transformers>=4.30.0

# Vision & NLP
pillow>=9.0.0
opencv-python>=4.7.0
nltk>=3.8.0
datasets>=2.12.0

# CLIP & Multimodal
clip @ git+https://github.com/openai/CLIP.git
sentence-transformers>=2.2.0

# Evaluation & Metrics
pycocotools>=2.0.0
torchmetrics>=0.11.0

# Utilities
numpy>=1.24.0
matplotlib>=3.7.0
seaborn>=0.12.0
tqdm>=4.65.0
```

### Installation
```bash
pip install torch torchvision transformers pillow opencv-python
pip install nltk datasets sentence-transformers pycocotools torchmetrics
pip install git+https://github.com/openai/CLIP.git
```

---

Let's build powerful multimodal AI systems! üöÄ

---

## üìä Mathematical Foundation

### Vision-Language Alignment (CLIP)

**Contrastive Learning Objective:**

$$
\mathcal{L}_{\text{contrastive}} = -\sum_{i=1}^{N} \left[ \log \frac{\exp(\text{sim}(v_i, t_i) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(v_i, t_j) / \tau)} \right]
$$

Where:
- $v_i$ = Visual embedding for image $i$
- $t_i$ = Text embedding for caption $i$
- $\text{sim}(v, t) = \frac{v \cdot t}{\|v\| \|t\|}$ = Cosine similarity
- $\tau$ = Temperature parameter (controls distribution sharpness)
- $N$ = Batch size

**Intuition:** Match image-text pairs while separating non-matching pairs.

### Visual Question Answering (VQA)

**Attention-based Fusion:**

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

Where:
- $Q$ = Query (from text)
- $K, V$ = Key, Value (from image features)
- $d_k$ = Key dimension

**Multimodal Fusion:**

$$
h_{\text{multi}} = \text{FFN}([h_{\text{text}}; h_{\text{vision}}; h_{\text{cross}}])
$$

Where:
- $h_{\text{text}}$ = Text-only features
- $h_{\text{vision}}$ = Vision-only features
- $h_{\text{cross}}$ = Cross-modal attention output
- $[;]$ = Concatenation
- $\text{FFN}$ = Feed-forward network

### Loss Functions

**Image Captioning (Autoregressive):**

$$
\mathcal{L}_{\text{caption}} = -\sum_{t=1}^{T} \log P(w_t | w_{<t}, I)
$$

Where:
- $w_t$ = Token at position $t$
- $w_{<t}$ = Previous tokens
- $I$ = Image features
- $T$ = Sequence length

**VQA (Classification):**

$$
\mathcal{L}_{\text{VQA}} = -\sum_{c=1}^{C} y_c \log \hat{y}_c
$$

Where:
- $C$ = Number of answer classes
- $y_c$ = Ground truth (one-hot)
- $\hat{y}_c$ = Predicted probability for class $c$

---

## üèóÔ∏è Multimodal Architecture Components

```mermaid
graph TB
    subgraph "Vision Encoder"
        A[Image Input] --> B[CNN/ViT]
        B --> C[Visual Features]
    end
    
    subgraph "Language Model"
        D[Text Input] --> E[Token Embeddings]
        E --> F[Transformer Decoder]
    end
    
    subgraph "Fusion Layer"
        C --> G[Cross-Attention]
        E --> G
        G --> H[Multimodal Features]
    end
    
    H --> F
    F --> I[Generated Text]
    
    style G fill:#4ecdc4,stroke:#0d7377,stroke-width:2px
    style H fill:#ffe66d,stroke:#ff6b6b,stroke-width:2px
```

### Key Components

1. **Vision Encoder** (e.g., CLIP ViT, ResNet)
   - Extracts visual features from images
   - Pre-trained on large image datasets
   - Output: High-dimensional feature vectors

2. **Language Model** (e.g., GPT, LLaMA)
   - Processes and generates text
   - Pre-trained on large text corpora
   - Output: Token-level predictions

3. **Fusion Mechanism**
   - **Early Fusion**: Concatenate features early
   - **Late Fusion**: Combine predictions at output
   - **Cross-Attention**: Let text attend to image features
   - **Adapter Layers**: Learnable bridges between modalities

### üìù What's Happening in This Code?

**Purpose:** Set up the environment and import all necessary libraries for multimodal LLM implementations

**Key Points:**
- **PyTorch & Transformers**: Core deep learning framework and HuggingFace ecosystem for pre-trained models
- **Vision Libraries**: PIL for image handling, torchvision for computer vision operations
- **CLIP**: OpenAI's contrastive vision-language model for zero-shot classification and embeddings
- **Evaluation Tools**: Metrics for assessing captioning and VQA performance

**Why This Matters:** Proper library setup ensures we have all tools needed for vision-language tasks, from loading pre-trained models to evaluating their performance.

In [None]:
# ===================================================================
# PART 1: LIBRARY IMPORTS & SETUP
# ===================================================================

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms
from torchvision.models import resnet50
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import warnings
warnings.filterwarnings('ignore')

# HuggingFace Transformers
from transformers import (
    CLIPProcessor, CLIPModel, CLIPTokenizer,
    AutoTokenizer, AutoModel,
    VisionEncoderDecoderModel,
    ViTImageProcessor, ViTModel,
    GPT2Tokenizer, GPT2LMHeadModel
)

# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

print("\n" + "="*70)
print("MULTIMODAL LLM ENVIRONMENT READY")
print("="*70)

### üìù What's Happening in This Code?

**Purpose:** Implement CLIP-based zero-shot image classification using contrastive vision-language embeddings

**Key Points:**
- **CLIP Model Loading**: Pre-trained on 400M image-text pairs, learns shared embedding space for vision and language
- **Zero-Shot Classification**: Classify images without task-specific training by comparing image embeddings to text prompt embeddings
- **Cosine Similarity**: Measures alignment between image and text in the shared embedding space (higher = better match)
- **Post-Silicon Application**: Can classify defect types ("scratched die", "good die", "contamination") without labeled training data

**Why This Matters:** Zero-shot classification eliminates the need for large labeled datasets, making it ideal for rare defect types in semiconductor testing where labeled examples are scarce.

In [None]:
# ===================================================================
# PART 2: CLIP ZERO-SHOT IMAGE CLASSIFIER
# ===================================================================

class CLIPZeroShotClassifier:
    """
    Zero-shot image classification using CLIP embeddings
    """
    def __init__(self, model_name="openai/clip-vit-base-patch32"):
        print(f"Loading CLIP model: {model_name}...")
        self.model = CLIPModel.from_pretrained(model_name).to(device)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.model.eval()
        print("‚úì CLIP model loaded successfully")
    
    def classify(self, image, class_labels, return_scores=False):
        """
        Classify image using text prompts
        
        Args:
            image: PIL Image or path to image
            class_labels: List of text labels (e.g., ["cat", "dog", "bird"])
            return_scores: If True, return all scores
        
        Returns:
            predicted_label or (predicted_label, scores_dict)
        """
        # Load image if path provided
        if isinstance(image, str):
            image = Image.open(image).convert('RGB')
        
        # Prepare inputs
        inputs = self.processor(
            text=class_labels,
            images=image,
            return_tensors="pt",
            padding=True
        ).to(device)
        
        # Get embeddings
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits_per_image = outputs.logits_per_image  # (1, num_classes)
            probs = logits_per_image.softmax(dim=1)
        
        # Get prediction
        pred_idx = probs.argmax().item()
        predicted_label = class_labels[pred_idx]
        
        if return_scores:
            scores = {label: prob.item() for label, prob in zip(class_labels, probs[0])}
            return predicted_label, scores
        
        return predicted_label

# Initialize classifier
print("\n" + "="*70)
print("CLIP ZERO-SHOT CLASSIFIER")
print("="*70)

clip_classifier = CLIPZeroShotClassifier()

# Example: Classify semiconductor defect types
defect_classes = [
    "a photograph of a good semiconductor die",
    "a photograph of a scratched semiconductor die",
    "a photograph of a contaminated semiconductor die",
    "a photograph of a cracked semiconductor die"
]

print(f"\nDefect classification categories ({len(defect_classes)}):")
for i, label in enumerate(defect_classes, 1):
    print(f"  {i}. {label}")

### üìù What's Happening in This Code?

**Purpose:** Visualize CLIP's vision-language embedding space and demonstrate similarity computations

**Key Points:**
- **Embedding Extraction**: Separate image and text encoders produce normalized feature vectors in shared 512-dimensional space
- **Cosine Similarity Matrix**: Quantifies alignment between all image-text pairs (ranges from -1 to +1, higher = better match)
- **Heatmap Visualization**: Shows which text descriptions best match which images, revealing CLIP's understanding
- **Cross-Modal Retrieval**: Foundation for image search by text or text search by image

**Why This Matters:** Understanding the embedding space is crucial for debugging model behavior, optimizing prompts, and building retrieval systems for test documentation or defect databases.

In [None]:
# ===================================================================
# PART 3: CLIP EMBEDDING SPACE VISUALIZATION
# ===================================================================

def visualize_clip_embeddings(images, text_descriptions, model, processor):
    """
    Visualize CLIP embedding similarities between images and texts
    
    Args:
        images: List of PIL Images
        text_descriptions: List of text descriptions
        model: CLIP model
        processor: CLIP processor
    """
    # Get image embeddings
    image_inputs = processor(images=images, return_tensors="pt").to(device)
    with torch.no_grad():
        image_features = model.get_image_features(**image_inputs)
        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    
    # Get text embeddings
    text_inputs = processor(text=text_descriptions, return_tensors="pt", padding=True).to(device)
    with torch.no_grad():
        text_features = model.get_text_features(**text_inputs)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
    
    # Compute similarity matrix
    similarity = (image_features @ text_features.T).cpu().numpy()
    
    # Visualize
    plt.figure(figsize=(12, 8))
    
    # Heatmap
    plt.subplot(1, 2, 1)
    sns.heatmap(
        similarity,
        annot=True,
        fmt='.3f',
        cmap='RdYlGn',
        xticklabels=[f"Text {i+1}" for i in range(len(text_descriptions))],
        yticklabels=[f"Img {i+1}" for i in range(len(images))],
        cbar_kws={'label': 'Cosine Similarity'}
    )
    plt.title('Image-Text Similarity Matrix\n(CLIP Embeddings)', fontsize=12, fontweight='bold')
    plt.xlabel('Text Descriptions')
    plt.ylabel('Images')
    
    # Bar chart for best matches
    plt.subplot(1, 2, 2)
    best_matches = similarity.max(axis=1)
    colors = ['green' if s > 0.3 else 'orange' if s > 0.2 else 'red' for s in best_matches]
    plt.barh(range(len(images)), best_matches, color=colors, alpha=0.7)
    plt.xlabel('Max Similarity Score')
    plt.ylabel('Image Index')
    plt.title('Best Text Match per Image', fontsize=12, fontweight='bold')
    plt.xlim([0, 1])
    plt.grid(axis='x', alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return similarity

print("\n" + "="*70)
print("CLIP EMBEDDING SPACE ANALYSIS")
print("="*70)
print("\nThis visualization shows:")
print("  ‚Ä¢ How well CLIP aligns image and text embeddings")
print("  ‚Ä¢ Cosine similarity scores (higher = better match)")
print("  ‚Ä¢ Which descriptions best match which images")
print("\nUse case: Retrieve similar defect images by text query")

### üìù What's Happening in This Code?

**Purpose:** Build a simple image captioning model from scratch using encoder-decoder architecture

**Key Points:**
- **CNN Encoder**: ResNet extracts visual features from images, frozen pre-trained weights preserve learned representations
- **RNN Decoder**: LSTM generates captions word-by-word, conditioned on image features
- **Attention Mechanism**: Decoder focuses on relevant image regions when generating each word (like reading different parts of a wafer map)
- **Teacher Forcing**: During training, use ground truth previous words to stabilize learning

**Why This Matters:** Image captioning is foundational for automated test report generation in post-silicon validation‚Äîconverting wafer maps and test plots into natural language descriptions for engineers and management.

In [None]:
# ===================================================================
# PART 4: IMAGE CAPTIONING MODEL (ENCODER-DECODER)
# ===================================================================

class ImageEncoder(nn.Module):
    """
    CNN-based image encoder using pre-trained ResNet
    """
    def __init__(self, embed_size=256):
        super(ImageEncoder, self).__init__()
        resnet = resnet50(pretrained=True)
        # Remove final FC layer
        modules = list(resnet.children())[:-1]
        self.resnet = nn.Sequential(*modules)
        # Freeze ResNet parameters
        for param in self.resnet.parameters():
            param.requires_grad = False
        # Linear layer to project ResNet features
        self.linear = nn.Linear(resnet.fc.in_features, embed_size)
        self.bn = nn.BatchNorm1d(embed_size)
    
    def forward(self, images):
        """
        Extract image features
        
        Args:
            images: (batch_size, 3, 224, 224)
        Returns:
            features: (batch_size, embed_size)
        """
        with torch.no_grad():
            features = self.resnet(images)  # (batch, 2048, 1, 1)
        features = features.reshape(features.size(0), -1)  # (batch, 2048)
        features = self.bn(self.linear(features))  # (batch, embed_size)
        return features


class CaptionDecoder(nn.Module):
    """
    LSTM-based caption decoder with attention
    """
    def __init__(self, embed_size=256, hidden_size=512, vocab_size=10000, num_layers=1):
        super(CaptionDecoder, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, vocab_size)
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, features, captions):
        """
        Generate caption probabilities
        
        Args:
            features: (batch_size, embed_size) from encoder
            captions: (batch_size, max_length) token IDs
        Returns:
            outputs: (batch_size, max_length, vocab_size)
        """
        embeddings = self.embed(captions)  # (batch, max_len, embed_size)
        embeddings = torch.cat((features.unsqueeze(1), embeddings), dim=1)
        hiddens, _ = self.lstm(embeddings)
        outputs = self.linear(self.dropout(hiddens))
        return outputs

print("\n" + "="*70)
print("IMAGE CAPTIONING MODEL ARCHITECTURE")
print("="*70)

# Initialize models
embed_size = 256
hidden_size = 512
vocab_size = 10000

encoder = ImageEncoder(embed_size).to(device)
decoder = CaptionDecoder(embed_size, hidden_size, vocab_size).to(device)

print(f"\nEncoder:")
print(f"  ‚Ä¢ Base: ResNet-50 (pre-trained, frozen)")
print(f"  ‚Ä¢ Output: {embed_size}-dim image embeddings")
print(f"  ‚Ä¢ Parameters: {sum(p.numel() for p in encoder.parameters() if p.requires_grad):,}")

print(f"\nDecoder:")
print(f"  ‚Ä¢ Type: LSTM with {num_layers} layer(s)")
print(f"  ‚Ä¢ Hidden size: {hidden_size}")
print(f"  ‚Ä¢ Vocabulary: {vocab_size:,} tokens")
print(f"  ‚Ä¢ Parameters: {sum(p.numel() for p in decoder.parameters()):,}")

### üìù What's Happening in This Code?

**Purpose:** Implement caption generation with beam search for producing high-quality, diverse captions

**Key Points:**
- **Greedy Decoding**: Select most probable word at each step (fast but may miss better overall sequences)
- **Beam Search**: Maintain top-k hypotheses at each step, explores multiple caption possibilities simultaneously
- **Start/End Tokens**: Special tokens mark caption boundaries (<start> and <end>)
- **Temperature Sampling**: Control randomness in word selection (lower = more conservative, higher = more creative)

**Why This Matters:** Beam search generates better captions than greedy decoding for test reports and failure analysis‚Äîexploring multiple phrasings helps find the most accurate and informative description of defects or test patterns.

In [None]:
# ===================================================================
# PART 5: CAPTION GENERATION WITH BEAM SEARCH
# ===================================================================

def generate_caption(encoder, decoder, image, vocab, max_length=20, beam_width=3):
    """
    Generate caption using beam search
    
    Args:
        encoder: Image encoder model
        decoder: Caption decoder model
        image: Input image tensor (1, 3, H, W)
        vocab: Vocabulary mapping (token_to_id, id_to_token)
        max_length: Maximum caption length
        beam_width: Number of beams to maintain
    
    Returns:
        caption: Generated caption string
    """
    encoder.eval()
    decoder.eval()
    
    # Extract image features
    with torch.no_grad():
        features = encoder(image.to(device))  # (1, embed_size)
    
    # Initialize beams: (sequence, score)
    start_token = vocab['<start>']
    end_token = vocab['<end>']
    
    beams = [([start_token], 0.0)]  # (sequence, score)
    completed = []
    
    for _ in range(max_length):
        candidates = []
        
        for seq, score in beams:
            if seq[-1] == end_token:
                completed.append((seq, score))
                continue
            
            # Prepare input
            input_seq = torch.LongTensor([seq]).to(device)
            
            # Get predictions
            with torch.no_grad():
                outputs = decoder(features, input_seq)
                logits = outputs[0, -1, :]  # Last timestep
                probs = F.softmax(logits, dim=0)
            
            # Get top-k candidates
            topk_probs, topk_indices = torch.topk(probs, beam_width)
            
            for prob, idx in zip(topk_probs, topk_indices):
                new_seq = seq + [idx.item()]
                new_score = score - torch.log(prob).item()  # Negative log likelihood
                candidates.append((new_seq, new_score))
        
        # Select top beams
        candidates = sorted(candidates, key=lambda x: x[1])[:beam_width]
        beams = candidates
        
        # Early stopping if all beams completed
        if len(completed) >= beam_width:
            break
    
    # Add remaining beams to completed
    completed.extend(beams)
    
    # Select best caption
    best_seq, _ = min(completed, key=lambda x: x[1] / len(x[0]))  # Normalize by length
    
    # Convert to words
    id_to_token = {v: k for k, v in vocab.items()}
    caption = ' '.join([id_to_token.get(idx, '<unk>') for idx in best_seq 
                       if idx not in [start_token, end_token]])
    
    return caption

print("\n" + "="*70)
print("CAPTION GENERATION WITH BEAM SEARCH")
print("="*70)
print("\nBeam search algorithm:")
print("  1. Start with <start> token")
print("  2. At each step, expand top-k hypotheses")
print("  3. Keep top-k sequences based on cumulative probability")
print("  4. Stop when <end> token generated or max length reached")
print("\nAdvantages over greedy decoding:")
print("  ‚úì Explores multiple caption possibilities")
print("  ‚úì Finds globally better sequences")
print("  ‚úì Reduces repetition and improves fluency")

### üìù What's Happening in This Code?

**Purpose:** Use pre-trained HuggingFace models for production-ready image captioning and visual question answering

**Key Points:**
- **Vision Encoder Decoder**: Combines ViT (Vision Transformer) image encoder with GPT-2 text decoder
- **Pre-trained Weights**: Model already trained on millions of image-caption pairs (COCO, Flickr)
- **Automatic Tokenization**: Handles text preprocessing, vocabulary mapping, and special tokens automatically
- **Inference Pipeline**: Simple API for generating captions from images without manual preprocessing

**Why This Matters:** Production systems need reliable, well-tested models. HuggingFace provides state-of-the-art models that can be fine-tuned on domain-specific data (like semiconductor test images) with minimal code.

In [None]:
# ===================================================================
# PART 6: PRODUCTION IMAGE CAPTIONING (HUGGINGFACE)
# ===================================================================

class ProductionImageCaptioner:
    """
    Production-ready image captioning using HuggingFace models
    """
    def __init__(self, model_name="nlpconnect/vit-gpt2-image-captioning"):
        print(f"Loading model: {model_name}...")
        self.model = VisionEncoderDecoderModel.from_pretrained(model_name).to(device)
        self.feature_extractor = ViTImageProcessor.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model.eval()
        print("‚úì Model loaded successfully")
    
    def caption_image(self, image, max_length=50, num_beams=4):
        """
        Generate caption for image
        
        Args:
            image: PIL Image or path
            max_length: Maximum caption length
            num_beams: Beam search width
        
        Returns:
            caption: Generated caption string
        """
        # Load image if path
        if isinstance(image, str):
            image = Image.open(image).convert('RGB')
        
        # Preprocess
        pixel_values = self.feature_extractor(
            images=image,
            return_tensors="pt"
        ).pixel_values.to(device)
        
        # Generate caption
        with torch.no_grad():
            output_ids = self.model.generate(
                pixel_values,
                max_length=max_length,
                num_beams=num_beams,
                early_stopping=True
            )
        
        # Decode
        caption = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
        return caption
    
    def batch_caption(self, images, max_length=50, num_beams=4):
        """
        Caption multiple images in batch
        
        Args:
            images: List of PIL Images
            max_length: Maximum caption length
            num_beams: Beam search width
        
        Returns:
            captions: List of caption strings
        """
        # Preprocess batch
        pixel_values = self.feature_extractor(
            images=images,
            return_tensors="pt"
        ).pixel_values.to(device)
        
        # Generate captions
        with torch.no_grad():
            output_ids = self.model.generate(
                pixel_values,
                max_length=max_length,
                num_beams=num_beams,
                early_stopping=True
            )
        
        # Decode
        captions = [self.tokenizer.decode(ids, skip_special_tokens=True) 
                   for ids in output_ids]
        return captions

print("\n" + "="*70)
print("PRODUCTION IMAGE CAPTIONING SYSTEM")
print("="*70)

# Initialize captioner
captioner = ProductionImageCaptioner()

print("\nModel details:")
print("  ‚Ä¢ Vision Encoder: Vision Transformer (ViT)")
print("  ‚Ä¢ Text Decoder: GPT-2")
print("  ‚Ä¢ Training data: COCO Captions (>100K images)")
print("  ‚Ä¢ Inference: Beam search for quality captions")
print("\nReady for:")
print("  ‚úì Single image captioning")
print("  ‚úì Batch processing")
print("  ‚úì Fine-tuning on custom datasets")

### üìù What's Happening in This Code?

**Purpose:** Implement Visual Question Answering (VQA) model that answers questions about images

**Key Points:**
- **Multimodal Fusion**: Combines image features (from CNN/ViT) with question embeddings (from text encoder)
- **Cross-Modal Attention**: Question attends to relevant image regions to find answer
- **Classification Head**: Predicts answer from fixed vocabulary (common answers like yes/no, numbers, objects)
- **Pre-training Strategy**: Model learns image-text alignment before fine-tuning on VQA task

**Why This Matters:** VQA enables automated analysis of test data‚Äîengineers can ask "How many dies failed?" or "Is there a spatial pattern?" about wafer maps, getting instant answers without manual inspection.

In [None]:
# ===================================================================
# PART 7: VISUAL QUESTION ANSWERING (VQA) MODEL
# ===================================================================

class VQAModel(nn.Module):
    """
    Visual Question Answering model with cross-modal attention
    """
    def __init__(self, vision_dim=2048, text_dim=768, hidden_dim=512, num_answers=3000):
        super(VQAModel, self).__init__()
        
        # Vision encoder (ResNet features)
        self.vision_proj = nn.Linear(vision_dim, hidden_dim)
        
        # Text encoder projection
        self.text_proj = nn.Linear(text_dim, hidden_dim)
        
        # Cross-modal attention
        self.attention = nn.MultiheadAttention(hidden_dim, num_heads=8, batch_first=True)
        
        # Fusion and classification
        self.fusion = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(hidden_dim, num_answers)
        )
    
    def forward(self, image_features, text_features):
        """
        Args:
            image_features: (batch, vision_dim)
            text_features: (batch, text_dim)
        Returns:
            logits: (batch, num_answers)
        """
        # Project to common space
        v = self.vision_proj(image_features)  # (batch, hidden)
        t = self.text_proj(text_features)     # (batch, hidden)
        
        # Add sequence dimension for attention
        v = v.unsqueeze(1)  # (batch, 1, hidden)
        t = t.unsqueeze(1)  # (batch, 1, hidden)
        
        # Cross-modal attention (text attends to image)
        attn_out, _ = self.attention(t, v, v)  # (batch, 1, hidden)
        attn_out = attn_out.squeeze(1)  # (batch, hidden)
        
        # Fuse features
        combined = torch.cat([attn_out, t.squeeze(1)], dim=1)  # (batch, hidden*2)
        
        # Classify
        logits = self.fusion(combined)  # (batch, num_answers)
        
        return logits


class SimpleVQASystem:
    """
    Complete VQA system with vision and text encoders
    """
    def __init__(self):
        print("Initializing VQA system...")
        
        # Load pre-trained encoders
        self.vision_encoder = resnet50(pretrained=True).to(device)
        self.vision_encoder.fc = nn.Identity()  # Remove classification head
        self.vision_encoder.eval()
        
        # For text: using simple CLIP text encoder
        self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
        self.clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
        
        # VQA model
        self.vqa_model = VQAModel(vision_dim=2048, text_dim=512).to(device)
        
        print("‚úì VQA system initialized")
    
    def answer_question(self, image, question, answer_candidates):
        """
        Answer question about image
        
        Args:
            image: PIL Image
            question: Question string
            answer_candidates: List of possible answers
        
        Returns:
            answer: Most likely answer
        """
        self.vision_encoder.eval()
        self.vqa_model.eval()
        
        # Get image features (using ResNet)
        transform = transforms.Compose([
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])
        img_tensor = transform(image).unsqueeze(0).to(device)
        
        with torch.no_grad():
            img_features = self.vision_encoder(img_tensor)  # (1, 2048)
        
        # Get text features (using CLIP for simplicity)
        text_inputs = self.clip_processor(text=[question], return_tensors="pt", padding=True).to(device)
        with torch.no_grad():
            text_features = self.clip_model.get_text_features(**text_inputs)  # (1, 512)
        
        # Score answer candidates using CLIP similarity
        # (In production, would use trained VQA classification head)
        answer_inputs = self.clip_processor(text=answer_candidates, return_tensors="pt", padding=True).to(device)
        with torch.no_grad():
            answer_features = self.clip_model.get_text_features(**answer_inputs)
            
            # Compute similarities
            text_features_norm = text_features / text_features.norm(dim=-1, keepdim=True)
            answer_features_norm = answer_features / answer_features.norm(dim=-1, keepdim=True)
            
            # Simple scoring (in practice, would use VQA model)
            similarities = (text_features_norm @ answer_features_norm.T).squeeze()
        
        # Get best answer
        best_idx = similarities.argmax().item()
        return answer_candidates[best_idx]

print("\n" + "="*70)
print("VISUAL QUESTION ANSWERING SYSTEM")
print("="*70)

# Initialize VQA system
vqa_system = SimpleVQASystem()

print("\nVQA Components:")
print("  ‚Ä¢ Vision: ResNet-50 (extracts image features)")
print("  ‚Ä¢ Text: CLIP text encoder (processes questions)")
print("  ‚Ä¢ Fusion: Cross-modal attention + classification")
print("\nExample questions:")
print("  ‚Ä¢ 'How many objects are in the image?'")
print("  ‚Ä¢ 'What color is the object?'")
print("  ‚Ä¢ 'Is there a defect visible?'")

### üìù What's Happening in This Code?

**Purpose:** Demonstrate LLaVA-style visual instruction following‚Äîa visual assistant that can have conversations about images

**Key Points:**
- **Visual Adapter**: Lightweight projection layer connects frozen vision encoder to frozen language model
- **Instruction Tuning**: Model trained to follow diverse instructions like "describe this image" or "what's unusual here?"
- **Two-Stage Training**: (1) Pre-train adapter on image-caption pairs, (2) Fine-tune on instruction-following data
- **Efficient Design**: Only adapter parameters trained (~10M), vision and language models stay frozen (~7B total)

**Why This Matters:** Visual assistants enable natural language interaction with test data‚Äîengineers can ask follow-up questions, request specific details, or get explanations in plain English rather than interpreting raw data.

In [None]:
# ===================================================================
# PART 8: LLaVA-STYLE VISUAL ASSISTANT (SIMPLIFIED)
# ===================================================================

class VisualAdapter(nn.Module):
    """
    Lightweight adapter to connect vision encoder to language model
    Similar to LLaVA's projection layer
    """
    def __init__(self, vision_dim=768, lm_dim=768, hidden_dim=512):
        super(VisualAdapter, self).__init__()
        self.projection = nn.Sequential(
            nn.Linear(vision_dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, lm_dim)
        )
    
    def forward(self, vision_features):
        """
        Project vision features to language model space
        
        Args:
            vision_features: (batch, seq_len, vision_dim)
        Returns:
            projected: (batch, seq_len, lm_dim)
        """
        return self.projection(vision_features)


class LLaVAStyleAssistant:
    """
    Visual instruction-following assistant (simplified LLaVA architecture)
    """
    def __init__(self):
        print("Building visual assistant...")
        
        # Vision encoder (CLIP ViT)
        self.vision_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
        self.vision_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
        
        # Freeze vision encoder
        for param in self.vision_model.parameters():
            param.requires_grad = False
        
        # Visual adapter (trainable)
        self.adapter = VisualAdapter(vision_dim=512, lm_dim=768).to(device)
        
        # Language model (GPT-2 small for demo)
        self.lm = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
        self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
        self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # Freeze language model (in practice, might fine-tune)
        for param in self.lm.parameters():
            param.requires_grad = False
        
        print("‚úì Visual assistant ready")
        print(f"  ‚Ä¢ Trainable params: {sum(p.numel() for p in self.adapter.parameters()):,}")
        print(f"  ‚Ä¢ Total params: {sum(p.numel() for p in self.parameters()):,}")
    
    def parameters(self):
        """Return all model parameters"""
        return list(self.adapter.parameters())
    
    def generate_response(self, image, instruction, max_length=50):
        """
        Generate response to instruction about image
        
        Args:
            image: PIL Image
            instruction: Instruction text
            max_length: Maximum response length
        
        Returns:
            response: Generated text
        """
        self.vision_model.eval()
        self.adapter.eval()
        self.lm.eval()
        
        # Get image features
        img_inputs = self.vision_processor(images=image, return_tensors="pt").to(device)
        with torch.no_grad():
            vision_outputs = self.vision_model.vision_model(**img_inputs)
            vision_features = vision_outputs.last_hidden_state  # (1, 50, 768)
        
        # Project to LM space
        with torch.no_grad():
            adapted_features = self.adapter(vision_features)  # (1, 50, 768)
        
        # Prepare text prompt
        prompt = f"<image> {instruction}\nAssistant:"
        input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to(device)
        
        # Generate (simplified - in practice, would properly merge visual tokens)
        with torch.no_grad():
            output_ids = self.lm.generate(
                input_ids,
                max_length=max_length,
                num_beams=3,
                early_stopping=True,
                pad_token_id=self.tokenizer.eos_token_id
            )
        
        # Decode
        response = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
        response = response.split("Assistant:")[-1].strip()
        
        return response

print("\n" + "="*70)
print("LLaVA-STYLE VISUAL ASSISTANT")
print("="*70)

# Initialize assistant
visual_assistant = LLaVAStyleAssistant()

print("\nArchitecture:")
print("  1. Vision Encoder: CLIP ViT (frozen)")
print("  2. Visual Adapter: Projection layer (trainable)")
print("  3. Language Model: GPT-2 (frozen)")
print("\nTraining strategy:")
print("  ‚Ä¢ Stage 1: Train adapter on image-caption pairs")
print("  ‚Ä¢ Stage 2: Instruction-tune on visual QA data")
print("\nCapabilities:")
print("  ‚úì Describe images in detail")
print("  ‚úì Answer questions about image content")
print("  ‚úì Follow multi-step instructions")
print("  ‚úì Explain visual reasoning")

### üìù What's Happening in This Code?

**Purpose:** Create synthetic wafer map and test data visualization for demonstrating multimodal LLM applications in semiconductor testing

**Key Points:**
- **Wafer Map Generation**: Creates realistic spatial failure patterns (edge failures, clusters, random defects)
- **Parametric Data Simulation**: Generates test parameters (voltage, current, frequency) with realistic correlations
- **Visualization**: Heatmaps and scatter plots that mimic actual ATE test reports
- **Ground Truth Labels**: Known defect types and patterns for testing VQA and captioning models

**Why This Matters:** Real semiconductor test data is proprietary and sensitive. Synthetic data allows us to demonstrate multimodal LLM capabilities on realistic scenarios without exposing confidential manufacturing information.

In [None]:
# ===================================================================
# PART 9: POST-SILICON VALIDATION - WAFER MAP GENERATOR
# ===================================================================

def generate_wafer_map(wafer_size=30, defect_type='edge_failure', severity=0.3):
    """
    Generate synthetic wafer map with defect patterns
    
    Args:
        wafer_size: Diameter in dies
        defect_type: 'edge_failure', 'cluster', 'random', 'scratch'
        severity: Defect density (0-1)
    
    Returns:
        wafer_map: 2D array (pass=1, fail=0)
        description: Text description of pattern
    """
    # Create circular wafer
    center = wafer_size // 2
    y, x = np.ogrid[-center:wafer_size-center, -center:wafer_size-center]
    mask = x**2 + y**2 <= center**2
    wafer = np.ones((wafer_size, wafer_size))
    wafer[~mask] = np.nan
    
    # Apply defect pattern
    if defect_type == 'edge_failure':
        # Fails at wafer edge
        distance = np.sqrt(x**2 + y**2)
        edge_threshold = center * 0.8
        fails = (distance > edge_threshold) & mask
        wafer[fails] = 0 if np.random.rand() < severity else 1
        description = f"Edge failure pattern, {np.sum(fails)} dies affected at wafer periphery"
        
    elif defect_type == 'cluster':
        # Clustered failures
        cluster_centers = np.random.randint(0, wafer_size, (3, 2))
        for cx, cy in cluster_centers:
            dist = np.sqrt((x + center - cx)**2 + (y + center - cy)**2)
            cluster = (dist < 5) & mask
            wafer[cluster] = 0 if np.random.rand() < severity else wafer[cluster]
        description = f"Clustered defects at {len(cluster_centers)} locations, indicating localized contamination"
        
    elif defect_type == 'random':
        # Random failures
        num_fails = int(np.sum(mask) * severity)
        valid_indices = np.where(mask)
        fail_idx = np.random.choice(len(valid_indices[0]), num_fails, replace=False)
        wafer[valid_indices[0][fail_idx], valid_indices[1][fail_idx]] = 0
        description = f"Random failures: {num_fails} dies, suggests process variation or handling issues"
        
    elif defect_type == 'scratch':
        # Linear scratch pattern
        angle = np.random.rand() * np.pi
        scratch_line = np.abs(np.sin(angle) * x - np.cos(angle) * y) < 1.5
        wafer[scratch_line & mask] = 0
        description = f"Linear scratch pattern at {np.degrees(angle):.1f}¬∞ angle, mechanical damage during handling"
    
    return wafer, description


def plot_wafer_map(wafer, title="Wafer Map", save_path=None):
    """
    Visualize wafer map with failures
    """
    plt.figure(figsize=(10, 8))
    
    # Custom colormap
    cmap = plt.cm.colors.ListedColormap(['red', 'lightgreen'])
    bounds = [-0.5, 0.5, 1.5]
    norm = plt.cm.colors.BoundaryNorm(bounds, cmap.N)
    
    plt.imshow(wafer, cmap=cmap, norm=norm, interpolation='nearest')
    plt.colorbar(ticks=[0, 1], label='Status', shrink=0.8)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.xlabel('Die X Position')
    plt.ylabel('Die Y Position')
    
    # Add statistics
    valid_dies = ~np.isnan(wafer)
    total_dies = np.sum(valid_dies)
    failed_dies = np.sum(wafer[valid_dies] == 0)
    yield_pct = (1 - failed_dies/total_dies) * 100
    
    stats_text = f"Total Dies: {total_dies}\nFailed: {failed_dies}\nYield: {yield_pct:.1f}%"
    plt.text(0.02, 0.98, stats_text, transform=plt.gca().transAxes,
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8),
             verticalalignment='top', fontsize=10)
    
    plt.tight_layout()
    
    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches='tight')
    
    return plt.gcf()

print("\n" + "="*70)
print("POST-SILICON VALIDATION: WAFER MAP GENERATOR")
print("="*70)

# Generate example wafer maps
print("\nGenerating synthetic wafer maps...")

wafer1, desc1 = generate_wafer_map(wafer_size=30, defect_type='edge_failure', severity=0.4)
print(f"\n1. Edge Failure: {desc1}")

wafer2, desc2 = generate_wafer_map(wafer_size=30, defect_type='cluster', severity=0.3)
print(f"\n2. Cluster: {desc2}")

print("\nThese wafer maps can be analyzed by multimodal LLMs for:")
print("  ‚Ä¢ Automated defect classification")
print("  ‚Ä¢ Root cause analysis text generation")
print("  ‚Ä¢ Visual question answering ('Where are the failures?')")
print("  ‚Ä¢ Yield prediction and reporting")

### üìù What's Happening in This Code?

**Purpose:** Apply multimodal LLMs to analyze synthetic wafer maps‚Äîdemonstrating automated defect analysis and reporting

**Key Points:**
- **CLIP Zero-Shot Classification**: Identify defect types from wafer map images without task-specific training
- **Image Captioning**: Generate natural language descriptions of failure patterns automatically
- **VQA Application**: Answer specific questions about defect locations, counts, and patterns
- **Automated Reporting**: Transform visual test data into executive summaries for management

**Why This Matters:** This pipeline demonstrates end-to-end automation of post-silicon validation reporting‚Äîfrom raw wafer maps to natural language insights, reducing manual analysis time from hours to seconds while maintaining accuracy.

In [None]:
# ===================================================================
# PART 10: WAFER MAP ANALYSIS WITH MULTIMODAL LLMS
# ===================================================================

def analyze_wafer_with_multimodal_llm(wafer, description, clip_model, captioner):
    """
    Complete analysis pipeline: classification, captioning, and QA
    
    Args:
        wafer: Wafer map array
        description: Ground truth description
        clip_model: CLIP classifier
        captioner: Image captioning model
    
    Returns:
        analysis: Dictionary with results
    """
    # Generate wafer map image
    fig = plot_wafer_map(wafer, title="Test Wafer")
    
    # Convert matplotlib figure to PIL Image
    fig.canvas.draw()
    img_array = np.frombuffer(fig.canvas.tostring_rgb(), dtype=np.uint8)
    img_array = img_array.reshape(fig.canvas.get_width_height()[::-1] + (3,))
    wafer_image = Image.fromarray(img_array)
    plt.close(fig)
    
    # 1. Defect Classification (CLIP zero-shot)
    defect_classes = [
        "a wafer map showing edge failures",
        "a wafer map showing clustered defects",
        "a wafer map showing random failures",
        "a wafer map showing scratch defects"
    ]
    
    predicted_class, scores = clip_model.classify(
        wafer_image, 
        defect_classes, 
        return_scores=True
    )
    
    # 2. Caption Generation
    generated_caption = captioner.caption_image(wafer_image, max_length=50, num_beams=4)
    
    # 3. Calculate statistics
    valid_dies = ~np.isnan(wafer)
    total_dies = np.sum(valid_dies)
    failed_dies = np.sum(wafer[valid_dies] == 0)
    yield_pct = (1 - failed_dies/total_dies) * 100
    
    # Compile analysis
    analysis = {
        'classification': predicted_class,
        'confidence_scores': scores,
        'generated_caption': generated_caption,
        'ground_truth': description,
        'statistics': {
            'total_dies': int(total_dies),
            'failed_dies': int(failed_dies),
            'yield_percent': round(yield_pct, 2)
        }
    }
    
    return analysis, wafer_image


print("\n" + "="*70)
print("MULTIMODAL LLM WAFER ANALYSIS PIPELINE")
print("="*70)

# Generate test wafers
print("\nüìä Generating test wafers...")
wafers_to_test = [
    ('edge_failure', 0.4),
    ('cluster', 0.3),
    ('random', 0.2),
    ('scratch', 0.35)
]

analyses = []

for defect_type, severity in wafers_to_test:
    print(f"\n{'='*70}")
    print(f"Analyzing: {defect_type.upper()} (severity={severity})")
    print('='*70)
    
    # Generate wafer
    wafer, description = generate_wafer_map(
        wafer_size=30, 
        defect_type=defect_type, 
        severity=severity
    )
    
    # Analyze with multimodal LLM
    analysis, wafer_img = analyze_wafer_with_multimodal_llm(
        wafer, description, clip_classifier, captioner
    )
    
    # Display results
    print(f"\nüéØ Classification: {analysis['classification']}")
    print(f"\nüìä Confidence Scores:")
    for label, score in analysis['confidence_scores'].items():
        print(f"   {label}: {score:.3f}")
    
    print(f"\nüìù Generated Caption: {analysis['generated_caption']}")
    print(f"\n‚úì Ground Truth: {analysis['ground_truth']}")
    
    print(f"\nüìà Statistics:")
    stats = analysis['statistics']
    print(f"   Total Dies: {stats['total_dies']}")
    print(f"   Failed Dies: {stats['failed_dies']}")
    print(f"   Yield: {stats['yield_percent']}%")
    
    analyses.append((defect_type, analysis))

print(f"\n{'='*70}")
print("‚úÖ Analysis Complete!")
print(f"{'='*70}")
print(f"\nProcessed {len(analyses)} wafer maps")
print("Multimodal LLM capabilities demonstrated:")
print("  ‚úì Zero-shot defect classification")
print("  ‚úì Automated caption generation")
print("  ‚úì Statistical analysis")
print("  ‚úì Natural language reporting")

### üìù What's Happening in This Code?

**Purpose:** Evaluate multimodal LLM performance using standard metrics for captioning and VQA tasks

**Key Points:**
- **BLEU Score**: Measures n-gram overlap between generated and reference captions (higher = better word-level match)
- **ROUGE Score**: Evaluates recall of reference words in generated text (focuses on completeness)
- **CIDEr Score**: Consensus-based metric that weights rare words higher (better for descriptive quality)
- **METEOR**: Considers synonyms and stemming (more semantic understanding than BLEU)

**Why This Matters:** Objective metrics are essential for comparing models, tracking improvements during fine-tuning, and validating that automated reports meet quality standards before deployment in production environments.

In [None]:
# ===================================================================
# PART 11: EVALUATION METRICS FOR MULTIMODAL LLMS
# ===================================================================

def compute_bleu_score(reference, candidate):
    """
    Compute BLEU score (n-gram overlap)
    
    Args:
        reference: List of reference sentences
        candidate: Generated sentence
    
    Returns:
        bleu: BLEU-4 score (0-1)
    """
    from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
    
    # Tokenize
    reference_tokens = [ref.lower().split() for ref in reference]
    candidate_tokens = candidate.lower().split()
    
    # Compute BLEU with smoothing
    smooth = SmoothingFunction()
    bleu = sentence_bleu(
        reference_tokens, 
        candidate_tokens,
        smoothing_function=smooth.method1
    )
    
    return bleu


def compute_rouge_score(reference, candidate):
    """
    Compute ROUGE scores (recall-oriented)
    
    Returns:
        rouge_scores: Dict with ROUGE-1, ROUGE-2, ROUGE-L
    """
    from collections import Counter
    
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()
    
    # ROUGE-1 (unigram overlap)
    ref_unigrams = Counter(ref_tokens)
    cand_unigrams = Counter(cand_tokens)
    overlap = sum((ref_unigrams & cand_unigrams).values())
    
    rouge_1_precision = overlap / len(cand_tokens) if cand_tokens else 0
    rouge_1_recall = overlap / len(ref_tokens) if ref_tokens else 0
    rouge_1_f1 = (2 * rouge_1_precision * rouge_1_recall / 
                  (rouge_1_precision + rouge_1_recall)) if (rouge_1_precision + rouge_1_recall) > 0 else 0
    
    return {
        'rouge-1': {'precision': rouge_1_precision, 'recall': rouge_1_recall, 'f1': rouge_1_f1}
    }


def evaluate_caption_quality(generated_caption, reference_caption):
    """
    Comprehensive caption evaluation
    
    Args:
        generated_caption: Model output
        reference_caption: Ground truth
    
    Returns:
        metrics: Dict with multiple scores
    """
    # BLEU score
    bleu = compute_bleu_score([reference_caption], generated_caption)
    
    # ROUGE score
    rouge = compute_rouge_score(reference_caption, generated_caption)
    
    # Exact match (strict)
    exact_match = int(generated_caption.lower().strip() == reference_caption.lower().strip())
    
    # Word overlap ratio
    ref_words = set(reference_caption.lower().split())
    gen_words = set(generated_caption.lower().split())
    overlap_ratio = len(ref_words & gen_words) / len(ref_words) if ref_words else 0
    
    metrics = {
        'bleu': round(bleu, 4),
        'rouge_1_f1': round(rouge['rouge-1']['f1'], 4),
        'exact_match': exact_match,
        'word_overlap': round(overlap_ratio, 4)
    }
    
    return metrics


print("\n" + "="*70)
print("MULTIMODAL LLM EVALUATION METRICS")
print("="*70)

# Example evaluation
print("\nMetrics for caption quality assessment:")
print("\n1. BLEU (0-1):")
print("   ‚Ä¢ Measures n-gram precision")
print("   ‚Ä¢ Higher = better word-level match")
print("   ‚Ä¢ BLEU-4 considers up to 4-word sequences")

print("\n2. ROUGE (0-1):")
print("   ‚Ä¢ Measures recall of reference words")
print("   ‚Ä¢ ROUGE-1: Unigram overlap")
print("   ‚Ä¢ ROUGE-L: Longest common subsequence")

print("\n3. Word Overlap (0-1):")
print("   ‚Ä¢ Simple ratio of shared words")
print("   ‚Ä¢ Fast to compute, interpretable")

print("\n4. Exact Match:")
print("   ‚Ä¢ Binary: 1 if identical, 0 otherwise")
print("   ‚Ä¢ Strict but useful for specific phrases")

# Evaluate synthetic examples
print("\n" + "="*70)
print("EXAMPLE EVALUATION")
print("="*70)

reference = "edge failure pattern with 45 dies affected at wafer periphery"
candidates = [
    "edge failure pattern with dies affected at wafer periphery",  # Good match
    "wafer shows failures at the edge region",  # Moderate match
    "random defects across the wafer"  # Poor match
]

for i, candidate in enumerate(candidates, 1):
    print(f"\nCandidate {i}: '{candidate}'")
    metrics = evaluate_caption_quality(candidate, reference)
    print(f"  BLEU: {metrics['bleu']}")
    print(f"  ROUGE-1 F1: {metrics['rouge_1_f1']}")
    print(f"  Word Overlap: {metrics['word_overlap']}")
    print(f"  Exact Match: {metrics['exact_match']}")

---

## üéØ Real-World Project Ideas

### Post-Silicon Validation Projects

#### 1. **Automated Failure Analysis Report Generator**
**Objective:** Convert wafer maps and test plots into executive summary reports

**Features:**
- Input: Wafer maps, parametric plots, STDF data
- Processing: CLIP classification + LLaVA captioning
- Output: Natural language reports with insights
- Metrics: Report accuracy, time savings vs manual

**Business Value:** Reduce failure analysis time from 2-4 hours to 5 minutes per lot

**Implementation:**
```python
# Pseudo-code structure
def generate_failure_report(wafer_images, test_data):
    # 1. Classify defect patterns (CLIP)
    # 2. Generate captions (LLaVA)
    # 3. Extract statistics from STDF
    # 4. Compile into template report
    # 5. Return PDF/HTML report
```

---

#### 2. **Visual Test Documentation Search Engine**
**Objective:** Search historical test data using natural language queries

**Features:**
- Index: 10K+ wafer maps with embeddings
- Query: "Show me edge failures from Q3 2024"
- Retrieval: CLIP image-text similarity
- Result: Ranked wafer maps with descriptions

**Business Value:** Engineers find similar failures 10x faster, improving root cause analysis

**Tech Stack:**
- CLIP for embeddings
- Vector database (FAISS/Pinecone)
- Streamlit UI

---

#### 3. **Defect Detection with VQA Interface**
**Objective:** Interactive defect inspection tool with natural language

**Features:**
- Upload die photo
- Ask: "Are there scratches?", "What's the defect type?"
- Model: Fine-tuned LLaVA on semiconductor images
- Output: Answer + confidence + highlighted regions

**Business Value:** Non-experts can inspect defects without training on defect taxonomy

**Dataset:** Fine-tune on 5K+ labeled die photos

---

#### 4. **Parametric Correlation Explainer**
**Objective:** Explain correlations between test parameters using multimodal AI

**Features:**
- Input: Scatter plots (Vdd vs Idd, Freq vs Power)
- Question: "Why does Idd increase with voltage?"
- Model: GPT-4V or fine-tuned LLaVA
- Output: Physics-based explanation in plain English

**Business Value:** Accelerates debug for junior engineers, documents insights automatically

---

### General AI/ML Projects

#### 5. **Medical Image Question Answering**
**Objective:** Assist radiologists with diagnostic questions

**Features:**
- Input: X-rays, MRIs, CT scans
- Questions: "Is there a fracture?", "Where is the abnormality?"
- Model: Fine-tuned BLIP-2 or LLaVA
- Output: Answer + attention map

**Dataset:** CheXpert, MIMIC-CXR (chest X-rays)

---

#### 6. **Visual Shopping Assistant**
**Objective:** Answer product questions from images

**Features:**
- Input: Product photos
- Questions: "What size is this?", "Is it waterproof?"
- Model: CLIP + GPT-3.5 retrieval
- Output: Structured answers from image + text

**Business Value:** Improve conversion rates, reduce customer service load

---

#### 7. **Accessibility Alt-Text Generator**
**Objective:** Auto-generate descriptive alt-text for web images

**Features:**
- Batch process website images
- Generate detailed captions (ViT-GPT2)
- Validate quality with BLEU/CIDEr
- Export to HTML alt tags

**Impact:** Make web content accessible to visually impaired users

---

#### 8. **Video Surveillance Event Summarizer**
**Objective:** Summarize security footage with natural language

**Features:**
- Input: Video frames (sampled every 5 sec)
- Processing: CLIP classification per frame
- Output: Timeline with event descriptions
- Alerts: Anomaly detection + notification

**Tech:** CLIP + frame sampling + timeline generation

---

## üîç Diagnostic & Validation

### Common Issues and Solutions

#### 1. **Poor Caption Quality**

**Symptoms:**
- Generic captions ("a picture of something")
- Repetitive phrases
- Missing key details

**Debugging:**
```python
# Check model outputs
outputs = decoder(features, captions)
print("Logits shape:", outputs.shape)
print("Max probability:", outputs.max().item())

# Visualize attention weights
attention_weights = attention_layer.attention_weights
plt.imshow(attention_weights[0].cpu().detach())
```

**Solutions:**
- Increase beam width (3‚Üí5)
- Fine-tune on domain-specific data
- Add length penalty to avoid short captions
- Use better pre-trained models (BLIP-2)

---

#### 2. **CLIP Misclassification**

**Symptoms:**
- Low confidence scores (<0.3)
- Wrong class predictions
- Sensitivity to prompt wording

**Debugging:**
```python
# Test different prompts
prompts = [
    "a photo of {class}",
    "a {class}",
    "{class} in an image"
]

for template in prompts:
    classes = [template.format(class=c) for c in class_names]
    pred, scores = classifier.classify(image, classes, return_scores=True)
    print(f"Template: {template} ‚Üí {pred} ({scores[pred]:.3f})")
```

**Solutions:**
- Engineer better text prompts
- Ensemble multiple prompt templates
- Fine-tune CLIP on domain data
- Use larger CLIP models (ViT-L/14)

---

#### 3. **VQA Model Overfitting**

**Symptoms:**
- High train accuracy, low test accuracy
- Model memorizes training answers
- Poor generalization to new images

**Solutions:**
- Increase dropout (0.5 ‚Üí 0.7)
- Add data augmentation (random crops, flips)
- Use more diverse training data
- Regularize with weight decay
- Implement early stopping

---

#### 4. **Memory Issues with Large Models**

**Symptoms:**
- CUDA out of memory errors
- Slow inference times

**Solutions:**
```python
# Mixed precision training
from torch.cuda.amp import autocast

with autocast():
    outputs = model(images, text)

# Gradient accumulation
for i, batch in enumerate(dataloader):
    loss = model(batch) / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# Model quantization
import torch.quantization as quantization
model_quantized = quantization.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8
)
```

---

### Validation Checklist

‚úÖ **Data Quality**
- [ ] Images are properly preprocessed (resize, normalize)
- [ ] Text is tokenized correctly
- [ ] No data leakage between train/test
- [ ] Balanced class distribution

‚úÖ **Model Architecture**
- [ ] Vision and text encoders are compatible
- [ ] Fusion layer has sufficient capacity
- [ ] Gradient flow is healthy (no vanishing/exploding)
- [ ] Output dimensions match task requirements

‚úÖ **Training Process**
- [ ] Learning rate is appropriate (1e-5 to 1e-3)
- [ ] Loss is decreasing steadily
- [ ] Validation metrics improve
- [ ] No overfitting (train-val gap < 10%)

‚úÖ **Inference**
- [ ] Model is in eval mode (`model.eval()`)
- [ ] Gradients are disabled (`with torch.no_grad()`)
- [ ] Beam search parameters are tuned
- [ ] Output is post-processed correctly

‚úÖ **Performance**
- [ ] BLEU > 0.2 (for captioning)
- [ ] Accuracy > 70% (for classification)
- [ ] Inference time < 1 sec per image
- [ ] Qualitative results make sense

---

## üéì Key Takeaways

### When to Use Multimodal LLMs

‚úÖ **Use When:**
- Need to understand both visual and textual information
- Want to generate descriptions of images automatically
- Building conversational AI that discusses images
- Creating accessibility tools (alt-text, visual assistance)
- Analyzing visual data at scale (defect detection, medical imaging)
- Enabling natural language interfaces to visual systems

‚ùå **Avoid When:**
- Pure text or pure vision tasks (use specialized models)
- Real-time critical systems (inference can be slow)
- Privacy-sensitive applications without careful data handling
- Limited compute resources (models are large)

---

### Architecture Trade-offs

| Approach | Pros | Cons | Best For |
|----------|------|------|----------|
| **CLIP** | Fast, zero-shot, well-aligned | No text generation | Classification, retrieval |
| **Encoder-Decoder** | Simple, interpretable | Requires training data | Basic captioning |
| **LLaVA** | Instruction following, conversational | Large, needs GPU | Complex reasoning, QA |
| **BLIP-2** | Efficient (frozen encoders) | Complex architecture | Production systems |
| **GPT-4V** | Best quality | Expensive, proprietary | High-stakes applications |

---

### Training Strategies

**Pre-training:**
1. **Contrastive Learning** (CLIP-style)
   - Large-scale image-text pairs (millions)
   - Learn aligned embedding space
   - Enables zero-shot transfer

2. **Generative Pre-training**
   - Image captioning datasets (COCO, Flickr)
   - Autoregressive text generation
   - Builds fluency

**Fine-tuning:**
1. **Task-Specific**
   - Adapt to VQA, captioning, etc.
   - Smaller datasets (10K-100K)
   - Tune only top layers

2. **Instruction Tuning**
   - Diverse visual instructions
   - Improves generalization
   - Better user interaction

**Parameter-Efficient:**
- LoRA, adapters (1-5% params)
- Faster training, less memory
- Good for domain adaptation

---

### Performance Optimization

**Inference Speed:**
```python
# 1. Batch processing
images = load_batch(paths)
captions = model.batch_caption(images)

# 2. Model quantization
model_int8 = torch.quantization.quantize_dynamic(model)

# 3. ONNX export
torch.onnx.export(model, dummy_input, "model.onnx")

# 4. Reduce beam width
captions = model.caption(image, num_beams=3)  # vs 5
```

**Memory Efficiency:**
```python
# 1. Gradient checkpointing
model.gradient_checkpointing_enable()

# 2. Mixed precision
with torch.cuda.amp.autocast():
    outputs = model(inputs)

# 3. Freeze encoders
for param in encoder.parameters():
    param.requires_grad = False
```

---

### Common Pitfalls

1. **Hallucination**: Models generate plausible but incorrect details
   - **Solution**: Use beam search, temperature control, post-processing validation

2. **Prompt Sensitivity**: Small wording changes affect CLIP dramatically
   - **Solution**: Template ensembling, prompt engineering

3. **Bias**: Models inherit biases from training data
   - **Solution**: Diverse training data, bias detection, human review

4. **Limited Context**: Fixed image resolution loses details
   - **Solution**: Multi-scale processing, region proposals

5. **Caption Repetition**: Models repeat same phrases
   - **Solution**: Diversity penalties, nucleus sampling

---

### Best Practices

**Development:**
- Start with pre-trained models (HuggingFace)
- Validate on small dataset before scaling
- Use diverse evaluation metrics (BLEU + human eval)
- Monitor for bias and fairness issues

**Deployment:**
- Cache frequent queries (CLIP embeddings)
- Use async processing for batch jobs
- Implement fallback mechanisms
- Log outputs for continuous improvement

**Fine-tuning:**
- Start with small learning rates (1e-5)
- Use domain-specific validation data
- Monitor train-val gap for overfitting
- Save checkpoints frequently

---

### Next Steps

**Continue Learning:**
- **079_RAG_Fundamentals.ipynb** - Combine multimodal LLMs with retrieval
- **080_Advanced_RAG_Techniques.ipynb** - Multi-modal RAG systems
- **074_LLM_Fine_Tuning.ipynb** - Fine-tune for specific domains

**Practice Projects:**
1. Build visual search engine for your photo library
2. Create automated report generator for data visualizations
3. Fine-tune LLaVA on semiconductor defect images
4. Deploy CLIP-based image classifier API

**Research Directions:**
- Video understanding (extend to temporal dimension)
- 3D vision-language models
- Multi-lingual multimodal models
- Efficient architectures for edge deployment

---

## üìö References & Resources

**Papers:**
- CLIP: "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., 2021)
- LLaVA: "Visual Instruction Tuning" (Liu et al., 2023)
- BLIP-2: "Bootstrapping Language-Image Pre-training" (Li et al., 2023)
- Flamingo: "Tackling Multiple Tasks with a Single Visual Language Model" (Alayrac et al., 2022)

**Code & Models:**
- HuggingFace Transformers: https://github.com/huggingface/transformers
- OpenAI CLIP: https://github.com/openai/CLIP
- LLaVA: https://github.com/haotian-liu/LLaVA
- BLIP-2: https://github.com/salesforce/LAVIS

**Datasets:**
- COCO Captions: https://cocodataset.org/
- VQAv2: https://visualqa.org/
- Flickr30K: https://shannon.cs.illinois.edu/DenotationGraph/
- Conceptual Captions: https://ai.google.com/research/ConceptualCaptions/

**Community:**
- HuggingFace Forums: https://discuss.huggingface.co/
- Papers With Code: https://paperswithcode.com/task/image-captioning
- Reddit r/MachineLearning: Multimodal AI discussions

---

**üéâ Congratulations!** You've completed the Multimodal LLMs notebook. You now understand how to build and deploy vision-language models for real-world applications, from zero-shot classification to visual question answering and instruction following.