[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.6/semantic-search.ipynb)
[![Open with SageMaker](https://img.shields.io/badge/Open%20with-SageMaker-orange?logo=amazonaws)](https://studiolab.sagemaker.aws/import/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.6/semantic-search.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.6/semantic-search.ipynb)

# Semantic Search with FAISS and HuggingFace

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- How semantic search differs from traditional keyword-based search
- How to generate text embeddings using transformer models
- How to use FAISS for efficient similarity search
- How to build a practical semantic search system
- How to evaluate and interpret semantic search results

## 📋 Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with Python and PyTorch
- Knowledge of NLP fundamentals (refer to [NLP Learning Journey](https://github.com/vuhung16au/nlp-learning-journey))
- Understanding of transformers and embeddings

## 📚 What We'll Cover
1. **Introduction to Semantic Search**: Understanding context-aware search
2. **Environment Setup**: Installing dependencies and configuring device
3. **Loading and Preparing Dataset**: Working with the GitHub issues dataset
4. **Creating Text Embeddings**: Using sentence transformers for encoding
5. **Building FAISS Index**: Setting up efficient similarity search
6. **Performing Semantic Search**: Querying the index with natural language
7. **Evaluating Results**: Analyzing search quality and relevance
8. **Summary and Best Practices**: Key takeaways and optimization tips

## 💡 Reference
This notebook is based on the [HuggingFace LLM Course Chapter 5.6](https://huggingface.co/learn/llm-course/chapter5/6?fw=pt) about semantic search with FAISS.

## Part 1: Introduction to Semantic Search

### What is Semantic Search?

**Semantic search** uses deep learning models to understand the *meaning* of text rather than just matching keywords. This enables:

- **Context-aware retrieval**: Find documents with similar meaning even with different wording
- **Better user experience**: Users can search naturally without exact keyword matching
- **Cross-lingual search**: Find relevant content across languages
- **Question answering**: Retrieve passages that answer questions, not just contain keywords

### How It Works

1. **Encode corpus**: Convert all documents to dense vector representations (embeddings)
2. **Index vectors**: Store embeddings in FAISS for fast similarity search
3. **Encode query**: Transform user query into same embedding space
4. **Find similar**: Use FAISS to retrieve most similar documents based on vector similarity

### Traditional vs Semantic Search

**Traditional (Keyword-based)**:
- Query: "python error handling"
- Matches: Documents containing exact words "python", "error", "handling"

**Semantic Search**:
- Query: "python error handling"
- Matches: Documents about exception management, try-catch blocks, debugging in Python
- Can find: "How to catch exceptions in Python", "Debugging Python code", etc.

## Part 2: Environment Setup

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import random
from typing import List, Dict, Optional, Tuple
from tqdm.auto import tqdm

# HuggingFace libraries
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModel

# Set random seeds for reproducibility (repository standard: seed=16)
random.seed(16)
np.random.seed(16)
torch.manual_seed(16)
if torch.cuda.is_available():
    torch.cuda.manual_seed(16)
    torch.cuda.manual_seed_all(16)

print("🔢 Random seed set to 16 for reproducibility (repository standard)")

# Configure visualization style (repository standard)
sns.set_style('darkgrid')  # Better readability with gridlines
sns.set_palette("husl")     # Consistent, accessible colors
print("📊 Visualization style configured: darkgrid with husl palette")

In [None]:
# Device detection with TPU support for Google Colab
try:
    from google.colab import userdata
    import torch_xla.core.xla_model as xm
    COLAB_AVAILABLE = True
    TPU_AVAILABLE = True
except ImportError:
    COLAB_AVAILABLE = False
    TPU_AVAILABLE = False

def get_device() -> torch.device:
    """
    Get the best available device for PyTorch operations.
    
    Device Priority:
    - General: CUDA GPU > TPU (Colab only) > MPS (Apple Silicon) > CPU
    - Google Colab: Always prefer TPU when available
    
    Returns:
        torch.device: The optimal device for current hardware
    """
    # Google Colab: Always prefer TPU when available
    if COLAB_AVAILABLE and TPU_AVAILABLE:
        try:
            device = xm.xla_device()
            print("🔥 Using Google Colab TPU for optimal performance")
            print("💡 TPU is preferred in Colab for training and inference")
            return device
        except Exception as e:
            print(f"⚠️ TPU initialization failed: {e}")
            print("Falling back to GPU/CPU detection")
    
    # Standard device detection for other environments
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA GPU: {torch.cuda.get_device_name()}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("🍎 Using Apple MPS (Apple Silicon)")
    else:
        device = torch.device("cpu")
        print("💻 Using CPU (consider GPU/TPU for better performance)")
    
    return device

device = get_device()

In [None]:
# Check and install required packages
print("📦 Checking required packages...")

# Check if FAISS is installed
try:
    import faiss
    print("✅ FAISS is already installed")
    FAISS_AVAILABLE = True
except ImportError:
    print("⚠️ FAISS not found. Installing...")
    print("💡 Run: pip install faiss-cpu")
    print("💡 Or for GPU: pip install faiss-gpu")
    FAISS_AVAILABLE = False

if FAISS_AVAILABLE:
    print(f"📊 FAISS version: {faiss.__version__ if hasattr(faiss, '__version__') else 'Unknown'}")
    print(f"🔧 FAISS available indices: IndexFlatL2, IndexFlatIP, IndexIVFFlat, etc.")

## Part 3: Loading and Preparing the Dataset

We'll use the `lewtun/github-issues` dataset, which contains GitHub issues from various repositories. This is perfect for demonstrating semantic search as users often search for issues using natural language descriptions.

In [None]:
# Load the GitHub issues dataset
print("📥 Loading GitHub issues dataset...")
print("📍 Dataset: lewtun/github-issues")
print("💡 This dataset contains GitHub issues from various repositories\n")

try:
    # Load the full dataset
    github_issues = load_dataset("lewtun/github-issues", split="train")
    
    print(f"✅ Dataset loaded successfully!")
    print(f"📊 Total examples: {len(github_issues):,}")
    print(f"📋 Features: {list(github_issues.features.keys())}")
    
except Exception as e:
    print(f"❌ Error loading dataset: {e}")
    print("💡 Make sure you have internet connection and the datasets library is installed")

In [None]:
# Explore the dataset structure
print("🔍 Dataset Structure Analysis")
print("=" * 60)

# Convert to pandas for easier exploration
df = github_issues.to_pandas()

print(f"\n📊 Dataset shape: {df.shape}")
print(f"\n📋 Columns: {list(df.columns)}")
print(f"\n🔢 Data types:\n{df.dtypes}")

# Display first few examples
print(f"\n📝 Sample Issue:")
print("=" * 60)
sample_idx = 0
sample = df.iloc[sample_idx]
print(f"Title: {sample['title'] if 'title' in df.columns else 'N/A'}")
print(f"\nBody: {sample['body'][:300] if 'body' in df.columns and pd.notna(sample['body']) else 'N/A'}...")
print(f"\nComments: {sample['comments'] if 'comments' in df.columns else 'N/A'}")

In [None]:
# Prepare the dataset for semantic search
# We'll combine title and body for better context
print("🔧 Preparing dataset for semantic search...")

def prepare_text_for_search(example):
    """
    Combine title and body for semantic search.
    
    Args:
        example: Dataset example with 'title' and 'body' fields
        
    Returns:
        Example with added 'text' field
    """
    # Get title
    title = example.get('title', '')
    if title is None:
        title = ''
    
    # Get body
    body = example.get('body', '')
    if body is None:
        body = ''
    
    # Combine title and body
    # Title is more important, so we include it prominently
    example['text'] = f"{title}\n{body}"
    
    return example

# Apply preprocessing
github_issues = github_issues.map(prepare_text_for_search)

print(f"✅ Text field created by combining title and body")
print(f"📊 Sample combined text:")
print(f"{github_issues[0]['text'][:400]}...")

In [None]:
# For demonstration purposes, we'll use a subset of the data
# This makes the notebook run faster while still showing the concepts
print("🎲 Sampling dataset for faster processing...")
print("💡 Using repository standard seed=16 for reproducible sampling\n")

# Sample 1000 issues for this demo (shuffle with seed=16)
SAMPLE_SIZE = 1000
github_issues_sample = github_issues.shuffle(seed=16).select(range(min(SAMPLE_SIZE, len(github_issues))))

print(f"✅ Sampled {len(github_issues_sample):,} issues for demonstration")
print(f"🔢 Using seed=16 ensures reproducible results across runs")

## Part 4: Creating Text Embeddings

Now we'll convert the text into dense vector embeddings using a sentence transformer model. These embeddings capture the semantic meaning of the text in a high-dimensional vector space.

In [None]:
# Load a sentence transformer model for embeddings
print("📥 Loading sentence transformer model...")
print("🤖 Model: sentence-transformers/all-MiniLM-L6-v2")
print("💡 This model is optimized for semantic similarity tasks\n")

model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Move model to optimal device
model = model.to(device)
model.eval()  # Set to evaluation mode

print(f"✅ Model loaded successfully")
print(f"📱 Model on device: {device}")
print(f"📊 Model max sequence length: {tokenizer.model_max_length}")

In [None]:
def mean_pooling(model_output, attention_mask):
    """
    Perform mean pooling on model output to get sentence embeddings.
    
    Mean pooling averages token embeddings, weighted by attention mask,
    to produce a single vector representing the entire sentence.
    
    Args:
        model_output: Output from transformer model
        attention_mask: Attention mask indicating valid tokens
        
    Returns:
        Sentence embeddings (batch_size x embedding_dim)
    """
    # Extract token embeddings from model output
    token_embeddings = model_output[0]  # Shape: (batch_size, seq_length, hidden_dim)
    
    # Expand attention mask to match token embeddings shape
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    
    # Sum token embeddings, weighted by attention mask
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1)
    
    # Calculate mean by dividing by number of valid tokens
    sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
    mean_embeddings = sum_embeddings / sum_mask
    
    return mean_embeddings

def encode_texts(texts: List[str], batch_size: int = 32) -> np.ndarray:
    """
    Encode a list of texts into embeddings.
    
    Args:
        texts: List of text strings to encode
        batch_size: Batch size for processing (larger = faster but more memory)
        
    Returns:
        numpy array of embeddings (num_texts x embedding_dim)
    """
    all_embeddings = []
    
    # Process in batches for memory efficiency
    for i in tqdm(range(0, len(texts), batch_size), desc="Encoding texts"):
        batch_texts = texts[i:i + batch_size]
        
        # Tokenize batch
        encoded_input = tokenizer(
            batch_texts,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors='pt'
        )
        
        # Move to device
        encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
        
        # Generate embeddings
        with torch.no_grad():
            model_output = model(**encoded_input)
            # Apply mean pooling
            embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
            # Normalize embeddings for cosine similarity
            embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
        
        all_embeddings.append(embeddings.cpu().numpy())
    
    # Concatenate all batches
    return np.vstack(all_embeddings)

print("✅ Embedding functions defined")
print("📊 Functions: mean_pooling() and encode_texts()")

In [None]:
# Generate embeddings for all issues in our sample
print("🔄 Generating embeddings for all issues...")
print(f"📊 Processing {len(github_issues_sample):,} issues")
print("⏱️ This may take a few minutes depending on your hardware\n")

# Extract texts from dataset
texts = github_issues_sample['text']

# Generate embeddings
embeddings = encode_texts(texts, batch_size=32)

print(f"\n✅ Embeddings generated successfully!")
print(f"📊 Embeddings shape: {embeddings.shape}")
print(f"📏 Embedding dimension: {embeddings.shape[1]}")
print(f"💾 Memory usage: {embeddings.nbytes / 1024 / 1024:.2f} MB")

## Part 5: Building FAISS Index for Efficient Search

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search. We'll create an index that allows us to quickly find the most similar embeddings to a query.

In [None]:
# Check if FAISS is available, if not provide instructions
if not FAISS_AVAILABLE:
    print("⚠️ FAISS is not installed. Please install it to continue:")
    print("💡 Run: !pip install faiss-cpu")
    print("💡 Or for GPU: !pip install faiss-gpu")
else:
    import faiss
    
    print("🏗️ Building FAISS index...")
    print("📊 Index type: IndexFlatIP (Inner Product for cosine similarity)")
    print("💡 IndexFlatIP is exact search - guarantees finding true nearest neighbors\n")
    
    # Get embedding dimension
    embedding_dim = embeddings.shape[1]
    
    # Create FAISS index
    # IndexFlatIP uses inner product (dot product) for similarity
    # Since our embeddings are normalized, inner product = cosine similarity
    index = faiss.IndexFlatIP(embedding_dim)
    
    # Add embeddings to index
    # FAISS requires float32 format
    embeddings_float32 = embeddings.astype('float32')
    index.add(embeddings_float32)
    
    print(f"✅ FAISS index built successfully!")
    print(f"📊 Index contains {index.ntotal:,} vectors")
    print(f"📏 Vector dimension: {embedding_dim}")
    print(f"🔍 Index is ready for semantic search!")

## Part 6: Performing Semantic Search

Now we can perform semantic search! We'll encode a query and find the most similar issues in our dataset.

In [None]:
def semantic_search(query: str, k: int = 5) -> List[Dict]:
    """
    Perform semantic search to find most relevant issues.
    
    Args:
        query: Natural language search query
        k: Number of results to return
        
    Returns:
        List of dictionaries with search results and scores
    """
    # Encode the query
    query_embedding = encode_texts([query], batch_size=1)
    query_embedding_float32 = query_embedding.astype('float32')
    
    # Search the index
    # Returns: scores (similarities) and indices of nearest neighbors
    scores, indices = index.search(query_embedding_float32, k)
    
    # Prepare results
    results = []
    for score, idx in zip(scores[0], indices[0]):
        result = {
            'score': float(score),
            'index': int(idx),
            'title': github_issues_sample[int(idx)].get('title', 'N/A'),
            'body': github_issues_sample[int(idx)].get('body', 'N/A'),
            'text': github_issues_sample[int(idx)]['text']
        }
        results.append(result)
    
    return results

def display_search_results(query: str, results: List[Dict]):
    """
    Display search results in a readable format.
    
    Args:
        query: The search query
        results: List of search results
    """
    print("🔍 SEMANTIC SEARCH RESULTS")
    print("=" * 80)
    print(f"\n📝 Query: '{query}'")
    print(f"\n🎯 Found {len(results)} most relevant issues:\n")
    
    for i, result in enumerate(results, 1):
        print(f"{'─' * 80}")
        print(f"\n{i}. Similarity Score: {result['score']:.4f}")
        print(f"   Title: {result['title']}")
        
        # Show first 200 characters of body
        body = result['body'] if result['body'] else ''
        body_preview = body[:200] + '...' if len(body) > 200 else body
        print(f"   Body: {body_preview}")
        print()

print("✅ Search functions defined")
print("📊 Functions: semantic_search() and display_search_results()")

In [None]:
# Example 1: Search for training-related issues
if FAISS_AVAILABLE:
    query1 = "How to train a model on custom dataset?"
    results1 = semantic_search(query1, k=5)
    display_search_results(query1, results1)

In [None]:
# Example 2: Search for tokenization issues
if FAISS_AVAILABLE:
    query2 = "Problems with tokenizer padding and truncation"
    results2 = semantic_search(query2, k=5)
    display_search_results(query2, results2)

In [None]:
# Example 3: Search for GPU/memory issues
if FAISS_AVAILABLE:
    query3 = "Out of memory error when using GPU"
    results3 = semantic_search(query3, k=5)
    display_search_results(query3, results3)

In [None]:
# Interactive search - Try your own queries!
if FAISS_AVAILABLE:
    print("🎮 Interactive Semantic Search")
    print("=" * 80)
    print("\n💡 Try searching with natural language queries!")
    print("💡 Examples:")
    print("   - 'How to save and load a trained model?'")
    print("   - 'Fine-tuning BERT for classification'")
    print("   - 'Error loading pretrained weights'")
    print("\n📝 Uncomment the code below and add your query:\n")
    
    # Uncomment and modify the query below to search:
    # custom_query = "Your search query here"
    # custom_results = semantic_search(custom_query, k=3)
    # display_search_results(custom_query, custom_results)

## Part 7: Visualizing Search Quality

Let's analyze the quality of our semantic search by examining similarity scores and comparing results.

In [None]:
# Analyze similarity score distribution
if FAISS_AVAILABLE:
    print("📊 Analyzing semantic search quality...\n")
    
    # Collect similarity scores from multiple queries
    test_queries = [
        "How to train a model?",
        "Tokenization problems",
        "GPU out of memory",
        "Model loading error",
        "Fine-tuning transformers"
    ]
    
    all_scores = []
    for query in test_queries:
        results = semantic_search(query, k=10)
        scores = [r['score'] for r in results]
        all_scores.extend(scores)
    
    # Plot similarity score distribution
    plt.figure(figsize=(12, 5))
    
    # Histogram
    plt.subplot(1, 2, 1)
    plt.hist(all_scores, bins=30, edgecolor='black', alpha=0.7)
    plt.xlabel('Similarity Score', fontsize=12)
    plt.ylabel('Frequency', fontsize=12)
    plt.title('Distribution of Similarity Scores', fontsize=14, fontweight='bold')
    plt.axvline(np.mean(all_scores), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(all_scores):.3f}')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Box plot
    plt.subplot(1, 2, 2)
    plt.boxplot(all_scores, vert=True)
    plt.ylabel('Similarity Score', fontsize=12)
    plt.title('Similarity Score Statistics', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    # Print statistics
    print("\n📈 Similarity Score Statistics:")
    print(f"   Mean: {np.mean(all_scores):.4f}")
    print(f"   Median: {np.median(all_scores):.4f}")
    print(f"   Std Dev: {np.std(all_scores):.4f}")
    print(f"   Min: {np.min(all_scores):.4f}")
    print(f"   Max: {np.max(all_scores):.4f}")

In [None]:
# Compare keyword vs semantic search
if FAISS_AVAILABLE:
    print("🔍 Comparing Keyword vs Semantic Search")
    print("=" * 80)
    
    comparison_query = "model training fails"
    
    print(f"\n📝 Query: '{comparison_query}'\n")
    
    # Semantic search results
    print("🧠 SEMANTIC SEARCH (Understanding Meaning):")
    print("─" * 80)
    semantic_results = semantic_search(comparison_query, k=3)
    for i, result in enumerate(semantic_results, 1):
        print(f"\n{i}. Score: {result['score']:.4f}")
        print(f"   Title: {result['title']}")
    
    # Simple keyword matching (for comparison)
    print("\n\n🔤 KEYWORD SEARCH (Exact Matching):")
    print("─" * 80)
    keywords = comparison_query.lower().split()
    keyword_matches = []
    
    for idx, text in enumerate(github_issues_sample['text'][:100]):
        text_lower = text.lower()
        matches = sum(1 for kw in keywords if kw in text_lower)
        if matches > 0:
            keyword_matches.append((matches, idx))
    
    keyword_matches.sort(reverse=True)
    
    for i, (matches, idx) in enumerate(keyword_matches[:3], 1):
        print(f"\n{i}. Keyword matches: {matches}")
        print(f"   Title: {github_issues_sample[idx].get('title', 'N/A')}")
    
    print("\n\n💡 Key Difference:")
    print("   • Semantic search understands MEANING and context")
    print("   • Keyword search only finds EXACT word matches")
    print("   • Semantic search finds relevant results even with different wording")

## Part 8: Advanced FAISS Index Types

FAISS offers different index types optimized for different use cases. Let's explore some alternatives to `IndexFlatIP`.

In [None]:
if FAISS_AVAILABLE:
    print("🏗️ Comparing Different FAISS Index Types")
    print("=" * 80)
    
    import time
    
    # IndexFlatIP (Inner Product - Exact Search)
    print("\n1️⃣ IndexFlatIP (Current Index - Exact Search)")
    print("   • Uses inner product for similarity (cosine with normalized vectors)")
    print("   • Exact search: guarantees finding true nearest neighbors")
    print("   • Best for: Small to medium datasets (<1M vectors)")
    print(f"   • Current index size: {index.ntotal:,} vectors")
    
    # Test search speed
    test_query = "How to train a model?"
    query_embedding = encode_texts([test_query], batch_size=1).astype('float32')
    
    start = time.time()
    scores, indices = index.search(query_embedding, 5)
    flat_time = time.time() - start
    print(f"   • Search time: {flat_time*1000:.2f} ms")
    
    # IndexIVFFlat (Inverted File Index - Approximate Search)
    print("\n2️⃣ IndexIVFFlat (Approximate Search with Clustering)")
    print("   • Uses clustering to partition the vector space")
    print("   • Approximate search: trades accuracy for speed")
    print("   • Best for: Large datasets (>100K vectors)")
    
    # Create IVF index (only if dataset is large enough)
    if len(github_issues_sample) >= 100:
        nlist = min(100, len(github_issues_sample) // 10)  # Number of clusters
        quantizer = faiss.IndexFlatIP(embedding_dim)
        index_ivf = faiss.IndexIVFFlat(quantizer, embedding_dim, nlist)
        
        # Train the index (IVF requires training)
        index_ivf.train(embeddings_float32)
        index_ivf.add(embeddings_float32)
        index_ivf.nprobe = 10  # Number of clusters to search
        
        start = time.time()
        scores_ivf, indices_ivf = index_ivf.search(query_embedding, 5)
        ivf_time = time.time() - start
        
        print(f"   • Number of clusters: {nlist}")
        print(f"   • Search time: {ivf_time*1000:.2f} ms")
        print(f"   • Speedup: {flat_time/ivf_time:.2f}x faster")
    
    print("\n💡 Choosing the Right Index:")
    print("   • <10K vectors: IndexFlatIP (exact, fast enough)")
    print("   • 10K-1M vectors: IndexIVFFlat (good speed/accuracy balance)")
    print("   • >1M vectors: IndexIVFPQ (with quantization for compression)")
    print("   • GPU available: Use GPU indices for massive speedup")

## Part 9: Saving and Loading FAISS Index

For production use, you'll want to save your index and embeddings to avoid recomputing them.

In [None]:
if FAISS_AVAILABLE:
    import os
    import tempfile
    
    print("💾 Saving FAISS Index and Metadata")
    print("=" * 80)
    
    # Create temporary directory for saving
    temp_dir = tempfile.mkdtemp()
    index_path = os.path.join(temp_dir, "semantic_search.index")
    embeddings_path = os.path.join(temp_dir, "embeddings.npy")
    dataset_path = os.path.join(temp_dir, "dataset_sample")
    
    # Save FAISS index
    faiss.write_index(index, index_path)
    print(f"✅ FAISS index saved to: {index_path}")
    
    # Save embeddings
    np.save(embeddings_path, embeddings)
    print(f"✅ Embeddings saved to: {embeddings_path}")
    
    # Save dataset sample
    github_issues_sample.save_to_disk(dataset_path)
    print(f"✅ Dataset saved to: {dataset_path}")
    
    print(f"\n📊 Saved files:")
    print(f"   • Index size: {os.path.getsize(index_path) / 1024 / 1024:.2f} MB")
    print(f"   • Embeddings size: {os.path.getsize(embeddings_path) / 1024 / 1024:.2f} MB")
    
    # Demonstrate loading
    print("\n📥 Loading FAISS Index and Metadata")
    print("=" * 80)
    
    # Load index
    loaded_index = faiss.read_index(index_path)
    print(f"✅ Index loaded: {loaded_index.ntotal:,} vectors")
    
    # Load embeddings
    loaded_embeddings = np.load(embeddings_path)
    print(f"✅ Embeddings loaded: {loaded_embeddings.shape}")
    
    # Load dataset
    from datasets import load_from_disk
    loaded_dataset = load_from_disk(dataset_path)
    print(f"✅ Dataset loaded: {len(loaded_dataset):,} examples")
    
    # Verify loaded index works
    print("\n🔍 Testing loaded index...")
    test_query = "model training issues"
    query_emb = encode_texts([test_query], batch_size=1).astype('float32')
    scores, indices = loaded_index.search(query_emb, 3)
    
    print(f"✅ Loaded index is functional!")
    print(f"📊 Top result: {loaded_dataset[int(indices[0][0])].get('title', 'N/A')}")
    
    print("\n💡 In production:")
    print("   • Save index after initial embedding generation")
    print("   • Load index at startup for instant search capability")
    print("   • Update index periodically as new data arrives")

## Part 10: Best Practices and Performance Tips

Let's discuss best practices for building production-ready semantic search systems.

In [None]:
print("🎯 Semantic Search Best Practices")
print("=" * 80)

print("""
📚 1. MODEL SELECTION
   ✅ Use sentence-transformers models for semantic similarity
      • all-MiniLM-L6-v2: Fast, good quality (384 dim)
      • all-mpnet-base-v2: Best quality (768 dim)
      • multi-qa-MiniLM-L6-cos-v1: Optimized for Q&A
   
   ✅ Consider domain-specific models
      • Use models fine-tuned for your domain (legal, medical, etc.)
      • Test multiple models and evaluate on your data

🔧 2. EMBEDDING GENERATION
   ✅ Normalize embeddings for cosine similarity
      • Use L2 normalization: F.normalize(embeddings, p=2, dim=1)
      • Allows using IndexFlatIP for cosine similarity
   
   ✅ Batch processing for efficiency
      • Process in batches to utilize GPU effectively
      • Balance batch size with memory constraints
   
   ✅ Handle long texts properly
      • Truncate or chunk texts longer than model max length
      • Consider using sliding window for very long documents

🏗️ 3. INDEX SELECTION
   ✅ Choose index based on scale
      • <10K: IndexFlatIP (exact, fast)
      • 10K-1M: IndexIVFFlat (approximate, faster)
      • >1M: IndexIVFPQ (with compression)
   
   ✅ Consider GPU acceleration
      • Use faiss-gpu for large-scale applications
      • Can achieve 10-100x speedup

💾 4. DATA MANAGEMENT
   ✅ Save and version your indices
      • Avoid recomputing embeddings
      • Track which model version created embeddings
   
   ✅ Implement incremental updates
      • Add new vectors without rebuilding entire index
      • Periodic full rebuilds for optimization

🔍 5. SEARCH OPTIMIZATION
   ✅ Tune number of results (k)
      • Return more candidates than needed
      • Apply post-filtering and re-ranking
   
   ✅ Implement result filtering
      • Filter by metadata (date, category, etc.)
      • Use hybrid search (combine with keyword search)
   
   ✅ Add query preprocessing
      • Clean and normalize queries
      • Handle typos and variants

📊 6. EVALUATION AND MONITORING
   ✅ Measure search quality
      • Track precision@k and recall@k
      • Collect user feedback on results
   
   ✅ Monitor performance
      • Track search latency
      • Monitor memory usage
   
   ✅ A/B test improvements
      • Test different models and parameters
      • Measure impact on user satisfaction

🚀 7. PRODUCTION DEPLOYMENT
   ✅ Implement caching
      • Cache frequent queries
      • Cache embeddings for common items
   
   ✅ Scale horizontally
      • Shard index across multiple servers
      • Use load balancing for queries
   
   ✅ Handle edge cases
      • Empty queries
      • Very short or very long queries
      • Non-English text (if applicable)

💡 8. ADVANCED TECHNIQUES
   ✅ Hybrid search
      • Combine semantic and keyword search
      • Use BM25 + semantic for best results
   
   ✅ Re-ranking
      • Use cross-encoder for final ranking
      • Apply business logic and rules
   
   ✅ Query expansion
      • Generate query variations
      • Use synonyms and related terms
""")

print("✅ Best practices overview complete!")

---

## 📋 Summary

### 🔑 Key Concepts Mastered
- **Semantic Search**: Using embeddings to find semantically similar content, not just keyword matches
- **Text Embeddings**: Converting text to dense vectors that capture semantic meaning using transformer models
- **FAISS Indexing**: Building efficient similarity search indices for fast retrieval at scale
- **Mean Pooling**: Aggregating token embeddings into sentence-level representations
- **Cosine Similarity**: Measuring similarity between normalized embeddings using inner product

### 📈 Best Practices Learned
- Always normalize embeddings when using cosine similarity for search
- Choose appropriate FAISS index type based on dataset size and accuracy requirements
- Process texts in batches for efficient GPU utilization
- Save and version indices to avoid recomputing embeddings
- Use sentence-transformer models optimized for semantic similarity tasks
- Set seed=16 for reproducible results (repository standard)
- Configure visualization style with darkgrid for better readability

### 🚀 Next Steps
- **Advanced Indexing**: Explore IVF and HNSW indices for larger datasets
- **Hybrid Search**: Combine semantic search with keyword-based BM25
- **Cross-Encoders**: Use re-ranking models for improved search quality
- **Domain Adaptation**: Fine-tune embedding models for specific domains
- **Multi-modal Search**: Extend to images, videos, and other modalities

### 📚 Further Reading
- [FAISS Documentation](https://faiss.ai/)
- [Sentence Transformers](https://www.sbert.net/)
- [HuggingFace Course - Semantic Search](https://huggingface.co/learn/llm-course/chapter5/6)
- [Dense Passage Retrieval (DPR)](https://arxiv.org/abs/2004.04906)

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*