# Lab 3: Semantic Search & Data Catalogue RAG

**Data Discovery: Harnessing AI, AGI & Vector Databases - Day 2**

| Duration | Difficulty | Framework | Exercises |
|---|---|---|---|
| 90 min | Intermediate | pandas, sentence-transformers, chromadb, rank_bm25, scikit-learn | 6 |

In this lab, you'll practice:
- Comparing embedding models for domain-specific retrieval
- Building a BM25 keyword search index
- Combining keyword and semantic search with hybrid fusion
- Evaluating search quality with precision@k
- Building a RAG pipeline over a data catalogue
- Expanding queries with embedding similarity

---

## Student Notes & Background

### Why Semantic Search Matters for Data Discovery

Traditional data catalogues rely on **keyword matching** — if you search for "salary data," you only find assets whose metadata literally contains the word "salary." But what about assets described as "compensation benchmarking" or "payroll deductions"? These are clearly relevant, yet a keyword search misses them entirely.

**Semantic search** solves this by converting text into dense numerical vectors (embeddings) that capture *meaning*, not just surface words. Two descriptions that are conceptually similar will have vectors that are close together in embedding space, even if they share no words in common.

### Key Concepts

#### 1. Embedding Models
An **embedding model** (e.g., `all-MiniLM-L6-v2` from Sentence-Transformers) takes a piece of text and maps it to a fixed-length vector, typically 384 or 768 dimensions. These models are pre-trained on large text corpora and fine-tuned so that semantically similar texts produce similar vectors. Different models have different strengths — some excel at short queries, others at long documents, and some are tuned for specific domains.

**Cosine similarity** is the standard metric for comparing embeddings. It measures the angle between two vectors:
- **1.0** = identical direction (maximum similarity)
- **0.0** = orthogonal (no similarity)
- **-1.0** = opposite direction (though rare with modern embeddings)

#### 2. BM25 Keyword Search
**BM25 (Best Matching 25)** is a classical information retrieval algorithm that improves on simple TF-IDF. It scores documents by:
- **Term Frequency (TF):** How often the query term appears in the document (with diminishing returns)
- **Inverse Document Frequency (IDF):** Terms that appear in fewer documents are weighted higher
- **Document length normalisation:** Longer documents are penalised slightly to avoid bias

BM25 excels at finding exact keyword matches and is extremely fast, but it cannot understand synonyms, paraphrases, or conceptual relationships.

#### 3. Hybrid Search (Fusion)
Neither keyword nor semantic search is universally superior. **Hybrid search** combines both using weighted fusion:

```
final_score = α × BM25_normalised + (1 - α) × cosine_similarity
```

- **α = 0.0:** Pure semantic search
- **α = 0.5:** Equal weight (good default)
- **α = 1.0:** Pure keyword search

The optimal α depends on your data and use case. In practice, hybrid search consistently outperforms either strategy alone.

#### 4. Precision@k
**Precision@k** measures what fraction of the top-k results are truly relevant:

```
Precision@k = (number of relevant results in top k) / k
```

For example, if you search for "employee data" and 3 of your top-5 results are from the HR category, your Precision@5 = 3/5 = 0.60.

#### 5. Retrieval-Augmented Generation (RAG)
**RAG** is a pattern that combines retrieval with language model generation:
1. **Retrieve** relevant documents using search (keyword, semantic, or hybrid)
2. **Augment** a prompt with the retrieved context
3. **Generate** an answer grounded in the retrieved facts

In this lab, we simulate the generation step with structured extraction since we don't have a live LLM, but the retrieval pipeline is identical to what you'd use in production.

#### 6. Query Expansion
**Query expansion** improves recall by adding semantically related terms to the original query before searching. For example, expanding "salary" might add "compensation," "payroll," "remuneration," and "wages." This helps bridge vocabulary gaps between the user's query and the catalogue descriptions.

### What You'll Build

In this lab, you will:
1. **Compare** two embedding models (`all-MiniLM-L6-v2` vs `all-mpnet-base-v2`) by measuring how well each clusters descriptions from the same department
2. **Build** a BM25 keyword search index and test it with domain queries
3. **Implement** hybrid search that fuses BM25 and semantic scores with a tuneable α parameter
4. **Evaluate** search quality using Precision@k against known ground-truth category labels
5. **Build** a RAG-style pipeline that retrieves relevant catalogue entries and produces structured answers
6. **Implement** query expansion using embedding similarity over the corpus vocabulary

### Prerequisites
- Familiarity with pandas DataFrames and numpy arrays
- Basic understanding of cosine similarity (from Lab 1)
- Concepts from Lab 1: TF-IDF vectorisation, ChromaDB basics

### Tips
- The synthetic data uses a fixed random seed (`np.random.seed(42)`), so your results should be reproducible
- When comparing search strategies, pay attention to which categories each strategy retrieves — this reveals their strengths and weaknesses
- The α parameter in hybrid search is powerful — experiment with different values to build intuition

---

## Setup

First, let's import the necessary libraries.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter

# Embeddings & vector DB
from sentence_transformers import SentenceTransformer
import chromadb
from sklearn.metrics.pairwise import cosine_similarity

# Keyword search
from rank_bm25 import BM25Okapi

# Settings
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries loaded successfully!")

## Part 1: Generate Synthetic Data Catalogue

We'll reuse the same 500-asset catalogue from Lab 1 so results are directly comparable.

In [None]:
np.random.seed(42)

categories = ['HR', 'Finance', 'Marketing', 'Engineering', 'Legal']
sources = ['PostgreSQL', 'S3 Bucket', 'SharePoint', 'Salesforce', 'MongoDB']
data_types = ['Table', 'Document', 'Spreadsheet', 'Log File', 'Report']
sensitivity_levels = ['Public', 'Internal', 'Confidential', 'Restricted']

descriptions_pool = {
    'HR': [
        'Employee personal records including name address and date of birth',
        'Annual performance review scores and manager feedback',
        'Payroll data with salary deductions and tax withholdings',
        'Recruitment pipeline tracking applicant status and interview notes',
        'Benefits enrollment records for health dental and vision plans',
        'Employee onboarding documentation and training completion',
        'Workforce diversity and inclusion metrics by department',
        'Time and attendance records with overtime calculations',
        'Employee termination records and exit interview summaries',
        'Compensation benchmarking data across industry roles',
    ],
    'Finance': [
        'Quarterly revenue reports broken down by business unit',
        'Accounts payable invoices and payment processing records',
        'Annual budget forecasts with departmental allocations',
        'Customer billing records including credit card transactions',
        'Expense reimbursement claims with receipt attachments',
        'General ledger entries and journal adjustments',
        'Tax filing documents and regulatory compliance records',
        'Cash flow projections and working capital analysis',
        'Vendor payment terms and contract financial summaries',
        'Audit trail logs for financial transaction approvals',
    ],
    'Marketing': [
        'Campaign performance metrics including click rates and conversions',
        'Customer segmentation profiles based on purchase behaviour',
        'Social media analytics with engagement and reach data',
        'Email marketing subscriber lists with opt-in preferences',
        'Brand sentiment analysis from customer reviews and surveys',
        'Website traffic analytics and user journey tracking',
        'Lead scoring models and marketing qualified lead reports',
        'Content calendar and editorial planning documents',
        'Competitive intelligence reports and market research data',
        'Event registration lists with attendee contact information',
    ],
    'Engineering': [
        'Application server logs with error traces and stack dumps',
        'CI/CD pipeline metrics including build times and failure rates',
        'Infrastructure monitoring data from cloud resources',
        'API usage statistics and rate limiting configurations',
        'Database schema documentation and migration scripts',
        'Code repository commit history and pull request reviews',
        'Load testing results and performance benchmarks',
        'Security vulnerability scan reports and remediation tracking',
        'Microservice dependency maps and architecture diagrams',
        'Incident response logs and post-mortem analysis documents',
    ],
    'Legal': [
        'Active contract repository with vendor agreements and SLAs',
        'Intellectual property filings including patents and trademarks',
        'Regulatory compliance audit findings and remediation plans',
        'Data processing agreements under GDPR Article 28',
        'Litigation case files and legal correspondence records',
        'Corporate governance meeting minutes and board resolutions',
        'Privacy impact assessments for new data processing activities',
        'Non-disclosure agreement tracking and expiration dates',
        'Employment law compliance documentation by jurisdiction',
        'Insurance policy records and claims history',
    ],
}

n_assets = 500
records = []

for i in range(n_assets):
    cat = np.random.choice(categories)
    desc = np.random.choice(descriptions_pool[cat])
    if np.random.random() < 0.3:
        desc += ' updated ' + np.random.choice(['weekly', 'monthly', 'quarterly', 'annually'])
    records.append({
        'asset_id': f'ASSET-{i+1:04d}',
        'name': f'{cat.lower()}_{np.random.choice(["report", "dataset", "log", "file", "table"])}_{i+1:04d}',
        'description': desc,
        'category': cat,
        'source': np.random.choice(sources),
        'data_type': np.random.choice(data_types),
        'sensitivity': np.random.choice(sensitivity_levels, p=[0.15, 0.35, 0.30, 0.20]),
        'owner': np.random.choice(['alice', 'bob', 'carol', 'dave', 'eve', None], p=[0.2, 0.2, 0.2, 0.2, 0.15, 0.05]),
        'row_count': np.random.randint(100, 1_000_000) if np.random.random() > 0.2 else None,
        'last_updated': pd.Timestamp('2023-01-01') + pd.Timedelta(days=int(np.random.randint(0, 730))),
    })

df = pd.DataFrame(records)
print(f"Generated {len(df)} data asset records")
df.head(10)

## Exercise 1.1: Multi-Strategy Embeddings

Different embedding models capture different aspects of text similarity. Compare two models on our data catalogue descriptions to see which produces better intra-category clustering.

**Your Task:** Load two SentenceTransformer models, encode all descriptions, and compare their intra-class similarity scores.

In [None]:
def compare_embedding_models(df, model_names=['all-MiniLM-L6-v2', 'all-mpnet-base-v2']):
    """Compare embedding models by measuring intra-category cosine similarity.
    
    Steps:
    1. Load each SentenceTransformer model
    2. Encode all descriptions
    3. For each category, compute the mean cosine similarity between all pairs
    4. Plot a grouped bar chart comparing models across categories
    
    Returns: dict of {model_name: {category: mean_similarity}}
    """
    # YOUR CODE HERE
    pass

results = compare_embedding_models(df)

## Exercise 1.2: BM25 Keyword Search

BM25 is a classical information retrieval algorithm that ranks documents by term frequency and inverse document frequency. Build a BM25 index over the data catalogue.

**Your Task:** Tokenize descriptions, build a BM25 index, and implement a search function.

In [None]:
def build_bm25_index(df):
    """Build a BM25 index over data asset descriptions.
    
    Steps:
    1. Tokenize each description (lowercase, split on whitespace)
    2. Build BM25Okapi index from tokenized corpus
    
    Returns: (bm25, tokenized_corpus)
    """
    # YOUR CODE HERE
    pass

def bm25_search(query, bm25, df, top_k=5):
    """Search the BM25 index and return top-k results.
    
    Steps:
    1. Tokenize the query
    2. Get BM25 scores for all documents
    3. Return top-k results with asset_id, description, category, and score
    
    Returns: list of (asset_id, description, category, score) tuples
    """
    # YOUR CODE HERE
    pass

bm25_result = build_bm25_index(df)

# Test queries
test_queries = [
    "employee salary payroll",
    "customer billing credit card",
    "server logs monitoring",
]

if bm25_result:
    bm25, tokenized_corpus = bm25_result
    for q in test_queries:
        print(f"\nQuery: '{q}'")
        results = bm25_search(q, bm25, df)
        if results:
            for asset_id, desc, cat, score in results:
                print(f"  [{cat}] {asset_id}: {desc[:80]}... (score: {score:.3f})")

## Exercise 2.1: Hybrid Search

Combine BM25 keyword scores with semantic cosine similarity using weighted fusion. This hybrid approach captures both exact keyword matches and semantic meaning.

**Your Task:** Implement a hybrid search function that fuses BM25 and semantic scores.

In [None]:
def build_semantic_index(df, model_name='all-MiniLM-L6-v2'):
    """Build a semantic search index.
    
    Steps:
    1. Load SentenceTransformer model
    2. Encode all descriptions
    
    Returns: (model, embeddings)
    """
    # YOUR CODE HERE
    pass

def hybrid_search(query, bm25, model, embeddings, df, alpha=0.5, top_k=5):
    """Perform hybrid search combining BM25 and semantic similarity.
    
    Steps:
    1. Get BM25 scores and normalize to [0, 1]
    2. Get cosine similarity scores between query embedding and all doc embeddings
    3. Combine: final_score = alpha * bm25_norm + (1 - alpha) * cosine_sim
    4. Return top-k results
    
    Args:
        alpha: Weight for BM25 scores (0 = pure semantic, 1 = pure keyword)
    
    Returns: list of (asset_id, description, category, score) tuples
    """
    # YOUR CODE HERE
    pass

semantic_result = build_semantic_index(df)

# Test hybrid search with different alpha values
if bm25_result and semantic_result:
    model, embeddings = semantic_result
    query = "employee compensation and benefits"
    
    for alpha in [0.0, 0.3, 0.5, 0.7, 1.0]:
        print(f"\n--- alpha={alpha} ({'pure semantic' if alpha == 0 else 'pure BM25' if alpha == 1 else 'hybrid'}) ---")
        results = hybrid_search(query, bm25, model, embeddings, df, alpha=alpha)
        if results:
            for asset_id, desc, cat, score in results:
                print(f"  [{cat}] {desc[:70]}... (score: {score:.3f})")

## Exercise 2.2: Search Quality Evaluation

Measure search quality using precision@k with known category relevance as ground truth.

**Your Task:** Define ground truth mappings, compute precision@k for each search strategy, and compare them.

In [None]:
def evaluate_search_quality(bm25, model, embeddings, df, top_k=5):
    """Evaluate BM25, semantic, and hybrid search using precision@k.
    
    Ground truth queries (query -> expected category):
    - 'employee salary payroll tax' -> HR
    - 'revenue budget financial reports' -> Finance
    - 'campaign marketing customer engagement' -> Marketing
    - 'server logs CI/CD pipeline deployment' -> Engineering
    - 'contract compliance legal agreement' -> Legal
    
    Steps:
    1. For each query, run BM25, semantic, and hybrid search
    2. Compute precision@k = (# results in expected category) / k
    3. Plot grouped bar chart comparing strategies
    
    Returns: dict of {strategy: {query: precision}}
    """
    # YOUR CODE HERE
    pass

if bm25_result and semantic_result:
    eval_results = evaluate_search_quality(bm25, model, embeddings, df)

## Exercise 3.1: Build a Data Catalogue RAG Pipeline

Use retrieved context from hybrid search to answer natural language questions about the data catalogue. Since we don't have a live LLM, we'll build a structured extraction pipeline that summarizes the retrieved assets.

**Your Task:** Implement a RAG-style query function that retrieves relevant assets and produces a structured answer.

In [None]:
def rag_query(question, bm25, model, embeddings, df, top_k=5):
    """Answer a natural language question using retrieved data catalogue context.
    
    Steps:
    1. Run hybrid_search to retrieve top-k relevant assets
    2. Extract structured information from results:
       - Categories represented
       - Sensitivity levels
       - Data sources
       - Owners
    3. Format a structured answer summarizing findings
    
    Returns: formatted string answer
    """
    # YOUR CODE HERE
    pass

# Test RAG queries
test_questions = [
    "What employee data assets contain personal information?",
    "Which financial assets involve customer payment data?",
    "What engineering assets are available for monitoring?",
    "Are there any legal compliance documents in the catalogue?",
]

if bm25_result and semantic_result:
    for q in test_questions:
        print(f"\nQ: {q}")
        answer = rag_query(q, bm25, model, embeddings, df)
        if answer:
            print(answer)
        print("-" * 60)

## Exercise 3.2: Query Expansion

Improve retrieval by expanding the original query with semantically related terms from the corpus vocabulary.

**Your Task:** Implement query expansion using embedding similarity and show improved results.

In [None]:
def expand_query(query, model, corpus_terms, term_embeddings, n_expand=5):
    """Expand a query with semantically related terms from the corpus.
    
    Steps:
    1. Encode the query
    2. Compute cosine similarity between query and all corpus term embeddings
    3. Select top-n most similar terms not already in the query
    4. Append to the original query
    
    Returns: expanded query string
    """
    # YOUR CODE HERE
    pass

def build_term_index(df, model):
    """Build an index of unique terms and their embeddings.
    
    Steps:
    1. Extract all unique words from descriptions (lowercase, len > 3)
    2. Encode each term with the model
    
    Returns: (terms_list, term_embeddings)
    """
    # YOUR CODE HERE
    pass

# Build term index and test expansion
if semantic_result:
    term_result = build_term_index(df, model)
    if term_result:
        corpus_terms, term_embeddings = term_result
        
        test_queries = ["salary", "server monitoring", "contract"]
        for q in test_queries:
            expanded = expand_query(q, model, corpus_terms, term_embeddings)
            print(f"Original:  '{q}'")
            print(f"Expanded:  '{expanded}'")
            print()

## Summary

In this lab, you learned how to:

1. **Compare** embedding models to find the best fit for your domain
2. **Build** a BM25 keyword search index for exact term matching
3. **Combine** keyword and semantic search with weighted hybrid fusion
4. **Evaluate** search quality using precision@k against ground truth
5. **Build** a RAG pipeline that retrieves context to answer catalogue questions
6. **Expand** queries with semantically related terms for better recall

---

*Data Discovery: Harnessing AI, AGI & Vector Databases | AI Elevate*