# Lab 3: Semantic Search & Data Catalogue RAG (Solutions)

**Data Discovery: Harnessing AI, AGI & Vector Databases - Day 2**

| Duration | Difficulty | Framework | Exercises |
|---|---|---|---|
| 90 min | Intermediate | pandas, numpy, sentence-transformers, chromadb, scikit-learn, rank_bm25, matplotlib | 6 |

## Student Notes & Background

### Why Semantic Search Matters for Data Discovery

Traditional data catalogues rely on **keyword matching** — if you search for "salary data," you only find assets whose metadata literally contains the word "salary." But what about assets described as "compensation benchmarking" or "payroll deductions"? These are clearly relevant, yet a keyword search misses them entirely.

**Semantic search** solves this by converting text into dense numerical vectors (embeddings) that capture *meaning*, not just surface words. Two descriptions that are conceptually similar will have vectors that are close together in embedding space, even if they share no words in common.

### Key Concepts

#### 1. Embedding Models
An **embedding model** (e.g., `all-MiniLM-L6-v2` from Sentence-Transformers) takes a piece of text and maps it to a fixed-length vector, typically 384 or 768 dimensions. These models are pre-trained on large text corpora and fine-tuned so that semantically similar texts produce similar vectors. Different models have different strengths — some excel at short queries, others at long documents, and some are tuned for specific domains.

**Cosine similarity** is the standard metric for comparing embeddings. It measures the angle between two vectors:
- **1.0** = identical direction (maximum similarity)
- **0.0** = orthogonal (no similarity)
- **-1.0** = opposite direction (though rare with modern embeddings)

#### 2. BM25 Keyword Search
**BM25 (Best Matching 25)** is a classical information retrieval algorithm that improves on simple TF-IDF. It scores documents by:
- **Term Frequency (TF):** How often the query term appears in the document (with diminishing returns)
- **Inverse Document Frequency (IDF):** Terms that appear in fewer documents are weighted higher
- **Document length normalisation:** Longer documents are penalised slightly to avoid bias

BM25 excels at finding exact keyword matches and is extremely fast, but it cannot understand synonyms, paraphrases, or conceptual relationships.

#### 3. Hybrid Search (Fusion)
Neither keyword nor semantic search is universally superior. **Hybrid search** combines both using weighted fusion:

```
final_score = α × BM25_normalised + (1 - α) × cosine_similarity
```

- **α = 0.0:** Pure semantic search
- **α = 0.5:** Equal weight (good default)
- **α = 1.0:** Pure keyword search

The optimal α depends on your data and use case. In practice, hybrid search consistently outperforms either strategy alone.

#### 4. Precision@k
**Precision@k** measures what fraction of the top-k results are truly relevant:

```
Precision@k = (number of relevant results in top k) / k
```

For example, if you search for "employee data" and 3 of your top-5 results are from the HR category, your Precision@5 = 3/5 = 0.60.

#### 5. Retrieval-Augmented Generation (RAG)
**RAG** is a pattern that combines retrieval with language model generation:
1. **Retrieve** relevant documents using search (keyword, semantic, or hybrid)
2. **Augment** a prompt with the retrieved context
3. **Generate** an answer grounded in the retrieved facts

In this lab, we simulate the generation step with structured extraction since we don't have a live LLM, but the retrieval pipeline is identical to what you'd use in production.

#### 6. Query Expansion
**Query expansion** improves recall by adding semantically related terms to the original query before searching. For example, expanding "salary" might add "compensation," "payroll," "remuneration," and "wages." This helps bridge vocabulary gaps between the user's query and the catalogue descriptions.

### What You'll Build

In this lab, you will:
1. **Compare** two embedding models (`all-MiniLM-L6-v2` vs `all-mpnet-base-v2`) by measuring how well each clusters descriptions from the same department
2. **Build** a BM25 keyword search index and test it with domain queries
3. **Implement** hybrid search that fuses BM25 and semantic scores with a tuneable α parameter
4. **Evaluate** search quality using Precision@k against known ground-truth category labels
5. **Build** a RAG-style pipeline that retrieves relevant catalogue entries and produces structured answers
6. **Implement** query expansion using embedding similarity over the corpus vocabulary

### Prerequisites
- Familiarity with pandas DataFrames and numpy arrays
- Basic understanding of cosine similarity (from Lab 1)
- Concepts from Lab 1: TF-IDF vectorisation, ChromaDB basics

### Tips
- The synthetic data uses a fixed random seed (`np.random.seed(42)`), so your results should be reproducible
- When comparing search strategies, pay attention to which categories each strategy retrieves — this reveals their strengths and weaknesses
- The α parameter in hybrid search is powerful — experiment with different values to build intuition

---

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from sentence_transformers import SentenceTransformer
import chromadb
from sklearn.metrics.pairwise import cosine_similarity
from rank_bm25 import BM25Okapi
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
print("Libraries loaded successfully!")

## Part 1: Generate Synthetic Data Catalogue

We reuse the **exact same 500-asset catalogue** from Lab 1 so that results are
directly comparable.  The random seed, categories, description pools, and
generation logic are identical.

In [None]:
np.random.seed(42)

categories = ['HR', 'Finance', 'Marketing', 'Engineering', 'Legal']
sources = ['PostgreSQL', 'S3 Bucket', 'SharePoint', 'Salesforce', 'MongoDB']
data_types = ['Table', 'Document', 'Spreadsheet', 'Log File', 'Report']
sensitivity_levels = ['Public', 'Internal', 'Confidential', 'Restricted']

descriptions_pool = {
    'HR': [
        'Employee personal records including name address and date of birth',
        'Annual performance review scores and manager feedback',
        'Payroll data with salary deductions and tax withholdings',
        'Recruitment pipeline tracking applicant status and interview notes',
        'Benefits enrollment records for health dental and vision plans',
        'Employee onboarding documentation and training completion',
        'Workforce diversity and inclusion metrics by department',
        'Time and attendance records with overtime calculations',
        'Employee termination records and exit interview summaries',
        'Compensation benchmarking data across industry roles',
    ],
    'Finance': [
        'Quarterly revenue reports broken down by business unit',
        'Accounts payable invoices and payment processing records',
        'Annual budget forecasts with departmental allocations',
        'Customer billing records including credit card transactions',
        'Expense reimbursement claims with receipt attachments',
        'General ledger entries and journal adjustments',
        'Tax filing documents and regulatory compliance records',
        'Cash flow projections and working capital analysis',
        'Vendor payment terms and contract financial summaries',
        'Audit trail logs for financial transaction approvals',
    ],
    'Marketing': [
        'Campaign performance metrics including click rates and conversions',
        'Customer segmentation profiles based on purchase behaviour',
        'Social media analytics with engagement and reach data',
        'Email marketing subscriber lists with opt-in preferences',
        'Brand sentiment analysis from customer reviews and surveys',
        'Website traffic analytics and user journey tracking',
        'Lead scoring models and marketing qualified lead reports',
        'Content calendar and editorial planning documents',
        'Competitive intelligence reports and market research data',
        'Event registration lists with attendee contact information',
    ],
    'Engineering': [
        'Application server logs with error traces and stack dumps',
        'CI/CD pipeline metrics including build times and failure rates',
        'Infrastructure monitoring data from cloud resources',
        'API usage statistics and rate limiting configurations',
        'Database schema documentation and migration scripts',
        'Code repository commit history and pull request reviews',
        'Load testing results and performance benchmarks',
        'Security vulnerability scan reports and remediation tracking',
        'Microservice dependency maps and architecture diagrams',
        'Incident response logs and post-mortem analysis documents',
    ],
    'Legal': [
        'Active contract repository with vendor agreements and SLAs',
        'Intellectual property filings including patents and trademarks',
        'Regulatory compliance audit findings and remediation plans',
        'Data processing agreements under GDPR Article 28',
        'Litigation case files and legal correspondence records',
        'Corporate governance meeting minutes and board resolutions',
        'Privacy impact assessments for new data processing activities',
        'Non-disclosure agreement tracking and expiration dates',
        'Employment law compliance documentation by jurisdiction',
        'Insurance policy records and claims history',
    ],
}

n_assets = 500
records = []

for i in range(n_assets):
    cat = np.random.choice(categories)
    desc = np.random.choice(descriptions_pool[cat])
    if np.random.random() < 0.3:
        desc += ' updated ' + np.random.choice(['weekly', 'monthly', 'quarterly', 'annually'])
    records.append({
        'asset_id': f'ASSET-{i+1:04d}',
        'name': f'{cat.lower()}_{np.random.choice(["report", "dataset", "log", "file", "table"])}_{i+1:04d}',
        'description': desc,
        'category': cat,
        'source': np.random.choice(sources),
        'data_type': np.random.choice(data_types),
        'sensitivity': np.random.choice(sensitivity_levels, p=[0.15, 0.35, 0.30, 0.20]),
        'owner': np.random.choice(['alice', 'bob', 'carol', 'dave', 'eve', None], p=[0.2, 0.2, 0.2, 0.2, 0.15, 0.05]),
        'row_count': np.random.randint(100, 1_000_000) if np.random.random() > 0.2 else None,
        'last_updated': pd.Timestamp('2023-01-01') + pd.Timedelta(days=int(np.random.randint(0, 730))),
    })

df = pd.DataFrame(records)
print(f"Generated {len(df)} data asset records")
print(f"\nCategory distribution:")
print(df['category'].value_counts())
df.head(10)

## Exercise 1.1: Multi-Strategy Embeddings - SOLUTION

Compare two sentence-transformer models on our catalogue descriptions:
- **all-MiniLM-L6-v2** (lightweight, 384-dim)
- **all-mpnet-base-v2** (higher quality, 768-dim)

We measure **intra-class cosine similarity** (how close descriptions within the
same category are to each other) to see which model produces tighter clusters.

In [None]:
# Load both embedding models
model_mini = SentenceTransformer('all-MiniLM-L6-v2')
model_mpnet = SentenceTransformer('all-mpnet-base-v2')

descriptions = df['description'].tolist()

print("Encoding descriptions with all-MiniLM-L6-v2...")
emb_mini = model_mini.encode(descriptions, show_progress_bar=True)

print("Encoding descriptions with all-mpnet-base-v2...")
emb_mpnet = model_mpnet.encode(descriptions, show_progress_bar=True)

print(f"\nMiniLM embeddings shape:  {emb_mini.shape}")
print(f"MPNet embeddings shape:   {emb_mpnet.shape}")

# Compute intra-class similarity for each model
def compute_intra_class_similarity(embeddings, labels):
    """Compute average cosine similarity among items in the same class."""
    unique_labels = sorted(set(labels))
    results = {}
    for label in unique_labels:
        mask = [i for i, l in enumerate(labels) if l == label]
        class_emb = embeddings[mask]
        sim_matrix = cosine_similarity(class_emb)
        # Take upper triangle (exclude diagonal)
        n = len(mask)
        upper_tri = sim_matrix[np.triu_indices(n, k=1)]
        results[label] = float(np.mean(upper_tri))
    return results

labels = df['category'].tolist()
sim_mini = compute_intra_class_similarity(emb_mini, labels)
sim_mpnet = compute_intra_class_similarity(emb_mpnet, labels)

print("\nIntra-class cosine similarity (higher = tighter clusters):")
print(f"{'Category':<15} {'MiniLM':>10} {'MPNet':>10}")
print("-" * 37)
for cat in categories:
    print(f"{cat:<15} {sim_mini[cat]:>10.4f} {sim_mpnet[cat]:>10.4f}")
print(f"{'Average':<15} {np.mean(list(sim_mini.values())):>10.4f} {np.mean(list(sim_mpnet.values())):>10.4f}")

# Bar chart comparison
fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(len(categories))
width = 0.35
bars1 = ax.bar(x - width/2, [sim_mini[c] for c in categories], width,
               label='all-MiniLM-L6-v2', color='#3b82f6')
bars2 = ax.bar(x + width/2, [sim_mpnet[c] for c in categories], width,
               label='all-mpnet-base-v2', color='#10b981')
ax.set_xlabel('Category')
ax.set_ylabel('Intra-class Cosine Similarity')
ax.set_title('Embedding Model Comparison: Intra-class Similarity')
ax.set_xticks(x)
ax.set_xticklabels(categories)
ax.legend()
ax.set_ylim(0, 1)
plt.tight_layout()
plt.show()

## Exercise 1.2: BM25 Keyword Search - SOLUTION

Build a **BM25Okapi** index over the catalogue descriptions.  BM25 is a
classic term-frequency-based ranking function that excels at exact keyword
matches and is complementary to dense semantic search.

In [None]:
# Tokenize descriptions for BM25
tokenized_corpus = [desc.lower().split() for desc in descriptions]

# Build BM25 index
bm25 = BM25Okapi(tokenized_corpus)
print(f"BM25 index built over {len(tokenized_corpus)} documents")
print(f"Average document length: {np.mean([len(d) for d in tokenized_corpus]):.1f} tokens")

def bm25_search(query, top_k=10):
    """Search the catalogue using BM25 keyword matching."""
    tokenized_query = query.lower().split()
    scores = bm25.get_scores(tokenized_query)
    top_indices = np.argsort(scores)[::-1][:top_k]
    results = []
    for idx in top_indices:
        results.append({
            'rank': len(results) + 1,
            'asset_id': df.iloc[idx]['asset_id'],
            'category': df.iloc[idx]['category'],
            'description': df.iloc[idx]['description'],
            'score': scores[idx],
        })
    return results

# Test queries
test_queries = [
    "employee salary payroll",
    "revenue budget financial",
    "server logs monitoring",
    "contract compliance GDPR",
    "campaign click conversion",
]

for query in test_queries:
    results = bm25_search(query, top_k=5)
    print(f"\nQuery: '{query}'")
    print("-" * 70)
    for r in results:
        print(f"  {r['rank']}. [{r['category']:<12}] {r['description'][:65]}... (score: {r['score']:.3f})")

## Exercise 2.1: Hybrid Search - SOLUTION

Combine **BM25 scores** (keyword relevance) with **cosine similarity** (semantic
relevance) using a weighted fusion parameter `alpha`:

$$\text{hybrid\_score} = \alpha \cdot \text{norm\_semantic} + (1 - \alpha) \cdot \text{norm\_bm25}$$

where `alpha=1.0` is pure semantic and `alpha=0.0` is pure BM25.

In [None]:
# Use MiniLM embeddings for the hybrid search (faster, good quality)
catalogue_embeddings = emb_mini

def hybrid_search(query, alpha=0.5, top_k=10):
    """Combine BM25 keyword scores with semantic cosine similarity.
    
    Args:
        query: Natural language search query
        alpha: Weight for semantic score (0=pure BM25, 1=pure semantic)
        top_k: Number of results to return
    
    Returns:
        List of result dicts with hybrid scores
    """
    # BM25 scores
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    
    # Normalise BM25 scores to [0, 1]
    bm25_max = bm25_scores.max()
    if bm25_max > 0:
        bm25_norm = bm25_scores / bm25_max
    else:
        bm25_norm = bm25_scores
    
    # Semantic cosine similarity scores
    query_emb = model_mini.encode([query])
    semantic_scores = cosine_similarity(query_emb, catalogue_embeddings)[0]
    
    # Normalise semantic scores to [0, 1]
    sem_min = semantic_scores.min()
    sem_max = semantic_scores.max()
    if sem_max > sem_min:
        semantic_norm = (semantic_scores - sem_min) / (sem_max - sem_min)
    else:
        semantic_norm = semantic_scores
    
    # Weighted fusion
    hybrid_scores = alpha * semantic_norm + (1 - alpha) * bm25_norm
    
    # Rank by hybrid score
    top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
    results = []
    for idx in top_indices:
        results.append({
            'rank': len(results) + 1,
            'asset_id': df.iloc[idx]['asset_id'],
            'category': df.iloc[idx]['category'],
            'description': df.iloc[idx]['description'],
            'hybrid_score': hybrid_scores[idx],
            'semantic_score': semantic_scores[idx],
            'bm25_score': bm25_scores[idx],
        })
    return results

# Demonstrate hybrid search with different alpha values
demo_query = "employee salary compensation data"

for alpha in [0.0, 0.5, 1.0]:
    label = {0.0: 'Pure BM25', 0.5: 'Hybrid (50/50)', 1.0: 'Pure Semantic'}[alpha]
    results = hybrid_search(demo_query, alpha=alpha, top_k=5)
    print(f"\n{'='*70}")
    print(f"Query: '{demo_query}' | Strategy: {label} (alpha={alpha})")
    print(f"{'='*70}")
    for r in results:
        print(f"  {r['rank']}. [{r['category']:<12}] {r['description'][:55]}...")
        print(f"     hybrid={r['hybrid_score']:.3f}  sem={r['semantic_score']:.3f}  bm25={r['bm25_score']:.3f}")

## Exercise 2.2: Search Quality Evaluation - SOLUTION

Measure **precision@k** for each search strategy.  We define a ground-truth
mapping from queries to expected categories and check how many of the top-k
results belong to the correct category.

In [None]:
# Ground truth: queries mapped to their expected relevant category
ground_truth = {
    "employee salary payroll": "HR",
    "performance review feedback": "HR",
    "recruitment hiring interview": "HR",
    "revenue budget quarterly": "Finance",
    "invoice payment accounts": "Finance",
    "tax filing compliance": "Finance",
    "campaign click conversion": "Marketing",
    "customer segmentation purchase": "Marketing",
    "social media engagement": "Marketing",
    "server logs error monitoring": "Engineering",
    "CI/CD pipeline build deployment": "Engineering",
    "API usage rate limiting": "Engineering",
    "contract vendor agreement SLA": "Legal",
    "GDPR data processing privacy": "Legal",
    "patent trademark intellectual property": "Legal",
}

def precision_at_k(results, relevant_category, k):
    """Compute precision@k: fraction of top-k results in the relevant category."""
    top_k_results = results[:k]
    relevant_count = sum(1 for r in top_k_results if r['category'] == relevant_category)
    return relevant_count / k

def semantic_search_as_results(query, top_k=10):
    """Pure semantic search returning results in the same format."""
    query_emb = model_mini.encode([query])
    scores = cosine_similarity(query_emb, catalogue_embeddings)[0]
    top_indices = np.argsort(scores)[::-1][:top_k]
    results = []
    for idx in top_indices:
        results.append({
            'rank': len(results) + 1,
            'asset_id': df.iloc[idx]['asset_id'],
            'category': df.iloc[idx]['category'],
            'description': df.iloc[idx]['description'],
            'score': scores[idx],
        })
    return results

# Evaluate all three strategies
strategies = {
    'BM25': lambda q, k: bm25_search(q, top_k=k),
    'Semantic': lambda q, k: semantic_search_as_results(q, top_k=k),
    'Hybrid': lambda q, k: hybrid_search(q, alpha=0.5, top_k=k),
}

k_values = [5, 10]
eval_results = {strat: {f'P@{k}': [] for k in k_values} for strat in strategies}

for query, relevant_cat in ground_truth.items():
    for strat_name, search_fn in strategies.items():
        results = search_fn(query, max(k_values))
        for k in k_values:
            p_at_k = precision_at_k(results, relevant_cat, k)
            eval_results[strat_name][f'P@{k}'].append(p_at_k)

# Compute averages
print("Search Quality Evaluation (averaged over 15 queries)")
print("=" * 50)
print(f"{'Strategy':<12} {'P@5':>8} {'P@10':>8}")
print("-" * 30)
avg_scores = {}
for strat_name in strategies:
    p5 = np.mean(eval_results[strat_name]['P@5'])
    p10 = np.mean(eval_results[strat_name]['P@10'])
    avg_scores[strat_name] = {'P@5': p5, 'P@10': p10}
    print(f"{strat_name:<12} {p5:>8.3f} {p10:>8.3f}")

# Plot comparison
fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(len(strategies))
width = 0.35
strat_names = list(strategies.keys())
bars1 = ax.bar(x - width/2, [avg_scores[s]['P@5'] for s in strat_names], width,
               label='Precision@5', color='#3b82f6')
bars2 = ax.bar(x + width/2, [avg_scores[s]['P@10'] for s in strat_names], width,
               label='Precision@10', color='#10b981')

ax.set_xlabel('Search Strategy')
ax.set_ylabel('Precision')
ax.set_title('Search Quality: Precision@K Comparison')
ax.set_xticks(x)
ax.set_xticklabels(strat_names)
ax.legend()
ax.set_ylim(0, 1.05)

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.annotate(f'{height:.2f}',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 3), textcoords="offset points",
                    ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

## Exercise 3.1: Build a Data Catalogue RAG Pipeline - SOLUTION

Implement a **Retrieval-Augmented Generation (RAG)** pipeline that:
1. Accepts a natural-language question about the data catalogue
2. Retrieves the top-5 most relevant assets via hybrid search
3. Formats them as context
4. Generates a structured answer summarising the findings

> **Note:** Since we do not have an LLM endpoint in this lab environment we
> simulate the generation step by programmatically extracting and summarising
> key information from the retrieved documents.  In production you would pass
> the context to an LLM (e.g. GPT-4, Claude, Llama) for free-form answering.

In [None]:
def format_context(results):
    """Format retrieved results into a context string for the RAG pipeline."""
    context_parts = []
    for r in results:
        context_parts.append(
            f"[{r['asset_id']}] Category: {r['category']} | "
            f"Description: {r['description']}"
        )
    return "\n".join(context_parts)


def generate_answer(question, results):
    """Simulate LLM generation by summarising retrieved documents.
    
    In production, replace this with an actual LLM call:
        prompt = f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
        answer = llm.generate(prompt)
    """
    # Extract statistics from retrieved results
    categories_found = Counter(r['category'] for r in results)
    unique_descriptions = list(dict.fromkeys(r['description'] for r in results))
    asset_ids = [r['asset_id'] for r in results]
    avg_score = np.mean([r.get('hybrid_score', 0) for r in results])
    
    # Build structured answer
    answer_parts = [
        f"Based on {len(results)} retrieved data assets, here is a summary:\n",
        f"**Relevant Categories:** {', '.join(f'{cat} ({cnt})' for cat, cnt in categories_found.most_common())}\n",
        f"**Matching Assets:** {', '.join(asset_ids)}\n",
        f"**Key Findings:**",
    ]
    for i, desc in enumerate(unique_descriptions, 1):
        answer_parts.append(f"  {i}. {desc}")
    
    answer_parts.append(f"\n**Average Relevance Score:** {avg_score:.3f}")
    answer_parts.append(f"**Recommendation:** Focus on the {categories_found.most_common(1)[0][0]} "
                        f"category assets which dominate the results for this query.")
    
    return "\n".join(answer_parts)


def rag_query(question, alpha=0.5, top_k=5):
    """Full RAG pipeline: retrieve -> format context -> generate answer."""
    # Step 1: Retrieve
    results = hybrid_search(question, alpha=alpha, top_k=top_k)
    
    # Step 2: Format context
    context = format_context(results)
    
    # Step 3: Generate answer
    answer = generate_answer(question, results)
    
    return {
        'question': question,
        'context': context,
        'answer': answer,
        'results': results,
    }


# Test the RAG pipeline with several questions
questions = [
    "What data assets contain employee salary information?",
    "Which datasets track marketing campaign performance?",
    "Are there any assets related to regulatory compliance or auditing?",
    "What engineering monitoring and observability data do we have?",
]

for question in questions:
    result = rag_query(question)
    print(f"\n{'='*80}")
    print(f"QUESTION: {result['question']}")
    print(f"{'='*80}")
    print(f"\n--- Retrieved Context ---")
    print(result['context'])
    print(f"\n--- Generated Answer ---")
    print(result['answer'])
    print()

## Exercise 3.2: Query Expansion - SOLUTION

Improve retrieval by **expanding** the original query with semantically related
terms drawn from the catalogue vocabulary.  We:
1. Build a vocabulary from all unique words in descriptions
2. Embed the vocabulary with the same model
3. Find words most similar to the query to generate expansion terms
4. Append expansion terms to the original query and re-search

In [None]:
# Build vocabulary from corpus
all_words = set()
for desc in descriptions:
    for word in desc.lower().split():
        # Keep only alphabetic tokens >= 3 chars
        cleaned = ''.join(c for c in word if c.isalpha())
        if len(cleaned) >= 3:
            all_words.add(cleaned)

vocab_list = sorted(all_words)
print(f"Vocabulary size: {len(vocab_list)} unique terms")

# Embed the vocabulary
print("Encoding vocabulary...")
vocab_embeddings = model_mini.encode(vocab_list, show_progress_bar=True)
print(f"Vocabulary embeddings shape: {vocab_embeddings.shape}")


def expand_query(query, n_expansion_terms=5):
    """Expand a query with semantically related terms from the corpus vocabulary."""
    # Encode the query
    query_emb = model_mini.encode([query])
    
    # Compute similarity to all vocab terms
    similarities = cosine_similarity(query_emb, vocab_embeddings)[0]
    
    # Get query words to exclude them from expansion
    query_words = set(query.lower().split())
    
    # Rank vocab by similarity, skip words already in query
    ranked_indices = np.argsort(similarities)[::-1]
    expansion_terms = []
    for idx in ranked_indices:
        term = vocab_list[idx]
        if term not in query_words and similarities[idx] > 0.1:
            expansion_terms.append((term, similarities[idx]))
        if len(expansion_terms) >= n_expansion_terms:
            break
    
    return expansion_terms


def expanded_hybrid_search(query, alpha=0.5, top_k=10, n_expansion=5):
    """Perform hybrid search with query expansion."""
    # Get expansion terms
    expansion = expand_query(query, n_expansion_terms=n_expansion)
    expansion_words = [term for term, score in expansion]
    expanded_query = query + " " + " ".join(expansion_words)
    
    return expanded_query, expansion, hybrid_search(expanded_query, alpha=alpha, top_k=top_k)


# Demonstrate query expansion and improved retrieval
demo_queries = [
    "employee records",
    "financial audit",
    "cloud infrastructure",
]

for query in demo_queries:
    print(f"\n{'='*80}")
    print(f"Original query: '{query}'")
    
    # Get expansion terms
    expansion = expand_query(query, n_expansion_terms=5)
    print(f"Expansion terms: {', '.join(f'{t} ({s:.3f})' for t, s in expansion)}")
    
    expanded_query, _, expanded_results = expanded_hybrid_search(query, top_k=5)
    print(f"Expanded query:  '{expanded_query}'")
    
    # Compare original vs expanded results
    original_results = hybrid_search(query, alpha=0.5, top_k=5)
    
    print(f"\n  {'Original Results':<40} | {'Expanded Results':<40}")
    print(f"  {'-'*40} | {'-'*40}")
    for orig, exp in zip(original_results, expanded_results):
        orig_str = f"[{orig['category']:<10}] {orig['description'][:25]}..."
        exp_str = f"[{exp['category']:<10}] {exp['description'][:25]}..."
        print(f"  {orig_str:<40} | {exp_str:<40}")

# Quantitative evaluation: compare precision with and without expansion
print(f"\n\n{'='*80}")
print("Precision@5 Comparison: Standard vs Query-Expanded Hybrid Search")
print(f"{'='*80}")

p5_standard = []
p5_expanded = []

for query, relevant_cat in ground_truth.items():
    # Standard hybrid
    std_results = hybrid_search(query, alpha=0.5, top_k=5)
    p5_std = precision_at_k(std_results, relevant_cat, 5)
    p5_standard.append(p5_std)
    
    # Expanded hybrid
    _, _, exp_results = expanded_hybrid_search(query, alpha=0.5, top_k=5)
    p5_exp = precision_at_k(exp_results, relevant_cat, 5)
    p5_expanded.append(p5_exp)

print(f"\nAverage P@5 (Standard Hybrid): {np.mean(p5_standard):.3f}")
print(f"Average P@5 (Expanded Hybrid): {np.mean(p5_expanded):.3f}")
improvement = np.mean(p5_expanded) - np.mean(p5_standard)
print(f"Improvement: {improvement:+.3f} ({improvement/max(np.mean(p5_standard), 1e-9)*100:+.1f}%)")

## Summary

In this lab, you learned how to:

1. **Compare** embedding models (MiniLM vs MPNet) using intra-class similarity
2. **Build** a BM25 keyword index for fast term-based retrieval
3. **Fuse** BM25 and semantic scores into a hybrid search with tuneable alpha
4. **Evaluate** search quality with precision@k on ground-truth queries
5. **Implement** a RAG pipeline that retrieves catalogue context and generates structured answers
6. **Expand** queries with semantically related vocabulary for improved recall

---

*Data Discovery: Harnessing AI, AGI & Vector Databases | AI Elevate*