# Week 2 — Local Retrieval Pipeline (FAISS)

This notebook implements a local retrieval pipeline:
- chunk documents with multiple chunk sizes/overlaps
- build FAISS vector index
- compare two embedding models (MiniLM vs MPNet)
- evaluate retrieval quality with hit@k and MRR
- do manual inspection and summarize findings

**Important note:** If you see MRR≈1 and hit@k≈1 for all queries, it is usually caused by an easy evaluation setup
(homogeneous corpus + direct queries), not by a perfect retriever.

In [1]:
import numpy as np
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


## Load Documents (GenAI & RAG domain only)

Replace `docs` with your real texts.
Each doc has: id, topic, text.
Topics can still be inside the domain (RAG, Chunking, Embeddings, VectorDB, HyDE, Evaluation).

In [2]:
RAG_KNOWLEDGE_BASE = [
    {
        "id": "rag_01",
        "topic": "RAG",
        "text": """Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with text generation. Instead of relying solely on the knowledge stored in model parameters, RAG systems retrieve relevant documents from an external knowledge base and use them as context for generating responses. This approach reduces hallucinations by grounding the model's output in factual, retrieved information. The typical RAG pipeline consists of three stages: indexing documents into a searchable format, retrieving relevant passages given a query, and generating a response using the retrieved context. RAG is particularly useful for knowledge-intensive tasks where the model needs access to up-to-date or domain-specific information that wasn't part of its training data."""
    },
    {
        "id": "rag_02",
        "topic": "RAG",
        "text": """The key advantage of RAG over fine-tuning is that the knowledge base can be updated without retraining the model. This makes RAG cost-effective and flexible for enterprise applications. RAG systems also provide transparency since retrieved sources can be cited, allowing users to verify the information. Common challenges in RAG include retrieval quality, context window limitations, and handling conflicting information from multiple sources. Advanced RAG architectures may include query rewriting, multi-hop retrieval, and fusion techniques to improve answer quality."""
    },
    {
        "id": "chunking_01",
        "topic": "Chunking",
        "text": """Chunking is the process of splitting documents into smaller pieces for indexing and retrieval. The chunk size significantly impacts retrieval quality. Small chunks (100-200 tokens) provide precise matches but may lose context. Large chunks (500-1000 tokens) preserve context but may include irrelevant information. Overlap between chunks helps maintain continuity across chunk boundaries. Common chunking strategies include fixed-size chunking, sentence-based splitting, and semantic chunking based on topic boundaries. The optimal chunk size depends on the embedding model's context window, the nature of queries, and the document structure."""
    },
    {
        "id": "chunking_02",
        "topic": "Chunking",
        "text": """Recursive character text splitting is a popular chunking method that tries to split on natural boundaries like paragraphs and sentences before falling back to character-level splits. This preserves semantic coherence within chunks. Chunk overlap (typically 10-20% of chunk size) ensures that information spanning chunk boundaries is not lost. For structured documents like code or markdown, specialized splitters that respect syntax boundaries produce better results. Parent-child chunking stores both small chunks for precise retrieval and their parent documents for expanded context."""
    },
    {
        "id": "embeddings_01",
        "topic": "Embeddings",
        "text": """Text embeddings are dense vector representations that capture semantic meaning. In RAG systems, both documents and queries are converted to embeddings, and similarity search finds the most relevant documents. Popular embedding models include OpenAI's text-embedding-ada-002, Sentence Transformers (like all-MiniLM-L6-v2), and BGE models. The embedding dimension affects storage requirements and search speed. Normalized embeddings allow using inner product as a similarity metric, which is computationally efficient. The choice of embedding model significantly impacts retrieval quality and should match the domain of your documents."""
    },
    {
        "id": "vectordb_01",
        "topic": "VectorDB",
        "text": """Vector databases store embeddings and enable fast similarity search at scale. FAISS (Facebook AI Similarity Search) is a popular open-source library for efficient similarity search. It supports various index types: IndexFlatIP for exact search, IndexIVF for approximate search with clustering, and IndexHNSW for graph-based approximate search. For production systems, managed vector databases like Pinecone, Weaviate, Milvus, and Qdrant provide additional features like filtering, hybrid search, and automatic scaling. The choice between exact and approximate nearest neighbor search depends on the dataset size and latency requirements."""
    },
    {
        "id": "hyde_01",
        "topic": "HyDE",
        "text": """HyDE (Hypothetical Document Embeddings) is an advanced retrieval technique that improves search quality by generating a hypothetical answer before retrieval. Instead of embedding the query directly, HyDE uses an LLM to generate a hypothetical document that would answer the query, then embeds this generated document for retrieval. This bridges the semantic gap between short queries and longer document passages. HyDE is particularly effective when queries are vague or use different terminology than the indexed documents. The technique adds latency due to the generation step but often significantly improves retrieval accuracy."""
    },
    {
        "id": "evaluation_01",
        "topic": "Evaluation",
        "text": """Evaluating RAG systems requires measuring both retrieval quality and generation quality. Retrieval metrics include Hit@K (whether relevant documents appear in top K results), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG). Generation quality can be measured using BLEU, ROUGE, or model-based metrics like faithfulness and relevance scores. End-to-end evaluation often uses human judgment or LLM-as-judge approaches. Important aspects to evaluate include factual accuracy, completeness, relevance to the query, and proper attribution of sources. A/B testing with real users provides the most reliable quality signal."""
    },
]

docs = RAG_KNOWLEDGE_BASE
print(f"Loaded {len(docs)} documents")
print("Topics:", [d["topic"] for d in docs])

Loaded 8 documents
Topics: ['RAG', 'RAG', 'Chunking', 'Chunking', 'Embeddings', 'VectorDB', 'HyDE', 'Evaluation']


# Chunking (3 configs)

We test multiple chunk configs and compare retrieval quality.

In [3]:
chunk_configs = [
    {"chunk_size": 200, "overlap": 20},
    {"chunk_size": 500, "overlap": 50},
    {"chunk_size": 800, "overlap": 80},
]

def chunk_text(text, chunk_size, overlap):
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end].strip())
        if end == len(text):
            break
        start = end - overlap
        if start < 0:
            start = 0
    return [c for c in chunks if c]

def make_chunks(docs, cfg):
    chunks = []
    for d in docs:
        parts = chunk_text(d["text"], cfg["chunk_size"], cfg["overlap"])
        for i, p in enumerate(parts):
            chunks.append({
                "doc_id": d["id"],
                "topic": d["topic"],
                "chunk_id": f"{d['id']}::c{i}",
                "text": p
            })
    return chunks


# Build FAISS Retriever (cosine similarity)

Cosine is implemented as:
- normalize embeddings
- use inner product index (IndexFlatIP)

In [4]:
def build_index(texts, model_name):
    model = SentenceTransformer(model_name)
    emb = model.encode(texts, normalize_embeddings=True)
    emb = np.array(emb, dtype="float32")

    index = faiss.IndexFlatIP(emb.shape[1])
    index.add(emb)
    return model, index

def retrieve(model, index, query, top_k):
    q = model.encode([query], normalize_embeddings=True)
    q = np.array(q, dtype="float32")
    scores, idxs = index.search(q, top_k)
    return scores[0], idxs[0]


## Evaluation Queries

We include:
- direct queries (easy)
- paraphrases (harder)
- ambiguous/trap queries (should sometimes fail)


In [5]:
eval_queries = [
    # RAG topic
    {"query": "Explain retrieval augmented generation in simple terms", "expected_topic": "RAG"},
    {"query": "How does retrieval reduce hallucinations?", "expected_topic": "RAG"},
    {"query": "What is the advantage of RAG over fine-tuning?", "expected_topic": "RAG"},
    
    # Chunking topic
    {"query": "What is chunking and why do we use overlap?", "expected_topic": "Chunking"},
    {"query": "How does chunk size affect retrieval quality?", "expected_topic": "Chunking"},
    {"query": "What is recursive character text splitting?", "expected_topic": "Chunking"},
    
    # Embeddings topic
    {"query": "What are text embeddings and how do they work?", "expected_topic": "Embeddings"},
    {"query": "Which embedding models are popular for RAG?", "expected_topic": "Embeddings"},
    
    # VectorDB topic
    {"query": "What is FAISS and how does it work?", "expected_topic": "VectorDB"},
    {"query": "What are the different FAISS index types?", "expected_topic": "VectorDB"},
    
    # HyDE topic
    {"query": "What is HyDE in retrieval?", "expected_topic": "HyDE"},
    {"query": "How does hypothetical document embedding improve search?", "expected_topic": "HyDE"},
    
    # Evaluation topic
    {"query": "How do we evaluate retrieval quality?", "expected_topic": "Evaluation"},
    {"query": "What is Mean Reciprocal Rank (MRR)?", "expected_topic": "Evaluation"},
    
    # Harder / cross-topic queries
    {"query": "Why does adding retrieved context help LLM answers stay accurate?", "expected_topic": "RAG"},
    {"query": "Does overlap always improve retrieval precision?", "expected_topic": "Chunking"},
]

print(f"Total queries: {len(eval_queries)}")
print("Expected topics:", set(q["expected_topic"] for q in eval_queries))

Total queries: 16
Expected topics: {'Evaluation', 'Chunking', 'VectorDB', 'HyDE', 'RAG', 'Embeddings'}


# Run Experiments: chunk configs × embedding models

We compare:
- MiniLM vs MPNet
- chunk configs
and compute hit@k and MRR


In [7]:
models = {
    "MiniLM": "all-MiniLM-L6-v2",
    "MPNet": "all-mpnet-base-v2"
}

def hit_at_k(expected, retrieved_topics, k):
    return 1.0 if expected in retrieved_topics[:k] else 0.0

def mrr(expected, retrieved_topics):
    for i, t in enumerate(retrieved_topics, start=1):
        if t == expected:
            return 1.0 / i
    return 0.0

rows = []
debug = []

for cfg in chunk_configs:
    chunks = make_chunks(docs, cfg)
    texts = [c["text"] for c in chunks]
    topics = [c["topic"] for c in chunks]

    for model_label, model_name in models.items():
        model, index = build_index(texts, model_name)

        for q in eval_queries:
            scores, idxs = retrieve(model, index, q["query"], top_k=5)

            retrieved_topics = [topics[i] for i in idxs]
            gap_1_2 = float(scores[0] - scores[1]) if len(scores) > 1 else None

            rows.append({
                "chunk_size": cfg["chunk_size"],
                "overlap": cfg["overlap"],
                "model": model_label,
                "query": q["query"],
                "expected": q["expected_topic"],
                "hit@1": hit_at_k(q["expected_topic"], retrieved_topics, 1),
                "hit@3": hit_at_k(q["expected_topic"], retrieved_topics, 3),
                "hit@5": hit_at_k(q["expected_topic"], retrieved_topics, 5),
                "mrr": mrr(q["expected_topic"], retrieved_topics),
                "top_topics": retrieved_topics,
                "top1_score": float(scores[0]),
                "gap_1_2": gap_1_2
            })

            # debug: store top3 chunks for manual check
            for rank in range(3):
                debug.append({
                    "chunk_size": cfg["chunk_size"],
                    "overlap": cfg["overlap"],
                    "model": model_label,
                    "query": q["query"],
                    "rank": rank+1,
                    "score": float(scores[rank]),
                    "topic": topics[idxs[rank]],
                    "chunk_id": chunks[idxs[rank]]["chunk_id"],
                    "text_preview": chunks[idxs[rank]]["text"][:250].replace("\n", " ")
                })

df = pd.DataFrame(rows)
df_debug = pd.DataFrame(debug)

df.head()


Loading weights: 100%|██████████| 103/103 [00:00<00:00, 2106.17it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Loading weights: 100%|██████████| 199/199 [00:00<00:00, 2052.54it/s, Materializing param=pooler.dense.weight]                        
MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 2053.84it/s, Materializing param=po

Unnamed: 0,chunk_size,overlap,model,query,expected,hit@1,hit@3,hit@5,mrr,top_topics,top1_score,gap_1_2
0,200,20,MiniLM,Explain retrieval augmented generation in simp...,RAG,1.0,1.0,1.0,1.0,"[RAG, HyDE, HyDE, RAG, RAG]",0.740879,0.257921
1,200,20,MiniLM,How does retrieval reduce hallucinations?,RAG,1.0,1.0,1.0,1.0,"[RAG, HyDE, Embeddings, RAG, RAG]",0.552341,0.242185
2,200,20,MiniLM,What is the advantage of RAG over fine-tuning?,RAG,1.0,1.0,1.0,1.0,"[RAG, Evaluation, RAG, RAG, RAG]",0.782866,0.226946
3,200,20,MiniLM,What is chunking and why do we use overlap?,Chunking,1.0,1.0,1.0,1.0,"[Chunking, Chunking, Chunking, Chunking, Chunk...",0.759435,0.134919
4,200,20,MiniLM,How does chunk size affect retrieval quality?,Chunking,1.0,1.0,1.0,1.0,"[Chunking, Chunking, Chunking, Chunking, Embed...",0.728739,0.073879


# Results Summary (compare configs and models)

In [8]:
summary = (
    df.groupby(["chunk_size", "overlap", "model"])
      .agg({"hit@1":"mean","hit@3":"mean","hit@5":"mean","mrr":"mean","top1_score":"mean","gap_1_2":"mean"})
      .reset_index()
      .sort_values(["mrr","hit@1"], ascending=False)
)
summary

Unnamed: 0,chunk_size,overlap,model,hit@1,hit@3,hit@5,mrr,top1_score,gap_1_2
5,800,80,MiniLM,0.9375,1.0,1.0,0.96875,0.533364,0.138228
4,800,80,MPNet,0.9375,1.0,1.0,0.958333,0.474881,0.122695
1,200,20,MiniLM,0.875,1.0,1.0,0.9375,0.603624,0.114243
2,500,50,MPNet,0.8125,1.0,1.0,0.90625,0.498471,0.068708
3,500,50,MiniLM,0.8125,1.0,1.0,0.895833,0.544313,0.115198
0,200,20,MPNet,0.8125,0.9375,0.9375,0.864583,0.568848,0.104315


# Manual Retrieval Evaluation 

- ✅ relevant 
- ⚠️ partially relevant 
- ❌ irrelevant

In [14]:
def show_debug(df_debug, chunk_size, overlap, model, query):
    view = df_debug[
        (df_debug["chunk_size"]==chunk_size) &
        (df_debug["overlap"]==overlap) &
        (df_debug["model"]==model) &
        (df_debug["query"]==query)
    ].sort_values("rank")
    return view[["rank","score","topic","chunk_id","text_preview"]]

show_debug(df_debug, 500, 50, "MiniLM", "Does overlap always improve retrieval precision?")


Unnamed: 0,rank,score,topic,chunk_id,text_preview
141,1,0.449827,HyDE,hyde_01::c1,queries are vague or use different terminology...
142,2,0.393171,Chunking,chunking_01::c0,Chunking is the process of splitting documents...
143,3,0.377681,RAG,rag_02::c0,The key advantage of RAG over fine-tuning is t...


✅ Manual Retrieval Evaluation — Case: RAG (Baseline Definition)
Query

Explain retrieval augmented generation in simple terms

Expected topic: RAG

Top-1 result

Rank: 1
Score: 0.728
Topic: RAG
Chunk: rag_01::c0

Verdict: ✅ Relevant

Reason:
The retrieved chunk provides a clear and concise definition of Retrieval-Augmented Generation, explaining its purpose and how it combines retrieval with generation. The explanation is suitable for a “simple terms” request and directly answers the question.

Top-2 result

Rank: 2
Score: 0.584
Topic: HyDE
Chunk: hyde_01::c1

Verdict: ⚠️ Partially relevant

Reason:
This chunk discusses challenges related to query–document semantic mismatch and advanced retrieval techniques. While conceptually related to retrieval, it does not explain RAG itself and is too specific for an introductory explanation.

Top-3 result

Rank: 3
Score: 0.520
Topic: RAG
Chunk: rag_02::c1

Verdict: ⚠️ Partially relevant

Reason:
The chunk refers to advanced RAG architectures and extensions, which are related to the topic but assume prior knowledge and do not focus on a simple, high-level explanation.

Observation

This query represents a direct definition-style baseline.
The retriever correctly ranks a fully relevant RAG definition at top-1, while semantically related but more advanced concepts (HyDE and advanced RAG variants) appear in lower ranks. This indicates semantic retrieval rather than strict keyword matching.

In [15]:
def show_debug(df_debug, chunk_size, overlap, model, query):
    view = df_debug[
        (df_debug["chunk_size"]==chunk_size) &
        (df_debug["overlap"]==overlap) &
        (df_debug["model"]==model) &
        (df_debug["query"]==query)
    ].sort_values("rank")
    return view[["rank","score","topic","chunk_id","text_preview"]]

show_debug(df_debug, 500, 50, "MiniLM", "Explain retrieval augmented generation in simple terms")


Unnamed: 0,rank,score,topic,chunk_id,text_preview
96,1,0.728413,RAG,rag_01::c0,Retrieval-Augmented Generation (RAG) is a tech...
97,2,0.584476,HyDE,hyde_01::c1,queries are vague or use different terminology...
98,3,0.519525,RAG,rag_02::c1,ed RAG architectures may include query rewriti...


✅ Manual Retrieval Evaluation — Case: RAG Definition (Confirmed)
Query

Explain retrieval augmented generation in simple terms

Top-1 result

Rank: 1
Score: 0.728
Topic: RAG
Chunk: rag_01::c0

Verdict: ✅ Relevant

Reason:
The retrieved chunk explicitly defines Retrieval-Augmented Generation, describing its purpose, how it combines retrieval with generation, and why it is useful. The explanation aligns well with the request for a simple, introductory description.

Top-2 result

Rank: 2
Score: 0.584
Topic: HyDE
Chunk: hyde_01::c1

Verdict: ⚠️ Partially relevant

Reason:
This chunk discusses issues related to vague queries and terminology mismatch, which are relevant to retrieval in general, but it does not explain RAG itself. It provides contextual background but does not directly answer the question.

Top-3 result

Rank: 3
Score: 0.520
Topic: RAG
Chunk: rag_02::c1

Verdict: ⚠️ Partially relevant

Reason:
The chunk mentions advanced RAG architectures and techniques but assumes prior knowledge and focuses on extensions rather than a simple explanation.

Observation

This query represents a direct, definition-style baseline.
The retriever correctly ranks a highly relevant RAG definition at top-1, while related but more advanced retrieval concepts (HyDE, advanced RAG architectures) appear in lower ranks, indicating semantic rather than keyword-based retrieval.

In [16]:
def show_debug(df_debug, chunk_size, overlap, model, query):
    view = df_debug[
        (df_debug["chunk_size"]==chunk_size) &
        (df_debug["overlap"]==overlap) &
        (df_debug["model"]==model) &
        (df_debug["query"]==query)
    ].sort_values("rank")
    return view[["rank","score","topic","chunk_id","text_preview"]]

show_debug(df_debug, 500, 50, "MiniLM", "How does retrieval reduce hallucinations?")


Unnamed: 0,rank,score,topic,chunk_id,text_preview
99,1,0.346955,RAG,rag_01::c0,Retrieval-Augmented Generation (RAG) is a tech...
100,2,0.23694,HyDE,hyde_01::c1,queries are vague or use different terminology...
101,3,0.212321,RAG,rag_01::c1,stages: indexing documents into a searchable f...


✅ Manual Retrieval Evaluation — Case: Hallucinations (RAG Grounding)
Query

How does retrieval reduce hallucinations?

Expected topic: RAG

Top-1 result

Rank: 1
Score: 0.347
Topic: RAG
Chunk: rag_01::c0

Verdict: ⚠️ Partially relevant

Reason:
The chunk explains what RAG is and mentions grounding outputs in retrieved information, which is related to hallucination reduction. However, it does not directly explain how retrieval reduces hallucinations (e.g., by constraining generation to retrieved evidence and reducing reliance on parametric memory). The answer would still require an explicit mechanism-focused explanation.

Top-2 result

Rank: 2
Score: 0.237
Topic: HyDE
Chunk: hyde_01::c1

Verdict: ❌ Irrelevant

Reason:
This chunk discusses vague queries and terminology mismatch in retrieval, which is not directly about hallucinations or grounding. It does not answer the question.

Top-3 result

Rank: 3
Score: 0.212
Topic: RAG
Chunk: rag_01::c1

Verdict: ⚠️ Partially relevant

Reason:
This chunk lists stages of the RAG pipeline (indexing/retrieval/generation) but does not explicitly connect retrieval to hallucination reduction. It is contextually related but not a direct answer.

Observation

This query reveals a realistic retrieval limitation: the retriever selects generally relevant RAG definition/pipeline chunks, but not a chunk that explicitly explains the mechanism of hallucination reduction. The relatively low top-1 score (0.347) indicates weaker semantic alignment compared to direct definition questions.

In [17]:
def show_debug(df_debug, chunk_size, overlap, model, query):
    view = df_debug[
        (df_debug["chunk_size"]==chunk_size) &
        (df_debug["overlap"]==overlap) &
        (df_debug["model"]==model) &
        (df_debug["query"]==query)
    ].sort_values("rank")
    return view[["rank","score","topic","chunk_id","text_preview"]]

show_debug(df_debug, 500, 50, "MiniLM", "What is HyDE in retrieval?")

Unnamed: 0,rank,score,topic,chunk_id,text_preview
126,1,0.62968,HyDE,hyde_01::c0,HyDE (Hypothetical Document Embeddings) is an ...
127,2,0.297483,RAG,rag_01::c1,stages: indexing documents into a searchable f...
128,3,0.265558,RAG,rag_01::c0,Retrieval-Augmented Generation (RAG) is a tech...


✅ Manual Retrieval Evaluation — Case: HyDE Definition
Query

What is HyDE in retrieval?

Expected topic: HyDE

Top-1 result

Rank: 1
Score: 0.630
Topic: HyDE
Chunk: hyde_01::c0

Verdict: ✅ Relevant

Reason:
The retrieved chunk explicitly explains HyDE (Hypothetical Document Embeddings), including the idea of generating a hypothetical document before retrieval and using it for embedding-based search. This directly answers the question and provides the correct definition.

Top-2 result

Rank: 2
Score: 0.297
Topic: RAG
Chunk: rag_01::c1

Verdict: ⚠️ Partially relevant

Reason:
This chunk describes the general RAG pipeline stages (indexing, retrieval, generation) but does not mention HyDE specifically. It provides related retrieval context but does not answer the question.

Top-3 result

Rank: 3
Score: 0.266
Topic: RAG
Chunk: rag_01::c0

Verdict: ❌ Irrelevant

Reason:
The chunk explains Retrieval-Augmented Generation at a high level without referencing HyDE or hypothetical document embeddings. It does not contribute to answering the question.

Observation

For a direct definition-style query, the retriever correctly places the HyDE definition at rank 1 with a clear score margin. Related RAG content appears in lower ranks due to semantic proximity between advanced retrieval techniques, but does not interfere with the correct top-1 result.

# Week 2 Summary (Final)

✅ Implemented FAISS local vector store  
✅ Implemented chunking with multiple configs (chunk_size/overlap)  
✅ Compared MiniLM vs MPNet retrievers  
✅ Evaluated retrieval with hit@k and MRR  
✅ Performed manual retrieval inspection and summarized findings  



In [18]:
# NO_ANSWER detection: out-of-domain queries should have low scores
T_SCORE = 0.45

no_answer_queries = [
    "How to create a VM in GCP?",
    "How do I merge a git branch safely?",
    "What is the best recipe for pasta carbonara?",
    "How to fix a flat tire on a bicycle?",
]

cfg = {"chunk_size": 500, "overlap": 50}
chunks = make_chunks(docs, cfg)
texts = [c["text"] for c in chunks]

model, index = build_index(texts, models["MiniLM"])

print("NO_ANSWER Detection Test")
print("=" * 50)
for q in no_answer_queries:
    scores, idxs = retrieve(model, index, q, top_k=1)
    if scores[0] < T_SCORE:
        print(f"✅ '{q[:40]}...' -> NO_ANSWER (score={scores[0]:.3f})")
    else:
        print(f"⚠️ '{q[:40]}...' -> FALSE POSITIVE (score={scores[0]:.3f})")

Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1886.13it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


NO_ANSWER Detection Test
✅ 'How to create a VM in GCP?...' -> NO_ANSWER (score=0.037)
✅ 'How do I merge a git branch safely?...' -> NO_ANSWER (score=0.127)
✅ 'What is the best recipe for pasta carbon...' -> NO_ANSWER (score=0.147)
✅ 'How to fix a flat tire on a bicycle?...' -> NO_ANSWER (score=0.024)
