# Part 5 ‚Äî Retrieval & RAG Pipeline

**Phase:** Retrieval-Augmented Generation (RAG) ‚Äî from query to answer üí°üìåüß†üîç

This notebook teaches retrieval fundamentals, how to build a simple RAG pipeline using FAISS and Chroma, and hands-on demos to compare retrieval quality and performance. It's written for beginners and is Google Colab ready.

## Learning Guide

**What you'll learn**

- What retrieval (semantic search) is and why it's essential for RAG
- Key retrieval parameters (k, score thresholds, metadata filters)
- How to construct a simple RAG pipeline (Ingest ‚Üí Split ‚Üí Embed ‚Üí Store ‚Üí Retrieve ‚Üí Answer)
- Hands-on demos using FAISS and Chroma to measure retrieval quality

**Why it matters**

Retrieval brings relevant context to LLMs so they can produce accurate, grounded answers without holding all knowledge in the model weights.

**Hands-on steps**

1. Load sample documents
2. Chunk them into meaningful pieces
3. Embed chunks (local or Gemini)
4. Index chunks into FAISS and Chroma
5. Query and compare results
6. Send top context to an LLM (example pattern included)

In [None]:
from secrete_key import my_gemini_api_key
API_KEY = my_gemini_api_key()
print('API_KEY loaded (hidden)')

## 5.1 What is Retrieval?

- **Query ‚Üí embed ‚Üí search vectors ‚Üí return top-k chunks**

Retrieval finds the most semantically relevant pieces of text (chunks) to use as context for an LLM. This process reduces hallucination and enables up-to-date or private-document QA.

## 5.2 Key Parameters

- `k` ‚Äî number of top results to return (top-K)
- `score_threshold` ‚Äî minimum similarity required to accept a result
- `metadata filters` ‚Äî restrict search by tags, source, date, etc.

Tips:
- Start with k=3-5 for many tasks
- Use a score threshold to avoid returning unrelated context
- Store and query metadata to narrow search scope

In [None]:
# Uncomment in Colab if you need packages
# !pip install --quiet faiss-cpu sentence-transformers chromadb langchain-google-genai numpy scikit-learn
print('Install required packages if missing.')

In [None]:
# Sample documents and simple splitter
sample_docs = [
    "Our office hours are 9 AM to 5 PM, Monday to Friday.",
    "We offer a 30-day return policy on all products.",
    "Customer support: support@company.com or call 1-800-HELP.",
    "Shipping takes 3-5 business days within the US.",
    "Refunds are processed within 7 business days after receiving the returned item.",
]

import re

def simple_split(doc):
    # very small splitter (sentence-level)
    return [s.strip() for s in re.split(r'(?<=[.!?])\s+', doc) if s.strip()]

# Build chunks
chunks = []
metadatas = []
for i, d in enumerate(sample_docs):
    sents = simple_split(d)
    for j, s in enumerate(sents):
        chunks.append(s)
        metadatas.append({'source_doc': i})

print('Prepared', len(chunks), 'chunks')

In [None]:
# Embedding function: try Gemini via LangChain, else fallback to sentence-transformers
try:
    from langchain_google_genai import GoogleGenerativeAIEmbeddings
    emb = GoogleGenerativeAIEmbeddings(model='models/embedding-001', google_api_key=API_KEY)
    def get_embeddings(texts):
        return emb.embed_documents(texts)
    print('Using Gemini embeddings via LangChain')
except Exception as e:
    print('Gemini embeddings not available ‚Äî', e)
    from sentence_transformers import SentenceTransformer
    import numpy as np
    st = SentenceTransformer('all-MiniLM-L6-v2')
    def get_embeddings(texts):
        arr = st.encode(texts, convert_to_numpy=True, normalize_embeddings=True)
        return [a.tolist() for a in arr]
    print('Using local sentence-transformers embeddings (all-MiniLM-L6-v2)')

In [None]:
# Build FAISS index from chunks
try:
    import faiss
    import numpy as np
    vectors = get_embeddings(chunks)
    vecs = np.array(vectors, dtype='float32')
    dim = vecs.shape[1]
    # Use inner product on normalized vectors to approximate cosine similarity
    index = faiss.IndexFlatIP(dim)
    index.add(vecs)
    print('FAISS index built, ntotal=', index.ntotal)
except Exception as e:
    print('Skipping FAISS build ‚Äî', e)

In [None]:
# FAISS query function (returns (chunk, metadata, score))
try:
    import numpy as np
    def faiss_query(query, k=3, score_threshold=None):
        qv = np.array(get_embeddings([query]), dtype='float32')
        D, I = index.search(qv, k)
        results = []
        for score, idx in zip(D[0], I[0]):
            if score_threshold is not None and score < score_threshold:
                continue
            results.append((chunks[idx], metadatas[idx], float(score)))
        return results
    print('FAISS query helper ready')
except Exception as e:
    print('FAISS query helper not available ‚Äî', e)

In [None]:
# Build Chroma collection
try:
    import chromadb
    from chromadb.config import Settings
    client = chromadb.Client(Settings(chroma_db_impl='duckdb+parquet', persist_directory='.chromadb_rag'))
    collection = client.get_or_create_collection(name='rag_collection')
    embeddings = get_embeddings(chunks)
    ids = [f'chunk{i}' for i in range(len(chunks))]
    collection.add(ids=ids, documents=chunks, metadatas=metadatas, embeddings=embeddings)
    print('Chroma collection created, count =', collection.count())
except Exception as e:
    print('Skipping Chroma build ‚Äî', e)

In [None]:
# Chroma query helper
try:
    def chroma_query(query, k=3, where=None):
        q_emb = get_embeddings([query])[0]
        res = collection.query(query_embeddings=[q_emb], n_results=k, where=where)
        out = []
        for doc, meta, score in zip(res['documents'][0], res['metadatas'][0], res['distances'][0]):
            out.append((doc, meta, float(score)))
        return out
    print('Chroma query helper ready')
except Exception as e:
    print('Chroma query helper not available ‚Äî', e)

In [None]:
# Compare FAISS vs Chroma on example queries
queries = [
    'How do I return a product?',
    'What are the office hours?',
    'How long for refunds?'
]

for q in queries:
    print('\n=== Query:', q, '===')
    try:
        print('\nFAISS results:')
        fr = faiss_query(q, k=3, score_threshold=0.1)
        for r in fr:
            print('-', r)
    except Exception as e:
        print('FAISS not available:', e)
    try:
        print('\nChroma results:')
        cr = chroma_query(q, k=3)
        for r in cr:
            print('-', r)
    except Exception as e:
        print('Chroma not available:', e)


## 5.3 Building a Simple RAG Pipeline (Pattern)

Pattern to answer a user query:

1. Embed the query
2. Retrieve top-k chunks from your vector store
3. Concatenate top chunks into a context string (keep token budget in mind)
4. Send a prompt to the LLM with the context and the user question, asking the LLM to answer using only the provided context

Next cell shows an example using the user's Gemini invocation pattern; it falls back to printing the prompt if Gemini isn't available.

In [None]:
# Example: assemble context and call Gemini (if available)
try:
    from langchain_google_genai import ChatGoogleGenerativeAI
    model = ChatGoogleGenerativeAI(model='gemini-2.5-flash', google_api_key=API_KEY)
    query = 'Summarize Provider and Client obligations.'
    try:
        top = faiss_query(query, k=3)
        context = '\n\n'.join([t[0] for t in top])
    except Exception:
        top = chroma_query(query, k=3)
        context = '\n\n'.join([t[0] for t in top])
    prompt = f"""You are a legal assistant. Use ONLY the context below (do not guess).\n\nContext:\n{context}\n\nQuestion: {query}\n\nAnswer concisely in 3 bullets."""
    print('Sending prompt to Gemini...')
    resp = model.invoke(prompt)
    print('\n=== Gemini response ===\n')
    print(resp)
except Exception as e:
    print('Could not call Gemini ‚Äî', e)
    print('\nAssembled prompt (for manual use):\n')
    try:
        print(prompt)
    except Exception:
        print('No prompt assembled because retrieval failed')

## 5.4 Hands-On Demo & Exercises

Exercises:

1. Increase dataset size to ~100 documents and compare FAISS vs Chroma query latency.
2. Add metadata and filter by it in Chroma queries.
3. Tune `k` and `score_threshold` for precision vs recall.
4. Implement a token-budget step to ensure the context fits your LLM.

Use the helper functions included to iterate quickly.