# Vector Stores & Retrievers (Embeddings + Retrieval)

What are Vector Stores and Retrievers?

- Embeddings turn text into numeric vectors.
- A Vector Store indexes those vectors for fast similarity search.
- A Retriever is a clean interface that, given a user query, returns the most relevant Document chunks.

What we'll cover:

- Building embeddings and a FAISS vector store
- Basic similarity search & the Retriever interface
- Score thresholds, Maximal Marginal Relevance (MMR)
- Metadata filtering
- Lightweight retrieval compression
- A tiny end-to-end RAG chain

Run this bootstrap cell before running any subsequent cells

In [11]:
# Environment & imports
import os, json
from dotenv import load_dotenv
load_dotenv()

if not os.getenv("OPENAI_API_KEY"):
    raise RuntimeError("Missing OPENAI_API_KEY in .env")

# Embeddings + vector stores
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Prompts & model for RAG demo
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.document_loaders import TextLoader, PyPDFLoader

# Documents & splitting (we'll create toy docs below)
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

print("✅ Ready: Embeddings + FAISS + Chat model")

✅ Ready: Embeddings + FAISS + Chat model


Let's create moderately sized chunks using `RecursiveTextSplitter` we saw earlier.

In [12]:
pdf_loader = PyPDFLoader("rag_info.pdf")
pdf_docs = pdf_loader.load() 

splitter = RecursiveCharacterTextSplitter(chunk_size=120, chunk_overlap=20)
docs = splitter.split_documents(pdf_docs)
len(docs), docs[:2]

(912,
 [Document(metadata={'producer': 'Adobe PDF Library 24.5.96', 'creator': 'Acrobat PDFMaker 24 for Word', 'creationdate': '2025-01-27T08:44:50-06:00', 'author': 'Joel Youvan', 'comments': '', 'company': '', 'keywords': '', 'moddate': '2025-01-27T08:44:55-06:00', 'sourcemodified': 'D:20250127144434', 'subject': '', 'title': '', 'rgid': 'PB:388414789_AS:11431281305729682@1737989125997', 'source': 'rag_info.pdf', 'total_pages': 58, 'page': 0, 'page_label': '1'}, page_content='See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/388414789'),
  Document(metadata={'producer': 'Adobe PDF Library 24.5.96', 'creator': 'Acrobat PDFMaker 24 for Word', 'creationdate': '2025-01-27T08:44:50-06:00', 'author': 'Joel Youvan', 'comments': '', 'company': '', 'keywords': '', 'moddate': '2025-01-27T08:44:55-06:00', 'sourcemodified': 'D:20250127144434', 'subject': '', 'title': '', 'rgid': 'PB:388414789_AS:11431281305729682@1737989125997', 'source

## Build embeddings + FAISS vector store

Embedding is converting chunks into a vector that holds the semantic context of the chunk.

In [13]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")  # fast & inexpensive for learning
vs = FAISS.from_documents(docs, embedding=embeddings)

print("Indexed vectors:", vs.index.ntotal)

Indexed vectors: 912


### How it works

1. Each chunk is embedded into a vector.
2. FAISS uses cosine-similarity to store the vectors.

## Basic Similarity Search and k-size

Top-k or k-size defines the number of chunks being returned. If k-size is 3, top 3 results based on the user query will be yielded.

A basic similarity search finds the most conceptually similar items from a collection by comparing their numerical representations (vector embeddings). For this purpose cosine similarity is used.

In [14]:
query = "How do I search relevant chunks efficiently?"
hits = vs.similarity_search(query, k=3)  # returns Documents
for i, d in enumerate(hits, 1):
    print(f"\n--- HIT {i} ---")
    print(d.page_content)
    print("metadata:", d.metadata)


--- HIT 1 ---
searching and retrieving the most relevant information grows,
metadata: {'producer': 'Adobe PDF Library 24.5.96', 'creator': 'Acrobat PDFMaker 24 for Word', 'creationdate': '2025-01-27T08:44:50-06:00', 'author': 'Joel Youvan', 'comments': '', 'company': '', 'keywords': '', 'moddate': '2025-01-27T08:44:55-06:00', 'sourcemodified': 'D:20250127144434', 'subject': '', 'title': '', 'rgid': 'PB:388414789_AS:11431281305729682@1737989125997', 'source': 'rag_info.pdf', 'total_pages': 58, 'page': 24, 'page_label': '25'}

--- HIT 2 ---
queries. 
• Optimizing indexing techniques (e.g., approximate nearest neighbor search) 
for faster lookups.
metadata: {'producer': 'Adobe PDF Library 24.5.96', 'creator': 'Acrobat PDFMaker 24 for Word', 'creationdate': '2025-01-27T08:44:50-06:00', 'author': 'Joel Youvan', 'comments': '', 'company': '', 'keywords': '', 'moddate': '2025-01-27T08:44:55-06:00', 'sourcemodified': 'D:20250127144434', 'subject': '', 'title': '', 'rgid': 'PB:388414789_AS:114

## Similarity Search with scores

Sometimes you want both the documents and their similarity scores, and to filter weak matches.

In [17]:
query = "What are the advantages of RAG?"
scored = vs.similarity_search_with_score(query, k=5)  # list of (Document, score)
# Lower score means closer (FAISS returns distances; wrapper converts appropriately)
threshold = 0.6  # tune this per model/index; smaller is stricter if using L2 distance
kept = [(d, s) for d, s in scored if s <= threshold]

print(f"Returned {len(scored)}; kept {len(kept)} under threshold {threshold}\n")

print(type(scored[0]))

for d, s in kept:
    print(f"score={s:.3f} :: {d.page_content[:100]}...")

Returned 5; kept 5 under threshold 0.6

<class 'tuple'>
score=0.461 :: Key Advantages of RAG Over Traditional Models:...
score=0.554 :: paper explores the fundamentals of RAG, its technical implementation, key...
score=0.565 :: 6 
 
Core Components of RAG 
At a high level, RAG consists of two primary components that work in ta...
score=0.578 :: reliable and domain-specific insights. These advantages make RAG particularly...
score=0.590 :: Challenges of RAG Compared to Traditional Models:...


### How it works

The `.similarity_search_with_score` returns a list of tuples, for each tuple the first value is the Document object and the second is the score. 

## Maximal Marginal Relevance (MMR)

Maximal Marginal Relevance (MMR) is a method used to retrieve a list of items that are both relevant to a query and diverse from one another.

In a similarity search the top retrieved documents are similar to one another. If your document contains similar verbose about a topic chances are only those are returned in a similarity search. This makes the retrived chunks too similar in meaning.

MMR focuses diversifying these results.

1. First, it selects the single most relevant item for your query.
2. Then, for every subsequent selection, it picks the item that offers the best trade-off between being relevant to the query and being different from the items already selected.

In [18]:
query = "Explain retrieval and vector stores."
mmr_hits = vs.max_marginal_relevance_search(query, k=4, fetch_k=10, lambda_mult=0.5)
for i, d in enumerate(mmr_hits, 1):
    print(f"\n--- MMR {i} ---")
    print(d.page_content)


--- MMR 1 ---
o Example: Elasticsearch, Apache Solr 
• Vector Indexing (for dense retrieval):

--- MMR 2 ---
2. Fundamentals of Retrieval-Augmented Generation (RAG)

--- MMR 3 ---
o Stores document embeddings in a high-dimensional space. 
o Example: FAISS, Annoy 
• Sharding and Partitioning:

--- MMR 4 ---
Retrieval (DPR) encode text into embeddings that can capture 
contextual meaning. 
o Advantages:


### How it works

- `fetch_k` is the candidate pool. This determines how many documents will retrieved using only a similarity search on which the MMR's diversification logic is applied.
    - If your fetch_k is too small then there will be less diversity between the results.
- `lambda_mult` controls the trade-off between relevance (similarity to the query) and diversity (dissimilarity from already selected documents).
    - For a value $1.0$, there is no diversification of results and the similarity search results are returned without any additional MMR logic.
    - For a value $0.0$, This would maximize diversity, potentially returning results that are very different from each other but not very relevant to the query.

## Vector store as Retriever

This converts the Vector store into a retriever. It uses similarity search to find documents.

In [7]:
# Basic retriever
retriever = vs.as_retriever(search_kwargs={"k": 3})

# Try it
for d in retriever.get_relevant_documents("What interface returns relevant docs for a query?"):
    print("-", d.page_content)

  for d in retriever.get_relevant_documents("What interface returns relevant docs for a query?"):


- The retrieval system relies on various methods to identify the most relevant data 
based on the input query:
- queries to ensure more relevant and precise results.
- Key Advancements: 
1. Interactive Query Refinement:


## ContextualCompressionRetriever

ContextualCompressionRetriever is a LangChain retriever that "compresses" retrieved documents to improve the quality of a Large Language Model's (LLM) response in a RAG (Retrieval Augmented Generation) system.

It also removes irrelevant context from the retrieved documents to the user's query. Hence optimising the number of tokens processed for a given user query.

In [8]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import EmbeddingsFilter

# Simple embeddings-based sentence filter
compressor = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.6)
compressed_retriever = ContextualCompressionRetriever(
    base_retriever=retriever,
    base_compressor=compressor
)

docs_compressed = compressed_retriever.get_relevant_documents("What is a Vector embedding and why is it needed?")
for i, d in enumerate(docs_compressed, 1):
    print(f"\n--- COMPRESSED {i} ---")
    print(d.page_content)

### How it works

1. Initial Retrieval: The `ContextualCompressionRetriever` first uses its `base_retriever` to fetch a set of documents based on the user's query. This step typically focuses on recall (getting a broad set of potentially relevant documents), which can often include a lot of irrelevant "fluff."
2. Compression: The initially retrieved documents are then passed to the `document_compressor`. This component processes the documents, filtering out irrelevant content and keeping only the information most pertinent to the query.

The compression can be done in a few ways:

- **LLM-based compression:** A document compressor like `LLMChainExtractor` uses an LLM to read through each document and extract only the sentences or passages that are highly relevant to the query.

- **Embedding-based filtering:** A compressor like `EmbeddingsFilter` uses embeddings to filter out entire documents if their similarity to the query falls below a certain threshold. _(This is the one we use in the above example)_

## A tiny RAG

Let's implement a tiny RAG to get to know more about Embeddings and Retrieval.

In [20]:
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)

rag_prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer using ONLY the provided context. If unsure, say you don't know.\n\nContext:\n{context}"),
    ("user", "{question}")
])

def stuff_context(question: str, retriever):
    docs = retriever.get_relevant_documents(question)
    context = "\n\n".join(d.page_content for d in docs)
    return {"context": context, "question": question}

rag_chain = (
    # adapter that builds the input dict expected by the prompt
    (lambda x: stuff_context(x["question"], retriever))
    | rag_prompt
    | llm
    | StrOutputParser()
)

print("\n--- RAG ANSWER ---\n")
print(rag_chain.invoke({"question": "How RAG reduces hallucinations?"}))


--- RAG ANSWER ---

RAG reduces hallucinations by retrieving data to check outputs, which reduces the likelihood of hallucinations and enhances factual correctness.


## Excercise

Try to implment MMR in the above example.