<a href="https://colab.research.google.com/github/singhraj00/langchain-tutorial/blob/main/RAG_Retrievers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What are Retrievers ?

## **A retriever is a component in LangChain that fetches relevant documents from a data source in response to a user's query.**

![](https://k21academy.com/wp-content/uploads/2024/12/Understanding-RAG-with-LangChain-blog-image-4.png)

- There are multiple retrievers
- All retrievers in LangChain are runnables

## Types of Retrievers

## Based On Data-Sources
- Wikipedia Retriever
- Vector Store Retriever
- Arxiv Retriever

## Based on Retriever
- Search-Strategy Based
  - MMR (Maximum Marginal Relevance Search)
  - MQR (Multi-Query Retriever)
  - Contextual Compression Retriever

## Based On Data-Sources

## 1. Wikipedia Retriever

### **A wikipedia retriever is a retriever that queries the Wikipedia API to fetch relevant content for a given query.**

### How It Works
- You give it a query (eg. "albert Einstein")
- It sends the query to wikipedia's API
- It retrieves the most relevants articles
- It returns them as LangChain `Document` objects.

In [3]:
!pip install langchain langchain-core langchain-community langchain-huggingface chroma faiss-cpu wikipedia

Collecting langchain-community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting chroma
  Downloading Chroma-0.2.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.1

## Code Example

In [9]:
from langchain_community.retrievers import WikipediaRetriever

retriever = WikipediaRetriever(top_k_results=2,lang='en')

## define your query
query = "Future of Agentic AI"

## get relevant wikipedia documents

docs =retriever.invoke(query)

## retriever are runnable -- use invoke method

## print relevant wikipedia document

for i,doc in enumerate(docs):
  print(f'\n ----- Result {i}    ---- \n')
  print(f'content: \n {doc.page_content}')



 ----- Result 0    ---- 

content: 
 Agentic AI is a class of artificial intelligence that focuses on autonomous systems that can make decisions and perform tasks without human intervention. The independent systems automatically respond to conditions, to produce process results. The field is closely linked to agentic automation, also known as agent-based process management systems, when applied to process automation. Applications include software development, customer support, cybersecurity and business intelligence. 


== Core concept ==
The core concept of agentic AI is the use of AI agents to perform automated tasks but without human intervention. While robotic process automation (RPA) and AI agents can be programmed to automate specific tasks or support rule-based decisions, the rules are usually fixed. Agentic AI operates independently, making decisions through continuous learning and analysis of external data and complex data sets. Functioning agents can require various AI techn

## Vector Store Retriever

### A vector store retriever in LangChain is the most common type of retriever that lets you search and fetch documents from a vector store based on similarity vaector embeddings.

## ⚙️How It Works

![](https://miro.medium.com/v2/resize:fit:1400/1*lN7sEOHJEjMpyNH1ahAvbQ.png)

- You store your documents in a vector store (like FAISS, Chroma etc.)
- Each document is converted into a **dense vector** using an **embedding model**
- It's also turned into a vector
- The retriever compares the query vector with the stored vectors
- It retrieves the top-k most similar ones.

## Code Example

In [10]:
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

In [17]:
documents = [
    Document(page_content="Artificial intelligence (AI) is a technology that allows machines to perform tasks that typically require human intelligence. AI can learn, reason, solve problems, and make decisions." ),
    Document(page_content="Generative AI (GenAI) is a type of artificial intelligence (AI) that can create new content like text, images, videos, and audio." ),
    Document(page_content="Agentic AI refers to AI systems that can act autonomously, making decisions and taking actions to achieve specific goals, with minimal human intervention." ),
    Document(page_content="LangChain is an open-source framework that simplifies the development of applications powered by large language models (LLMs)." ),
    Document(page_content="RAG stands for Retrieval-Augmented Generation. It's an AI technique that enhances the output of large language models (LLMs) by incorporating external knowledge from various sources. RAG helps LLMs generate more accurate, relevant, and up-to-date responses by leveraging their generative capabilities while also accessing and integrating information from external databases, documents, or web sources. ")
]

In [14]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}

embedding_function = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

  embedding_function = HuggingFaceBgeEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [19]:
pip install chromadb

Collecting chromadb
  Downloading chromadb-1.0.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi==0.115.9 (from chromadb)
  Downloading fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.1-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.24.1-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.21.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentele

In [20]:
from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents, embedding_function, collection_name="my_collection")

In [25]:
## create retriever object
retriever = vectorstore.as_retriever(search_kwargs={"k":1})

In [26]:
query = "what is langchain?"

result = retriever.invoke(query)

for i,doc in enumerate(result):
  print(f'\n === result {i+1} ===')
  print(doc.page_content)


 === result 1 ===
LangChain is an open-source framework that simplifies the development of applications powered by large language models (LLMs).


## Maximal Marginal Relevance (MMR)

``"how can we pick results that are not only relevant to the query but also different from each other?"``

### **MMR is an information retrieval algorithm designed to reduce redundancy in the received results while maintaining high relevance to the query.**

## ❓Why MMR Retriever?

In regular similarity search, you may get documents that are :
- All very similar to each other
- Repeating the same info
- Lacking diverse perspective

✅ MMR Retriever avoids that by:
- Picking the **most relevant document** first
- Then picking the next most relevant and **least similar** already selected docs
- And so on...

🔰 This helps especially in RAG pipelines where:
- You want your content your contxnt window to contain **diverse but still relevant information.**
- Especially useful when documents are semantically overlapping.

## Code Example

In [None]:
from langchain_core.documents import Document
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import FAISS


# Sample documents
docs = [
    Document(page_content="LangChain makes it easy to work with LLMs."),
    Document(page_content="LangChain is used to build LLM based applications."),
    Document(page_content="Chroma is used to store and search document embeddings."),
    Document(page_content="Embeddings are vector representations of text."),
    Document(page_content="MMR helps you get diverse results when doing similarity search."),
    Document(page_content="LangChain supports Chroma, FAISS, Pinecone, and more."),
]

## embedding model
model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}

embedding_function = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

## load vector store
vectorstore = FAISS.from_documents(docs, embedding_function)

## Enable MMR in the retriever
retriever = vectorstore.as_retriever(search_type="mmr",  ## --- this enables MMR
                                     search_kwargs={'k':3,'lambda_mult':0.5}
                                     ## k = top results , lambda_mult = relevance-diversity balance
                                     )

## define query
query = "what is langchain?"

## retrive relevant information
results=retriever.invoke(query)

## print relevant info
for i,doc in enumerate(results):
  print(f'\n === result {i+1} ===')
  print(doc.page_content)


## Multi-Query Retriever

``""Sometimes a single query might not capture all the ways information is phrased in your documents."``

## ❓**Query:** ``How can I stay healthy?``"

✅ Could mean:
- what should I eat?
-How often should I exercise?
- How can I manage stress?

#### A simple similarity search might **miss documents** that talk about those things but don't use the word "healthy".

## ⚙️How It Works

![](https://miro.medium.com/v2/resize:fit:1200/1*wa27F8CbKqYtaRRGyRKy0Q.png)

- Takes your original query
- Uses as LLM (eg. GPT-4.o) to generate multiple semantically versions of that query.
- Performs retrieval for each sub-query
- Combines and duplicates the results

## Code Example

In [55]:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.vectorstores import FAISS
from langchain_groq import ChatGroq
from google.colab import userdata
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}

embedding_function = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)


API_KEY = userdata.get('GROQ_API_KEY')

llm = ChatGroq(
      model="llama-3.1-8b-instant",
      temperature=0.7,
      groq_api_key=API_KEY,
)

# Relevant health & wellness documents
all_docs = [
    Document(page_content="Regular walking boosts heart health and can reduce symptoms of depression.", metadata={"source": "H1"}),
    Document(page_content="Consuming leafy greens and fruits helps detox the body and improve longevity.", metadata={"source": "H2"}),
    Document(page_content="Deep sleep is crucial for cellular repair and emotional regulation.", metadata={"source": "H3"}),
    Document(page_content="Mindfulness and controlled breathing lower cortisol and improve mental clarity.", metadata={"source": "H4"}),
    Document(page_content="Drinking sufficient water throughout the day helps maintain metabolism and energy.", metadata={"source": "H5"}),
    Document(page_content="The solar energy system in modern homes helps balance electricity demand.", metadata={"source": "I1"}),
    Document(page_content="Python balances readability with power, making it a popular system design language.", metadata={"source": "I2"}),
    Document(page_content="Photosynthesis enables plants to produce energy by converting sunlight.", metadata={"source": "I3"}),
    Document(page_content="The 2022 FIFA World Cup was held in Qatar and drew global energy and excitement.", metadata={"source": "I4"}),
    Document(page_content="Black holes bend spacetime and store immense gravitational energy.", metadata={"source": "I5"}),
]


vectorstores = FAISS.from_documents(all_docs,embedding_function)

similarity_retriever = vectorstores.as_retriever(search_type='similarity',search_kwargs={"k":5})

multiquery_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstores.as_retriever(search_kwargs={"k":5}),
    llm=llm,
)

## define your query
query = "How to improve energy levels and maintain balance?"

## retrieve results

similarity_results = similarity_retriever.invoke(query)
multiquery_results = multiquery_retriever.invoke(query)

## compare results
for i, doc in enumerate(similarity_results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)

print("*"*150)

for i, doc in enumerate(multiquery_results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)




--- Result 1 ---
Drinking sufficient water throughout the day helps maintain metabolism and energy.

--- Result 2 ---
Mindfulness and controlled breathing lower cortisol and improve mental clarity.

--- Result 3 ---
Consuming leafy greens and fruits helps detox the body and improve longevity.

--- Result 4 ---
Regular walking boosts heart health and can reduce symptoms of depression.

--- Result 5 ---
Deep sleep is crucial for cellular repair and emotional regulation.
******************************************************************************************************************************************************

--- Result 1 ---
Photosynthesis enables plants to produce energy by converting sunlight.

--- Result 2 ---
Drinking sufficient water throughout the day helps maintain metabolism and energy.

--- Result 3 ---
Black holes bend spacetime and store immense gravitational energy.

--- Result 4 ---
Mindfulness and controlled breathing lower cortisol and improve mental clarity.



## Contextual Compression Retriever

### **The Contextual Compression Retriever in LangChain is an advanced retriever that improves the retriever quality by compressing documents and retrieval - keeping only the relevant content based on the user's query.**

## ❓ Query:

### `what is the photosynthesis?`

## Retrieved Document (by a traditional retriever)

### `The Grand Conyon is a famous natural site. Photosynthesis is how plants convert light into energy Many tourist visit every year.`

## ❌ Problem:
- The retriver returns the **entire paragraph.**
- Only **one sentence** is actually relevant to the query.
- The rest is **irrelevant noise** that wastes context window and may cause the LLM.


## ✅ What Contextual Compression Returns does:

**Returns only the relevant part**, eg.

#### `photosynthesis is how plants converts light into energy.`

## ⚙️How It Works

![](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5897dad-1498-4ef7-8c50-c4db976e9656_1009x709.png)

1. Base Retriever (eg . FAISS, Chroma), retrieves N documents.
2. A **compressor** (usually an LLM) is applied to each document.
3. The compressor keeps **only the parts relevant to the query**
4. Irrelevant content is discarded.

## ✅ When to Use
- Your document is are **long and contain mixed information**
- You want to **reduce context length** for LLM.
- You need to **improve answer accuracy in RAG pipelines.**

## Code Example

In [51]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS

In [52]:
# Recreate the document objects from the previous data
docs = [
    Document(page_content=(
        """The Grand Canyon is one of the most visited natural wonders in the world.
        Photosynthesis is the process by which green plants convert sunlight into energy.
        Millions of tourists travel to see it every year. The rocks date back millions of years."""
    ), metadata={"source": "Doc1"}),

    Document(page_content=(
        """In medieval Europe, castles were built primarily for defense.
        The chlorophyll in plant cells captures sunlight during photosynthesis.
        Knights wore armor made of metal. Siege weapons were often used to breach castle walls."""
    ), metadata={"source": "Doc2"}),

    Document(page_content=(
        """Basketball was invented by Dr. James Naismith in the late 19th century.
        It was originally played with a soccer ball and peach baskets. NBA is now a global league."""
    ), metadata={"source": "Doc3"}),

    Document(page_content=(
        """The history of cinema began in the late 1800s. Silent films were the earliest form.
        Thomas Edison was among the pioneers. Photosynthesis does not occur in animal cells.
        Modern filmmaking involves complex CGI and sound design."""
    ), metadata={"source": "Doc4"})
]

In [54]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}

embedding_function = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

vectorstore = FAISS.from_documents(docs, embedding_function)

llm = ChatGroq(
      model="llama-3.1-8b-instant",
      temperature=0.7,
      groq_api_key=API_KEY,
)

compressor = LLMChainExtractor.from_llm(llm)

## create the contextual compression retriever

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)

## define query
query = "what is photosynthesis?"

compressed_results = compression_retriever.invoke(query)

for i,doc in enumerate(compressed_results):
  print(f'\n ===== Result {i+1} =====')
  print(doc.page_content)


 ===== Result 1 =====
Photosynthesis is the process by which green plants convert sunlight into energy.

 ===== Result 2 =====
The chlorophyll in plant cells captures sunlight during photosynthesis.

 ===== Result 3 =====
Photosynthesis does not occur in animal cells.


## More Retrievers
- BM25Reyriever
- ParentDocumentRetriever
- SelfQueryRetriever
- TimeWeightedVectorRetriever
- MultiVectorRetriever
- EnsembleRetriever
- ArxivRetriever