Imagine you are working on a project that involves processing a large collection of text documents, such as research papers, legal documents, or customer service logs. Your task is to develop a system that can quickly retrieve the most relevant segments of text based on a user's query. Traditional keyword-based search methods might not be sufficient, as they often fail to capture the nuanced meanings and contexts within the documents. To address this challenge, you can use different types of retrievers based on LangChain.

Using retrievers is crucial for several reasons:

- Efficiency: Retrievers enable fast and efficient retrieval of relevant information from large datasets, saving time and computational resources.
- Accuracy: By leveraging advanced retrieval techniques, these tools can provide more accurate and contextually relevant results compared to traditional search methods.
- Versatility: Different retrievers can be tailored to specific use cases, making them adaptable to various types of text data and query requirements.
- Context awareness: Some retrievers, like the Parent Document Retriever, can consider the broader context of the document, enhancing the relevance of the retrieved segments.


We will learn about four types of retrievers: `Vector Store-backed Retriever`, `Multi-Query Retriever`, `Self-Querying Retriever`, and `Parent Document Retriever`. We will also learn the differences between these retrievers and understand the appropriate situations in which to use each one. By the end of this lab, you will be equipped with the skills to implement and utilize these retrievers in your projects.

In [1]:
!pip install --user "chromadb==0.4.24" | tail -n 1




In [2]:
pip install numpy==1.26.4




In [3]:
pip install -U langchain-community



In [4]:
# You can use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

The following functions are prerequisite knowledge for understanding the topic of this project—retrievers. These functions include:

- Building LLMs
- Splitting documents into chunks
- Building an embedding model

The relevant knowledge and details of these functions have been covered in previous lessons.


In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

def llm():
    model_id = "mistralai/Mistral-7B-Instruct-v0.1"  # Lightweight version of Mixtral, openly available

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id)

    # Use pipeline for easier generation
    text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

    def generate(prompt):
        output = text_generator(
            prompt,
            max_new_tokens=256,
            temperature=0.5,
            do_sample=True,
            top_p=0.95,
            top_k=50
        )
        return output[0]["generated_text"]

    return generate


### Text Splitter

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def text_splitter(data, chunk_size, chunk_overlap):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    chunks = text_splitter.split_documents(data)
    return chunks

#### Embedding model



In [7]:
from langchain.embeddings import HuggingFaceEmbeddings

def huggingface_embedding():
    model_name = "sentence-transformers/all-MiniLM-L6-v2"  # Fast, light, and effective

    embedding = HuggingFaceEmbeddings(
        model_name=model_name,
        model_kwargs={"device": "cpu"}  # or "cuda" if GPU is available
    )
    return embedding


## Retrievers

A retriever is an interface designed to return documents based on an unstructured query. Unlike a vector store, which stores and retrieves documents, a retriever's primary function is to find and return relevant documents. While vector stores can serve as the backbone of a retriever, there are various other types of retrievers that can be used as well.

Retrievers take a string `query` as input and output a list of `Documents`.

### Vector Store-Backed Retriever

A vector store retriever is a type of retriever that utilizes a vector store to fetch documents. It acts as a lightweight wrapper around the vector store class, enabling it to conform to the retriever interface. This retriever leverages the search methods implemented by the vector store, such as similarity search and Maximum Marginal Relevance (MMR), to query texts stored within it.



In [8]:
!wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/MZ9z1lm-Ui3YBp3SYWLTAQ/companypolicies.txt"

--2025-06-18 20:55:50--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/MZ9z1lm-Ui3YBp3SYWLTAQ/companypolicies.txt
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15660 (15K) [text/plain]
Saving to: ‘companypolicies.txt.2’


2025-06-18 20:55:50 (191 MB/s) - ‘companypolicies.txt.2’ saved [15660/15660]



In [9]:
from langchain_community.document_loaders import TextLoader
loader = TextLoader("companypolicies.txt")
txt_data = loader.load()

In [10]:
txt_data



Split `txt_data` into chunks. `chunk_size = 200`, `chunk_overlap = 20` has been set.


In [11]:
chunks_txt = text_splitter(txt_data, 200, 20)

Store the embeddings into a `ChromaDB`.


In [12]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# Step 1: Define the embedding model
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"}  # change to "cuda" if GPU available
)

# Step 2: Create Chroma vectorstore
vectordb = Chroma.from_documents(chunks_txt, embedding_model)


In [29]:
chunks_txt

[Document(metadata={'source': 'companypolicies.txt'}, page_content='1.\tCode of Conduct'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Our Code of Conduct outlines the fundamental principles and ethical standards that guide every member of our organization. We are committed to maintaining a workplace that is built on integrity,'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='built on integrity, respect, and accountability.'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Integrity: We hold ourselves to the highest ethical standards. This means acting honestly and transparently in all our interactions, whether with colleagues, clients, or the broader community. We'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='community. We respect and protect sensitive information, and we avoid conflicts of interest.'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content="Respect: We embrace dive

#### Simple similarity search

Here is an example of a simple similarity search based on the vector database.

For this demonstration, the query has been set to "email policy".


In [14]:
query = "email policy"
retriever = vectordb.as_retriever()
docs = retriever.invoke(query)
docs

[Document(metadata={'source': 'companypolicies.txt'}, page_content='3.\tInternet and Email Policy'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Our Internet and Email Policy aims to promote safe, responsible usage of digital communication tools that align with our values and legal obligations. Each employee is expected to understand and'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Our Internet and Email Policy is established to guide the responsible and secure use of these essential tools within our organization. We recognize their significance in daily business operations and'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Confidentiality: Reserve email for the transmission of confidential information, trade secrets, and sensitive customer data only when encryption is applied. Exercise discretion when discussing')]

You can also specify `search kwargs` like `k` to limit the retrieval results.


In [15]:
retriever = vectordb.as_retriever(search_kwargs={"k": 1})
docs = retriever.invoke(query)
docs

[Document(metadata={'source': 'companypolicies.txt'}, page_content='3.\tInternet and Email Policy')]

#### MMR retrieval

MMR in vector stores is a technique used to balance the relevance and diversity of retrieved results. It selects documents that are both highly relevant to the query and minimally similar to previously selected documents. This approach helps to avoid redundancy and ensures a more comprehensive coverage of different aspects of the query.

The following code is showing how to conduct an MMR search in a vector database. You just need to sepecify `search_type="mmr"`.


In [16]:
retriever = vectordb.as_retriever(search_type="mmr")
docs = retriever.invoke(query)
docs

[Document(metadata={'source': 'companypolicies.txt'}, page_content='3.\tInternet and Email Policy'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Confidentiality: Reserve email for the transmission of confidential information, trade secrets, and sensitive customer data only when encryption is applied. Exercise discretion when discussing'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Review of Policy: This policy will be reviewed periodically to ensure its alignment with evolving legal requirements and best practices for maintaining a healthy and safe workplace.'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='individual found to be in violation of this policy.')]

#### Similarity score threshold retrieval

You can also set a retrieval method that defines a similarity score threshold, returning only documents with a score above that threshold.


In [17]:
retriever = vectordb.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.4}
)
docs = retriever.invoke(query)
docs

[Document(metadata={'source': 'companypolicies.txt'}, page_content='3.\tInternet and Email Policy'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Our Internet and Email Policy aims to promote safe, responsible usage of digital communication tools that align with our values and legal obligations. Each employee is expected to understand and'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Our Internet and Email Policy is established to guide the responsible and secure use of these essential tools within our organization. We recognize their significance in daily business operations and'),
 Document(metadata={'source': 'companypolicies.txt'}, page_content='Confidentiality: Reserve email for the transmission of confidential information, trade secrets, and sensitive customer data only when encryption is applied. Exercise discretion when discussing')]

### Multi-Query Retriever

Distance-based vector database retrieval represents queries in high-dimensional space and finds similar embedded documents based on "distance". However, retrieval results may vary with subtle changes in query wording or if the embeddings do not accurately capture the data's semantics.

The `MultiQueryRetriever` addresses this by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and then takes the unique union of these results to form a larger set of potentially relevant documents. By generating multiple perspectives on the same question, the `MultiQueryRetriever` can potentially overcome some limitations of distance-based retrieval, resulting in a richer and more diverse set of results.

A PDF document has been prepared to demonstrate this Multi-Query Retriever.


In [19]:
pip install pypdf

Collecting pypdf
  Downloading pypdf-5.6.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.6.0-py3-none-any.whl (304 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/304.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/304.2 kB[0m [31m3.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m304.2/304.2 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.6.0


In [25]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/ioch1wsxkfqgfLLgmd-6Rw/langchain-paper.pdf")
pdf_data = loader.load()
pdf_data[1]

Document(metadata={'producer': 'PyPDF', 'creator': 'Microsoft Word', 'creationdate': '2023-12-31T03:50:13+00:00', 'author': 'IEEE', 'moddate': '2023-12-31T03:52:06+00:00', 'title': 's8329 final', 'source': 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/ioch1wsxkfqgfLLgmd-6Rw/langchain-paper.pdf', 'total_pages': 6, 'page': 1, 'page_label': '2'}, page_content='LangChain helps us to unlock the ability to harness the \nLLM’s immense potential in tasks such as document analysis, \nchatbot development, code analysis, and countless other \napplications. Whether your desire is to unlock deeper natural \nlanguage understanding , enhance data, or circumvent \nlanguage barriers through translation, LangChain is ready to \nprovide the tools and programming support you need to do \nwithout it that it is not only difficult but also fresh for you. Its \ncore functionalities encompass: \n1. Context-Aware Capabilities: LangChain facilitates the \ndevelopment of applications that ar

Split document and store the embeddings into a vector database.


In [26]:
vectordb.get()["ids"]

[]

In [23]:
# Split
chunks_pdf = text_splitter(pdf_data, 500, 20)

# VectorDB
ids = vectordb.get()["ids"]
vectordb.delete(ids=ids)  # ✅ pass as keyword argument
vectordb = Chroma.from_documents(documents=chunks_pdf, embedding=embedding_model)

ValueError: 
                You must provide either ids, where, or where_document to delete. If
                you want to delete all data in a collection you can delete the
                collection itself using the delete_collection method. Or alternatively,
                you can get() all the relevant ids and then delete them.
                