# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [7]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [8]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [21]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [22]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [23]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [24]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [25]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [26]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [27]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [10]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided data, the most common issues with loans tend to revolve around problems with how payments are being handled and dealing with lenders or servicers. Specific issues frequently include:\n\n- Errors in loan balances, misapplied payments, or wrongful denials of payment plans.\n- Trouble with how payments are applied, such as additional funds only being directed to interest, or difficulty in paying down the principal.\n- Discrepancies in account status, such as incorrect reporting of delinquencies or statuses being changed without proper notification.\n- Issues with loan transfer and lack of proper notification about changes in servicers.\n- Incorrect or bad information about loan details, including balances, interest rates, or account status.\n- Disputes over fees or account inaccuracies, and mishandling of personal information.\n\nIn summary, the most common issue appears to be problems related to the handling and processing of payments and account information by loa

In [11]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, some complaints did not get handled in a timely manner. Specifically, there are at least two complaints where the response time was marked as "No" for being timely:\n\n1. Complaint from 03/28/25 (Complaint ID: 12709087) submitted to MOHELA, where the response was not timely.\n2. Complaint from 04/21/25 (Complaint ID: 13091395) submitted to Maximus Federal Services, which was handled in a timely manner.\n\nAdditionally, there are instances where consumers reported that their issues remained unresolved for extended periods, such as over a year or nearly 18 months, indicating delays or lack of resolution.\n\nIf you need detailed information on specific complaints, I can provide that.'

In [12]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for several reasons, including:\n\n1. **Interest Accumulation and Unmanageable Debt:** Many borrowers reported that interest continued to accrue even during deferment or forbearance periods, which increased total debt and made repayment more difficult over time.\n\n2. **Financial Hardship and Income Insufficiency:** Borrowers faced financial difficulties, low wages, or stagnant incomes, making it difficult to afford minimum payments without sacrificing basic needs.\n\n3. **Lack of Clear or Adequate Communication:** Some borrowers were not properly informed about repayment schedules, loan transfer processes, or their account status, leading to missed payments or delinquency.\n\n4. **Systemic Issues and Loan Management Problems:** Issues such as unresponsive servicers, incorrect reporting of delinquency, or loan mismanagement (including unauthorized transfers and errors in account information) contributed to repayment difficulties.\n\n5. **Inadequat

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [28]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [29]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [15]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be problems related to dealing with lenders or servicers, such as disputes over fees, difficulties with payment application, and receiving inaccurate or bad information about the loan. Multiple complaints highlight issues like not agreeing with charged fees, trouble with how payments are processed (particularly regarding interest and principal), and receiving incorrect or confusing loan information.\n\nTherefore, the most common issue with loans, as reflected in the complaints, is related to challenges in dealing with lenders or servicers, including disputes over fees, payment processing, and incorrect information.'

In [16]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all the complaints listed in the context received responses marked as "Closed with explanation," and the responses were indicated as "Yes" for being timely. There is no record of any complaints that were not handled in a timely manner. Therefore, the answer is:\n\nNo, there are no complaints that did not get handled in a timely manner.'

In [17]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including being misled or improperly guided on repayment options, lack of communication from lenders or servicers, technical issues with payments, or difficulties obtaining the correct information about their loan status. For example, some borrowers experienced problems with payment reversals or being incorrectly classified as overdue, while others were not properly notified about transfers or changes in their loan account or repayment plans. Additionally, some faced issues with forbearances and deferments, either not receiving responses or having their applications ignored. These issues often led to negative impacts on credit scores and feelings of deception or unfair treatment. Overall, inadequate communication, administrative errors, and improper handling of loan information contributed to the difficulty borrowers faced in repaying their loans.'

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

</div>

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### Answer:

- Find any complaints that describe a suspecteo or alleged violation of FERPA (the Family Educational Rights and Privacy Act), the Privacy Act of 1974, or HIPAA (the Health Insurance Portability and Accountability Act).
- Best-Matching 25 (BM25) outshines embeddings in scenarios requiring precise keyword or exact term matching and where the query's meaning is literal and specific.

</div>

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [30]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [31]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [20]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issue with loans appears to be problems related to dealing with lenders or servicers, such as errors in loan balances, misapplied payments, wrongful denials of payment plans, and mishandling of account information. A frequent concern includes the accumulation of interest during forbearance or deferment and the lack of clear or accurate information provided to borrowers.'

In [21]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, it appears that at least one complaint (the one related to EdFinancial Services) was handled in a timely manner, since the response was marked as "Yes" for timely response. However, there is also evidence that some complaints have been open for over a year without resolution, such as the complaint about loan account review issues and violations, which have remained unresolved for nearly 18 months.\n\nTherefore, while some complaints were handled promptly, others did not get addressed in a timely manner.'

In [22]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People often failed to pay back their loans due to a combination of factors including lack of proper information about the repayment obligations, unexpected or uncommunicated transfers of loan control, accumulating interest even during forbearance or deferment, and confusion or difficulty in navigating complex loan management systems. Additionally, some borrowers were unaware that they needed to repay their loans at all, or were not adequately notified about payment due dates, which led to missed payments and accumulation of debt. The accumulation of interest, especially when payments are deferred or when only small payments are made, can cause the total debt to grow over time, making it more difficult to pay off. In some cases, borrowers felt misled or lacked transparency regarding their loan balances and interest calculations, further complicating their ability to repay effectively.'

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [32]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [33]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [25]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be problems related to the handling and management of student loans, including:\n\n- Trouble with how payments are being handled (e.g., inability to apply extra funds to principal, payments applying mostly to interest)\n- Issues with payment plans and difficulty adjusting or understanding payment obligations\n- Problems with loan balances, interest calculation, and misapplied payments\n- Lack of transparency and communication from servicers\n- Errors or discrepancies in loan balance or account information\n- Servicing failures leading to credit report errors or unauthorized reporting\n- Unauthorized transfers or changes in loan management without proper notice\n- Difficulties with loan discharge, forgiveness, or discharge disputes\n- Unfair or predatory practices, such as pressure to consolidate or misleading information about loans\n\nIn summary, the most common issue tends to be mismanagement and lack of tran

In [26]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, there are multiple instances indicating that complaints were not handled in a timely manner. Specifically:\n\n- A complaint regarding the processing of a graduated loan application (Complaint ID: 12709087) was not responded to within the expected timeframe, with the report stating "The original complaint did not address that main issue" and being unresolved after multiple follow-ups.\n- Several complaints, including those related to credit report inaccuracies (Complaint IDs: 13062402 and 13140511), mention that the complaints were not resolved promptly despite being filed weeks or months earlier. For example, one complaint was submitted over 18 months ago with no resolution, and others have been waiting for months without response.\n- A complaint about the complaint not being addressed initially (Complaint ID: 12914633) indicates ongoing delays in response.\n\nOverall, yes, multiple complaints did not get handled in a timely manner, with delays rangin

In [27]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for several reasons based on the complaints and issues described:\n\n1. **Miscommunication and Lack of Clear Information:** Several borrowers were not properly informed about their repayment obligations, the status of their loans, or changes in the loan servicing process. For example, a borrower was unaware that their auto-debit payments had been suspended during the COVID-19 forbearance period, leading to missed payments.\n\n2. **Errors and Delays in Servicing and Reporting:** Many complaints involved errors such as incorrect account statuses, reports of delinquency without proper notification, or misapplication of payments. For instance, some borrowers were wrongly reported as delinquent or 90 days late, which impacted their credit scores.\n\n3. **Systemic Failures and Systemic Deception:** There are systemic issues like servicers steering borrowers into forbearances instead of income-driven repayment plans, or not informing them of options that

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

</div>

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### Answer:

- Generating multiple reformulations of a user query can improve recall by increasing the likelihood that the retriever identifies all relevant documents. This is particularly important because those documents may use different vocabulary, phrasing, or structure compared to the original query.
- A single user query might miss certain documents, as users may use different terms or expressions than those found in the relevant documents.

</div>

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [34]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [35]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [36]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [37]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [38]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [33]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided context, appears to involve problems with the handling and management of federal student loans. Specific recurring issues include:\n\n- Struggling to repay the loans due to misinformation, inadequate counseling, or circumstances such as school closure and job prospects.\n- Problems with loan servicing, including improper handling of loan consolidation, lack of disclosure about terms, interest rate increases, and miscommunication.\n- Discrepancies in loan balances, interest rates, and reporting, leading to credit score impacts and confusion.\n- Issues related to loan forgiveness, cancellation, or discharge, particularly when borrowers face hardship or misconduct by institutions.\n\nOverall, many complaints point to systemic breakdowns, mismanagement, lack of transparency, and insufficient guidance from loan servicers and agencies, which complicate repayment and exacerbate financial hardship for borrowers.'

In [34]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, all of them explicitly mention that responses from the companies were not handled in a timely manner. For example, the complaint with ID 12709087 states the response was "No" for timely response, and multiple other complaints mention long wait times, unanswered calls, or delays in resolution. Therefore, yes, some complaints did not get handled in a timely manner.'

In [35]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to financial hardships caused by circumstances beyond their control, such as inability to secure employment or real career opportunities after attending certain schools, as well as misrepresentations about the value of their education and the manageability of their loans. Additionally, issues with loan servicing, such as improper notification about payment commencement, inability to reevaluate payments during grace periods, or administrative errors like reporting late payments or failed communication, also contributed to difficulties in repayment.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [39]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [40]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [38]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

"Based on the provided data, the most common issues with loans tend to revolve around mishandling by servicers, including errors in loan balances, misapplied payments, wrongful denials of repayment plans, and inaccurate or delayed reporting to credit bureaus. Many complaints highlight problems such as:\n\n- Errors in loan balances and interest calculations\n- Misapplication of payments primarily to interest rather than principal\n- Lack of transparency or incorrect information about loan status\n- Failure to process income-driven repayment plans or forgiveness applications\n- Unexpected or unauthorized transfer of loans without proper notification\n- Negative credit reporting without proper notice or opportunity to remedy\n\nOverall, a prevalent theme is the dissatisfaction with loan servicers' handling of accounts, which often leads to financial hardship, credit score impacts, or lack of clarity about loan terms. This suggests that mismanagement and lack of transparency by servicers a

In [39]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints indicate that complaints did not get handled in a timely manner. For example:\n\n- Complaint ID 12935889 (from 04/11/25, EdFinancial Services, NJ): The response was "Closed with explanation" and the response to complaint was marked as "No" for timely response.\n- Complaint ID 12654977 (from 03/25/25, MOHELA, MD): The response was "Closed with explanation" and the response was marked as "No" for timely response.\n- Complaint ID 12739706 (from 04/01/25, MOHELA, NJ): Also marked as "No" for timely response.\n- Complaint ID 12973003 (from 04/14/25, EdFinancial Services, TX): Marked as "Yes" for timely response but the complaint indicates ongoing issues with no resolution.\n\nAdditionally, several complaints reference delays or failures to respond promptly, often with wait times of hours or unacknowledged communication attempts.\n\nIn conclusion, yes, some complaints did not get handled in a timely manner, as indicated by the complaint

In [40]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including:\n\n- Lack of proper information about repayment options and consequences, such as interest accumulation during forbearance or deferment.\n- Financial hardships like unemployment, low income, or personal crises, making it difficult to afford payments.\n- Mismanagement by loan servicers, such as pushing borrowers into long-term forbearances or not informing them about income-driven repayment plans and other assistance programs.\n- Errors or delays in communication, leading to borrowers being unaware of the start or resumption of payments.\n- Unfair or misleading practices by servicers, including misapplied payments, incorrect reporting, or mishandling of account information.\n- Systemic issues like transfer of loans without proper notification, incorrect credit reporting, or loss of documentation.\n- Predatory lending or high-interest rates that cause the debt to balloon over time, even with regular payments.\n- In so

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [41]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [42]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [43]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [44]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [45]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [46]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems related to mismanagement or mishandling by loan servicers, including issues with payment processing, incorrect information, and lack of transparency. Many complaints involve delays, incorrect reporting, and failure to properly handle borrower requests or documentation.'

In [47]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, it appears that several complaints were marked as "Closed with explanation" and responded to in a timely manner ("Yes" under the "Timely response?" field). For example, complaints involving Nelnet and Maximus Federal Services, Inc. indicate that responses were given within the expected timeframe.\n\nHowever, the specific complaint about bad communication, errors, and violations transferred to Nelnet (Complaint ID: 13331376) was also responded to and marked "Closed with explanation." Similarly, complaints about payment issues and disputes regarding correct handling were responded to in a timely manner.\n\nSince all complaints included in the data set indicate timely responses from the companies, there is no evidence from this data that any complaints did not get handled in a timely manner.\n\nTherefore, based on the available information, **it does not appear that any complaints were left unhandled or responded to outside the expected timeframe.**\n\nI

In [48]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for various reasons indicated in the complaints, including:\n\n1. Lack of clear communication and bad information from lenders or servicers, leading to confusion about loan status or repayment terms.\n2. Delays or issues with re-amortization or adjusting payment plans after periods of forbearance, causing increased payments that some borrowers could not handle.\n3. Problems with the handling and processing of payments, including payment rejection due to bank or technical issues.\n4. Disputes over the legitimacy and verification of debt, especially when loans are reported as in default or unpaid despite evidence of payments.\n5. Administrative errors, misreporting, or inability to access or correct loan information, leading to defaults or negative credit impacts.\n6. Alleged illegal reporting, collection practices, or data breaches, which may have impacted the borrowers' ability to manage or verify their loans.\n   \nOverall, a combination of admin

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

</div>

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### Answer:
* Semantic chunking can sometimes produce unexpected results. One potential issue is under-chunking, where multiple distinct question-and-answer pairs are grouped into a single large chunk. This tends to happen when the semantic similarity between the end of one answer and the beginning of the next question is high, as they all relate to the same general topic. 

* On the other hand, over-chunking, or fragmentation, can also be problematic. This occurs when we attempt to separate text at points where topic changes happen, but in short, repetitive sets like FAQs, nearly every sentence may seem like a topic change. As a result, this can create tiny chunks, sometimes as small as one sentence, which can negatively impact retrieval because there is less context for the embeddings to work with. Additionally, if many chunks contain nearly identical sentences, it leads to wasted storage and computational resources on duplicate vectors.
</div>

# 🤝 Breakout Room Part #2

<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against each other. 
You can use the loans or bills dataset.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

</div>

##### HINTS:

- LangSmith provides detailed information about latency and cost.

## Create "Golden Dataset" with Synthetic Data Generation

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

In [4]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

In [9]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [12]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary' already exists in node 'f2358d'. Skipping!
Property 'summary' already exists in node '1ea2d9'. Skipping!
Property 'summary' already exists in node 'e03cd6'. Skipping!
Property 'summary' already exists in node 'e8a962'. Skipping!
Property 'summary' already exists in node '856b3e'. Skipping!
Property 'summary' already exists in node '65e8ed'. Skipping!
Property 'summary' already exists in node '9f8107'. Skipping!
Property 'summary' already exists in node '9ffa4b'. Skipping!
Property 'summary' already exists in node '24ac72'. Skipping!
Property 'summary' already exists in node '3594e0'. Skipping!
Property 'summary' already exists in node 'b93910'. Skipping!
Property 'summary' already exists in node 'b41dba'. Skipping!
Property 'summary' already exists in node 'b74262'. Skipping!
Property 'summary' already exists in node '98f0ba'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/41 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '9f8107'. Skipping!
Property 'summary_embedding' already exists in node 'b41dba'. Skipping!
Property 'summary_embedding' already exists in node '65e8ed'. Skipping!
Property 'summary_embedding' already exists in node 'b93910'. Skipping!
Property 'summary_embedding' already exists in node 'e8a962'. Skipping!
Property 'summary_embedding' already exists in node 'b74262'. Skipping!
Property 'summary_embedding' already exists in node 'f2358d'. Skipping!
Property 'summary_embedding' already exists in node 'e03cd6'. Skipping!
Property 'summary_embedding' already exists in node '24ac72'. Skipping!
Property 'summary_embedding' already exists in node '9ffa4b'. Skipping!
Property 'summary_embedding' already exists in node '1ea2d9'. Skipping!
Property 'summary_embedding' already exists in node '856b3e'. Skipping!
Property 'summary_embedding' already exists in node '98f0ba'. Skipping!
Property 'summary_embedding' already exists in node '3594e0'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [13]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is the academic year and how does it work...,"[Chapter 1 Academic Years, Academic Calendars,...",Every eligible program must have a defined aca...,single_hop_specifc_query_synthesizer
1,What is a practicum in the context of academic...,[Inclusion of Clinical Work in a Standard Term...,Inclusion of clinical work in a standard term ...,single_hop_specifc_query_synthesizer
2,What is the significance of Title IV in relati...,[Non-Term Characteristics A program that measu...,The payment period is applicable to all Title ...,single_hop_specifc_query_synthesizer
3,What is a Pell Grant?,[both the credit or clock hours and the weeks ...,The context does not explicitly define Pell Gr...,single_hop_specifc_query_synthesizer
4,How does accelerated progression impact disbur...,[<1-hop>\n\nboth the credit or clock hours and...,Accelerated progression in clock-hour or non-t...,multi_hop_abstract_query_synthesizer
5,How does credit hour allocation for clinical w...,[<1-hop>\n\nInclusion of Clinical Work in a St...,Credit hours associated with clinical work mus...,multi_hop_abstract_query_synthesizer
6,How do the requirements for disbursement of Ti...,[<1-hop>\n\nboth the credit or clock hours and...,The requirements for disbursement of Title IV ...,multi_hop_abstract_query_synthesizer
7,How do separate academic years for different p...,"[<1-hop>\n\nChapter 1 Academic Years, Academic...",The context indicates that a school may define...,multi_hop_abstract_query_synthesizer
8,How do the disbursement timing rules in Volume...,[<1-hop>\n\nDisbursement Timing in Subscriptio...,The disbursement timing rules in Volume 2 spec...,multi_hop_specific_query_synthesizer
9,How do the disbursement timing rules in Volume...,[<1-hop>\n\nDisbursement Timing in Subscriptio...,The disbursement timing rules in Volume 2 spec...,multi_hop_specific_query_synthesizer


In [14]:
from langsmith import Client

client = Client()

dataset_name = "Loans Dataset"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Loans dataset"
)

In [15]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

In [16]:
eval_llm = ChatOpenAI(model="gpt-4.1")

## Evaluate Each Retriever With Various Metrics

In [49]:
def evaluate_retriever_ragas(retriever, dataset_df, retriever_name):
    """
    RAGAS evaluation
    """
    from ragas import evaluate
    from ragas.metrics import (
        context_recall,
        context_precision,
        faithfulness,
        answer_relevancy
    )

    from datasets import Dataset
    import nest_asyncio
    
    nest_asyncio.apply()
    
    print(f"Evaluating {retriever_name}...")
    
    # Prepare data
    questions = dataset_df['user_input'].tolist()
    ground_truths = dataset_df['reference'].tolist()
    
    # Get retrieved contexts and generate answers
    retrieved_contexts = []
    generated_answers = []
    
    for question in questions:
        try:
            docs = retriever.invoke(question)
            contexts = [doc.page_content for doc in docs]
            retrieved_contexts.append(contexts)
            
            generated_answers.append(ground_truths[questions.index(question)])
            
        except Exception as e:
            print(f"Error processing question: {e}")
            retrieved_contexts.append([])
            generated_answers.append("")
    
    # Create dataset for evaluation
    eval_dataset = Dataset.from_dict({
        "question": questions,
        "answer": generated_answers,
        "contexts": retrieved_contexts,
        "ground_truth": ground_truths,
    })
    
    # Define metrics
    metrics = [
        context_recall,
        context_precision,
        faithfulness,
        answer_relevancy
    ]
    
    try:
        # Evaluate
        result = evaluate(eval_dataset, metrics)
        
        print(f"\n=== {retriever_name} Results ===")
        print(result)
        
        return result
        
    except Exception as e:
        print(f"Error in evaluation: {e}")
        return None

print("Evaluating retrievers...")
test_result = evaluate_retriever_ragas(
    naive_retriever, 
    dataset.to_pandas().head(5),  # Test with just 5 samples
    "Naive Retriever"
)

test_result = evaluate_retriever_ragas(
    bm25_retriever,
    dataset.to_pandas().head(5),  # Test with just 5 samples
    "BM25 Retriever"
)

test_result = evaluate_retriever_ragas(
    compression_retriever,
    dataset.to_pandas().head(5),  # Test with just 5 samples
    "Compression Retriever"
)

test_result = evaluate_retriever_ragas(
    multi_query_retriever,
    dataset.to_pandas().head(5),  # Test with just 5 samples
    "Multi-Query Retriever"
)

test_result = evaluate_retriever_ragas(
    parent_document_retriever,
    dataset.to_pandas().head(5),  # Test with just 5 samples
    "Parent Document Retriever"
)

test_result = evaluate_retriever_ragas(
    ensemble_retriever,
    dataset.to_pandas().head(5),  # Test with just 5 samples
    "Ensemble Retriever"
)

Evaluating retrievers...
Evaluating Naive Retriever...


Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]


=== Naive Retriever Results ===
{'context_recall': 0.8500, 'context_precision': 0.2394, 'faithfulness': 0.1200, 'answer_relevancy': 0.7066}
Evaluating BM25 Retriever...


Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]


=== BM25 Retriever Results ===
{'context_recall': 0.0000, 'context_precision': 0.0833, 'faithfulness': 0.0400, 'answer_relevancy': 0.7067}
Evaluating Compression Retriever...


Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]


=== Compression Retriever Results ===
{'context_recall': 0.3000, 'context_precision': 0.3667, 'faithfulness': 0.1200, 'answer_relevancy': 0.7195}
Evaluating Multi-Query Retriever...


Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]


=== Multi-Query Retriever Results ===
{'context_recall': 0.9200, 'context_precision': 0.2739, 'faithfulness': 0.0400, 'answer_relevancy': 0.7105}
Evaluating Parent Document Retriever...


Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]


=== Parent Document Retriever Results ===
{'context_recall': 0.4000, 'context_precision': 0.2500, 'faithfulness': 0.0400, 'answer_relevancy': 0.7194}
Evaluating Ensemble Retriever...


Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]


=== Ensemble Retriever Results ===
{'context_recall': 0.9200, 'context_precision': 0.4133, 'faithfulness': 0.1800, 'answer_relevancy': 0.7152}


<div style="background-color: #204B8E; color: white; padding: 10px; border-radius: 5px;">

### Analysis & Observations:

* It can be observed that the multi-query and ensemble retrievers excel in context recall, while the BM25 retriever achieved the highest context precision.
* In terms of faithfulness, naive and compression retrievers scored the highest, while there is no significant difference among the retrievers in answer relevancy.
* The ensemble retriever has the highest latency and cost compared to other options.

#### LangSmith Trace:

![LangSmith Trace](LangSmith-trace.png)

</div>