# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [30]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [31]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [32]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [33]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [34]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [35]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [36]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [37]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [38]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [39]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided data, the most common issues with student loans appear to involve problems related to lender or servicer misconduct, such as errors in loan balances, misapplied payments, wrongful denials of payment plans, and difficulties with handling payments. Additionally, issues like incorrect information on credit reports, confusion caused by loan transfers without proper notification, and problems with repayment plans (including disputes over interest increases and improper handling of loan forgiveness or discharge) are frequently mentioned.\n\nIn summary, the most common issues involve:\n\n- Errors in loan balances and account information\n- Misapplication or mishandling of payments\n- Lack of transparency and notification about loan transfers and changes\n- Problems with repayment plans and interest rate increases\n- Disputes over incorrect reporting and loan management\n\nPlease note that these reflect the recurring problems highlighted in the complaints data, indicatin

In [40]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, some complaints did not get handled in a timely manner. Specifically, at least one complaint (row 441) was marked as "Not timely," indicating that the response or resolution took longer than expected. Additionally, multiple complaints (like row 67 and row 816) mention delays or that the issue remains unresolved after a significant period, such as over a year or nearly 18 months, suggesting these were not handled within a timely timeframe.'

In [41]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for several reasons, including:\n\n1. **Accumulation of interest during deferment or forbearance:** Borrowers were often given options like forbearance or deferment, but interest continued to accrue, increasing the total amount owed and making repayment more difficult over time.\n\n2. **Unmanageable repayment options:** Lowering monthly payments to make them more affordable extended the repayment period and increased overall interest, creating a cycle where borrowers could not fully pay off their loans.\n\n3. **Lack of clear information and poor communication:** Borrowers frequently reported not being adequately informed about loan terms, repayment schedules, loan transfers, or changes in servicers. This lack of transparency led to missed payments, credit issues, or unawareness of when repayment was expected to resume.\n\n4. **Financial hardship and stagnant wages:** Many borrowers faced economic challenges such as stagnating wages, job loss, or l

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [42]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [43]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [44]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be dealing with the lender or servicer, including problems such as unhelpful or dishonest communication, difficulty in obtaining accurate information about loan balances, interest calculations, or repayment terms, and issues with loan servicing errors. Multiple complaints highlight problems like being given bad or confusing information, issues with payment application, and disputes over fees or loan details.'

In [45]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all the complaints in the context received timely responses from the companies, as indicated by the "Timely response?" field being "Yes" for each complaint. Therefore, no complaints appear to have gone unhandled in a timely manner.'

In [46]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People fail to pay back their loans for various reasons, including problems with payment plans, difficulties in communication with lenders or servicers, and issues related to mismanagement or errors by the loan servicers. In some cases, borrowers have experienced challenges due to the servicers steering them into incorrect forbearance options, having their automatic payments unenrolled without proper notification, or facing repeated reversal of payments despite making timely payments. Additionally, some borrowers report being unaware of transfer of their loans to new servicers, lack of clear communication about their account status, or being falsely reported as overdue, all of which can hinder their ability to successfully repay loans.'

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

#### ✅ Answer:

**Example Query:** "What is the APR for Chase Freedom card?"

**Why BM25 is better:**
- **Exact keyword matching**: BM25 will precisely match "Chase Freedom" and "APR" terms
- **Specific product names**: Embeddings might confuse "Chase Freedom" with other Chase cards or general freedom concepts
- **Technical terms**: "APR" is a specific financial term that BM25 handles better than semantic embeddings
- **Domain-specific vocabulary**: Financial products have precise naming conventions that benefit from exact matching

**Justification:** BM25 excels at finding documents containing the exact product name and technical term, while embeddings might retrieve documents about general credit card information or Chase banking services that don't specifically mention the Freedom card's APR.



## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [47]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [48]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [49]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to involve problems related to dealing with lenders or servicers, including errors in loan balances, misapplied payments, wrongful denials of payment plans, and improper handling or transfer of loans. Additionally, issues such as receiving incorrect or bad information, lack of clear communication, and disputes over account discrepancies are prevalent.'

In [50]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, some complaints indicate delays in handling: \n- One complaint has been open since over a year with no resolution, specifically regarding a request for account review and discharge due to violations. \n- Another complaint, related to a previous issue not being addressed, has been ongoing for over 2-3 weeks.\n- The complaint about payments not appearing on the account was submitted on May 2, 2025, but there is no indication it was resolved promptly.\n\nWhile the responses from companies in these cases are marked as "Closed with explanation" and responses are considered timely, the extended duration of some issues suggests that certain complaints were not handled in a timely manner from the consumer’s perspective.\n\nTherefore, yes, some complaints did not get handled in a timely manner.'

In [51]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for several reasons, including:\n\n1. Lack of awareness and understanding: Borrowers often did not realize they had to repay their loans or were not properly informed by financial aid officers, leading to surprise and confusion about repayment obligations.\n\n2. Poor communication and notification: Borrowers reported not being adequately notified about loan transferences, due dates, or the start of repayment, which contributed to missed payments or late payments.\n\n3. Financial hardship and inability to afford payments: Many borrowers found the monthly payments unaffordable, especially when interest continued to accrue during forbearance or deferment, making it difficult to pay down the principal.\n\n4. Accumulation of interest and increasing balances: For some, interest accumulated during forbearance or when payments were reduced, causing balances to grow over time despite ongoing payments, further complicating repayment efforts.\n\n5. Unfavorab

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [52]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [53]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [54]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints data, the most common issues with loans tend to involve mismanagement and inaccurate information, including errors in loan balances, incorrect reporting of account status, and mishandling of payments such as misapplied payments, wrongful denials of repayment plans, and incorrect loan classifications. Many complaints also highlight problems with loan servicing companies failing to provide proper documentation, improper handling of loan deferments, and issues related to loan transfer and sale, which can lead to confusion and errors in loan account status and balances.\n\nIn summary, the most common issue appears to be **mismanagement and errors by loan servicers, leading to incorrect loan information, mishandled payments, and inadequate communication with borrowers**.'

In [55]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided data, yes, some complaints did not get handled in a timely manner. Specifically, there are instances indicating delays:\n\n- On complaint ID **12654977** (submitted to MOHELA), the response was **"No"** for timely response, and it states that despite efforts, the issue remained unresolved with over 3 hours of wait times when calling customer service, and messages sent through the inbox went unanswered. Additionally, the complaint was marked "Closed with explanation," implying the issue was not resolved promptly.\n\n- Similarly, for complaint ID **12973003** (submitted to EdFinancial), the response was **"Yes"** for timely response, yet the complaint indicates that the issue persisted over 2-3 weeks and was not resolved, suggesting a delay or inadequate handling.\n\nIn multiple cases, consumers reported prolonged unresolved issues, long wait times, or lack of follow-up, which indicates that not all complaints were handled in a timely manner.'

In [56]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a combination of misunderstandings, lack of clear information, systemic issues, and hardships such as financial difficulties and mismanagement. Specific reasons include:\n\n1. Accumulation of interest during forbearance or deferment, which made loans unpayable or extended the repayment period.\n2. Misleading or inadequate information from servicers about repayment options like income-driven plans or rehabilitation programs, leading borrowers to be steered into long-term forbearances or aggressive consolidation practices.\n3. High interest rates and compounded interest working against borrowers, increasing the total debt over time.\n4. Systemic failures and misconduct by servicers, including errors in reporting, mishandling of accounts, and failure to communicate important deadlines or options.\n5. Financial hardships, unemployment, low wages, or other personal circumstances making it difficult for borrowers to afford payments.\n6.

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

#### ✅ Answer :

**Core Mechanism:**
- Takes original query → LLM generates multiple variations → Retrieves documents for each → Combines unique results

**This improves recall due to:**

1. **Query Diversity**: Different ways to ask the same question find different relevant documents
   - Original: "What is the most common issue with loans?"
   - Reformulations: "What problems do people have with loans?", "What are the main loan complaints?", "What issues arise with student loans?"

2. **Vocabulary Coverage**: Captures synonyms and alternative phrasings
   - "Payment problems" vs "billing issues" vs "repayment difficulties"

3. **Semantic Variations**: Different angles of the same question retrieve different document subsets
   - Some documents might use "servicer" instead of "lender"
   - Some might mention "forbearance" instead of "payment pause"

**Example from our dataset:**
- Original: "Why did people fail to pay back their loans?"
- Reformulations might include: "What causes loan default?", "Why do borrowers struggle with repayment?", "What prevents successful loan payment?"

**Result:** Higher recall because you're searching with multiple query formulations, each potentially finding different relevant documents that the original query might miss.


## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [57]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [58]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [59]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [60]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [61]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [62]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issues with loans tend to involve misconduct or errors related to loan servicing. Specific common issues include: \n\n- Errors in loan balances\n- Misapplication of payments\n- Wrongful denials of payment plans\n- Discrepancies in loan balances and interest rates\n- Unfair or deceptive practices by loan servicers\n- Illegal credit reporting related to unverified or questionable debts\n\nOverall, errors in the administration and reporting of loans, as well as issues with loan servicer misconduct, appear to be the most prevalent problems mentioned.'

In [63]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, yes, some complaints did not get handled in a timely manner. Specifically, the complaints regarding issues with student loan servicing and payment processing (such as those submitted to MOHELA and Aidvantage) were marked as "Timely response?": "No." Additionally, the complainant states that they have not received responses within the expected timeframes, and in some cases, there were significant delays or no response at all. Therefore, there were complaints that were not handled promptly.'

In [64]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People often failed to pay back their loans due to a variety of issues highlighted in the complaints. These include:\n\n- Lack of clear communication from loan servicers about payment schedules, especially prior to the end of grace periods.\n- Misrepresentations or lack of transparency about the long-term financial consequences of taking out loans.\n- Financial hardship or severe economic difficulties, such as unemployment or health issues, making it impossible to make payments.\n- Problems related to the management and legitimacy of the debt, such as failure to verify debt validity or improper reporting to credit bureaus.\n- Difficulties arising from educational institutions' misconduct or closure, which impacted graduates' ability to secure employment and repay loans.\n- Issues with the transfer of loans between different agencies or servicers, leading to missed notifications and missed payments.\n\nOverall, failure to pay back loans often resulted from administrative errors, inadeq

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [65]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [66]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [67]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issue with loans appears to be problems related to "Dealing with your lender or servicer," particularly issues such as:\n\n- Errors in loan balances and misapplied payments\n- Wrongful denials of repayment plans or issues with interest accrual\n- Inaccurate or conflicting information about loan status and balances\n- Lack of proper communication or notices from loan servicers\n- Problems with loan transfers and misconduct\n- Inaccurate credit reporting and negative impacts on credit scores\n\nWhile there are various specific sub-issues, a recurring theme is maladministration or misconduct by loan servicers, leading to incorrect loan information, unexpected interest accumulation, or inadequate communication. Thus, the most common issue is problems stemming from the handling and communication by lenders or servicers.'

In [68]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, some complaints did not get handled in a timely manner. Specifically, at least two complaints—Complaint ID 12668396 and Complaint ID 12739706—were marked as "No" or "Late" responses, indicating they were not responded to within the expected timeframe. Additionally, several complaints noted delays or failures to respond promptly, suggesting that handling was not always timely.'

In [69]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'Based on the provided context, people failed to pay back their loans primarily due to a combination of factors including:\n\n1. Lack of clear or adequate communication from loan servicers about payment obligations, due dates, and available options, leading to confusion and unintentional delinquencies.\n2. Suspension or transfer of loan accounts without proper notification, which caused borrowers to be unaware of the start or resumption of payments.\n3. Difficulties in accessing or managing online accounts and a lack of documentation, resulting in errors in payment application and balances.\n4. Deceptive or coercive practices such as long-term forbearance steering, which prevented borrowers from entering income-driven repayment or rehabilitation plans that could have made repayment more manageable.\n5. Errors in reporting, including incorrect delinquency status or missed payments, which adversely impacted credit scores.\n6. Systemic mismanagement, including failure to provide transpare

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [70]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [71]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [72]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [73]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [74]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [75]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the complaints provided, appears to be problems related to loan servicing and misinformation. This includes issues such as struggling to make payments, incorrect or delayed information about account status or payment plans, disputes over the legitimacy or accuracy of reported debt, and difficulty in communication with loan servicers. Many complaints highlight errors or discrepancies in how loans are handled, which can cause stress, financial impact, and violations of legal protections.'

In [76]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, it appears that several complaints were marked as "Closed with explanation" and have statements indicating that responses were timely ("Yes"), which suggests they were handled within the expected timeframe. However, the complaint regarding Nelnet (ID 13331376) specifically notes that despite acknowledgment and repeated letters, the company never responded to the complaint, which implies it was not handled in a timely manner. \n\nTherefore, yes, there was at least one complaint that did not get handled in a timely manner.'

In [77]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'Based on the provided information, people failed to pay back their loans primarily due to issues such as miscommunication, lack of transparency from loan servicers, technical difficulties, and delays or problems with payment processing. Some individuals faced difficulties logging into their accounts, receiving inadequate assistance, or encountering stalling tactics by loan servicers that discouraged ongoing payment efforts. Additionally, there were cases involving improper reporting, disputes over the legitimacy of certain debts, and alleged breaches of privacy laws, all of which contributed to complications in repayment.'

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

#### ✅ Answer:

In this scenario, there are several **problems** with standard semantic chunking:

1. **Over-chunking**: Short, repetitive sentences will have very similar embeddings, causing the algorithm to create many tiny, redundant chunks
2. **Poor semantic boundaries**: FAQs often have similar semantic content across different questions, making it hard to find natural breakpoints
3. **Reduced retrieval quality**: Tiny chunks lose context and become less useful for retrieval

**Adjustments Needed:**

1. **Increase minimum chunk size**: Set a higher minimum threshold to prevent over-fragmentation
2. **Use different thresholding**: Switch from `percentile` to `standard_deviation` or `interquartile` for better handling of repetitive content
3. **Pre-process content**: Group similar FAQs before chunking to reduce redundancy
4. **Adjust similarity thresholds**: Use higher similarity thresholds to keep related content together
5. **Consider hierarchical chunking**: Create larger chunks first, then sub-chunk only when necessary
6. **Hybrid Strategy**: Use semantic chunking for longer, diverse content and use fixed-size chunking for repetitive FAQ sections

**Example Adjustment:**
```python
semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="standard_deviation",  # Instead of percentile
    min_chunk_size=100,  # Prevent tiny chunks
    similarity_threshold=0.8  # Higher threshold for repetitive content
)
```

**Result:** Better handling of FAQ-style content by creating more meaningful, larger chunks that preserve context while avoiding redundancy.


# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

#### ✅ Answer:

In [83]:
# ============================================================================
# 1. SETUP LANGSMITH DATASET
# ============================================================================

import os
import time
from getpass import getpass 
from datetime import datetime
from collections import defaultdict
from langsmith import Client

# Setup LangSmith API key
os.environ["LANGCHAIN_API_KEY"] = getpass("Please enter your LANGCHAIN API key!")

# Create LangSmith client
client = Client()

# Create dataset for evaluation
dataset_name = f"Advanced-Retrieval-Evaluation-{datetime.now().strftime('%Y%m%d-%H%M%S')}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Advanced Retrieval Methods Evaluation Dataset"
)

print("✅ LangSmith configured for evaluation")
print(f"📊 Dataset: {dataset_name}")
print(f"�� Dataset URL: https://smith.langchain.com/datasets/{langsmith_dataset.id}")

✅ LangSmith configured for evaluation
📊 Dataset: Advanced-Retrieval-Evaluation-20250729-010657
�� Dataset URL: https://smith.langchain.com/datasets/cfd03fa7-cbef-4680-acca-b7336570fc6e


In [84]:
# ============================================================================
# 2. LOAD DOCUMENTS
# ============================================================================

from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

# Load PDFs from data/ (like reference notebook)
path = "data/"
pdf_loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
pdf_docs = pdf_loader.load()

# Combine with existing loan complaint data
all_documents = pdf_docs + loan_complaint_data

print(f"📂 Loaded {len(pdf_docs)} PDFs + {len(loan_complaint_data)} complaints = {len(all_documents)} total documents")

📂 Loaded 269 PDFs + 825 complaints = 1094 total documents


In [85]:
# ============================================================================
# 3. GENERATE SYNTHETIC QUESTIONS WITH SDG
# ============================================================================

from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

print("\n🔄 Generating synthetic questions...")

# Sample 50 documents for efficiency
import random
sample_docs = random.sample(all_documents, min(50, len(all_documents)))

# Initialize Ragas generator (exact same pattern as reference notebook)
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

# Generate testset (exact same pattern as reference notebook)
testset = generator.generate_with_langchain_docs(sample_docs[:20], testset_size=10)

# Extract questions
test_questions = list(testset.to_pandas()['user_input'])
print(f"✅ Generated {len(test_questions)} synthetic questions")

testset.to_pandas()


🔄 Generating synthetic questions...


Applying HeadlinesExtractor:   0%|          | 0/7 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/12 [00:00<?, ?it/s]

Property 'summary' already exists in node '4cec46'. Skipping!
Property 'summary' already exists in node 'b4244f'. Skipping!
Property 'summary' already exists in node 'e54db4'. Skipping!
Property 'summary' already exists in node '8540a4'. Skipping!
Property 'summary' already exists in node '3a8e7d'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/4 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'b4244f'. Skipping!
Property 'summary_embedding' already exists in node '3a8e7d'. Skipping!
Property 'summary_embedding' already exists in node '4cec46'. Skipping!
Property 'summary_embedding' already exists in node 'e54db4'. Skipping!
Property 'summary_embedding' already exists in node '8540a4'. Skipping!


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

✅ Generated 12 synthetic questions


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What obligations does Maximus have under the F...,[XXXX XXXX XXXX XXXX. XXXX XXXX XXXX XXXX XXXX...,"According to the provided context, when a cons...",single_hop_specifc_query_synthesizer
1,Wut dos 15 U.S.C. 1681i requir when disputin s...,[XXXX XXXX XXXX XXXX. XXXX XXXX XXXX XXXX XXXX...,"Under 15 U.S.C. 1681i(a)(1)(A), a credit repor...",single_hop_specifc_query_synthesizer
2,What is Aidvantage as mentioned in the summary...,[Failure to Comply Will Result in Further Acti...,Aidvantage is listed as the name of an account...,single_hop_specifc_query_synthesizer
3,What does FCRA refer to in the context of stud...,[Failure to Comply Will Result in Further Acti...,FCRA refers to the legal framework under which...,single_hop_specifc_query_synthesizer
4,How does the formal dispute of student loan ac...,[<1-hop>\n\nXXXX XXXX XXXX XXXX. XXXX XXXX XXX...,The formal dispute of student loan accounts is...,multi_hop_abstract_query_synthesizer
5,what happen if they dont fix my dispute of stu...,[<1-hop>\n\nXXXX XXXX XXXX XXXX. XXXX XXXX XXX...,if they dont fix your dispute of student loan ...,multi_hop_abstract_query_synthesizer
6,How does the formal dispute of student loan ac...,[<1-hop>\n\nXXXX XXXX XXXX XXXX. XXXX XXXX XXX...,The formal dispute of student loan accounts be...,multi_hop_abstract_query_synthesizer
7,How does the formal dispute of student loan ac...,[<1-hop>\n\nXXXX XXXX XXXX XXXX. XXXX XXXX XXX...,The formal dispute of student loan accounts qu...,multi_hop_abstract_query_synthesizer
8,How do the requirements of 15 U.S.C. 1681e reg...,[<1-hop>\n\nXXXX XXXX XXXX XXXX. XXXX XXXX XXX...,"Under 15 U.S.C. 1681e(b), a credit reporting a...",multi_hop_specific_query_synthesizer
9,How does the formal dispute letter regarding u...,[<1-hop>\n\nXXXX XXXX XXXX XXXX. XXXX XXXX XXX...,The formal dispute letter cites 15 U.S.C. 1681...,multi_hop_specific_query_synthesizer


In [86]:
# ============================================================================
# 4. EVALUATE RETRIEVERS
# ============================================================================

from ragas import evaluate
from ragas.metrics import context_precision, faithfulness, answer_relevancy, context_recall
from datasets import Dataset
from langchain.callbacks import get_openai_callback

print("\n🔍 Evaluating retrievers...")

# Define retrievers
retrievers = {
    "naive": naive_retriever,
    "bm25": bm25_retriever,
    "multi_query": multi_query_retriever,
    "parent_doc": parent_document_retriever,
    "compression": compression_retriever,
    "ensemble": ensemble_retriever
}

results = {}

for name, retriever in retrievers.items():
    print(f"\n📊 Evaluating {name}...")
    
    eval_data = []
    
    # Use first 8 questions for each retriever
    for i, question in enumerate(test_questions[:8]):
        start_time = time.time()
        
        # Get contexts
        contexts = retriever.invoke(question)
        context_text = "\n".join([doc.page_content for doc in contexts])
        
        # Generate answer with cost tracking
        prompt = f"Question: {question}\n\nContext: {context_text}\n\nAnswer:"
        
        with get_openai_callback() as cb:
            response = chat_model.invoke(prompt)
        
        # Create example for LangSmith dataset
        example = client.create_example(
            inputs={"question": question},
            outputs={"answer": response.content},
            dataset_id=langsmith_dataset.id
        )
        
        eval_data.append({
            "question": question,
            "answer": response.content,
            "contexts": [doc.page_content for doc in contexts],
            "ground_truth": response.content,
            "latency": time.time() - start_time,
            "cost": cb.total_cost
        })
        
        print(f"  ✅ Question {i+1}/8 completed")
    
    # Create dataset
    dataset = Dataset.from_dict({
        "question": [d["question"] for d in eval_data],
        "answer": [d["answer"] for d in eval_data],
        "contexts": [d["contexts"] for d in eval_data],
        "ground_truth": [d["ground_truth"] for d in eval_data]
    })
    
    # Evaluate with Ragas
    ragas_result = evaluate(dataset, metrics=[context_precision, faithfulness, answer_relevancy, context_recall])
    
    # Helper function to extract metric values (handles both list and single value formats)
    def extract_metric_value(metric_result):
        if isinstance(metric_result, list):
            return sum(metric_result) / len(metric_result)  # Calculate mean
        return metric_result
    
    # Store results
    results[name] = {
        "context_precision": extract_metric_value(ragas_result["context_precision"]),
        "faithfulness": extract_metric_value(ragas_result["faithfulness"]),
        "answer_relevancy": extract_metric_value(ragas_result["answer_relevancy"]),
        "context_recall": extract_metric_value(ragas_result["context_recall"]),
        "avg_latency": sum(d["latency"] for d in eval_data) / len(eval_data),
        "total_cost": sum(d["cost"] for d in eval_data)
    }
    
    print(f"✅ {name}: Precision={results[name]['context_precision']:.3f}")


🔍 Evaluating retrievers...

📊 Evaluating naive...
  ✅ Question 1/8 completed
  ✅ Question 2/8 completed
  ✅ Question 3/8 completed
  ✅ Question 4/8 completed
  ✅ Question 5/8 completed
  ✅ Question 6/8 completed
  ✅ Question 7/8 completed
  ✅ Question 8/8 completed


Evaluating:   0%|          | 0/32 [00:00<?, ?it/s]

✅ naive: Precision=1.000

📊 Evaluating bm25...
  ✅ Question 1/8 completed
  ✅ Question 2/8 completed
  ✅ Question 3/8 completed
  ✅ Question 4/8 completed
  ✅ Question 5/8 completed
  ✅ Question 6/8 completed
  ✅ Question 7/8 completed
  ✅ Question 8/8 completed


Evaluating:   0%|          | 0/32 [00:00<?, ?it/s]

✅ bm25: Precision=0.955

📊 Evaluating multi_query...
  ✅ Question 1/8 completed
  ✅ Question 2/8 completed
  ✅ Question 3/8 completed
  ✅ Question 4/8 completed
  ✅ Question 5/8 completed
  ✅ Question 6/8 completed
  ✅ Question 7/8 completed
  ✅ Question 8/8 completed


Evaluating:   0%|          | 0/32 [00:00<?, ?it/s]

✅ multi_query: Precision=1.000

📊 Evaluating parent_doc...
  ✅ Question 1/8 completed
  ✅ Question 2/8 completed
  ✅ Question 3/8 completed
  ✅ Question 4/8 completed
  ✅ Question 5/8 completed
  ✅ Question 6/8 completed
  ✅ Question 7/8 completed
  ✅ Question 8/8 completed


Evaluating:   0%|          | 0/32 [00:00<?, ?it/s]

✅ parent_doc: Precision=1.000

📊 Evaluating compression...
  ✅ Question 1/8 completed
  ✅ Question 2/8 completed
  ✅ Question 3/8 completed
  ✅ Question 4/8 completed
  ✅ Question 5/8 completed
  ✅ Question 6/8 completed
  ✅ Question 7/8 completed
  ✅ Question 8/8 completed


Evaluating:   0%|          | 0/32 [00:00<?, ?it/s]

✅ compression: Precision=1.000

📊 Evaluating ensemble...
  ✅ Question 1/8 completed
  ✅ Question 2/8 completed
  ✅ Question 3/8 completed
  ✅ Question 4/8 completed
  ✅ Question 5/8 completed
  ✅ Question 6/8 completed
  ✅ Question 7/8 completed
  ✅ Question 8/8 completed


Evaluating:   0%|          | 0/32 [00:00<?, ?it/s]

✅ ensemble: Precision=0.990


In [87]:
# ============================================================================
# 5. DISPLAY RESULTS
# ============================================================================

import pandas as pd

print("\n" + "="*80)
print("📊 RESULTS")
print("="*80)

# Create comparison table
comparison_data = []
for name, metrics in results.items():
    comparison_data.append({
        "Retriever": name.replace('_', ' ').title(),
        "Precision": f"{metrics['context_precision']:.3f}",
        "Faithfulness": f"{metrics['faithfulness']:.3f}",
        "Relevancy": f"{metrics['answer_relevancy']:.3f}",
        "Recall": f"{metrics['context_recall']:.3f}",
        "Latency (s)": f"{metrics['avg_latency']:.2f}",
        "Cost ($)": f"{metrics['total_cost']:.4f}"
    })

df = pd.DataFrame(comparison_data)
print(df.to_string(index=False))

# Calculate overall scores (weighted by performance metrics)
for name, metrics in results.items():
    overall = (metrics['context_precision'] * 0.3 + 
               metrics['faithfulness'] * 0.25 + 
               metrics['answer_relevancy'] * 0.25 + 
               metrics['context_recall'] * 0.2)
    results[name]['overall'] = overall

# Rank by performance
sorted_results = sorted(results.items(), key=lambda x: x[1]['overall'], reverse=True)

print(f"\n🏆 RANKING:")
for i, (name, metrics) in enumerate(sorted_results):
    print(f"{i+1}. {name.replace('_', ' ').title()}: {metrics['overall']:.3f}")

best = sorted_results[0]
print(f"\n💡 Best retriever: {best[0].replace('_', ' ').title()} (Score: {best[1]['overall']:.3f})")

# Cost analysis
print(f"\n💰 COST ANALYSIS:")
cost_sorted = sorted(results.items(), key=lambda x: x[1]['total_cost'])
for i, (name, metrics) in enumerate(cost_sorted):
    print(f"{i+1}. {name.replace('_', ' ').title()}: ${metrics['total_cost']:.4f}")

# Latency analysis
print(f"\n⚡ LATENCY ANALYSIS:")
latency_sorted = sorted(results.items(), key=lambda x: x[1]['avg_latency'])
for i, (name, metrics) in enumerate(latency_sorted):
    print(f"{i+1}. {name.replace('_', ' ').title()}: {metrics['avg_latency']:.2f}s")

print(f"\n�� LangSmith Dataset: https://smith.langchain.com/datasets/{langsmith_dataset.id}")
print("\n✅ Evaluation complete!")


📊 RESULTS
  Retriever Precision Faithfulness Relevancy Recall Latency (s) Cost ($)
      Naive     1.000        0.920     0.822  0.978        6.65   0.0036
       Bm25     0.955        0.919     0.817  0.984        4.55   0.0025
Multi Query     1.000        0.940     0.943  0.960        8.79   0.0052
 Parent Doc     1.000        0.819     0.940  1.000        6.36   0.0026
Compression     1.000        0.972     0.829  1.000        7.48   0.0020
   Ensemble     0.990        0.844     0.819  0.986        9.95   0.0071

🏆 RANKING:
1. Multi Query: 0.963
2. Compression: 0.950
3. Parent Doc: 0.940
4. Naive: 0.931
5. Bm25: 0.917
6. Ensemble: 0.910

💡 Best retriever: Multi Query (Score: 0.963)

💰 COST ANALYSIS:
1. Compression: $0.0020
2. Bm25: $0.0025
3. Parent Doc: $0.0026
4. Naive: $0.0036
5. Multi Query: $0.0052
6. Ensemble: $0.0071

⚡ LATENCY ANALYSIS:
1. Bm25: 4.55s
2. Parent Doc: 6.36s
3. Naive: 6.65s
4. Compression: 7.48s
5. Multi Query: 8.79s
6. Ensemble: 9.95s

�� LangSmith Dataset: h

# Results Summary: Advanced Retrieval Methods Evaluation

This evaluation compared six different retrieval methods for RAG systems using loan complaint data. The analysis incorporated **performance metrics** (Ragas evaluation), **cost efficiency** (OpenAI API costs), and **latency** (response times) to provide a comprehensive comparison.

### Performance Analysis

#### 🏆 Overall Performance Ranking
1. **Multi Query** (0.963) - Highest overall score
2. **Compression** (0.950) - Excellent performance
3. **Parent Doc** (0.940) - Strong performance
4. **Naive** (0.931) - Good baseline
5. **BM25** (0.917) - Competitive performance
6. **Ensemble** (0.910) - Lower than expected

#### 📊 Key Performance Insights

**Multi Query** emerged as the top performer with:
- Perfect precision (1.000) and high faithfulness (0.940)
- Best relevancy score (0.943) among all methods
- Demonstrates the power of query reformulation for comprehensive retrieval

**Compression** showed excellent results:
- Perfect precision and recall (1.000)
- Highest faithfulness score (0.972)
- Proves reranking significantly improves answer quality

**Parent Doc** achieved:
- Perfect precision and recall (1.000)
- Highest relevancy (0.940)
- Shows that retrieving larger context chunks improves answer relevance

### 💰 Cost Analysis

| Rank | Method      | Cost (\$) | Highlights                                                   |
| ---- | ----------- | --------- | ------------------------------------------------------------ |
| 1    | Compression | 0.0020    | Best cost-performance ratio                                  |
| 2    | BM25        | 0.0025    | Very cost-efficient                                          |
| 3    | Parent Doc  | 0.0026    | Cost-effective                                               |
| 4    | Naive       | 0.0036    | Moderate cost                                                |
| 5    | Multi Query | 0.0052    | High cost due to multiple queries (2.6x cost of Compression) |
| 6    | Ensemble    | 0.0071    | Most expensive, underperforms compared to others             |


### ⚡ Latency Analysis

| Rank | Method      | Latency (s) | Highlights                                    |
| ---- | ----------- | ----------- | --------------------------------------------- |
| 1    | BM25        | 4.55        | Fastest                                       |
| 2    | Parent Doc  | 6.36        | Fast                                          |
| 3    | Naive       | 6.65        | Moderate                                      |
| 4    | Compression | 7.48        | Moderate                                      |
| 5    | Multi Query | 8.79        | Slower due to multiple queries                |
| 6    | Ensemble    | 9.95        | Slowest, multiple retrievers increase latency |


### Recommendations

#### 🎯 Best Overall Choice: **Compression**
- **Why**: Excellent performance (0.950) with lowest cost ($0.0020)
- **Trade-off**: Slightly slower (7.48s) but acceptable for most use cases
- **Use case**: Production systems where cost and performance are equally important

#### 🚀 Performance-First Choice: **Multi Query**
- **Why**: Highest performance (0.963) with good precision
- **Trade-off**: Higher cost ($0.0052) and slower (8.79s)
- **Use case**: High-stakes applications where accuracy is paramount

#### ⚡ Speed-First Choice: **BM25**
- **Why**: Fastest (4.55s) with competitive performance (0.917)
- **Trade-off**: Lower faithfulness score (0.919)
- **Use case**: Real-time applications where speed is critical

#### ❌ Avoid: **Ensemble**
- **Why**: Highest cost ($0.0071) and slowest (9.95s) with lowest performance (0.910)
- **Lesson**: More complex doesn't always mean better results

### Key Insights

1. **Simplicity often wins**: BM25, a simple keyword-based method, performed competitively
2. **Reranking is effective**: Compression's reranking significantly improved faithfulness
3. **Query reformulation works**: Multi Query's approach of generating multiple queries improved recall
4. **Context matters**: Parent Doc's larger context chunks improved relevancy
5. **Cost-performance trade-offs**: The most expensive method (Ensemble) didn't provide the best results

### Conclusion

For this loan complaint dataset, **Compression** offers the optimal balance of performance, cost, and speed. The evaluation demonstrates that advanced retrieval methods can significantly improve RAG system performance, but complexity doesn't always correlate with better results. The choice of retrieval method should be based on specific requirements for accuracy, speed, and cost constraints.
