# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

In [5]:
# !pip install -qU langchain langchain-openai langchain-cohere

In [8]:
# !pip install -qU rank_bm25

We're also going to be leveraging [Qdrant's](https://qdrant.tech/documentation/frameworks/langchain/) (pronounced "Quadrant") VectorDB in "memory" mode (so we can leverage it locally in our colab environment).

In [9]:
!pip install -qU qdrant-client

We'll also provide our OpenAI key, as well as our Cohere API key and LangSmith key

In [2]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [2]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

In [3]:
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

## Task 2: Data Collection and Preparation

We'll be using some reviews from the 4 movies in the John Wick franchise today to explore the different retrieval strategies.

These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository).

### Data Collection

We can simply `wget` these from GitHub.

You could use any review data you wanted in this step - just be careful to make sure your metadata is aligned with your choice.

In [None]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv -O john_wick_1.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv -O john_wick_2.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw3.csv -O john_wick_3.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw4.csv -O john_wick_4.csv

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

- Self-Query: Wants as much metadata as we can provide
- Time-weighted: Wants temporal data

> NOTE: While we're creating a temporal relationship based on when these movies came out for illustrative purposes, it needs to be clear that the "time-weighting" in the Time-weighted Retriever is based on when the document was *accessed* last - not when it was created.

In [6]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"]
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  documents.extend(movie_docs)

Let's look at an example document to see if everything worked as expected!

In [6]:
documents[0]

Document(metadata={'source': 'john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2024, 9, 26, 21, 26, 25, 456154)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity.")

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "JohnWick".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [65]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWick"
)

In [66]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})
top_k = 1
retrieved_documents = retriever.invoke("WHow many people killed in John Wick?")
for doc in retrieved_documents:
    print(f"content: {doc.page_content}")
    print(f"metadata: {doc.metadata}")
    print("---")

results = vectorstore.similarity_search(
    "What weapons were used in the third movie", k=2
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")


content: : 5
Review: Ultra-violent first entry with lots of killings, thrills , noisy action , suspense , and crossfire . In this original John Wick (2014) , an ex-hit-man comes out of retirement to track down the gangsters that killed his dog and took everything from him . With the untimely death of his beloved wife still bitter in his mouth he seeks for vengeance . But when an arrogant Russian mob prince and hoodlums steal his car and kill his dog , they are fully aware of his lethal capacity. The Bogeyman will find himself dragged into an impossible task as every killer in the business dreams of cornering the legendary Wick who now has an enormous price on his head . In this first installment John Wick , blind with revenge, and for his salvation John will immediately unleash a carefully orchestrated maelstrom of destruction against those attempt to chase him and with a price tag on his head, as he is the target of hit men : an army of bounty-hunting killers on his trail and a murder

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [67]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [68]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-3.5-turbo` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [69]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI()

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [70]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [13]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the reviews provided.'

In [13]:
naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review: \'/review/rw4854296/?ref_=tt_urv\''

In [13]:
naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, an ex-hit-man comes out of retirement to seek revenge on the gangsters who killed his dog and took everything from him. This leads to a series of violent and action-packed confrontations as he faces off against various enemies.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [24]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents)

We'll construct the same chain - only changing the retriever.

In [25]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [26]:
bm25_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Some people liked John Wick, while others did not. It seems to have mixed reviews.'

In [27]:
bm25_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 4" by the author jtindahouse. The URL to that review is: \'/review/rw8946038/?ref_=tt_urv\'.'

In [28]:
bm25_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, the action is beautifully choreographed, the setup is surprisingly emotional for an action flick, and Keanu Reeves delivers a great performance. If you love action movies, you will enjoy John Wick.'

It's not clear that this is better or worse - but the `I don't know` isn't great!

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

#### Note

I could not get this to work within the confines of the versions working with the other methods

I did work with this in another notebook and failed to find the correct combination of versions

I elected to ignore this retrieval method to ensure completion of the assignment

In [91]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

RuntimeError: no validator found for <class 'pydantic.types.SecretStr'>, see `arbitrary_types_allowed` in Config

In [109]:
!pip show langchain langchain-cohere


Name: langchain
Version: 0.2.16
Summary: Building applications with LLMs through composability
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: /home/rchrdgwr/anaconda3/envs/jupyter_2/lib/python3.11/site-packages
Requires: aiohttp, langchain-core, langchain-text-splitters, langsmith, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: langchain-community, ragas
---
Name: langchain-cohere
Version: 0.3.0
Summary: An integration package connecting Cohere and LangChain
Home-page: https://github.com/langchain-ai/langchain-cohere
Author: 
Author-email: 
License: MIT
Location: /home/rchrdgwr/anaconda3/envs/jupyter_2/lib/python3.11/site-packages
Requires: cohere, langchain-core, langchain-experimental, pandas, pydantic, tabulate
Required-by: 


Let's create our chain again, and see how this does!

In [30]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [31]:
contextual_compression_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick. It was described as the coolest action film of the year, slick, violent fun, and a must-see for action fans.'

In [32]:
contextual_compression_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review: /review/rw4854296/?ref_=tt_urv'

In [33]:
contextual_compression_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, the main character, played by Keanu Reeves, is forced back into the world of crime and assassination after the mobster Santino D'Antonio seeks his help and blows up his house when he refuses. Wick is given a contract to kill Santino's sister in Rome, leading to a series of events where he becomes the target of professional killers. Ultimately, Wick seeks revenge on Santino for betraying him."

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [26]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [27]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [36]:
multi_query_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the reviews provided.'

In [37]:
multi_query_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"Yes, there is a review with a rating of 10. Here is the URL to that review: '/review/rw4854296/?ref_=tt_urv'"

In [38]:
multi_query_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In "John Wick," the main character, John Wick, seeks revenge after his wife dies and his dog is killed by Russian mobsters who also steal his car. Wick, revealed to be a super-assassin, goes on a mission to take down the mobster\'s gang and seek justice for his losses. The film is known for its intense action sequences and stylish stunts.'

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [28]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [29]:
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
    collection_name="full_documents", embeddings=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [30]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [31]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [32]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [44]:
parent_document_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"People's opinions on John Wick seem to vary. Some individuals like the series and find it consistent and well-received, while others have negative opinions about it. It ultimately depends on personal preferences."

In [45]:
parent_document_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. Here is the URL to that review: /review/rw4854296/?ref_=tt_urv'

In [46]:
parent_document_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, the main character, played by Keanu Reeves, is a retired assassin who comes out of retirement after someone kills his dog and steals his car. He is then called on to pay off an old debt by helping Ian McShane take over the Assassin's Guild by traveling to Italy, Canada, and Manhattan and killing numerous assassins."

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [33]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever,  multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [34]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [49]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Based on the reviews provided, it seems that generally, people liked John Wick. The majority of reviews praise its action sequences, Keanu Reeves' performance, stylishness, and fun factor. Overall, it appears that John Wick was well-received by fans of action films."

In [50]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3." Here is the URL to that review:\n/review/rw4854296/?ref_=tt_urv'

In [51]:
ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, an ex-hitman comes out of retirement to seek vengeance against the gangsters who killed his dog and took everything from him. The movie is filled with violent action, shootouts, and breathtaking fights as John Wick unleashes a maelstrom of destruction against those who try to chase him. The plot revolves around John Wick's quest for revenge and the consequences of his actions."

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

In [52]:
# !pip install -qU langchain_experimental

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [57]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [58]:
semantic_documents = semantic_chunker.split_documents(documents)

Let's create a new vector store.

In [59]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWickSemantic"
)

We'll use naive retrieval for this example.

In [60]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [61]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [59]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the reviews provided.'

In [60]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review: \'/review/rw4854296/?ref_=tt_urv\'.'

In [61]:
ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In "John Wick," an ex-hitman comes out of retirement to seek vengeance against the gangsters who killed his dog and took everything from him. The story follows John Wick as he unleashes a maelstrom of destruction against those who come after him, leading to a relentless vendetta. The movie is filled with action, suspense, shootouts, and breathtaking fights as John Wick navigates through a world where every action has consequences.'

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

# Assignment

#### LangSmith for Tracing

We will set up LangSmith to trace our LLM calls to allow us to analyze latency and costs

Set the project name

Originally I could not get tracing to appear in LangSmith I moved on to at least accomplish something

However later after deleting all LangSmith keys and creating a new one I did get this to work

In [45]:
import os
from uuid import uuid4

os.environ["LANGCHAIN_TRACING_V2"] = "false"

os.environ["LANGCHAIN_PROJECT"] = f"AIE4 - A14 - BM25 {uuid4().hex[0:8]}"

print(f"Your langsmith key is: {os.environ['LANGCHAIN_PROJECT']}")

Your langsmith key is: AIE4 - A14 - BM25 c1520bab


In [46]:
response = naive_retrieval_chain.invoke({"question" : "Who dies in John Wick?"})['response']
print(response)
print(response.content)

content="In John Wick, the character who dies is John Wick's dog." additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 14, 'prompt_tokens': 3905, 'total_tokens': 3919, 'completion_tokens_details': {'reasoning_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None} id='run-9110893c-b297-47dc-ae96-7b7d80bd190c-0' usage_metadata={'input_tokens': 3905, 'output_tokens': 14, 'total_tokens': 3919}
In John Wick, the character who dies is John Wick's dog.


Lets look at the documents - kind of curious how big they are, how variable in length

In [16]:
documents[0]
print(len(documents))
content_1 = documents[0].page_content
print(len(content_1))
lengths = [len(doc.page_content) for doc in documents]
average_length = sum(lengths) / len(lengths) if lengths else 0
min_length = min(lengths) if lengths else 0
max_length = max(lengths) if lengths else 0

print(f"Average Content Length: {average_length}")
print(f"Minimum Content Length: {min_length}")
print(f"Maximum Content Length: {max_length}")

100
599
Average Content Length: 533.9
Minimum Content Length: 29
Maximum Content Length: 2440


#### Question Generation

We will use RAGAS to create the questions from the context

Set up the RAGAS generator

Use gpt-3.5-turbo for the generator - reasonable powerful LLM

Use gpt-4o-mini as the critic - strong, has some reasoning, cheap

In [17]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
critic_llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

Functions to create the questions using RAGAS, save them to file and restore them from file.

Saved offline to be used for subsequent runs since we want consistent questions


In [18]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
import pickle

chunk_size = 1000
chunk_overlap = 100
file_path = 'ragas_testset.pkl'

# load an existing ragas testset
def load_ragas_testset_if_exists():
    if os.path.exists(file_path):
        try:
            with open(file_path, 'rb') as f:
                ragas_state = pickle.load(f)
            print(f"Ragas testset loaded from {file_path}")
            return ragas_state
        except Exception as e:
            print(f"Error loading ragas testset: {e}")
            return None
    else:
        print(f"No existing ragas tesetset found at {file_path}")
        return None

# Save the ragas testset
def save_ragas_testset(testset):

    try:
        with open(file_path, 'wb') as f:
            pickle.dump(testset, f)
        print(f"Ragas testset saved to {file_path}")
    except Exception as e:
        print(f"Error saving ragas testset: {e}")


# create questions
def create_questions_for_ragas(documents, num_questions=1):
    generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
    critic_llm = ChatOpenAI(model="gpt-4o-mini")
    embeddings = OpenAIEmbeddings()

    generator = TestsetGenerator.from_langchain(
        generator_llm,
        critic_llm,
        embeddings
    )
    distributions = {
        simple: 0.5,
        multi_context: 0.4,
        reasoning: 0.1
    }
    
    testset = generator.generate_with_langchain_docs(documents, num_questions, distributions, with_debugging_logs=False)
    save_ragas_testset(testset)
    return testset

Routine to get the questions or create them

Keep create_questions = False unless we need to recreate the questions.

In [19]:
create_questions = False
ragas_testset = None
num_questions = 40
if create_questions:
    ragas_testset = create_questions_for_ragas(documents, num_questions)
else:
    ragas_testset = load_ragas_testset_if_exists()
if ragas_testset:
    ragas_testset.to_pandas()
else:
    print("No RAGAS testset found - need to create questions")

Ragas testset loaded from ragas_testset.pkl


Look at some of the questions pulled in from offline.

In [20]:
ragas_testset.to_pandas().head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What makes John Wick stand out as a favorite r...,[: 22\nReview: John Wick is one of my favourit...,John Wick stands out as a favorite recent year...,simple,"[{'source': 'john_wick_2.csv', 'row': 22, 'Rev...",True
1,What are some examples of classic internationa...,[: 11\nReview: JOHN WICK is a rare example of ...,Some examples of classic international films s...,simple,"[{'source': 'john_wick_1.csv', 'row': 11, 'Rev...",True
2,What was the surprise hit movie starring Keanu...,"[: 6\nReview: In 2014, a Keanu Reeves revenge ...",John Wick,simple,"[{'source': 'john_wick_2.csv', 'row': 6, 'Revi...",True
3,Who does John Wick face as he is called upon b...,[: 5\nReview: Iosef's uncle still has John Wic...,"John Wick faces deadly assassins, numerous kil...",simple,"[{'source': 'john_wick_2.csv', 'row': 5, 'Revi...",True
4,"What is the genre and main actor of the film ""...",[: 21\nReview: John Wick is an action film wit...,"The genre of the film 'John Wick' is action, a...",simple,"[{'source': 'john_wick_1.csv', 'row': 21, 'Rev...",True


#### Answer Generation

Routine to generate answers by invoking the passed in chain

Use the previously generated questions to determine answers. We then have the questions, the answer, the context and the ground truth.

Also lets calculate latency (time for the invoke to run) and number of tokens used. This will allow us to compare the different chains.

In [72]:
import time
from datasets import Dataset
from langsmith import traceable

@traceable(
        run_type="llm",
        name="generate_answers",
        project_name="Random"
)
def generate_answers(name, chain, testset):
    answers = []
    contexts = []
    questions = testset.to_pandas()["question"].values.tolist()
    ground_truths = testset.to_pandas()["ground_truth"].values.tolist()
    latencies = []
    tokens = []

    for question in questions:
        start_time = time.time()  
        answer = chain.invoke({"question" : question})
        latency = time.time() - start_time
        answers.append(answer["response"].content)
        contexts.append([context.page_content for context in answer["context"]])
        latencies.append(latency)
        total_tokens = answer["response"].response_metadata["token_usage"]["total_tokens"]
        tokens.append(total_tokens)
    print("Answers generated")
    return Dataset.from_dict({
        "question" : questions,
        "answer" : answers,
        "contexts" : contexts,
        "ground_truth" : ground_truths,
        "latency": latencies,
        "tokens": tokens,
    })


Generate answers for the naive retrieval chain

In [73]:
naive_retrieval_chain_dataset = generate_answers("Naive_retrieval", naive_retrieval_chain, ragas_testset)

Answers generated


Lets look at some of the responses we get

In [74]:
naive_retrieval_chain_dataset.to_pandas().head()

Unnamed: 0,question,answer,contexts,ground_truth,latency,tokens
0,What makes John Wick stand out as a favorite r...,John Wick stands out as a favorite recent year...,"[: 9\nReview: At first glance, John Wick sound...",John Wick stands out as a favorite recent year...,3.196085,3571
1,What are some examples of classic internationa...,Some classic international films similar to JO...,[: 11\nReview: JOHN WICK is a rare example of ...,Some examples of classic international films s...,0.953689,3875
2,What was the surprise hit movie starring Keanu...,The surprise hit movie starring Keanu Reeves i...,"[: 6\nReview: In 2014, a Keanu Reeves revenge ...",John Wick,0.769292,3015
3,Who does John Wick face as he is called upon b...,"John Wick faces deadly assassins, numerous kil...",[: 20\nReview: After resolving his issues with...,"John Wick faces deadly assassins, numerous kil...",0.967237,3763
4,"What is the genre and main actor of the film ""...","The genre of the film ""John Wick"" is action, a...","[: 9\nReview: At first glance, John Wick sound...","The genre of the film 'John Wick' is action, a...",0.790676,3904


#### Evaluation

Use RAGAS to determine context precision and context recall

In [76]:
from ragas.metrics import (

    context_recall,
    context_precision,
)
from ragas import evaluate
def evaluate_ragas_results(dataset):
    results = evaluate(
        dataset,
        metrics=[
        context_precision,
        context_recall,
        ],
    )
    print("Evaluation complete")
    return results

Run the evaluation for the naive retrieval chain

In [77]:
naive_retrieval_chain_evaluation = evaluate_ragas_results(naive_retrieval_chain_dataset)

Evaluating:   0%|          | 0/74 [00:00<?, ?it/s]

Evaluation complete


#### Data Merge and Offline Storage

Merge the answer dataset and the evaluation dataset to combine the latency, the tokens, the context precision and the context recall

Save this combined dataset offline for later recall in case we have program failures due to installing software that causes conflicts - yes this has happened way too many times.

In [78]:
import pandas as pd
import os
def save_datasets(name, dataset, evaluation):
    os.makedirs("datasets", exist_ok=True)
    d1 = dataset.to_pandas()
    d2 = evaluation.to_pandas()
    d2_d1 = pd.merge(d2, d1[["question","latency", "tokens"]], on="question", how="outer")
    file_name = f"datasets/{name}.csv"
    d2_d1.to_csv(file_name, index=False, encoding='utf-8')
    print("Datasets merged and saved offline")
    return d2_d1
def read_dataset(name):
    file_name = f"datasets/{name}.csv"
    dataset = pd.read_csv(file_name)
    return dataset

Merge and save the naive retrieval dataset

In [79]:
naive_retrieval_dataset = save_datasets("Naive_retrieval",naive_retrieval_chain_dataset,naive_retrieval_chain_evaluation )

Datasets merged and saved offline


Get the naive retrieval dataset to ensure the process is working

In [80]:
naive_retrieval_dataset = read_dataset("Naive_retrieval")


#### Calculate and Display Metrics

Display the averges for:
- context precision
- context recall
- latency
- tokens

In [81]:
def get_averages(name, df):
    averages = {
        "Average Context Precision": df["context_precision"].mean(),
        "Average Context Recall": df["context_recall"].mean(),
        "Average Latency (ms)": df["latency"].mean(),
        "Average Tokens": df["tokens"].mean()
    }
    averages_df = pd.DataFrame(list(averages.items()), columns=["Metric", name])
    print(averages_df)
    return averages_df

Display the metrics for the naive retrieval chain

In [82]:
get_averages("Naive_retrieval", naive_retrieval_dataset)

                      Metric  Naive_retrieval
0  Average Context Precision         0.733476
1     Average Context Recall         0.936937
2       Average Latency (ms)         1.486473
3             Average Tokens      3620.405405


Unnamed: 0,Metric,Naive_retrieval
0,Average Context Precision,0.733476
1,Average Context Recall,0.936937
2,Average Latency (ms),1.486473
3,Average Tokens,3620.405405


#### BM25

Lets check the BM25 retrieval chain through all of the routines and display the metrics

In [48]:
bm25_retrieval_chain_dataset = generate_answers("BM25_retrieval", bm25_retrieval_chain, ragas_testset)
bm25_retrieval_chain_evaluation = evaluate_ragas_results(bm25_retrieval_chain_dataset)
bm25_retrieval_dataset = save_datasets("BM25_retrieval",bm25_retrieval_chain_dataset,bm25_retrieval_chain_evaluation )
get_averages("BM25_retrieval", bm25_retrieval_dataset)

Answers generated


Evaluating:   0%|          | 0/74 [00:00<?, ?it/s]

Evaluation complete
Datasets merged and saved offline
                      Metric  BM25_retrieval
0  Average Context Precision        0.656907
1     Average Context Recall        0.783784
2       Average Latency (ms)        1.033972
3             Average Tokens     1275.081081


Unnamed: 0,Metric,BM25_retrieval
0,Average Context Precision,0.656907
1,Average Context Recall,0.783784
2,Average Latency (ms),1.033972
3,Average Tokens,1275.081081


#### Multi Query Retrieval

Lets check the multi query retrieval chain and display the metrics

Note - we took some rate limits reached on some of the LLM calls

In [52]:
multi_query_retrieval_chain_dataset = generate_answers("MultiQuery_retrieval", multi_query_retrieval_chain, ragas_testset)
multi_query_retrieval_chain_evaluation = evaluate_ragas_results(multi_query_retrieval_chain_dataset)
multi_query_retrieval_dataset = save_datasets("MultiQuery_retrieval",multi_query_retrieval_chain_dataset,multi_query_retrieval_chain_evaluation )
get_averages("MultiQuery_retrieval", multi_query_retrieval_dataset)

Answers generated


Evaluating:   0%|          | 0/74 [00:00<?, ?it/s]

Evaluation complete
Datasets merged and saved offline
                      Metric  MultiQuery_retrieval
0  Average Context Precision              0.747101
1     Average Context Recall              0.959459
2       Average Latency (ms)              3.323619
3             Average Tokens           4465.189189


Unnamed: 0,Metric,MultiQuery_retrieval
0,Average Context Precision,0.747101
1,Average Context Recall,0.959459
2,Average Latency (ms),3.323619
3,Average Tokens,4465.189189


#### Parent Document Retrieval

Lets check the parent document retrieval chain and display the metrics

In [54]:
parent_document_retrieval_chain_dataset = generate_answers("ParentDocument_retrieval", parent_document_retrieval_chain, ragas_testset)
parent_document_retrieval_chain_evaluation = evaluate_ragas_results(parent_document_retrieval_chain_dataset)
parent_document_retrieval_dataset = save_datasets("ParentDocument_retrieval",parent_document_retrieval_chain_dataset,parent_document_retrieval_chain_evaluation )
get_averages("ParentDocument_retrieval", parent_document_retrieval_dataset)

Answers generated


Evaluating:   0%|          | 0/74 [00:00<?, ?it/s]

Evaluation complete
Datasets merged and saved offline
                      Metric  ParentDocument_retrieval
0  Average Context Precision                  0.756006
1     Average Context Recall                  0.765766
2       Average Latency (ms)                  1.311893
3             Average Tokens                648.513514


Unnamed: 0,Metric,ParentDocument_retrieval
0,Average Context Precision,0.756006
1,Average Context Recall,0.765766
2,Average Latency (ms),1.311893
3,Average Tokens,648.513514


#### Ensemble Retrieval

Lets check the ensemble retrieval chain and display the metrics

In [56]:
ensemble_retrieval_chain_dataset = generate_answers("Ensemble_retrieval", ensemble_retrieval_chain, ragas_testset)
ensemble_retrieval_chain_evaluation = evaluate_ragas_results(ensemble_retrieval_chain_dataset)
ensemble_retrieval_dataset = save_datasets("Ensemble_retrieval",ensemble_retrieval_chain_dataset,ensemble_retrieval_chain_evaluation )
get_averages("Ensemble_retrieval", ensemble_retrieval_dataset)

Answers generated


Evaluating:   0%|          | 0/74 [00:00<?, ?it/s]

Exception raised in Job[36]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-dm9dvvnDgfJGEGv0fE2Q952w on tokens per min (TPM): Limit 200000, Used 199271, Requested 1729. Please try again in 300ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[11]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-dm9dvvnDgfJGEGv0fE2Q952w on tokens per min (TPM): Limit 200000, Used 197724, Requested 4357. Please try again in 624ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[13]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o-mini in organization org-dm9dvvnDgfJGEGv0fE2Q952w on tokens per min (TPM): Limit 200000, Us

Evaluation complete
Datasets merged and saved offline
                      Metric  Ensemble_retrieval
0  Average Context Precision            0.768539
1     Average Context Recall            0.931373
2       Average Latency (ms)            3.769606
3             Average Tokens         5328.081081


Unnamed: 0,Metric,Ensemble_retrieval
0,Average Context Precision,0.768539
1,Average Context Recall,0.931373
2,Average Latency (ms),3.769606
3,Average Tokens,5328.081081


#### Semantic Chunking

In [63]:

semantic_chunking_retrieval_chain_dataset = generate_answers("ParentDocument_retrieval", semantic_retrieval_chain, ragas_testset)
semantic_chunking_retrieval_chain_evaluation = evaluate_ragas_results(semantic_chunking_retrieval_chain_dataset)
semantic_chunking_retrieval_dataset = save_datasets("ParentDocument_retrieval",semantic_chunking_retrieval_chain_dataset,semantic_chunking_retrieval_chain_evaluation )
get_averages("ParentDocument_retrieval", semantic_chunking_retrieval_dataset)

Answers generated


Evaluating:   0%|          | 0/74 [00:00<?, ?it/s]

Evaluation complete
Datasets merged and saved offline
                      Metric  ParentDocument_retrieval
0  Average Context Precision                  0.716682
1     Average Context Recall                  0.972973
2       Average Latency (ms)                  1.400724
3             Average Tokens               2801.702703


Unnamed: 0,Metric,ParentDocument_retrieval
0,Average Context Precision,0.716682
1,Average Context Recall,0.972973
2,Average Latency (ms),1.400724
3,Average Tokens,2801.702703


Lets summarize all of the scores

In [83]:
import pandas as pd

def get_averages(df):
    averages = {
        "Average Context Precision": df["context_precision"].mean(),
        "Average Context Recall": df["context_recall"].mean(),
        "Average Latency (ms)": df["latency"].mean(),
        "Average Tokens": df["tokens"].mean()
    }
    return averages

datasets = [
    (naive_retrieval_dataset, "Naive Retrieval"),
    (bm25_retrieval_dataset, "BM25 Retrieval"),
    (multi_query_retrieval_dataset, "Multi Query Retrieval"),
    (parent_document_retrieval_dataset, "Parent Document Retrieval"),
    (ensemble_retrieval_dataset, "Ensemble Retrieval"),
    (semantic_chunking_retrieval_dataset, "Semantic Chunking")
]
results = []
for dataset, name in datasets:
    avg_values = get_averages(dataset)
    avg_values["Dataset"] = name 
    results.append(avg_values)
summary_df = pd.DataFrame(results)
summary_df = summary_df[["Dataset", "Average Context Precision", "Average Context Recall", "Average Latency (ms)", "Average Tokens"]]

print(summary_df.to_string(index=False))

summary_df.to_csv("summary_averages.csv")

                  Dataset  Average Context Precision  Average Context Recall  Average Latency (ms)  Average Tokens
          Naive Retrieval                   0.733476                0.936937              1.486473     3620.405405
           BM25 Retrieval                   0.656907                0.783784              1.033972     1275.081081
    Multi Query Retrieval                   0.747101                0.959459              3.323619     4465.189189
Parent Document Retrieval                   0.756006                0.765766              1.311893      648.513514
       Ensemble Retrieval                   0.768539                0.931373              3.769606     5328.081081
        Semantic Chunking                   0.716682                0.972973              1.400724     2801.702703


 #### Results and Analysis                 

Average scores:  

| Dataset                       | Average Context Precision | Average Context Recall | Average Latency (s) | Average Tokens | Total Latency | Total Tokens | LS Cost |
|-------------------------------|---------------------------|------------------------|-----------------------|----------------|-----------|-----------|---------|
| Naive Retrieval               | 0.733476                  | 0.936937               | 1.486473              | 3620.405405    | 198.13    | 133955    | 0.070   |
| BM25 Retrieval                | 0.656907                  | 0.783784               | 1.033972              | 1275.08108     | 38.28     | 47178     | 0.026   |
| Multi Query Retrieval         | 0.747101                  | 0.959459               | 3.323619              | 4465.189189    | 122.99    | 171764    | 0.091   |
| Parent Document Retrieval     | 0.756006                  | 0.765766               | 1.311893              | 648.513514     | 48.55     | 23995     | 0.014   |
| Ensemble Retrieval            | 0.768539                  | 0.931373               | 3.769606              | 5328.081081    | 139.49    | 203677    | 0.107   |
| Semantic Chunking             | 0.716682                  | 0.972973               | 1.400724              | 2801.702703    | 51.84     | 103663    | 0.054   |

Ranking 1 (best) - 6 (worst)

| | Dataset                    | Precision Score | Recall Score | Latency Score | Tokens Score | Total Latency | Total Tokens | Cost $ |
|-|----------------------------|-----------------|--------------|----------------|--------------|--------------|--------------|--------|
| | Naive Retrieval            | 4               | 3            | 4              | 4            | 6            | 4            | 4      |
| | BM25 Retrieval             | 6               | 5            | 1              | 2            | 1            | 2            | 2      |
| | Multi Query Retrieval      | 3               | 2            | 4              | 5            | 4            | 5            | 5      |
| | Parent Document Retrieval  | 2               | 6            | 2              | 1            | 2            | 1            | 1      |
| | Ensemble Retrieval         | 1               | 4            | 6              | 6            | 5            | 6            | 6      |
| | Semantic Chunking          | 5               | 1            | 3              | 3            | 3            | 3            | 3      |
 

- Highest in Context Precision is Ensemble retrieval followed by Parent Document and Multi Query
- Highest in Context Recall is Semantic Chunking followed closely by Multi Query and Naive
- Lowest in Latency is BM25 followed by Parent Document and Semantic Chunking
- Lowest in Average Tokens is Parent Document followed by BM25 and Semantic Chunking
- Lowest Total Latency is BM25 followef by Parent Document and Semantic Chunking
- Lowest Total Tokens is Parent Document followed by BM24 and Semantic Chunking
- Lowest Total Cost is Parent Document followed by BM25 and Semantic Chunking

Ensemble retrieval has the highest average context retrieval and a high context recall indicating it is effective at returning both relevant results and a good range of relevant documents.

Semantic Chunking has the highest Context recall and a somewhat lower context precision indicating its ability to get a wide range of relevant documents but not so good at getting the most relevant results. It is in the middle of the results for latency and cost.

BM25 has the lowest latency making it the fastest retrieval. Not surprising considering it is an efficient ranking algorithm. However BM25 does the worst in precision and second worst in recall. Its speed could be an asset in real time applications

Multi Query has the second highest latency due to its strategy of using multiple queries. It is also the second most costliest. These higher latency and costs do help it somewhat in precision and recall.

Parent Document retrieval has the second highest context precision but the lowest recall score. It will retrieve hightly relevant documents but will also miss some relevant ones. It has the second lowest latency and the lowest cost in tokens.

Ensemble Retrieval has the highest latency (slowest) and highest cost in tokens. Since it uses the other techniques this is not surprising. What is surprising is that even using all of these different retrieval methods it does not do so well on precision and recall. 

If precision is a priority (returning the most relevant documents), then Ensemble or Parent Document retrieval may be best. Parent Documemt is also the cheapest retrieval mechanism and pretty quick. Ensemble on the other hand is slower and most expensive.

If speed and cost are priorities then BM25 or Parent Document could be a good choice. BM25 comes with poor scores in precision or recall. Parent Document does well in precision.

Summary

- Ensemble Retrieval provides the best overall performance but at a high cost and latency.
- BM25 is the most efficient method providing low costs and high speed but at the cost of precision and accuracy


For this application, providing users the opportunity to ask questions about reviews of the John Wick movies, The Semantic Chunking method may be the best choice. It has the best recall score although a slightly lower precision and latency and cost in the lower half. This means it can provide relevant information both quickly and accurately. Its low cost adds to the benefits this retrieval method provides. If cost is not a concern then Multi Query may be a good option

#### Winner ==> Semantic Chunking

#### Runner Up ==> Multi Query
