# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

In [1]:
!pip install -qU langchain langchain-openai langchain-cohere rank_bm25

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chainlit 1.1.306 requires packaging<24.0,>=23.1, but you have packaging 24.1 which is incompatible.
ragas 0.1.20 requires langchain-core<0.3, but you have langchain-core 0.3.6 which is incompatible.[0m[31m
[0m

We're also going to be leveraging [Qdrant's](https://qdrant.tech/documentation/frameworks/langchain/) (pronounced "Quadrant") VectorDB in "memory" mode (so we can leverage it locally in our colab environment).

In [2]:
!pip install -qU qdrant-client

We'll also provide our OpenAI key, as well as our Cohere API key.

In [4]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [5]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using some reviews from the 4 movies in the John Wick franchise today to explore the different retrieval strategies.

These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository).

### Data Collection

We can simply `wget` these from GitHub.

You could use any review data you wanted in this step - just be careful to make sure your metadata is aligned with your choice.

In [6]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv -O john_wick_1.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv -O john_wick_2.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw3.csv -O john_wick_3.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw4.csv -O john_wick_4.csv

--2024-09-29 17:32:54--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8001::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19628 (19K) [text/plain]
Saving to: ‘john_wick_1.csv’


2024-09-29 17:32:54 (15.2 MB/s) - ‘john_wick_1.csv’ saved [19628/19628]

--2024-09-29 17:32:54--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8000::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14747 (14K) [text/plain]
Saving to: ‘john_wick_2.csv’


2024-09-29 17:32:55

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

- Self-Query: Wants as much metadata as we can provide
- Time-weighted: Wants temporal data

> NOTE: While we're creating a temporal relationship based on when these movies came out for illustrative purposes, it needs to be clear that the "time-weighting" in the Time-weighted Retriever is based on when the document was *accessed* last - not when it was created.

In [130]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"]
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  documents.extend(movie_docs)

Let's look at an example document to see if everything worked as expected!

In [131]:
documents[0]

Document(metadata={'source': 'john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2024, 9, 27, 20, 23, 0, 196035)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity.")

In [132]:
len(documents)

100

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "JohnWick".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [133]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWick"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [134]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [135]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-3.5-turbo` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [136]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI()

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [137]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [15]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the reviews provided.'

In [16]:
naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review: \'/review/rw4854296/?ref_=tt_urv\'.'

In [17]:
naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, an ex-hitman comes out of retirement to seek vengeance against the gangsters who killed his dog and took everything from him. The story is filled with violent action, shootouts, and breathtaking fights as John Wick unleashes a maelstrom of destruction against those who come after him. The plot revolves around his relentless vendetta and the consequences of his actions in the criminal underworld.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [139]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents)

We'll construct the same chain - only changing the retriever.

In [140]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [20]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Opinions on John Wick seem to vary among viewers. Some people really enjoyed it and praised the action sequences and Keanu Reeves' performance, while others found it lacking in plot and substance. It ultimately depends on individual preferences."

In [21]:
naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'No reviews have a rating of 10.'

In [22]:
naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, the action is beautifully choreographed, the setup is surprisingly emotional for an action flick, and Keanu Reeves delivers a fantastic performance. It is highly recommended for action movie fans.'

It's not clear that this is better or worse - but the `I don't know` isn't great!

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [23]:
pip install --upgrade langchain_cohere

Note: you may need to restart the kernel to use updated packages.


In [24]:
pip install --upgrade langchain

Note: you may need to restart the kernel to use updated packages.


In [34]:
pip install pydantic==1.10.7

Note: you may need to restart the kernel to use updated packages.


In [25]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

In [162]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import OpenAI

llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [163]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [141]:
contextual_compression_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick.'

In [28]:
contextual_compression_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"Yes, there is a review with a rating of 10. Here is the URL to that review: '/review/rw4854296/?ref_=tt_urv'"

In [29]:
contextual_compression_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick 2, after resolving his issues with the Russian mafia, John Wick returns home but is visited by the mobster Santino D'Antonio who asks him to kill his sister Gianna D'Antonio in Rome. When John accomplishes this task, Santino puts a seven-million dollar contract on him, leading to professional killers coming after him. Wick promises to kill Santino who is not protected by his marker anymore."

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [30]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [31]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [32]:
multi_query_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Yes, people generally liked John Wick based on the reviews provided, which praised the film for its action sequences, Keanu Reeves' performance, and overall entertainment value."

In [33]:
multi_query_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"I'm sorry, there are no reviews with a rating of 10 in the provided context."

In [34]:
multi_query_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, a retired assassin named John Wick comes out of retirement when someone kills his dog and steals his car. He goes on a rampage of carnage to seek revenge and settle old debts by helping take over the Assassin's Guild. The movie involves John Wick flying around to Italy, Canada, and Manhattan, killing numerous assassins along the way."

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [35]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [36]:
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
    collection_name="full_documents", embeddings=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [37]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [38]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [39]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [40]:
parent_document_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"People's opinions about John Wick seem to vary. Some really enjoy the series, while others have criticized certain aspects of the movies."

In [41]:
parent_document_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". The URL to that review is: \'/review/rw4854296/?ref_=tt_urv\''

In [42]:
parent_document_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick 2, John Wick, a retired assassin, is forced back into action when someone steals his car, leading to a lot of carnage. He is then called on to pay off an old debt by helping Ian McShane take over the Assassin's Guild by traveling to Italy, Canada, and Manhattan and killing many assassins."

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [199]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [200]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [45]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"People generally liked John Wick based on the reviews provided. The action, choreography, and Keanu Reeves' performance were highly praised. The film was described as fun, stylish, and engaging, with many reviewers recommending it to action movie fans. Some reviews mentioned that the first film in the series was special and different from typical action films, while others appreciated the unique world-building and intense action sequences. Overall, the majority of reviews were positive, highlighting John Wick's appeal to action fans."

In [46]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"Yes, there is a review with a rating of 10. Here is the URL to that review: '/review/rw4854296/?ref_=tt_urv'"

In [47]:
ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, an ex-hitman comes out of retirement to seek vengeance on the gangsters who killed his dog and took everything from him. The story is filled with violent action, shootouts, and breathtaking fights as John Wick unleashes destruction against those who come after him.'

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

In [48]:
!pip install -qU langchain_experimental

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [49]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [50]:
semantic_documents = semantic_chunker.split_documents(documents)

Let's create a new vector store.

In [51]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWickSemantic"
)

We'll use naive retrieval for this example.

In [52]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [53]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [54]:
semantic_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'People generally liked John Wick based on the positive reviews provided.'

In [55]:
semantic_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"Yes, there are reviews with a rating of 10. Here are the URLs to those reviews:\n1. '/review/rw4854296/?ref_=tt_urv' - A Masterpiece & Brilliant Sequel\n2. '/review/rw8944843/?ref_=tt_urv' - How Can Anyone Choose to Watch Marvel Over This?"

In [56]:
semantic_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, the main character seeks revenge on the people who took something he loved from him. Specifically, in the first movie, they killed his dog and stole his car, leading him to unleash a carefully orchestrated maelstrom of destruction against those who wronged him.'

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [None]:
### YOUR CODE HERE

In [64]:
import pandas as pd

test_df = pd.read_csv("/Users/xico/AIE4/Week 7/Day 2/synthetic_jw_dataset.csv")

In [None]:
test_df

In [145]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

In [123]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

In [124]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"Retriever Evaluation - {uuid4().hex[0:8]}"

In [148]:
from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"LangSmith AIM Retriever Eval - {unique_id}"

In [127]:
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass('Enter your LangSmith API key: ')

In [128]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"Yes, there is a review with a rating of 10. Here is the URL to that review: '/review/rw4854296/?ref_=tt_urv'"

In [90]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

In [170]:
from ragas import evaluate
from ragas.metrics import (
    context_recall,
    context_precision,
    answer_correctness
)

metrics = [
    context_recall,
    context_precision,
    answer_correctness
]

In [146]:
from langsmith import traceable

def run_evaluation(retrieval_chain):
    answers = []
    contexts = []

    for question in test_questions:
        response = naive_retrieval_chain.invoke({"question" : question})
        answers.append(response["response"].content)
        contexts.append([context.page_content for context in response["context"]])
    response_dataset = Dataset.from_dict({
        "question" : test_questions,
        "answer" : answers,
        "contexts" : contexts,
        "ground_truth" : test_groundtruths
    })
    return evaluate(response_dataset, metrics)

In [151]:
answers = []
contexts = []

for question in test_questions:
  response = naive_retrieval_chain.invoke({"question" : question})
  answers.append(response['response'].content)
  contexts.append([context.page_content for context in response["context"]])

In [152]:
from datasets import Dataset

baseline_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [153]:
baseline_dataset[0]

{'question': 'What makes the set pieces in John Wick: Chapter 4 stand out compared to the previous movies in the franchise?',
 'answer': 'The set pieces in John Wick: Chapter 4 stand out compared to the previous movies in the franchise due to their infusion of creativity that had not been seen before in the series.',
 'contexts': [": 18\nReview: Ever since the original John Wick, the franchise has set a standard of what action in Hollywood should be. Thanks to Chad Stahelski and Keanu Reeve's knowledge of the technical aspects of shooting action, they've been able to deliver expertly choreographed, shot, and edited action films that are now the go to as examples of great action filmmaking. And so, the expectations for the fourth film were fairly high, especially as it became more apparent this was not only the culmination of everything before it, but a whopping 169 minutes long. Rest assured, however, that it delivers in spades. Everything we have come to know and love is here, but wit

In [171]:
baseline_results = evaluate(baseline_dataset, metrics)

Evaluating:   0%|          | 0/102 [00:00<?, ?it/s]

In [172]:
baseline_results

{'context_recall': 0.9265, 'context_precision': 0.7399, 'answer_correctness': 0.6463}

In [173]:
baseline_results_df = baseline_results.to_pandas()
baseline_results_df.head()

Unnamed: 0,question,contexts,answer,ground_truth,context_recall,context_precision,answer_correctness
0,What makes the set pieces in John Wick: Chapte...,[: 18\nReview: Ever since the original John Wi...,The set pieces in John Wick: Chapter 4 stand o...,The answer to given question is not present in...,1.0,0.0,0.177896
1,What sets the action sequences in the JOHN WIC...,[: 3\nReview: John wick has a very simple reve...,The action sequences in the JOHN WICK franchis...,The action sequences in the JOHN WICK franchis...,1.0,0.982143,0.589875
2,How does the level of violence in the third Jo...,[: 0\nReview: It is 5 years since the first Jo...,I don't know.,The level of violence in the third John Wick f...,1.0,0.488889,0.180156
3,Who plays the character John Wick in the movie...,"[: 9\nReview: At first glance, John Wick sound...",Keanu Reeves plays the character John Wick in ...,Keanu Reeves plays the character John Wick in ...,1.0,0.703571,1.0
4,What sets the Mission Impossible franchise apa...,[: 10\nReview: Most American action flicks rel...,The Mission Impossible franchise sets itself a...,The Mission Impossible franchise is one of the...,1.0,0.611111,0.473502


In [253]:
avg_length = round(baseline_results_df['contexts'].apply(lambda x: len(str(x))).mean())

baseline_avg_token_df = pd.DataFrame({'BaslineAvgToken': [avg_length]})

In [254]:
baseline_avg_token_df

Unnamed: 0,BaslineAvgToken
0,9022


In [174]:
baseline_metrics_df = pd.DataFrame(list(baseline_results.items()), columns=['Metric', 'Baseline'])

In [175]:
baseline_metrics_df

Unnamed: 0,Metric,Baseline
0,context_recall,0.926471
1,context_precision,0.739921
2,answer_correctness,0.646322


In [164]:
answers = []
contexts = []

for question in test_questions:
  response = contextual_compression_retrieval_chain.invoke({"question" : question})
  answers.append(response['response'].content)
  contexts.append([context.page_content for context in response["context"]])

In [165]:
contextual_compression_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [176]:
contextual_compression_results = evaluate(contextual_compression_dataset, metrics)

Evaluating:   0%|          | 0/102 [00:00<?, ?it/s]

In [177]:
contextual_compression_results

{'context_recall': 0.8186, 'context_precision': 0.6217, 'answer_correctness': 0.6331}

In [179]:
contextual_compression_results_df = contextual_compression_results.to_pandas()
contextual_compression_results_df.head()

Unnamed: 0,question,contexts,answer,ground_truth,context_recall,context_precision,answer_correctness
0,What makes the set pieces in John Wick: Chapte...,[- the original John Wick\n- the franchise\n- ...,The set pieces in John Wick: Chapter 4 stand o...,The answer to given question is not present in...,1.0,0.0,0.929085
1,What sets the action sequences in the JOHN WIC...,[- Directed by Chad Stahelski who's a stunt sp...,The action sequences in the JOHN WICK franchis...,The action sequences in the JOHN WICK franchis...,1.0,0.741667,0.501298
2,How does the level of violence in the third Jo...,[- the third John Wick film\n- level of violen...,I don't know the specific impact of the level ...,The level of violence in the third John Wick f...,1.0,0.418889,0.225291
3,Who plays the character John Wick in the movie...,"[Keanu Reeves, John Wick, action film, action ...",Keanu Reeves plays the character John Wick in ...,Keanu Reeves plays the character John Wick in ...,1.0,0.797222,1.0
4,What sets the Mission Impossible franchise apa...,"[Mission Impossible, - ""Mission Impossible fra...","I'm sorry, I don't have specific information a...",The Mission Impossible franchise is one of the...,0.333333,0.2,0.219323


In [251]:
avg_length = round(contextual_compression_results_df['contexts'].apply(lambda x: len(str(x))).mean())

contextual_compresion_avg_token_df = pd.DataFrame({'CCAvgToken': [avg_length]})

In [252]:
contextual_compresion_avg_token_df

Unnamed: 0,CCAvgToken
0,3898


In [207]:
contextual_compression_metrics_df = pd.DataFrame(list(contextual_compression_results.items()), columns=['Metric', 'ContextualCompression'])

In [208]:
contextual_compression_metrics_df

Unnamed: 0,Metric,ContextualCompression
0,context_recall,0.818627
1,context_precision,0.621724
2,answer_correctness,0.633126


In [182]:
answers = []
contexts = []

for question in test_questions:
  response = multi_query_retrieval_chain.invoke({"question" : question})
  answers.append(response['response'].content)
  contexts.append([context.page_content for context in response["context"]])

In [183]:
multi_query_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [184]:
multi_query_results = evaluate(multi_query_dataset, metrics)

Evaluating:   0%|          | 0/102 [00:00<?, ?it/s]

In [185]:
multi_query_results

{'context_recall': 0.9118, 'context_precision': 0.6728, 'answer_correctness': 0.6522}

In [186]:
multi_query_results_df = multi_query_results.to_pandas()
multi_query_results_df.head()

Unnamed: 0,question,contexts,answer,ground_truth,context_recall,context_precision,answer_correctness
0,What makes the set pieces in John Wick: Chapte...,[: 19\nReview: John Wick: Chapter 4 picks up w...,The set pieces in John Wick: Chapter 4 stand o...,The answer to given question is not present in...,1.0,0.0,0.924665
1,What sets the action sequences in the JOHN WIC...,[: 3\nReview: John wick has a very simple reve...,The action sequences in the JOHN WICK franchis...,The action sequences in the JOHN WICK franchis...,1.0,0.84747,0.66475
2,How does the level of violence in the third Jo...,[: 2\nReview: The first three John Wick films ...,I don't know the specific impact of the level ...,The level of violence in the third John Wick f...,1.0,0.607384,0.232388
3,Who plays the character John Wick in the movie...,"[: 9\nReview: At first glance, John Wick sound...",Keanu Reeves plays the character John Wick in ...,Keanu Reeves plays the character John Wick in ...,1.0,0.806602,1.0
4,What sets the Mission Impossible franchise apa...,[: 10\nReview: Most American action flicks rel...,What sets the Mission Impossible franchise apa...,The Mission Impossible franchise is one of the...,1.0,0.791667,0.361774


In [249]:
avg_length = round(multi_query_results_df['contexts'].apply(lambda x: len(str(x))).mean())

multi_query_avg_token_df = pd.DataFrame({'MQAvgToken': [avg_length]})

In [250]:
multi_query_avg_token_df

Unnamed: 0,MQAvgToken
0,10951


In [209]:
multi_query_metrics_df = pd.DataFrame(list(multi_query_results.items()), columns=['Metric', 'MultiQuery'])

In [210]:
multi_query_metrics_df

Unnamed: 0,Metric,MultiQuery
0,context_recall,0.911765
1,context_precision,0.672826
2,answer_correctness,0.652217


In [187]:
answers = []
contexts = []

for question in test_questions:
  response = parent_document_retrieval_chain.invoke({"question" : question})
  answers.append(response['response'].content)
  contexts.append([context.page_content for context in response["context"]])

In [190]:
parent_document_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [191]:
parent_document_results = evaluate(parent_document_dataset, metrics)

Evaluating:   0%|          | 0/102 [00:00<?, ?it/s]

In [192]:
parent_document_results

{'context_recall': 0.7770, 'context_precision': 0.7475, 'answer_correctness': 0.6293}

In [193]:
parent_document_results_df = parent_document_results.to_pandas()
parent_document_results_df.head()

Unnamed: 0,question,contexts,answer,ground_truth,context_recall,context_precision,answer_correctness
0,What makes the set pieces in John Wick: Chapte...,[: 14\nReview: By now you know what to expect ...,The set pieces in John Wick: Chapter 4 stand o...,The answer to given question is not present in...,1.0,0.0,0.178577
1,What sets the action sequences in the JOHN WIC...,[: 1\nReview: I'm a fan of the John Wick films...,What sets the action sequences in the JOHN WIC...,The action sequences in the JOHN WICK franchis...,0.333333,1.0,0.347527
2,How does the level of violence in the third Jo...,[: 14\nReview: By now you know what to expect ...,"Based on the context provided, the level of vi...",The level of violence in the third John Wick f...,1.0,0.583333,0.792764
3,Who plays the character John Wick in the movie...,[: 0\nReview: The best way I can describe John...,Keanu Reeves plays the character John Wick in ...,Keanu Reeves plays the character John Wick in ...,1.0,1.0,1.0
4,What sets the Mission Impossible franchise apa...,[: 10\nReview: Most American action flicks rel...,The Mission Impossible franchise sets itself a...,The Mission Impossible franchise is one of the...,1.0,1.0,0.68676


In [247]:
avg_length = round(parent_document_results_df['contexts'].apply(lambda x: len(str(x))).mean())

parent_doc_avg_token_df = pd.DataFrame({'PDAvgToken': [avg_length]})

In [248]:
parent_doc_avg_token_df

Unnamed: 0,PDAvgToken
0,1350


In [211]:
parent_document_metrics_df = pd.DataFrame(list(parent_document_results.items()), columns=['Metric', 'ParentDoc'])

In [212]:
parent_document_metrics_df

Unnamed: 0,Metric,ParentDoc
0,context_recall,0.776961
1,context_precision,0.747549
2,answer_correctness,0.629344


In [201]:
answers = []
contexts = []

for question in test_questions:
  response = ensemble_retrieval_chain.invoke({"question" : question})
  answers.append(response['response'].content)
  contexts.append([context.page_content for context in response["context"]])

In [202]:
ensemble_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [204]:
ensemble_results = evaluate(ensemble_dataset, metrics)

Evaluating:   0%|          | 0/102 [00:00<?, ?it/s]

In [205]:
ensemble_results

{'context_recall': 0.9167, 'context_precision': 0.6061, 'answer_correctness': 0.6883}

In [206]:
ensemble_results

Unnamed: 0,question,contexts,answer,ground_truth,context_recall,context_precision,answer_correctness
0,What makes the set pieces in John Wick: Chapte...,[: 19\nReview: John Wick: Chapter 4 picks up w...,The set pieces in John Wick: Chapter 4 stand o...,The answer to given question is not present in...,1.0,0.0,0.176822
1,What sets the action sequences in the JOHN WIC...,[: 1\nReview: I'm a fan of the John Wick films...,The action sequences in the JOHN WICK franchis...,The action sequences in the JOHN WICK franchis...,1.0,0.780953,0.880688
2,How does the level of violence in the third Jo...,"[: 9\nReview: At first glance, John Wick sound...",The level of violence in the third John Wick f...,The level of violence in the third John Wick f...,1.0,0.523265,0.638225
3,Who plays the character John Wick in the movie...,"[Keanu Reeves, : 0\nReview: The best way I can...",Keanu Reeves plays the character John Wick in ...,Keanu Reeves plays the character John Wick in ...,1.0,0.756942,1.0
4,What sets the Mission Impossible franchise apa...,[: 10\nReview: Most American action flicks rel...,I don't know.,The Mission Impossible franchise is one of the...,1.0,0.53125,0.18178


In [238]:
len(str(ensemble_results_df.iloc[0].contexts))

18533

In [245]:
avg_length = round(ensemble_results_df['contexts'].apply(lambda x: len(str(x))).mean())

ensemble_avg_token_df = pd.DataFrame({'EnsembleAvgToken': [avg_length]})

In [246]:
ensemble_avg_token_df

Unnamed: 0,EnsembleAvgToken
0,16960


In [213]:
ensemble_metrics_df = pd.DataFrame(list(ensemble_results.items()), columns=['Metric', 'Ensemble'])

In [214]:
ensemble_metrics_df

Unnamed: 0,Metric,Ensemble
0,context_recall,0.916667
1,context_precision,0.606057
2,answer_correctness,0.688328


In [215]:
answers = []
contexts = []

for question in test_questions:
  response = semantic_retrieval_chain.invoke({"question" : question})
  answers.append(response['response'].content)
  contexts.append([context.page_content for context in response["context"]])

In [216]:
semantic_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [217]:
semantic_results = evaluate(semantic_dataset, metrics)

Evaluating:   0%|          | 0/102 [00:00<?, ?it/s]

In [218]:
semantic_results

{'context_recall': 0.8284, 'context_precision': 0.6835, 'answer_correctness': 0.6140}

In [220]:
semantic_results_df = semantic_results.to_pandas()
semantic_results_df.head()

Unnamed: 0,question,contexts,answer,ground_truth,context_recall,context_precision,answer_correctness
0,What makes the set pieces in John Wick: Chapte...,[: 18\nReview: Ever since the original John Wi...,The set pieces in John Wick: Chapter 4 stand o...,The answer to given question is not present in...,1.0,0.0,0.924168
1,What sets the action sequences in the JOHN WIC...,[This is EXACTLY what you want out of an actio...,The action sequences in the JOHN WICK franchis...,The action sequences in the JOHN WICK franchis...,1.0,0.916667,0.327083
2,How does the level of violence in the third Jo...,[: 14\nReview: By now you know what to expect ...,"Based on the reviews provided, the level of vi...",The level of violence in the third John Wick f...,1.0,0.532738,0.650598
3,Who plays the character John Wick in the movie...,[John Wick (Reeves) is out to seek revenge on ...,Keanu Reeves plays the character John Wick in ...,Keanu Reeves plays the character John Wick in ...,1.0,0.415476,1.0
4,What sets the Mission Impossible franchise apa...,[: 10\nReview: Most American action flicks rel...,I don't know.,The Mission Impossible franchise is one of the...,0.5,0.784722,0.18178


In [255]:
avg_length = round(semantic_results_df['contexts'].apply(lambda x: len(str(x))).mean())

semantic_avg_token_df = pd.DataFrame({'SemanticAvgToken': [avg_length]})

In [256]:
semantic_avg_token_df

Unnamed: 0,SemanticAvgToken
0,6488


In [221]:
semantic_metrics_df = pd.DataFrame(list(semantic_results.items()), columns=['Metric', 'Semantic'])

In [222]:
semantic_metrics_df

Unnamed: 0,Metric,Semantic
0,context_recall,0.828431
1,context_precision,0.683538
2,answer_correctness,0.614006


In [227]:

dataframes = [baseline_metrics_df, contextual_compression_metrics_df, multi_query_metrics_df, parent_document_metrics_df, ensemble_metrics_df, semantic_metrics_df]


merged_df = dataframes[0] 

for df in dataframes[1:]:
    merged_df = pd.merge(merged_df, df, on='Metric', how='inner')


In [224]:
merged_df

Unnamed: 0,Metric,Baseline,ContextualCompression,MultiQuery,ParentDoc,Ensemble,Semantic
0,context_recall,0.926471,0.818627,0.911765,0.776961,0.916667,0.828431
1,context_precision,0.739921,0.621724,0.672826,0.747549,0.606057,0.683538
2,answer_correctness,0.646322,0.633126,0.652217,0.629344,0.688328,0.614006


In [226]:
merged_df['MaxValue'] = merged_df[['Baseline', 'ContextualCompression', 'MultiQuery', 'ParentDoc', 'Ensemble', 'Semantic']].max(axis=1)

merged_df['MaxMetric'] = merged_df[['Baseline', 'ContextualCompression', 'MultiQuery', 'ParentDoc', 'Ensemble', 'Semantic']].idxmax(axis=1)

merged_df['HigestValue'] = merged_df['MaxValue'].round(2).astype(str) + ' (' + merged_df['MaxMetric'] + ')'

merged_df = merged_df.drop(columns=['MaxValue', 'MaxMetric'])

merged_df

Unnamed: 0,Metric,Baseline,ContextualCompression,MultiQuery,ParentDoc,Ensemble,Semantic,HigestValue
0,context_recall,0.926471,0.818627,0.911765,0.776961,0.916667,0.828431,0.93 (Baseline)
1,context_precision,0.739921,0.621724,0.672826,0.747549,0.606057,0.683538,0.75 (ParentDoc)
2,answer_correctness,0.646322,0.633126,0.652217,0.629344,0.688328,0.614006,0.69 (Ensemble)


In [264]:
token_dfs = pd.concat([baseline_avg_token_df, contextual_compresion_avg_token_df, multi_query_avg_token_df, parent_doc_avg_token_df, ensemble_avg_token_df, semantic_avg_token_df], axis=1)

In [265]:
token_dfs

Unnamed: 0,BaslineAvgToken,CCAvgToken,MQAvgToken,PDAvgToken,EnsembleAvgToken,SemanticAvgToken
0,9022,3898,10951,1350,16960,6488


Based on the results of these evaluations and evaluating the average input token length along with the time to loop through the test_questions there are a few different strategies that could be considered the 'best' for this dataset. Looking at purely answer_correctness, the Ensemble retrieval method is best, however it did require the largest average length of input tokens which would then increase the cost of each query. Additionally, the time it took to answer each question came out to ~16 seconds (9 minutes total) if you consider the amount of time it took to loop through each question of the test_dataset. Considering the cost and time, perhaps the Ensemble method might not be the best despite it having the highest score when it comes to answer_correctness and high score of context_recall. When evaluating the RAGAS metric scores along with average token size and time per query, MultiQuery and the Baseline (Naive) retrieval methods are the two that stand out. Both have ~.92 context recall, much higher than Contextual Compression, ParentDoc, and Semantic and both have higher answer_correctness scores as well. They both do have larger average token amounts but considering the RAGAS metrics, the additional cost could be worth it. Further, 

